mirror of
https://github.com/ksyasuda/SubMiner.git
synced 2026-02-27 18:22:41 -08:00
1.7 KiB
1.7 KiB
Agent: codex-narrow-space-tokenizer-20260220T061716Z-p97s
- alias:
codex-narrow-space-tokenizer - mission:
Fix narrow/invisible subtitle spacing causing incorrect tokenizer boundaries. - status:
done - branch:
main - started_at:
2026-02-20T06:17:31Z - heartbeat_minutes:
5
Current Work (newest first)
- [2026-02-20T06:20:07Z] handoff: normalized invisible separators in tokenizer input; added regression test; targeted tests green.
- [2026-02-20T06:19:20Z] test:
bun run build && node --test dist/subtitle/stages/normalize.test.js(pass, 1/1);node --test dist/core/services/tokenizer.test.js(pass, 43/43). - [2026-02-20T06:18:38Z] edit: updated
normalizeTokenizerInputto mapU+200B/U+2060/U+FEFFto regular spaces before whitespace collapsing. - [2026-02-20T06:18:02Z] test: added failing regression for subtitle sample with
\u200Bseparator. - [2026-02-20T06:17:31Z] intent: create TASK-90; TDD-first regression for narrow Unicode spacing in subtitle line
キリキリと かかってこい. - [2026-02-20T06:17:31Z] progress: coordination started; index row added; scanning tokenizer normalization points next.
Files Touched
docs/subagents/INDEX.mddocs/subagents/agents/codex-narrow-space-tokenizer-20260220T061716Z-p97s.mddocs/subagents/collaboration.mdbacklog/tasks/task-90 - Normalize-narrow-Unicode-whitespace-in-tokenizer-input.mdsrc/subtitle/stages/normalize.tssrc/subtitle/stages/normalize.test.ts
Assumptions
- issue likely Unicode spacing code point treated as token boundary.
- target behavior: collapse/normalize narrow spacing to standard spacing before lookup token grouping.
Open Questions / Blockers
- possible overlap with TASK-85 refactor touching tokenizer paths.
Next Step
- done.