mirror of
https://github.com/ksyasuda/SubMiner.git
synced 2026-02-28 06:22:45 -08:00
1.4 KiB
1.4 KiB
id, title, status, assignee, created_date, updated_date, labels, dependencies, priority, ordinal
| id | title | status | assignee | created_date | updated_date | labels | dependencies | priority | ordinal |
|---|---|---|---|---|---|---|---|---|---|
| TASK-90 | Normalize narrow Unicode whitespace in tokenizer input | Done | 2026-02-20 06:17 | 2026-02-22 07:49 | medium | 94000 |
Description
Fix tokenizer behavior where subtitle lines containing narrow/invisible Unicode spacing between Japanese segments can be split/grouped incorrectly compared with normal space handling.
Acceptance Criteria
- #1 A regression test reproduces the subtitle sample containing narrow/invisible Unicode spacing and fails before fix.
- #2 Tokenizer normalization treats narrow/invisible spacing variants consistently with regular spacing for grouping/highlight behavior.
- #3 Existing tokenizer tests still pass.
Implementation Notes
Linked from subagent session codex-narrow-space-tokenizer-20260220T061716Z-p97s.
Added src/subtitle/stages/normalize.test.ts regression for \u200B separator in subtitle sample and updated normalizeTokenizerInput to map U+200B/U+2060/U+FEFF to regular spaces before whitespace collapsing.
Validation:
bun run build && node --test dist/subtitle/stages/normalize.test.jsnode --test dist/core/services/tokenizer.test.js