Files
SubMiner/backlog/tasks/task-90 - Normalize-narrow-Unicode-whitespace-in-tokenizer-input.md

1.3 KiB

id, title, status, assignee, created_date, updated_date, labels, dependencies, priority
id title status assignee created_date updated_date labels dependencies priority
TASK-90 Normalize narrow Unicode whitespace in tokenizer input Done
2026-02-20 06:17 2026-02-20 06:20
medium

Description

Fix tokenizer behavior where subtitle lines containing narrow/invisible Unicode spacing between Japanese segments can be split/grouped incorrectly compared with normal space handling.

Acceptance Criteria

  • #1 A regression test reproduces the subtitle sample containing narrow/invisible Unicode spacing and fails before fix.
  • #2 Tokenizer normalization treats narrow/invisible spacing variants consistently with regular spacing for grouping/highlight behavior.
  • #3 Existing tokenizer tests still pass.

Implementation Notes

Linked from subagent session codex-narrow-space-tokenizer-20260220T061716Z-p97s.

Added src/subtitle/stages/normalize.test.ts regression for \u200B separator in subtitle sample and updated normalizeTokenizerInput to map U+200B/U+2060/U+FEFF to regular spaces before whitespace collapsing.

Validation:

  • bun run build && node --test dist/subtitle/stages/normalize.test.js
  • node --test dist/core/services/tokenizer.test.js