SubMiner/backlog/tasks/task-90 - Normalize-narrow-Unicode-whitespace-in-tokenizer-input.md at 904ca3f3bbff358a6ec1fbdf34c09850623f4153 - SubMiner

sudacode/SubMiner

Fork 0

mirror of https://github.com/ksyasuda/SubMiner.git synced 2026-02-28 06:22:45 -08:00

Files

sudacode c480fe6ad4 update docs

2026-02-22 02:15:12 -08:00

1.4 KiB

Raw Blame History

id, title, status, assignee, created_date, updated_date, labels, dependencies, priority, ordinal

title

status

assignee

created_date

updated_date

labels

dependencies

priority

ordinal

TASK-90

Normalize narrow Unicode whitespace in tokenizer input

Done

2026-02-20 06:17

2026-02-22 07:49

medium

94000

Description

Fix tokenizer behavior where subtitle lines containing narrow/invisible Unicode spacing between Japanese segments can be split/grouped incorrectly compared with normal space handling.

Acceptance Criteria

#1 A regression test reproduces the subtitle sample containing narrow/invisible Unicode spacing and fails before fix.
#2 Tokenizer normalization treats narrow/invisible spacing variants consistently with regular spacing for grouping/highlight behavior.
#3 Existing tokenizer tests still pass.

Implementation Notes

Linked from subagent session codex-narrow-space-tokenizer-20260220T061716Z-p97s.

Added src/subtitle/stages/normalize.test.ts regression for \u200B separator in subtitle sample and updated normalizeTokenizerInput to map U+200B/U+2060/U+FEFF to regular spaces before whitespace collapsing.

Validation:

bun run build && node --test dist/subtitle/stages/normalize.test.js
node --test dist/core/services/tokenizer.test.js

1.4 KiB Raw Blame History

Description

Acceptance Criteria

Implementation Notes

1.4 KiB

Raw Blame History