1.7 KiB
id, title, status, assignee, created_date, updated_date, labels, dependencies, priority
| id | title | status | assignee | created_date | updated_date | labels | dependencies | priority | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| TASK-93 | Replace subtitle tokenizer with left-to-right Yomitan scanning parser | Done | 2026-03-06 09:02 | 2026-03-06 09:14 |
|
high |
Description
Replace the current parseText candidate-selection tokenizer with a GSM-style left-to-right Yomitan scanning tokenizer for all subtitles. Preserve downstream token contracts for rendering, JLPT/frequency/N+1 annotation, and MeCab enrichment while improving full-term matching for names and katakana compounds.
Acceptance Criteria
- #1 Subtitle tokenization uses a left-to-right Yomitan scanning strategy instead of parseText candidate selection.
- #2 Token surfaces, readings, headwords, and offsets remain compatible with existing renderer and annotation stages.
- #3 Known problematic name cases such as カズマ and バニール resolve to full-token dictionary matches when Yomitan can match them.
- #4 Regression tests cover left-to-right exact-match scanning, unmatched text handling, and downstream tokenizeSubtitle integration.
Final Summary
Replaced the live subtitle tokenization path with a left-to-right Yomitan termsFind scanner that greedily advances through the normalized subtitle text, preserving downstream MergedToken contracts for renderer, MeCab enrichment, JLPT, frequency, and N+1 annotation. Added runtime and integration coverage for exact-match scanning plus name cases like カズマ and kept compatibility fallback handling for older mocked parseText-style test payloads.