3.3 KiB
id, title, status, assignee, created_date, updated_date, labels, dependencies, priority
| id | title | status | assignee | created_date | updated_date | labels | dependencies | priority | ||
|---|---|---|---|---|---|---|---|---|---|---|
| TASK-88 | Remove MeCab fallback tokenizer and simplify Yomitan token flow | Done | 2026-02-20 00:00 | 2026-02-23 01:44 |
|
medium |
Description
Remove the MeCab fallback tokenization path and associated merge-selection complexity in subtitle tokenization. Treat Yomitan parser output as the single source of token boundaries/grouping, and keep only minimal normalization needed for downstream known-word, JLPT, and frequency annotation.
Action Steps
- Remove MeCab fallback execution from
tokenizeSubtitleand delete dead fallback-specific branches. - Remove merge/candidate-selection code that is only needed to reconcile MeCab-vs-Yomitan tokenization strategies.
- Keep Yomitan parsing pipeline with minimal structural token normalization only.
- Update MeCab usage so it is no longer required for tokenization fallback (retain only explicitly needed behavior, if any).
- Update docs/config notes to reflect Yomitan-only tokenization flow.
- Add regression tests for Yomitan-only success/failure paths and token annotation continuity.
Acceptance Criteria
- #1 Subtitle tokenization no longer falls back to MeCab when Yomitan parsing fails.
- #2 Token grouping logic is simplified to rely on Yomitan structure; redundant custom merge-selection logic removed.
- #3 Known-word, JLPT, frequency, and N+1 annotations still work on Yomitan-derived tokens.
- #4 If Yomitan parsing fails, behavior is explicit and tested (for example
tokens: nullwithout MeCab fallback path). - #5 Documentation reflects that tokenization flow is Yomitan-first and Yomitan-only.
Implementation Notes
Removed MeCab fallback tokenization from src/core/services/tokenizer.ts; tokenizeSubtitle now returns tokens: null when Yomitan parsing/selecting yields no tokens.
Simplified parse candidate selection in src/core/services/tokenizer/parser-selection-stage.ts to scanning-parser sources only; added null behavior when only mecab parse candidates are present.
Updated tokenizer regression suites to reflect Yomitan-only flow while preserving annotation continuity checks (known-word, JLPT, frequency, N+1) in src/core/services/tokenizer.test.ts and src/core/services/tokenizer/parser-selection-stage.test.ts.
Updated docs to remove MeCab fallback positioning and clarify Yomitan-only tokenization in docs/usage.md and docs/troubleshooting.md.
Validation run passed:
bun test src/core/services/tokenizer/parser-selection-stage.test.ts src/core/services/tokenizer.test.tsbun test src/core/services/subtitle-processing-controller.test.tsbun run buildbun run docs:build
Definition of Done
- #1
src/core/services/tokenizer.tsno longer contains MeCab fallback tokenization branch. - #2 Tests cover Yomitan-only pipeline and failure behavior regressions.
- #3 Any removed MeCab-only merge helpers are deleted with no unused exports/imports.
- #4 Build and relevant tokenizer/subtitle tests pass.