Files
SubMiner/backlog/tasks/task-88 - Remove-MeCab-fallback-tokenizer-and-simplify-Yomitan-token-flow.md
2026-02-22 19:35:19 -08:00

3.3 KiB

id, title, status, assignee, created_date, updated_date, labels, dependencies, priority
id title status assignee created_date updated_date labels dependencies priority
TASK-88 Remove MeCab fallback tokenizer and simplify Yomitan token flow Done
2026-02-20 00:00 2026-02-23 01:44
tokenizer
refactor
medium

Description

Remove the MeCab fallback tokenization path and associated merge-selection complexity in subtitle tokenization. Treat Yomitan parser output as the single source of token boundaries/grouping, and keep only minimal normalization needed for downstream known-word, JLPT, and frequency annotation.

Action Steps

  1. Remove MeCab fallback execution from tokenizeSubtitle and delete dead fallback-specific branches.
  2. Remove merge/candidate-selection code that is only needed to reconcile MeCab-vs-Yomitan tokenization strategies.
  3. Keep Yomitan parsing pipeline with minimal structural token normalization only.
  4. Update MeCab usage so it is no longer required for tokenization fallback (retain only explicitly needed behavior, if any).
  5. Update docs/config notes to reflect Yomitan-only tokenization flow.
  6. Add regression tests for Yomitan-only success/failure paths and token annotation continuity.

Acceptance Criteria

  • #1 Subtitle tokenization no longer falls back to MeCab when Yomitan parsing fails.
  • #2 Token grouping logic is simplified to rely on Yomitan structure; redundant custom merge-selection logic removed.
  • #3 Known-word, JLPT, frequency, and N+1 annotations still work on Yomitan-derived tokens.
  • #4 If Yomitan parsing fails, behavior is explicit and tested (for example tokens: null without MeCab fallback path).
  • #5 Documentation reflects that tokenization flow is Yomitan-first and Yomitan-only.

Implementation Notes

Removed MeCab fallback tokenization from src/core/services/tokenizer.ts; tokenizeSubtitle now returns tokens: null when Yomitan parsing/selecting yields no tokens.

Simplified parse candidate selection in src/core/services/tokenizer/parser-selection-stage.ts to scanning-parser sources only; added null behavior when only mecab parse candidates are present.

Updated tokenizer regression suites to reflect Yomitan-only flow while preserving annotation continuity checks (known-word, JLPT, frequency, N+1) in src/core/services/tokenizer.test.ts and src/core/services/tokenizer/parser-selection-stage.test.ts.

Updated docs to remove MeCab fallback positioning and clarify Yomitan-only tokenization in docs/usage.md and docs/troubleshooting.md.

Validation run passed:

  • bun test src/core/services/tokenizer/parser-selection-stage.test.ts src/core/services/tokenizer.test.ts
  • bun test src/core/services/subtitle-processing-controller.test.ts
  • bun run build
  • bun run docs:build

Definition of Done

  • #1 src/core/services/tokenizer.ts no longer contains MeCab fallback tokenization branch.
  • #2 Tests cover Yomitan-only pipeline and failure behavior regressions.
  • #3 Any removed MeCab-only merge helpers are deleted with no unused exports/imports.
  • #4 Build and relevant tokenizer/subtitle tests pass.