SubMiner/backlog/tasks/task-88 - Remove-MeCab-fallback-tokenizer-and-simplify-Yomitan-token-flow.md at 904ca3f3bbff358a6ec1fbdf34c09850623f4153 - SubMiner

sudacode/SubMiner

Fork 0

mirror of https://github.com/ksyasuda/SubMiner.git synced 2026-02-27 18:22:41 -08:00

Files

sudacode 64acf22292 update docs

2026-02-22 19:35:19 -08:00

3.3 KiB

Raw Blame History

id, title, status, assignee, created_date, updated_date, labels, dependencies, priority

title

status

assignee

created_date

updated_date

labels

dependencies

priority

TASK-88

Remove MeCab fallback tokenizer and simplify Yomitan token flow

Done

2026-02-20 00:00

2026-02-23 01:44

tokenizer

refactor

medium

Description

Remove the MeCab fallback tokenization path and associated merge-selection complexity in subtitle tokenization. Treat Yomitan parser output as the single source of token boundaries/grouping, and keep only minimal normalization needed for downstream known-word, JLPT, and frequency annotation.

Action Steps

Remove MeCab fallback execution from tokenizeSubtitle and delete dead fallback-specific branches.
Remove merge/candidate-selection code that is only needed to reconcile MeCab-vs-Yomitan tokenization strategies.
Keep Yomitan parsing pipeline with minimal structural token normalization only.
Update MeCab usage so it is no longer required for tokenization fallback (retain only explicitly needed behavior, if any).
Update docs/config notes to reflect Yomitan-only tokenization flow.
Add regression tests for Yomitan-only success/failure paths and token annotation continuity.

Acceptance Criteria

#1 Subtitle tokenization no longer falls back to MeCab when Yomitan parsing fails.
#2 Token grouping logic is simplified to rely on Yomitan structure; redundant custom merge-selection logic removed.
#3 Known-word, JLPT, frequency, and N+1 annotations still work on Yomitan-derived tokens.
#4 If Yomitan parsing fails, behavior is explicit and tested (for example tokens: null without MeCab fallback path).
#5 Documentation reflects that tokenization flow is Yomitan-first and Yomitan-only.

Implementation Notes

Removed MeCab fallback tokenization from src/core/services/tokenizer.ts; tokenizeSubtitle now returns tokens: null when Yomitan parsing/selecting yields no tokens.

Simplified parse candidate selection in src/core/services/tokenizer/parser-selection-stage.ts to scanning-parser sources only; added null behavior when only mecab parse candidates are present.

Updated tokenizer regression suites to reflect Yomitan-only flow while preserving annotation continuity checks (known-word, JLPT, frequency, N+1) in src/core/services/tokenizer.test.ts and src/core/services/tokenizer/parser-selection-stage.test.ts.

Updated docs to remove MeCab fallback positioning and clarify Yomitan-only tokenization in docs/usage.md and docs/troubleshooting.md.

Validation run passed:

bun test src/core/services/tokenizer/parser-selection-stage.test.ts src/core/services/tokenizer.test.ts
bun test src/core/services/subtitle-processing-controller.test.ts
bun run build
bun run docs:build

Definition of Done

#1 src/core/services/tokenizer.ts no longer contains MeCab fallback tokenization branch.
#2 Tests cover Yomitan-only pipeline and failure behavior regressions.
#3 Any removed MeCab-only merge helpers are deleted with no unused exports/imports.
#4 Build and relevant tokenizer/subtitle tests pass.

3.3 KiB Raw Blame History