--- id: TASK-77 title: Split tokenizer pipeline into parser selection enrichment and annotation stages status: To Do assignee: [] created_date: '2026-02-18 11:43' updated_date: '2026-02-18 11:43' labels: - tokenizer - subtitles - refactor dependencies: [] priority: high --- ## Description `src/core/services/tokenizer.ts` mixes parser-window lifecycle, candidate scoring, MeCab enrichment, and post-annotation (known/JLPT/frequency/N+1). This task decomposes the tokenizer into explicit pipeline stages with stable contracts. ## Suggestions - Use a stage pipeline contract (`TokenizationInput -> StageOutput`) to isolate concerns. - Isolate Yomitan parser-window lifecycle into a dedicated runtime adapter module. - Keep heuristics/test fixtures in one place to make scoring behavior reviewable. ## Action Steps 1. Define stage interfaces: source parsing, candidate selection, POS enrichment, semantic annotation. 2. Extract parser runtime adapter (`ensure parser window`, `execute parse`) from annotation logic. 3. Extract candidate scoring/selection into pure module with fixture-based tests. 4. Extract annotation passes (known-word, frequency, JLPT, N+1) into composable functions. 5. Add regression tests with fixed subtitle inputs to lock behavior. 6. Update tokenizer architecture notes in docs. ## Acceptance Criteria - [ ] #1 Tokenizer code is split into explicit stages with narrow interfaces - [ ] #2 Candidate selection logic is pure + directly testable - [ ] #3 Parser lifecycle concerns are separated from annotation passes - [ ] #4 Existing tokenization behavior preserved in regression tests ## Definition of Done - [ ] #1 Tokenizer-related test suites pass - [ ] #2 New stage-level tests exist for scoring and annotation