mirror of
https://github.com/ksyasuda/SubMiner.git
synced 2026-02-27 18:22:41 -08:00
2.0 KiB
2.0 KiB
id, title, status, assignee, created_date, updated_date, labels, dependencies, priority
| id | title | status | assignee | created_date | updated_date | labels | dependencies | priority | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| TASK-77 | Split tokenizer pipeline into parser selection enrichment and annotation stages | To Do | 2026-02-18 11:43 | 2026-02-18 11:43 |
|
high |
Description
src/core/services/tokenizer.ts mixes parser-window lifecycle, candidate scoring, MeCab enrichment, and post-annotation (known/JLPT/frequency/N+1). This task decomposes the tokenizer into explicit pipeline stages with stable contracts.
Suggestions
- Use a stage pipeline contract (
TokenizationInput -> StageOutput) to isolate concerns. - Isolate Yomitan parser-window lifecycle into a dedicated runtime adapter module.
- Keep heuristics/test fixtures in one place to make scoring behavior reviewable.
Action Steps
- Define stage interfaces: source parsing, candidate selection, POS enrichment, semantic annotation.
- Extract parser runtime adapter (
ensure parser window,execute parse) from annotation logic. - Extract candidate scoring/selection into pure module with fixture-based tests.
- Extract annotation passes (known-word, frequency, JLPT, N+1) into composable functions.
- Add regression tests with fixed subtitle inputs to lock behavior.
- Update tokenizer architecture notes in docs.
Acceptance Criteria
- #1 Tokenizer code is split into explicit stages with narrow interfaces
- #2 Candidate selection logic is pure + directly testable
- #3 Parser lifecycle concerns are separated from annotation passes
- #4 Existing tokenization behavior preserved in regression tests
Definition of Done
- #1 Tokenizer-related test suites pass
- #2 New stage-level tests exist for scoring and annotation