SubMiner/backlog/tasks/task-77 - Split-tokenizer-pipeline-into-parser-selection-enrichment-and-annotation-stages.md at 10b94ce88954fd0af1424cfe29f697c14b2dfa57 - SubMiner

sudacode/SubMiner

Fork 0

mirror of https://github.com/ksyasuda/SubMiner.git synced 2026-02-27 18:22:41 -08:00

Files

kyasuda f299f2a19e chore: switch texthooker-ui workflow to pnpm and add backlog tasks

2026-02-18 18:05:42 -08:00

2.0 KiB

Raw Blame History

id, title, status, assignee, created_date, updated_date, labels, dependencies, priority

title

status

assignee

created_date

updated_date

labels

dependencies

priority

TASK-77

Split tokenizer pipeline into parser selection enrichment and annotation stages

To Do

2026-02-18 11:43

tokenizer

subtitles

refactor

high

Description

src/core/services/tokenizer.ts mixes parser-window lifecycle, candidate scoring, MeCab enrichment, and post-annotation (known/JLPT/frequency/N+1). This task decomposes the tokenizer into explicit pipeline stages with stable contracts.

Suggestions

Use a stage pipeline contract (TokenizationInput -> StageOutput) to isolate concerns.
Isolate Yomitan parser-window lifecycle into a dedicated runtime adapter module.
Keep heuristics/test fixtures in one place to make scoring behavior reviewable.

Action Steps

Define stage interfaces: source parsing, candidate selection, POS enrichment, semantic annotation.
Extract parser runtime adapter (ensure parser window, execute parse) from annotation logic.
Extract candidate scoring/selection into pure module with fixture-based tests.
Extract annotation passes (known-word, frequency, JLPT, N+1) into composable functions.
Add regression tests with fixed subtitle inputs to lock behavior.
Update tokenizer architecture notes in docs.

Acceptance Criteria

#1 Tokenizer code is split into explicit stages with narrow interfaces
#2 Candidate selection logic is pure + directly testable
#3 Parser lifecycle concerns are separated from annotation passes
#4 Existing tokenization behavior preserved in regression tests

Definition of Done

#1 Tokenizer-related test suites pass
#2 New stage-level tests exist for scoring and annotation

2.0 KiB Raw Blame History