SubMiner/backlog/tasks/task-77 - Split-tokenizer-pipeline-into-parser-selection-enrichment-and-annotation-stages.md at a6d85def3446c0ca56ce196cca68e2e64a780424 - SubMiner

sudacode/SubMiner

Fork 0

mirror of https://github.com/ksyasuda/SubMiner.git synced 2026-02-27 18:22:41 -08:00

Files

sudacode c480fe6ad4 update docs

2026-02-22 02:15:12 -08:00

5.2 KiB

Raw Blame History

id, title, status, assignee, created_date, updated_date, labels, dependencies, priority, ordinal

title

status

assignee

created_date

updated_date

labels

dependencies

priority

ordinal

TASK-77

Split tokenizer pipeline into parser selection enrichment and annotation stages

Done

@opencode

2026-02-18 11:43

2026-02-22 07:49

tokenizer

subtitles

refactor

high

80000

Description

src/core/services/tokenizer.ts mixes parser-window lifecycle, candidate scoring, MeCab enrichment, and post-annotation (known/JLPT/frequency/N+1). This task decomposes the tokenizer into explicit pipeline stages with stable contracts.

Suggestions

Use a stage pipeline contract (TokenizationInput -> StageOutput) to isolate concerns.
Isolate Yomitan parser-window lifecycle into a dedicated runtime adapter module.
Keep heuristics/test fixtures in one place to make scoring behavior reviewable.

Action Steps

Define stage interfaces: source parsing, candidate selection, POS enrichment, semantic annotation.
Extract parser runtime adapter (ensure parser window, execute parse) from annotation logic.
Extract candidate scoring/selection into pure module with fixture-based tests.
Extract annotation passes (known-word, frequency, JLPT, N+1) into composable functions.
Add regression tests with fixed subtitle inputs to lock behavior.
Update tokenizer architecture notes in docs.

Acceptance Criteria

#1 Tokenizer code is split into explicit stages with narrow interfaces
#2 Candidate selection logic is pure + directly testable
#3 Parser lifecycle concerns are separated from annotation passes
#4 Existing tokenization behavior preserved in regression tests

Implementation Plan

Extract pure parser-selection stage from src/core/services/tokenizer.ts into src/core/services/tokenizer/parser-selection-stage.ts (parse-result mapping + candidate scoring/selection) and add direct stage tests for source preference/tie-break scoring.
Extract MeCab POS1 enrichment stage into src/core/services/tokenizer/parser-enrichment-stage.ts with direct tests for overlap and surface-sequence fallback behavior.
Extract annotation stage into src/core/services/tokenizer/annotation-stage.ts to handle known-word/frequency/JLPT/N+1 passes behind a narrow API, with new stage-level tests.
Separate parser window/runtime lifecycle into src/core/services/tokenizer/yomitan-parser-runtime.ts, keep tokenizer.ts as thin orchestrator, run tokenizer + core src/dist gates, then finalize TASK-77 AC/DoD evidence in Backlog MCP.

Implementation Notes

2026-02-21: started execution pass in current session; loaded Backlog context and tokenizer module/tests before drafting implementation plan via writing-plans skill.

Implemented tokenizer pipeline split with new stage modules: src/core/services/tokenizer/parser-selection-stage.ts, src/core/services/tokenizer/parser-enrichment-stage.ts, src/core/services/tokenizer/annotation-stage.ts, and parser lifecycle runtime in src/core/services/tokenizer/yomitan-parser-runtime.ts; reduced src/core/services/tokenizer.ts to orchestration facade over stages.

Added direct stage-level tests: src/core/services/tokenizer/parser-selection-stage.test.ts, src/core/services/tokenizer/parser-enrichment-stage.test.ts, and src/core/services/tokenizer/annotation-stage.test.ts, and wired them into test:core:src + test:core:dist scripts in package.json.

Verification: bun test src/core/services/tokenizer.test.ts src/core/services/tokenizer/annotation-stage.test.ts src/core/services/tokenizer/parser-selection-stage.test.ts src/core/services/tokenizer/parser-enrichment-stage.test.ts PASS (53/53); bun run test:core:src PASS (219 pass, 6 skip); bun run build PASS; bun run test:core:dist PASS (214 pass, 10 skip).

Final Summary

Split tokenizer internals into explicit stages while preserving external behavior: parser candidate mapping/selection moved to parser-selection-stage, MeCab POS1 enrichment moved to parser-enrichment-stage, post-token annotation (known-word, frequency, JLPT, N+1) moved to annotation-stage, and Yomitan parser window lifecycle isolated in yomitan-parser-runtime. src/core/services/tokenizer.ts now acts as a thin orchestrator that normalizes subtitle text, requests parser output, runs stage pipeline, and handles MeCab fallback.

Added direct stage-level tests for scoring/selection and annotation semantics (parser-selection-stage.test.ts, parser-enrichment-stage.test.ts, annotation-stage.test.ts) and included them in both source and dist core test lanes via package.json. Validation passed across targeted tokenizer tests plus full core gates (test:core:src, build, test:core:dist) with no tokenizer regression.

Definition of Done

#1 Tokenizer-related test suites pass
#2 New stage-level tests exist for scoring and annotation

5.2 KiB Raw Blame History