Files
SubMiner/backlog/tasks/task-93 - Replace-subtitle-tokenizer-with-left-to-right-Yomitan-scanning-parser.md

1.7 KiB

id, title, status, assignee, created_date, updated_date, labels, dependencies, priority
id title status assignee created_date updated_date labels dependencies priority
TASK-93 Replace subtitle tokenizer with left-to-right Yomitan scanning parser Done
2026-03-06 09:02 2026-03-06 09:14
tokenizer
yomitan
refactor
high

Description

Replace the current parseText candidate-selection tokenizer with a GSM-style left-to-right Yomitan scanning tokenizer for all subtitles. Preserve downstream token contracts for rendering, JLPT/frequency/N+1 annotation, and MeCab enrichment while improving full-term matching for names and katakana compounds.

Acceptance Criteria

  • #1 Subtitle tokenization uses a left-to-right Yomitan scanning strategy instead of parseText candidate selection.
  • #2 Token surfaces, readings, headwords, and offsets remain compatible with existing renderer and annotation stages.
  • #3 Known problematic name cases such as カズマ and バニール resolve to full-token dictionary matches when Yomitan can match them.
  • #4 Regression tests cover left-to-right exact-match scanning, unmatched text handling, and downstream tokenizeSubtitle integration.

Final Summary

Replaced the live subtitle tokenization path with a left-to-right Yomitan termsFind scanner that greedily advances through the normalized subtitle text, preserving downstream MergedToken contracts for renderer, MeCab enrichment, JLPT, frequency, and N+1 annotation. Added runtime and integration coverage for exact-match scanning plus name cases like カズマ and kept compatibility fallback handling for older mocked parseText-style test payloads.