Files
SubMiner/backlog/tasks/task-92 - Fix-merged-Yomitan-headword-selection-for-katakana-subtitle-tokens.md

1.7 KiB

id, title, status, assignee, created_date, updated_date, labels, dependencies, priority
id title status assignee created_date updated_date labels dependencies priority
TASK-92 Fix merged Yomitan headword selection for katakana subtitle tokens Done
2026-03-06 08:43 2026-03-06 08:43
bug
tokenizer
yomitan
medium

Description

Tokenizer/parser-selection bug: when a scanning-parser line is merged from multiple segments, the merged token currently keeps the first segment headword even if a later segment provides the full dictionary-backed term. This truncates katakana names such as バニール to バニ in the lookup payload and prevents correct dictionary matching. Also align kana classification so the prolonged sound mark is treated as kana in tokenizer heuristics.

Acceptance Criteria

  • #1 Merged scanning-parser tokens prefer a full cross-segment headword when one segment expands to the full term.
  • #2 Standalone later segment headwords do not override the primary token headword in normal content-word + auxiliary merges.
  • #3 Katakana prolonged sound mark is treated as kana in tokenizer heuristics.
  • #4 Regression tests cover the merged katakana headword case.

Final Summary

Adjusted merged scanning-parser headword selection so later segments only override the first headword when they provide an expanded cross-segment dictionary term, which fixes truncated katakana lookups like バニール -> バニ. Also updated kana classification to include the katakana prolonged sound mark and added regression coverage for both the expanded-headword case and the normal content-word-plus-auxiliary case.