SubMiner/backlog/tasks/task-92 - Fix-merged-Yomitan-headword-selection-for-katakana-subtitle-tokens.md at e659b5d8f4045db5c2b7b94b8ffa3a286e552ab0 - SubMiner

sudacode/SubMiner

Fork 0

mirror of https://github.com/ksyasuda/SubMiner.git synced 2026-03-07 03:22:17 -08:00

Files

sudacode 8c2c950564

feat: merge AniList character dictionaries by recent usage

2026-03-06 01:01:31 -08:00

1.7 KiB

Raw Blame History

id, title, status, assignee, created_date, updated_date, labels, dependencies, priority

title

status

assignee

created_date

updated_date

labels

dependencies

priority

TASK-92

Fix merged Yomitan headword selection for katakana subtitle tokens

Done

2026-03-06 08:43

bug

tokenizer

yomitan

medium

Description

Tokenizer/parser-selection bug: when a scanning-parser line is merged from multiple segments, the merged token currently keeps the first segment headword even if a later segment provides the full dictionary-backed term. This truncates katakana names such as バニール to バニ in the lookup payload and prevents correct dictionary matching. Also align kana classification so the prolonged sound mark is treated as kana in tokenizer heuristics.

Acceptance Criteria

#1 Merged scanning-parser tokens prefer a full cross-segment headword when one segment expands to the full term.
#2 Standalone later segment headwords do not override the primary token headword in normal content-word + auxiliary merges.
#3 Katakana prolonged sound mark is treated as kana in tokenizer heuristics.
#4 Regression tests cover the merged katakana headword case.

Final Summary

Adjusted merged scanning-parser headword selection so later segments only override the first headword when they provide an expanded cross-segment dictionary term, which fixes truncated katakana lookups like バニール -> バニ. Also updated kana classification to include the katakana prolonged sound mark and added regression coverage for both the expanded-headword case and the normal content-word-plus-auxiliary case.

1.7 KiB Raw Blame History

Description

Acceptance Criteria

Final Summary

1.7 KiB

Raw Blame History