fix: improve yomitan subtitle name lookup

2026-04-30 04:19:25 -07:00 · 2026-03-06 01:28:58 -08:00
parent ebe9515486
commit 746696b1a4
9 changed files with 1041 additions and 34 deletions
--- a/Replace-subtitle-tokenizer-with-left-to-right-Yomitan-scanning-parser.md
+++ b/Replace-subtitle-tokenizer-with-left-to-right-Yomitan-scanning-parser.md
@@ -0,0 +1,34 @@
+---
+id: TASK-93
+title: Replace subtitle tokenizer with left-to-right Yomitan scanning parser
+status: Done
+assignee: []
+created_date: '2026-03-06 09:02'
+updated_date: '2026-03-06 09:14'
+labels:
+  - tokenizer
+  - yomitan
+  - refactor
+dependencies: []
+priority: high
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Replace the current parseText candidate-selection tokenizer with a GSM-style left-to-right Yomitan scanning tokenizer for all subtitles. Preserve downstream token contracts for rendering, JLPT/frequency/N+1 annotation, and MeCab enrichment while improving full-term matching for names and katakana compounds.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 Subtitle tokenization uses a left-to-right Yomitan scanning strategy instead of parseText candidate selection.
+- [x] #2 Token surfaces, readings, headwords, and offsets remain compatible with existing renderer and annotation stages.
+- [x] #3 Known problematic name cases such as カズマ and バニール resolve to full-token dictionary matches when Yomitan can match them.
+- [x] #4 Regression tests cover left-to-right exact-match scanning, unmatched text handling, and downstream tokenizeSubtitle integration.
+<!-- AC:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Replaced the live subtitle tokenization path with a left-to-right Yomitan `termsFind` scanner that greedily advances through the normalized subtitle text, preserving downstream `MergedToken` contracts for renderer, MeCab enrichment, JLPT, frequency, and N+1 annotation. Added runtime and integration coverage for exact-match scanning plus name cases like カズマ and kept compatibility fallback handling for older mocked parseText-style test payloads.
+<!-- SECTION:FINAL_SUMMARY:END -->
--- a/Add-kana-aliases-for-AniList-character-dictionary-entries.md
+++ b/Add-kana-aliases-for-AniList-character-dictionary-entries.md
@@ -0,0 +1,40 @@
+---
+id: TASK-94
+title: Add kana aliases for AniList character dictionary entries
+status: Done
+assignee: []
+created_date: '2026-03-06 09:20'
+updated_date: '2026-03-06 09:23'
+labels:
+  - dictionary
+  - tokenizer
+  - anilist
+dependencies: []
+references:
+  - >-
+    /home/sudacode/projects/japanese/SubMiner/src/main/character-dictionary-runtime.ts
+  - >-
+    /home/sudacode/projects/japanese/SubMiner/src/main/character-dictionary-runtime.test.ts
+priority: high
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Generate katakana/hiragana-friendly aliases from AniList romanized character names so subtitle katakana names like カズマ match character dictionary entries even when AniList native name is kanji.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 AniList character dictionary generation adds kana aliases for romanized names when native name is not already kana-only
+- [x] #2 Generated dictionary entries allow katakana subtitle names like カズマ to resolve against a kanji-native AniList character entry
+- [x] #3 Regression tests cover alias generation and resulting term bank output
+<!-- AC:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Added katakana aliases synthesized from AniList romanized character names during character dictionary generation, so kanji-native entries such as 佐藤和真 / Satou Kazuma now also emit terms like カズマ and サトウカズマ with hiragana readings. Added regression coverage verifying generated term-bank output for the Konosuba case.
+
+Verified with `bun test src/main/character-dictionary-runtime.test.ts` and `bun run tsc --noEmit`.
+<!-- SECTION:FINAL_SUMMARY:END -->
--- a/Invalidate-old-character-dictionary-snapshots-after-kana-alias-schema-change.md
+++ b/Invalidate-old-character-dictionary-snapshots-after-kana-alias-schema-change.md
@@ -0,0 +1,39 @@
+---
+id: TASK-95
+title: Invalidate old character dictionary snapshots after kana alias schema change
+status: Done
+assignee: []
+created_date: '2026-03-06 09:25'
+updated_date: '2026-03-06 09:28'
+labels:
+  - dictionary
+  - cache
+dependencies: []
+references:
+  - >-
+    /home/sudacode/projects/japanese/SubMiner/src/main/character-dictionary-runtime.ts
+  - >-
+    /home/sudacode/projects/japanese/SubMiner/src/main/character-dictionary-runtime.test.ts
+priority: high
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Bump character dictionary snapshot format/version so cached AniList snapshots created before kana alias generation are rebuilt automatically on next auto-sync or generation run.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 Old cached character dictionary snapshots are treated as invalid after the schema/version bump
+- [x] #2 Current snapshot generation tests cover rebuild behavior across version mismatch
+- [x] #3 No manual cache deletion is required for users to pick up kana alias term generation
+<!-- AC:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Bumped the character dictionary snapshot format version so cached AniList snapshots created before kana alias generation are automatically treated as stale and rebuilt. Added regression coverage that seeds an older-format snapshot and verifies `getOrCreateCurrentSnapshot` fetches fresh data and overwrites the stale cache.
+
+Verified with `bun test src/main/character-dictionary-runtime.test.ts` and `bun run tsc --noEmit`.
+<!-- SECTION:FINAL_SUMMARY:END -->