fix: improve yomitan subtitle name lookup

This commit is contained in:
2026-03-06 01:28:58 -08:00
parent ebe9515486
commit 746696b1a4
9 changed files with 1041 additions and 34 deletions

View File

@@ -0,0 +1,34 @@
---
id: TASK-93
title: Replace subtitle tokenizer with left-to-right Yomitan scanning parser
status: Done
assignee: []
created_date: '2026-03-06 09:02'
updated_date: '2026-03-06 09:14'
labels:
- tokenizer
- yomitan
- refactor
dependencies: []
priority: high
---
## Description
<!-- SECTION:DESCRIPTION:BEGIN -->
Replace the current parseText candidate-selection tokenizer with a GSM-style left-to-right Yomitan scanning tokenizer for all subtitles. Preserve downstream token contracts for rendering, JLPT/frequency/N+1 annotation, and MeCab enrichment while improving full-term matching for names and katakana compounds.
<!-- SECTION:DESCRIPTION:END -->
## Acceptance Criteria
<!-- AC:BEGIN -->
- [x] #1 Subtitle tokenization uses a left-to-right Yomitan scanning strategy instead of parseText candidate selection.
- [x] #2 Token surfaces, readings, headwords, and offsets remain compatible with existing renderer and annotation stages.
- [x] #3 Known problematic name cases such as カズマ and バニール resolve to full-token dictionary matches when Yomitan can match them.
- [x] #4 Regression tests cover left-to-right exact-match scanning, unmatched text handling, and downstream tokenizeSubtitle integration.
<!-- AC:END -->
## Final Summary
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
Replaced the live subtitle tokenization path with a left-to-right Yomitan `termsFind` scanner that greedily advances through the normalized subtitle text, preserving downstream `MergedToken` contracts for renderer, MeCab enrichment, JLPT, frequency, and N+1 annotation. Added runtime and integration coverage for exact-match scanning plus name cases like カズマ and kept compatibility fallback handling for older mocked parseText-style test payloads.
<!-- SECTION:FINAL_SUMMARY:END -->

View File

@@ -0,0 +1,40 @@
---
id: TASK-94
title: Add kana aliases for AniList character dictionary entries
status: Done
assignee: []
created_date: '2026-03-06 09:20'
updated_date: '2026-03-06 09:23'
labels:
- dictionary
- tokenizer
- anilist
dependencies: []
references:
- >-
/home/sudacode/projects/japanese/SubMiner/src/main/character-dictionary-runtime.ts
- >-
/home/sudacode/projects/japanese/SubMiner/src/main/character-dictionary-runtime.test.ts
priority: high
---
## Description
<!-- SECTION:DESCRIPTION:BEGIN -->
Generate katakana/hiragana-friendly aliases from AniList romanized character names so subtitle katakana names like カズマ match character dictionary entries even when AniList native name is kanji.
<!-- SECTION:DESCRIPTION:END -->
## Acceptance Criteria
<!-- AC:BEGIN -->
- [x] #1 AniList character dictionary generation adds kana aliases for romanized names when native name is not already kana-only
- [x] #2 Generated dictionary entries allow katakana subtitle names like カズマ to resolve against a kanji-native AniList character entry
- [x] #3 Regression tests cover alias generation and resulting term bank output
<!-- AC:END -->
## Final Summary
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
Added katakana aliases synthesized from AniList romanized character names during character dictionary generation, so kanji-native entries such as 佐藤和真 / Satou Kazuma now also emit terms like カズマ and サトウカズマ with hiragana readings. Added regression coverage verifying generated term-bank output for the Konosuba case.
Verified with `bun test src/main/character-dictionary-runtime.test.ts` and `bun run tsc --noEmit`.
<!-- SECTION:FINAL_SUMMARY:END -->

View File

@@ -0,0 +1,39 @@
---
id: TASK-95
title: Invalidate old character dictionary snapshots after kana alias schema change
status: Done
assignee: []
created_date: '2026-03-06 09:25'
updated_date: '2026-03-06 09:28'
labels:
- dictionary
- cache
dependencies: []
references:
- >-
/home/sudacode/projects/japanese/SubMiner/src/main/character-dictionary-runtime.ts
- >-
/home/sudacode/projects/japanese/SubMiner/src/main/character-dictionary-runtime.test.ts
priority: high
---
## Description
<!-- SECTION:DESCRIPTION:BEGIN -->
Bump character dictionary snapshot format/version so cached AniList snapshots created before kana alias generation are rebuilt automatically on next auto-sync or generation run.
<!-- SECTION:DESCRIPTION:END -->
## Acceptance Criteria
<!-- AC:BEGIN -->
- [x] #1 Old cached character dictionary snapshots are treated as invalid after the schema/version bump
- [x] #2 Current snapshot generation tests cover rebuild behavior across version mismatch
- [x] #3 No manual cache deletion is required for users to pick up kana alias term generation
<!-- AC:END -->
## Final Summary
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
Bumped the character dictionary snapshot format version so cached AniList snapshots created before kana alias generation are automatically treated as stale and rebuilt. Added regression coverage that seeds an older-format snapshot and verifies `getOrCreateCurrentSnapshot` fetches fresh data and overwrites the stale cache.
Verified with `bun test src/main/character-dictionary-runtime.test.ts` and `bun run tsc --noEmit`.
<!-- SECTION:FINAL_SUMMARY:END -->