feat: merge AniList character dictionaries by recent usage

This commit is contained in:
2026-03-06 01:01:31 -08:00
parent e2b51c6306
commit 8c2c950564
17 changed files with 1386 additions and 517 deletions

View File

@@ -0,0 +1,57 @@
---
id: TASK-89
title: Replace per-anime Yomitan imports with merged usage-based character dictionary
status: Done
assignee:
- '@codex'
created_date: '2026-03-06 07:59'
updated_date: '2026-03-06 08:09'
labels:
- character-dictionary
- yomitan
- anilist
dependencies: []
references:
- >-
/home/sudacode/projects/japanese/SubMiner/src/main/character-dictionary-runtime.ts
- >-
/home/sudacode/projects/japanese/SubMiner/src/main/runtime/character-dictionary-auto-sync.ts
- >-
/home/sudacode/projects/japanese/SubMiner/src/config/definitions/defaults-integrations.ts
priority: high
---
## Description
<!-- SECTION:DESCRIPTION:BEGIN -->
Replace TTL-based per-anime character dictionary imports with a single merged Yomitan dictionary built from locally stored per-media metadata snapshots. Retain only most-recently-used anime up to configured maxLoaded, rebuild merged import when retained set membership/order changes, and avoid rebuilding on revisits that do not change the retained set.
<!-- SECTION:DESCRIPTION:END -->
## Acceptance Criteria
<!-- AC:BEGIN -->
- [x] #1 Character dictionary retention becomes usage-based rather than TTL-based.
- [x] #2 Only one Yomitan character dictionary import is maintained and updated as a merged dictionary.
- [x] #3 Local storage keeps only metadata/snapshots needed to rebuild the merged dictionary; per-anime source zip cache is removed.
- [x] #4 Merged dictionary rebuild occurs when retained-set membership or order changes, not on unchanged revisits.
- [x] #5 Tests cover merged rebuild, MRU eviction, and no-op revisits.
<!-- AC:END -->
## Implementation Notes
<!-- SECTION:NOTES:BEGIN -->
Replaced per-media auto-sync imports with one merged Yomitan dictionary. Added snapshot persistence in `src/main/character-dictionary-runtime.ts` so auto-sync stores normalized per-media term/image metadata locally under `character-dictionaries/snapshots/` and rebuilds `merged.zip` from the MRU retained media ids.
Updated `src/main/runtime/character-dictionary-auto-sync.ts` to keep only MRU `activeMediaIds` plus merged revision/title state, rebuild/import the merged dictionary only when retained-set membership/order changes or the merged import is missing/stale, and skip rebuild on unchanged revisits.
Kept manual `generateForCurrentMedia` support by generating a one-off per-media zip from the stored snapshot, but removed the old per-media zip cache path from auto-sync state.
Updated config/help text to describe usage-based merged retention and mark legacy TTL/eviction knobs as ignored.
<!-- SECTION:NOTES:END -->
## Final Summary
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
Implemented MRU-based merged character dictionary sync. Auto-sync now stores per-media normalized snapshots locally, rebuilds a single merged Yomitan dictionary when the retained anime set/order changes, and keeps `maxLoaded` as the cap on most-recently-used anime included in that merged import. Unchanged revisits no longer rebuild/import the dictionary.
Validation: `bun test src/main/runtime/character-dictionary-auto-sync.test.ts src/main/character-dictionary-runtime.test.ts`, `bun run tsc --noEmit`.
<!-- SECTION:FINAL_SUMMARY:END -->

View File

@@ -0,0 +1,35 @@
---
id: TASK-91
title: >-
Keep unsupported subtitle characters visible while excluding them from token
hover
status: Done
assignee: []
created_date: '2026-03-06 08:29'
updated_date: '2026-03-06 08:32'
labels:
- bug
- tokenizer
- renderer
dependencies: []
priority: medium
---
## Description
<!-- SECTION:DESCRIPTION:BEGIN -->
Tokenizer/rendering bug: symbols and other unsupported characters with no lookup result are removed from the rendered subtitle line after tokenization, causing the displayed line to diverge from the source subtitle text. Update rendering so unsupported spans remain visible as plain text but are not tokenized/hoverable, and add regression coverage.
<!-- SECTION:DESCRIPTION:END -->
## Acceptance Criteria
<!-- AC:BEGIN -->
- [x] #1 Subtitle rendering preserves unsupported symbols and special characters from the original line.
- [x] #2 Unsupported symbols and special characters do not create interactive token hover targets.
- [x] #3 Regression tests cover a mixed line containing tokenizable text plus unsupported characters.
<!-- AC:END -->
## Final Summary
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
Updated tokenized subtitle rendering to preserve unsupported punctuation and symbol spans as plain text while keeping only matched tokens interactive. Added renderer and alignment regression coverage for mixed lines so hover offsets stay correct after non-tokenizable characters remain visible.
<!-- SECTION:FINAL_SUMMARY:END -->

View File

@@ -0,0 +1,34 @@
---
id: TASK-92
title: Fix merged Yomitan headword selection for katakana subtitle tokens
status: Done
assignee: []
created_date: '2026-03-06 08:43'
updated_date: '2026-03-06 08:43'
labels:
- bug
- tokenizer
- yomitan
dependencies: []
priority: medium
---
## Description
<!-- SECTION:DESCRIPTION:BEGIN -->
Tokenizer/parser-selection bug: when a scanning-parser line is merged from multiple segments, the merged token currently keeps the first segment headword even if a later segment provides the full dictionary-backed term. This truncates katakana names such as バニール to バニ in the lookup payload and prevents correct dictionary matching. Also align kana classification so the prolonged sound mark is treated as kana in tokenizer heuristics.
<!-- SECTION:DESCRIPTION:END -->
## Acceptance Criteria
<!-- AC:BEGIN -->
- [x] #1 Merged scanning-parser tokens prefer a full cross-segment headword when one segment expands to the full term.
- [x] #2 Standalone later segment headwords do not override the primary token headword in normal content-word + auxiliary merges.
- [x] #3 Katakana prolonged sound mark is treated as kana in tokenizer heuristics.
- [x] #4 Regression tests cover the merged katakana headword case.
<!-- AC:END -->
## Final Summary
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
Adjusted merged scanning-parser headword selection so later segments only override the first headword when they provide an expanded cross-segment dictionary term, which fixes truncated katakana lookups like バニール -> バニ. Also updated kana classification to include the katakana prolonged sound mark and added regression coverage for both the expanded-headword case and the normal content-word-plus-auxiliary case.
<!-- SECTION:FINAL_SUMMARY:END -->