feat(tokenizer): refine Yomitan grouping and parser tooling

- map segmented Yomitan lines into single logical tokens and improve candidate selection heuristics - limit frequency lookup to selected token text with POS-based exclusions and add debug logging hook - add standalone Yomitan parser test script, deterministic utility-script shutdown, and docs/backlog updates
2026-06-15 15:13:31 -07:00 · 2026-02-16 17:41:24 -08:00
parent 0eb2868805
commit 457e6f0f10
17 changed files with 1667 additions and 293 deletions
@@ -0,0 +1,32 @@
+---
+id: TASK-58
+title: >-
+  Add standalone script to exercise SubMiner Yomitan parser and candidate
+  selection
+status: Done
+assignee: []
+created_date: '2026-02-16 22:04'
+updated_date: '2026-02-16 22:06'
+labels: []
+dependencies: []
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Create `scripts/test-yomitan-parser.ts` as a standalone CLI tool that reuses SubMiner's Yomitan parser logic to inspect parse output and candidate selection behavior for test inputs.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 A new script exists at `scripts/test-yomitan-parser.ts`.
+- [x] #2 The script can be run standalone and accepts input text for parsing.
+- [x] #3 The script uses existing SubMiner parser logic rather than duplicating parser behavior.
+- [x] #4 The script prints parsed results and candidate selection details (including which candidate is chosen).
+<!-- AC:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Added a standalone parser test script at `scripts/test-yomitan-parser.ts` that reuses `tokenizeSubtitleService` from SubMiner, initializes optional Yomitan+Electron runtime, fetches raw parse candidates via Yomitan `parseText`, and reports which candidate(s) match the final selected token output. Added package scripts `test-yomitan-parser` and `test-yomitan-parser:electron` for direct and Electron-backed runs.
+<!-- SECTION:FINAL_SUMMARY:END -->
@@ -0,0 +1,29 @@
+---
+id: TASK-59
+title: Restrict Yomitan frequency lookup to selected headword only
+status: Done
+assignee: []
+created_date: '2026-02-16 22:16'
+updated_date: '2026-02-16 22:18'
+labels: []
+dependencies: []
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Update tokenizer and related scripts/tests so frequency lookup no longer uses Yomitan headword variant lists and instead only uses the selected headword returned by Yomitan.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 Frequency ranking for Yomitan tokens uses only the token headword (with existing fallback behavior) and not `frequencyLookupTerms` variants.
+- [x] #2 Tokenizer tests reflect the new headword-only lookup behavior.
+- [x] #3 Parser testing script output no longer implies variant-based frequency lookup.
+<!-- AC:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Updated frequency lookup to use only the selected token lookup text (headword first, fallback to reading/surface only when headword is absent) and removed Yomitan variant-term usage. Removed `frequencyLookupTerms` from token mapping/types, updated tokenizer tests for headword-only behavior, and aligned helper scripts (`scripts/get_frequency.ts`, `scripts/test-yomitan-parser.ts`) so diagnostics/output no longer imply variant-based lookup.
+<!-- SECTION:FINAL_SUMMARY:END -->
@@ -0,0 +1,29 @@
+---
+id: TASK-60
+title: Remove hard-coded particle term exclusions from frequency lookup
+status: Done
+assignee: []
+created_date: '2026-02-16 22:20'
+updated_date: '2026-02-16 22:21'
+labels: []
+dependencies: []
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Update tokenizer frequency filtering to rely on MeCab POS information instead of a hard-coded set of particle surface forms.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 `FREQUENCY_EXCLUDED_PARTICLES` hard-coded term list is removed.
+- [x] #2 Frequency exclusion for particles/auxiliaries is driven by POS metadata.
+- [x] #3 Tokenizer tests cover POS-driven exclusion behavior.
+<!-- AC:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Removed hard-coded particle surface exclusions (`FREQUENCY_EXCLUDED_PARTICLES`) from tokenizer frequency logic. Frequency skip now relies on POS metadata only: `partOfSpeech` (`particle`/`bound_auxiliary`) and MeCab-enriched `pos1` (`助詞`/`助動詞`) for Yomitan tokens. Added tokenizer test `tokenizeSubtitleService skips frequency rank when Yomitan token is enriched as particle by mecab pos1` to validate POS-driven exclusion.
+<!-- SECTION:FINAL_SUMMARY:END -->
@@ -0,0 +1,29 @@
+---
+id: TASK-61
+title: Ensure parser utility scripts exit immediately after output
+status: Done
+assignee: []
+created_date: '2026-02-16 22:35'
+updated_date: '2026-02-16 22:37'
+labels: []
+dependencies: []
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Update `scripts/test-yomitan-parser.ts` and `scripts/get_frequency.ts` so they clean up Electron parser resources and terminate immediately after producing results, avoiding hangs.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 `scripts/test-yomitan-parser.ts` exits promptly after printing output.
+- [x] #2 `scripts/get_frequency.ts` exits promptly after printing output.
+- [x] #3 Electron-related resources (parser window/app loop) are cleaned up on both success and error paths.
+<!-- AC:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Added deterministic shutdown to both utility scripts. `scripts/get_frequency.ts` now destroys parser windows in a `finally` block, calls `app.quit()` when Electron is loaded, and uses explicit `.then/.catch` exits so the process terminates immediately after output with correct exit codes. `scripts/test-yomitan-parser.ts` now mirrors this pattern with runtime cleanup (`shutdownYomitanRuntime`) and explicit process exit handling.
+<!-- SECTION:FINAL_SUMMARY:END -->
@@ -0,0 +1,48 @@
+---
+id: TASK-62
+title: Color full Japanese term when Yomitan splits lookup into multiple tokens
+status: Done
+assignee: []
+created_date: '2026-02-16 23:03'
+updated_date: '2026-02-16 23:11'
+labels: []
+dependencies: []
+priority: medium
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Users should see one continuous highlight for a looked-up term even when Yomitan returns the term as multiple adjacent tokens, so color feedback matches the selected word/phrase.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 When a looked-up Japanese term is represented as multiple adjacent tokens from Yomitan, the UI applies highlight color to the entire contiguous term instead of only one token.
+- [x] #2 Existing highlighting behavior for single-token matches remains unchanged.
+- [x] #3 Automated coverage or reproducible verification demonstrates the multi-token case is rendered correctly.
+<!-- AC:END -->
+
+## Implementation Plan
+
+<!-- SECTION:PLAN:BEGIN -->
+1. Update Yomitan parse-result mapping so each parse line is treated as one logical token (combine segment text/reading and preserve the selected headword from segment metadata).
+2. Add regression coverage for furigana-split parse lines to ensure frequency/highlight metadata applies to the full combined token.
+3. Rebuild and run tokenizer tests to verify multi-segment and single-segment behavior remain correct.
+<!-- SECTION:PLAN:END -->
+
+## Implementation Notes
+
+<!-- SECTION:NOTES:BEGIN -->
+Implemented line-level token mapping in `src/core/services/tokenizer-service.ts` so segmented Yomitan line parts (e.g. furigana-split pieces) are merged into one `MergedToken` with one headword, one surface span, and one reading string.
+
+Added/updated tokenizer tests in `src/core/services/tokenizer-service.test.ts` covering segmented-line behavior and aligned several existing fixtures/assertions to current runtime behavior so the full tokenizer suite is green.
+
+Validation run: `pnpm run build && node dist/core/services/tokenizer-service.test.js` (38/38 passing).
+<!-- SECTION:NOTES:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Fixed partial token coloring caused by Yomitan segmented parse lines by changing tokenizer mapping to treat each parse line as one logical token instead of one token per segment. The new mapping concatenates segment text/reading, carries the selected headword from segment metadata, and preserves correct span offsets so frequency/known-word/JLPT classifications apply to the full term span. Added regression coverage for furigana-split tokens and updated related parser fixture tests to reflect line-level token semantics. Verified with `pnpm run build` and `node dist/core/services/tokenizer-service.test.js` (38 passing).
+<!-- SECTION:FINAL_SUMMARY:END -->
@@ -0,0 +1,52 @@
+---
+id: TASK-63
+title: Add runtime toggle to log selected Yomitan token groups
+status: Done
+assignee: []
+created_date: '2026-02-16 23:38'
+updated_date: '2026-02-16 23:41'
+labels: []
+dependencies: []
+priority: low
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Provide an in-app debug toggle that logs the selected Yomitan token grouping for each subtitle parse so users can verify token boundaries live without rebuilding.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 A runtime option exists to enable/disable Yomitan group debug logging without app restart.
+- [x] #2 When enabled, subtitle tokenization logs the selected Yomitan grouped tokens (with enough detail to verify boundaries/headwords).
+- [x] #3 When disabled, no additional Yomitan group debug logs are emitted.
+- [x] #4 Related tests/build pass for touched modules.
+<!-- AC:END -->
+
+## Implementation Plan
+
+<!-- SECTION:PLAN:BEGIN -->
+1. Add a boolean runtime option for Yomitan-group debug logging in the centralized runtime option registry and expose it in config metadata.
+2. Extend tokenizer dependency wiring so main runtime can pass the current toggle value to tokenization without restart.
+3. Log selected Yomitan token groups (surface/headword/reading/span) only when the toggle is enabled.
+4. Add tests for registry presence and enabled/disabled logging behavior, then run build/tests.
+<!-- SECTION:PLAN:END -->
+
+## Implementation Notes
+
+<!-- SECTION:NOTES:BEGIN -->
+Added runtime option `anki.debugYomitanGroups` (`Debug Yomitan Groups`) with default `false`, mapped to `ankiConnect.behavior.debugYomitanGroups`.
+
+Wired `main.ts` tokenizer deps to read the runtime option value live via `RuntimeOptionsManager`, with config fallback.
+
+Implemented conditional tokenizer logging (`Selected Yomitan token groups`) in `tokenizer-service` and covered enabled/disabled behavior with unit tests.
+
+Validation run: `pnpm run build && node dist/core/services/tokenizer-service.test.js && node --test dist/config/config.test.js` (all passing).
+<!-- SECTION:NOTES:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Implemented a live runtime debug toggle to inspect Yomitan token grouping. Added `anki.debugYomitanGroups` to the runtime option registry and config defaults, wired it through `main.ts` into tokenizer deps, and added conditional logging in tokenizer parsing that emits selected groups with surface/headword/reading/span for each parsed subtitle. Logging is gated by the toggle and disabled by default. Added tests for runtime registry presence and tokenizer logging on/off behavior, then validated with build + tokenizer + config tests.
+<!-- SECTION:FINAL_SUMMARY:END -->
@@ -0,0 +1,50 @@
+---
+id: TASK-63.1
+title: Drive Yomitan group debug logging from overlay debug mode (Y-D)
+status: Done
+assignee: []
+created_date: '2026-02-16 23:48'
+updated_date: '2026-02-16 23:50'
+labels: []
+dependencies: []
+parent_task_id: TASK-63
+priority: low
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Remove dedicated runtime/config toggle for Yomitan group logging and instead enable logs only when overlay debug mode is active via the existing Y-D debug flow.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 No runtime option or config key is required for Yomitan group logging.
+- [x] #2 Yomitan group logs are emitted only when overlay debug mode is enabled (Y-D/DevTools debug state).
+- [x] #3 When overlay debug mode is disabled, Yomitan group logs are not emitted.
+- [x] #4 Build/tests for touched modules pass.
+<!-- AC:END -->
+
+## Implementation Plan
+
+<!-- SECTION:PLAN:BEGIN -->
+1. Remove the `debugYomitanGroups` runtime/config option wiring from types/config registries so it no longer appears in runtime options.
+2. Keep tokenizer-level debug logging gate but drive it from existing overlay debug state (`overlayDebugVisualizationEnabled`) which is toggled by Y-D/DevTools flow.
+3. Rebuild and run tokenizer/config/runtime-option tests to confirm behavior and no registry regressions.
+<!-- SECTION:PLAN:END -->
+
+## Implementation Notes
+
+<!-- SECTION:NOTES:BEGIN -->
+Removed `anki.debugYomitanGroups` from runtime option ID union and removed `ankiConnect.behavior.debugYomitanGroups` from config defaults/registry entries.
+
+Updated `main.ts` tokenizer dependency wiring so `getYomitanGroupDebugEnabled` now directly reads `appState.overlayDebugVisualizationEnabled` (the existing debug visualization state toggled via Y-D/DevTools).
+
+Validated with `pnpm run build && node dist/core/services/tokenizer-service.test.js && node --test dist/config/config.test.js dist/core/services/runtime-options-ipc-service.test.js`.
+<!-- SECTION:NOTES:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Switched Yomitan group debug logging to follow the existing overlay debug mode (Y-D/DevTools state) and removed the dedicated runtime/config option surface. The tokenizer still logs `Selected Yomitan token groups` only when the debug gate is true, but the gate now comes from `appState.overlayDebugVisualizationEnabled` in main runtime wiring. Removed the temporary runtime-option/config definitions and updated related registry expectations. Build and relevant tests are passing.
+<!-- SECTION:FINAL_SUMMARY:END -->