Files
SubMiner/backlog/tasks/task-174 - Fix-missing-frequency-highlights-for-merged-tokenizer-tokens.md
sudacode 46fbea902a Harden stats APIs and fix Electron Yomitan debug runtime
- Validate stats session IDs/limits and add AnkiConnect request timeouts
- Stabilize stats window/runtime lifecycle and tighten window security defaults
- Fix Electron CLI debug startup by unsetting `ELECTRON_RUN_AS_NODE` and wiring Yomitan session state
- Expand regression coverage for tracker queries/events ordering and session aggregates
- Update docs for stats dashboard usage and Yomitan lookup troubleshooting
2026-03-17 20:05:07 -07:00

3.6 KiB

id, title, status, assignee, created_date, updated_date, labels, dependencies, references, priority
id title status assignee created_date updated_date labels dependencies references priority
TASK-174 Fix missing frequency highlights for merged tokenizer tokens In Progress
codex
2026-03-15 10:18 2026-03-15 10:40
bug
tokenizer
frequency-highlighting
/Users/sudacode/projects/japanese/SubMiner/src/core/services/tokenizer.ts
/Users/sudacode/projects/japanese/SubMiner/src/core/services/tokenizer/parser-selection-stage.ts
/Users/sudacode/projects/japanese/SubMiner/src/core/services/tokenizer/yomitan-parser-runtime.ts
/Users/sudacode/projects/japanese/SubMiner/scripts/get_frequency.ts
/Users/sudacode/projects/japanese/SubMiner/scripts/test-yomitan-parser.ts
high

Description

Frequency highlighting can miss words that should color within the configured top-X limit when tokenizer candidate selection keeps merged Yomitan units that combine a content word with trailing function text. The annotation stage then conservatively clears frequency for the whole merged token, so visible high-frequency words lose highlighting. The standalone debug CLIs are also failing to initialize the shared Yomitan runtime, which blocks reliable repro for this class of bug.

Acceptance Criteria

  • #1 Tokenizer no longer drops frequency highlighting for content words in merged-token cases where a better scanning parse candidate would preserve highlightable tokens.
  • #2 A regression test covers the reported sentence shape and fails before the fix.
  • #3 The standalone frequency/parser debug path can initialize the shared Yomitan runtime well enough to reproduce tokenizer output instead of immediately reporting runtime/session wiring errors.

Implementation Plan

  1. Add a regression test for the reported merged-token frequency miss, centered on Yomitan scanning candidate selection and downstream frequency annotation.
  2. Update tokenizer candidate selection so merged content+function tokens do not win over candidates that preserve highlightable content tokens.
  3. Repair the standalone frequency/parser debug scripts so their Electron/Yomitan runtime wiring matches current shared runtime expectations.
  4. Verify with targeted tokenizer/parser tests and the standalone debug repro command.

Implementation Notes

Initial triage: shared frequency class logic looks correct; likely failure is upstream tokenizer candidate selection producing merged content+function tokens that annotation later excludes from frequency. Standalone debug scripts also fail to initialize a usable Electron/Yomitan runtime, blocking reliable repro from the current CLI path.

Repro after fixing the standalone Electron wrapper does not support the original highlight claim for 誰でもいいから かかってこいよ: the tokenizer reports かかってこい with frequencyRank 63098, so it correctly stays uncolored at --color-top-x 10000 and becomes colorable once the threshold is raised above that rank. The concrete bug fixed in this pass is the standalone Electron debug path: package scripts now unset ELECTRON_RUN_AS_NODE, and the scripts normalize Electron imports/guards so get-frequency:electron can reach real Electron/Yomitan runtime state instead of immediately falling back to Node-mode diagnostics. test-yomitan-parser:electron still shows extension/service-worker issues against the existing profile and was not stabilized in this pass.