SubMiner/backlog/tasks/task-350 - Fix-known-highlighting-for-Yomitan-compound-tokens.md at e8f10fe8a9164f24baecf339ff239cf5a6964d9f - SubMiner

sudacode/SubMiner

Fork 0

mirror of https://github.com/ksyasuda/SubMiner.git synced 2026-05-13 08:12:54 -07:00

Files

T

sudacode ca796bfe6a

fix: macOS overlay z-order and Yomitan compound token known highlighting

- Release always-on-top when tracked mpv loses foreground on macOS
- Skip visible overlay blur restacking on macOS to avoid covering unrelated windows
- Prefer Yomitan internal parse tokens over fragmented scanner output for known-word decisions
- Add regression tests for both behaviors

2026-05-12 02:34:28 -07:00

4.3 KiB

Raw Blame History

id, title, status, assignee, created_date, updated_date, labels, dependencies, modified_files, priority, ordinal

title

status

assignee

created_date

updated_date

labels

dependencies

modified_files

priority

ordinal

TASK-350

Fix known highlighting for Yomitan compound tokens

Done

codex

2026-05-12 09:08

2026-05-12 09:29

bug

tokenizer

src/core/services/tokenizer/yomitan-parser-runtime.ts

src/core/services/tokenizer/yomitan-parser-runtime.test.ts

src/core/services/tokenizer.test.ts

changes/350-known-yomitan-token-highlighting.md

high

184500

Description

Subtitle known-word coloring should respect the lexical token selected by Yomitan. If Yomitan emits a compound or inflected expression as one token, SubMiner must not mark that displayed token known solely because MeCab/POS enrichment can decompose it into known component words.

Acceptance Criteria

#1 A Yomitan token such as 取り組んで with headword 取り組む remains not-known when only component words like 取る or 組む are known.
#2 Frequency/JLPT/POS enrichment still works for the selected Yomitan token without leaking component known-word status into isKnown.
#3 Regression coverage demonstrates the compound-token case and fails on current behavior before the fix.

Implementation Plan

Add a regression in src/core/services/tokenizer.test.ts for a Yomitan-selected compound token: Yomitan emits 取り組んで with headword 取り組む; MeCab splits the same span into component tokens whose headwords include known component words such as 組む; expected result is one displayed token with isKnown === false when only the components are known.
Verify the regression fails on current code.
Patch MeCab enrichment so it only contributes POS metadata used by annotation filters/exclusions. It must preserve the Yomitan token's surface, headword, reading, offsets, and existing lexical annotation state, especially isKnown.
Re-run the targeted tokenizer test, then a relevant fast test lane if practical.

After inspecting code, MeCab enrichment currently only writes POS metadata. The observed component coloring can also come from SubMiner's custom Yomitan scanning path fragmenting a phrase differently than Yomitan's internal parser. Regression should exercise requestYomitanScanTokens fallback/parser behavior as seen by tokenizeSubtitle, and the fix should prefer Yomitan internal parse token identity while keeping MeCab limited to filtering/POS metadata.

Implementation Notes

User clarified MeCab is intended only to help filter unwanted characters/particles/sound effects/etc., not to alter lexical tokenization or known-word decisions.

Implementation settled on parse-first token identity: requestYomitanScanTokens now reads Yomitan internal parse tokens first. It still runs the scanner to keep scanner metadata when spans agree, but returns parse tokens when the scanner fragments the parse token. MeCab remains POS/filter enrichment only.

Final Summary

Fixed known-word highlighting for Yomitan compound tokens by preferring Yomitan internal parse token spans over fragmented scanner output. When scanner output agrees with parse spans, scanner metadata such as name-match and word classes is preserved; when it fragments a Yomitan token, the parse token identity wins so known component words do not color the larger unknown token green.

Added regressions for 取り組んで with known component words (取る, 組む, もらう) and for parser-runtime token selection/metadata behavior. Added a changelog fragment.

Validation run: bun test src/core/services/tokenizer.test.ts src/core/services/tokenizer/yomitan-parser-runtime.test.ts src/core/services/tokenizer/parser-selection-stage.test.ts src/core/services/tokenizer/parser-enrichment-stage.test.ts; bun run typecheck; bun x prettier --check src/core/services/tokenizer.test.ts src/core/services/tokenizer/yomitan-parser-runtime.ts src/core/services/tokenizer/yomitan-parser-runtime.test.ts changes/350-known-yomitan-token-highlighting.md; bun run changelog:lint; git diff --check.

4.3 KiB Raw Blame History