feat(tokenizer): use Yomitan word classes for subtitle POS filtering (#57)

* feat(tokenizer): use Yomitan word classes for subtitle POS filtering

- Carry matched headword wordClasses from termsFind into YomitanScanToken
- Map recognized Yomitan wordClasses to SubMiner coarse POS before annotation
- MeCab enrichment now fills only missing POS fields, preserving existing coarse pos1
- Exclude standalone grammar particles, して helper fragments, and single-kana surfaces from annotations
- Respect source-text punctuation gaps when counting N+1 sentence words
- Preserve known-word highlight on excluded kanji-containing tokens
- Add backlog tasks 304 (N+1 boundary bug) and 305 (wordClasses POS, done)

* fix(tokenizer): preserve annotation and enrichment behavior

* fix: restore jlpt subtitle underlines

* fix: exclude kana-only n+1 targets

* fix: refresh overlay on Hyprland fullscreen

* fix: address fullscreen and n-plus-one review notes

* fix: address CodeRabbit review comments

* fix: accept modified digits for multi-line sentence mining

* Cancel pending Linux MPV fullscreen overlay refresh bursts

- return a cancel handle from the Linux refresh burst scheduler
- clear pending refresh bursts when overlays hide or windows close
- tighten the burst test polling to wait for the async refresh

* fix: suppress N+1 for kana-only candidates and fix minSentenceWords coun

- Treat kana-only tokens with surrounding subtitle punctuation (…, ―, etc.) as kana-only so they are not promoted to N+1 targets
- Exclude unknown tokens filtered from N+1 targeting from the minSentenceWords count so filtered kana-only unknowns cannot satisfy sentence length threshold
- Add regression tests for kana-only candidate suppression and filtered-unknown padding cases

* Suppress subtitle annotations for grammar fragments

- Hide annotation metadata for auxiliary inflection and ja-nai endings
- Preserve lexical `くれる` forms and add regression coverage

* Fix kana-only N+1 tokenizer regression test

- Use a pure-kana fixture for the subtitle token N+1 case
- Update task notes for the latest CodeRabbit follow-up

* Fix managed playback exit and tokenizer grammar splits

- Ignore background stats daemons during regular app startup
- Split standalone grammar endings before applying annotations
- Clear helper-span annotations for auxiliary-only tokens

* fix: refresh current subtitle after known-word mining

* fix: suppress sigh interjection annotations

* fix: preserve jlpt underline color after lookup

* Replace grammar-ending permutations with shared matcher; preserve word a

- Extract `grammar-ending.ts` with `isStandaloneGrammarEndingText` / `isSubtitleGrammarEndingText` pattern matchers
- Replace `STANDALONE_GRAMMAR_ENDINGS` set in parser-selection-stage with shared matcher
- Replace generated phrase sets in subtitle-annotation-filter with shared matcher
- Remove stale duplicate subtitle-exclusion constants and helpers from annotation-stage
- Manual clipboard card updates now write only to the sentence audio field, leaving word/expression audio untouched

* fix: CI changelog, annotation options threading, and Jellyfin quit

- Add `type: fixed` / `area:` frontmatter to `changes/319` to pass `changelog:lint`
- Thread `TokenizerAnnotationOptions` through `stripSubtitleAnnotationMetadata` so `sourceText` is honored
- Include `jellyfinPlay` in `shouldQuitOnDisconnectWhenOverlayRuntimeInitialized` predicate
- Make mouse test `elementFromPoint` stubs coordinate-sensitive
- Make Lua test `.tmp` mkdir portable on Windows

* Preserve overlay across macOS flaps and mpv playlist changes

- keep visible overlays alive during transient macOS tracker loss
- reuse the running mpv overlay path on playlist navigation
- update regression coverage and changelog fragments

* fix: restore stats daemon deferral

* fix: keep subtitle prefetch alive after cache hits

* Fix JLPT underline color drift and AniList skipped-threshold sync

- Replace JLPT `text-decoration` underlines with `border-bottom` so Chromium selection/hover cannot repaint them to another annotation's color
- Lock JLPT underline color for combined annotation selectors (known, n+1, frequency) and character hover/selection states
- Trigger AniList post-watch check on every mpv time-position update to catch skipped completion thresholds
- Fall back to filename-parser season/episode when guessit omits them

* fix: address coderabbit feedback

* fix: sync AniList after seeked completion

* fix: preserve ordinal frequency annotations

* fix: preserve known highlighting for filtered tokens

* fix: address PR #57 CodeRabbit feedback

- Acquire AniList post-watch in-flight lock before async gating to prevent duplicate writes
- Isolate manual watched mark result from AniList post-watch callback failures
- Report known-word cache clears as mutations during immediate append when state existed
- Add regression tests for each fix

* fix: stop AniList setup reopening on Linux when keyring token exists

- Gate setup success on token persistence: `saveToken` now returns `boolean`; on failure, keeps the setup window open instead of reporting success
- Config reload passes `allowSetupPrompt: false` so playback reloads don't re-open the setup window
- Add regression test for persistence-failure path

* fix: suppress known highlights for subtitle particles

* fix: retry transient AniList safeStorage failures

* fix: hide overlay focus ring

* fix: align Hyprland fullscreen overlays

* fix: restore subtitle playback keybindings

* fix: align Hyprland overlay windows to mpv and stop pinning them

- Force-apply exact Hyprland move/resize/setprop dispatches when bounds are provided
- Stop pinning overlay windows; toggle pin off when Hyprland reports pinned=true
- Compensate stats overlay outer placement for Electron/Wayland content insets
- Make stats overlay window and page opaque so mpv cannot show through transparent insets
- Constrain stats app to h-screen with internal scroll so content covers mpv from y=0
- Lock overlay/stats window titles against page-title-updated events
- Add regression coverage for placement dispatches, inset compensation, and CSS overlay mode

* fix: retain frequency rank for honorific prefix-noun tokens

- Add `shouldAllowHonorificPrefixNounFrequency` to exempt お/ご/御 + noun merged tokens from frequency exclusion
- Add regression test for `ご機嫌` asserting rank 5484 is preserved after MeCab enrichment and annotation
- Close TASK-341

* fix: map openCharacterDictionary session action to --open-character-dict

- Add missing Lua CLI dispatch entry for openCharacterDictionary
- Add regression test for Alt+Meta+A binding and CLI flag forwarding

* fix: keep macOS overlay interactive while mpv remains active

- Overlay no longer hides or becomes click-through during tracker refreshes when mpv is the focused window
- Preserve already-visible overlay when tracker is temporarily not ready but mpv target signal is active
- Add regression tests for active-mpv tracker refresh and transient tracker-not-ready paths

* fix: address coderabbit subtitle follow-ups

* fix: resolve media detail from sessions when lifetime summary is absent

- Change `getMediaDetail` JOIN to LEFT JOIN on `imm_lifetime_media` and fall back to aggregated session metrics when no lifetime row exists
- Add filter `AND (lm.video_id IS NOT NULL OR s.session_id IS NOT NULL)` to keep results valid
- Add regression test covering the session-visible / media-detail-missing mismatch

* fix: address PR-57 CodeRabbit findings and CI failures

- use filtered word counts in media detail session token aggregation
- cancel fullscreen refresh burst on exit via updateLinuxMpvFullscreenOverlayRefreshBurst
- guard Hyprland JSON.parse in try/catch; exclude windowtitle from geometry events
- narrow focus suppression from :focus to :focus-visible
- apply JLPT lock selectors to word-name-match tokens (N1–N5)

* fix: macOS overlay z-order and Yomitan compound token known highlighting

- Release always-on-top when tracked mpv loses foreground on macOS
- Skip visible overlay blur restacking on macOS to avoid covering unrelated windows
- Prefer Yomitan internal parse tokens over fragmented scanner output for known-word decisions
- Add regression tests for both behaviors

* fix: macOS visible-overlay blur no longer invokes Windows-only blur call

- Split win32/darwin branches in handleOverlayWindowBlurred so darwin visible blur returns early without calling onWindowsVisibleOverlayBlur
- Add regression test asserting Windows callback stays inactive on macOS visible overlay blur
- Close TASK-347
This commit is contained in:
2026-05-12 12:08:09 -07:00
committed by GitHub
parent b68d17614d
commit 430373f010
176 changed files with 8174 additions and 569 deletions
@@ -533,7 +533,7 @@ test('requestYomitanTermFrequencies caches repeated term+reading lookups', async
assert.equal(frequencyCalls, 1);
});
test('requestYomitanScanTokens uses left-to-right termsFind scanning instead of parseText', async () => {
test('requestYomitanScanTokens prefers parseText tokenization over termsFind fragments', async () => {
const scripts: string[] = [];
const deps = createDeps(async (script) => {
scripts.push(script);
@@ -549,6 +549,138 @@ test('requestYomitanScanTokens uses left-to-right termsFind scanning instead of
],
};
}
if (script.includes('parseText')) {
return [
{
source: 'scanning-parser',
index: 0,
content: [
[
{
text: '取り組んで',
reading: 'とりくんで',
headwords: [[{ term: '取り組む' }]],
},
],
],
},
];
}
return [
{
surface: '取り',
reading: 'とり',
headword: '取る',
startPos: 0,
endPos: 2,
},
{
surface: '組んで',
reading: 'くんで',
headword: '組む',
startPos: 2,
endPos: 5,
},
];
});
const result = await requestYomitanScanTokens('取り組んで', deps, {
error: () => undefined,
});
assert.deepEqual(result, [
{
surface: '取り組んで',
reading: 'とりくんで',
headword: '取り組む',
startPos: 0,
endPos: 5,
},
]);
assert.ok(scripts.some((script) => script.includes('parseText')));
assert.ok(scripts.some((script) => script.includes('termsFind')));
});
test('requestYomitanScanTokens keeps scanner metadata when parse spans agree', async () => {
const deps = createDeps(async (script) => {
if (script.includes('optionsGetFull')) {
return {
profileCurrent: 0,
profiles: [
{
options: {
scanning: { length: 40 },
},
},
],
};
}
if (script.includes('parseText')) {
return [
{
source: 'scanning-parser',
index: 0,
content: [
[
{
text: 'アクア',
reading: 'あくあ',
headwords: [[{ term: 'アクア' }]],
},
],
],
},
];
}
return [
{
surface: 'アクア',
reading: 'あくあ',
headword: 'アクア',
startPos: 0,
endPos: 3,
isNameMatch: true,
wordClasses: ['n'],
},
];
});
const result = await requestYomitanScanTokens('アクア', deps, {
error: () => undefined,
});
assert.deepEqual(result, [
{
surface: 'アクア',
reading: 'あくあ',
headword: 'アクア',
startPos: 0,
endPos: 3,
isNameMatch: true,
wordClasses: ['n'],
},
]);
});
test('requestYomitanScanTokens falls back to left-to-right termsFind scanning', async () => {
const scripts: string[] = [];
const deps = createDeps(async (script) => {
scripts.push(script);
if (script.includes('optionsGetFull')) {
return {
profileCurrent: 0,
profiles: [
{
options: {
scanning: { length: 40 },
},
},
],
};
}
if (script.includes('parseText')) {
return [];
}
return [
{
surface: 'カズマ',
@@ -573,6 +705,7 @@ test('requestYomitanScanTokens uses left-to-right termsFind scanning instead of
endPos: 3,
},
]);
assert.ok(scripts.some((script) => script.includes('parseText')));
const scannerScript = scripts.find((script) => script.includes('termsFind'));
assert.ok(scannerScript, 'expected termsFind scanning request script');
assert.doesNotMatch(scannerScript ?? '', /parseText/);
@@ -891,6 +1024,105 @@ test('requestYomitanScanTokens can use frequency from later exact secondary-matc
]);
});
test('requestYomitanScanTokens uses exact frequency entry when selected reading differs', async () => {
let scannerScript = '';
const deps = createDeps(async (script) => {
if (script.includes('termsFind')) {
scannerScript = script;
return [];
}
if (script.includes('optionsGetFull')) {
return {
profileCurrent: 0,
profileIndex: 0,
scanLength: 40,
dictionaries: ['JPDBv2㋕', 'Jiten', 'CC100'],
dictionaryPriorityByName: {
'JPDBv2㋕': 0,
Jiten: 1,
CC100: 2,
},
dictionaryFrequencyModeByName: {
'JPDBv2㋕': 'rank-based',
Jiten: 'rank-based',
CC100: 'rank-based',
},
profiles: [
{
options: {
scanning: { length: 40 },
dictionaries: [
{ name: 'JPDBv2㋕', enabled: true, id: 0 },
{ name: 'Jiten', enabled: true, id: 1 },
{ name: 'CC100', enabled: true, id: 2 },
],
},
},
],
};
}
return null;
});
await requestYomitanScanTokens('第二走者', deps, {
error: () => undefined,
});
const result = (await runInjectedYomitanScript(scannerScript, (action, params) => {
if (action !== 'termsFind') {
throw new Error(`unexpected action: ${action}`);
}
const text = (params as { text?: string } | undefined)?.text ?? '';
if (!text.startsWith('第二')) {
return { originalTextLength: 0, dictionaryEntries: [] };
}
return {
originalTextLength: 2,
dictionaryEntries: [
{
headwords: [
{
term: '第二',
reading: 'だいに',
sources: [{ originalText: '第二', isPrimary: true, matchType: 'exact' }],
},
],
frequencies: [],
},
{
headwords: [
{
term: '第二',
reading: '',
sources: [{ originalText: '第二', isPrimary: false, matchType: 'exact' }],
},
],
frequencies: [
{
headwordIndex: 0,
dictionary: 'JPDBv2㋕',
frequency: 189513,
displayValue: '1820,189513句',
},
],
},
],
};
})) as Array<Record<string, unknown>>;
assert.deepEqual(result?.[0], {
surface: '第二',
reading: 'だいに',
headword: '第二',
startPos: 0,
endPos: 2,
isNameMatch: false,
frequencyRank: 1820,
});
});
test('requestYomitanScanTokens marks tokens backed by SubMiner character dictionary entries', async () => {
const deps = createDeps(async (script) => {
if (script.includes('optionsGetFull')) {
@@ -1049,6 +1281,60 @@ test('requestYomitanScanTokens marks grouped entries when SubMiner dictionary al
assert.equal((result as Array<{ isNameMatch?: boolean }>)[0]?.isNameMatch, true);
});
test('requestYomitanScanTokens preserves matched headword word classes', async () => {
let scannerScript = '';
const deps = createDeps(async (script) => {
if (script.includes('termsFind')) {
scannerScript = script;
return [];
}
if (script.includes('optionsGetFull')) {
return {
profileCurrent: 0,
profiles: [
{
options: {
scanning: { length: 40 },
},
},
],
};
}
return null;
});
await requestYomitanScanTokens('は', deps, { error: () => undefined });
const result = await runInjectedYomitanScript(scannerScript, (action, params) => {
if (action !== 'termsFind') {
throw new Error(`unexpected action: ${action}`);
}
const text = (params as { text?: string } | undefined)?.text;
if (text !== 'は') {
return { originalTextLength: 0, dictionaryEntries: [] };
}
return {
originalTextLength: 1,
dictionaryEntries: [
{
headwords: [
{
term: 'は',
reading: 'は',
wordClasses: ['prt'],
sources: [{ originalText: 'は', isPrimary: true, matchType: 'exact' }],
},
],
},
],
};
});
assert.deepEqual((result as Array<{ wordClasses?: string[] }>)[0]?.wordClasses, ['prt']);
});
test('requestYomitanScanTokens skips fallback fragments without exact primary source matches', async () => {
const deps = createDeps(async (script) => {
if (script.includes('optionsGetFull')) {