feat(tokenizer): use Yomitan word classes for subtitle POS filtering (#57)

* feat(tokenizer): use Yomitan word classes for subtitle POS filtering - Carry matched headword wordClasses from termsFind into YomitanScanToken - Map recognized Yomitan wordClasses to SubMiner coarse POS before annotation - MeCab enrichment now fills only missing POS fields, preserving existing coarse pos1 - Exclude standalone grammar particles, して helper fragments, and single-kana surfaces from annotations - Respect source-text punctuation gaps when counting N+1 sentence words - Preserve known-word highlight on excluded kanji-containing tokens - Add backlog tasks 304 (N+1 boundary bug) and 305 (wordClasses POS, done) * fix(tokenizer): preserve annotation and enrichment behavior * fix: restore jlpt subtitle underlines * fix: exclude kana-only n+1 targets * fix: refresh overlay on Hyprland fullscreen * fix: address fullscreen and n-plus-one review notes * fix: address CodeRabbit review comments * fix: accept modified digits for multi-line sentence mining * Cancel pending Linux MPV fullscreen overlay refresh bursts - return a cancel handle from the Linux refresh burst scheduler - clear pending refresh bursts when overlays hide or windows close - tighten the burst test polling to wait for the async refresh * fix: suppress N+1 for kana-only candidates and fix minSentenceWords coun - Treat kana-only tokens with surrounding subtitle punctuation (…, ―, etc.) as kana-only so they are not promoted to N+1 targets - Exclude unknown tokens filtered from N+1 targeting from the minSentenceWords count so filtered kana-only unknowns cannot satisfy sentence length threshold - Add regression tests for kana-only candidate suppression and filtered-unknown padding cases * Suppress subtitle annotations for grammar fragments - Hide annotation metadata for auxiliary inflection and ja-nai endings - Preserve lexical `くれる` forms and add regression coverage * Fix kana-only N+1 tokenizer regression test - Use a pure-kana fixture for the subtitle token N+1 case - Update task notes for the latest CodeRabbit follow-up * Fix managed playback exit and tokenizer grammar splits - Ignore background stats daemons during regular app startup - Split standalone grammar endings before applying annotations - Clear helper-span annotations for auxiliary-only tokens * fix: refresh current subtitle after known-word mining * fix: suppress sigh interjection annotations * fix: preserve jlpt underline color after lookup * Replace grammar-ending permutations with shared matcher; preserve word a - Extract `grammar-ending.ts` with `isStandaloneGrammarEndingText` / `isSubtitleGrammarEndingText` pattern matchers - Replace `STANDALONE_GRAMMAR_ENDINGS` set in parser-selection-stage with shared matcher - Replace generated phrase sets in subtitle-annotation-filter with shared matcher - Remove stale duplicate subtitle-exclusion constants and helpers from annotation-stage - Manual clipboard card updates now write only to the sentence audio field, leaving word/expression audio untouched * fix: CI changelog, annotation options threading, and Jellyfin quit - Add `type: fixed` / `area:` frontmatter to `changes/319` to pass `changelog:lint` - Thread `TokenizerAnnotationOptions` through `stripSubtitleAnnotationMetadata` so `sourceText` is honored - Include `jellyfinPlay` in `shouldQuitOnDisconnectWhenOverlayRuntimeInitialized` predicate - Make mouse test `elementFromPoint` stubs coordinate-sensitive - Make Lua test `.tmp` mkdir portable on Windows * Preserve overlay across macOS flaps and mpv playlist changes - keep visible overlays alive during transient macOS tracker loss - reuse the running mpv overlay path on playlist navigation - update regression coverage and changelog fragments * fix: restore stats daemon deferral * fix: keep subtitle prefetch alive after cache hits * Fix JLPT underline color drift and AniList skipped-threshold sync - Replace JLPT `text-decoration` underlines with `border-bottom` so Chromium selection/hover cannot repaint them to another annotation's color - Lock JLPT underline color for combined annotation selectors (known, n+1, frequency) and character hover/selection states - Trigger AniList post-watch check on every mpv time-position update to catch skipped completion thresholds - Fall back to filename-parser season/episode when guessit omits them * fix: address coderabbit feedback * fix: sync AniList after seeked completion * fix: preserve ordinal frequency annotations * fix: preserve known highlighting for filtered tokens * fix: address PR #57 CodeRabbit feedback - Acquire AniList post-watch in-flight lock before async gating to prevent duplicate writes - Isolate manual watched mark result from AniList post-watch callback failures - Report known-word cache clears as mutations during immediate append when state existed - Add regression tests for each fix * fix: stop AniList setup reopening on Linux when keyring token exists - Gate setup success on token persistence: `saveToken` now returns `boolean`; on failure, keeps the setup window open instead of reporting success - Config reload passes `allowSetupPrompt: false` so playback reloads don't re-open the setup window - Add regression test for persistence-failure path * fix: suppress known highlights for subtitle particles * fix: retry transient AniList safeStorage failures * fix: hide overlay focus ring * fix: align Hyprland fullscreen overlays * fix: restore subtitle playback keybindings * fix: align Hyprland overlay windows to mpv and stop pinning them - Force-apply exact Hyprland move/resize/setprop dispatches when bounds are provided - Stop pinning overlay windows; toggle pin off when Hyprland reports pinned=true - Compensate stats overlay outer placement for Electron/Wayland content insets - Make stats overlay window and page opaque so mpv cannot show through transparent insets - Constrain stats app to h-screen with internal scroll so content covers mpv from y=0 - Lock overlay/stats window titles against page-title-updated events - Add regression coverage for placement dispatches, inset compensation, and CSS overlay mode * fix: retain frequency rank for honorific prefix-noun tokens - Add `shouldAllowHonorificPrefixNounFrequency` to exempt お/ご/御 + noun merged tokens from frequency exclusion - Add regression test for `ご機嫌` asserting rank 5484 is preserved after MeCab enrichment and annotation - Close TASK-341 * fix: map openCharacterDictionary session action to --open-character-dict - Add missing Lua CLI dispatch entry for openCharacterDictionary - Add regression test for Alt+Meta+A binding and CLI flag forwarding * fix: keep macOS overlay interactive while mpv remains active - Overlay no longer hides or becomes click-through during tracker refreshes when mpv is the focused window - Preserve already-visible overlay when tracker is temporarily not ready but mpv target signal is active - Add regression tests for active-mpv tracker refresh and transient tracker-not-ready paths * fix: address coderabbit subtitle follow-ups * fix: resolve media detail from sessions when lifetime summary is absent - Change `getMediaDetail` JOIN to LEFT JOIN on `imm_lifetime_media` and fall back to aggregated session metrics when no lifetime row exists - Add filter `AND (lm.video_id IS NOT NULL OR s.session_id IS NOT NULL)` to keep results valid - Add regression test covering the session-visible / media-detail-missing mismatch * fix: address PR-57 CodeRabbit findings and CI failures - use filtered word counts in media detail session token aggregation - cancel fullscreen refresh burst on exit via updateLinuxMpvFullscreenOverlayRefreshBurst - guard Hyprland JSON.parse in try/catch; exclude windowtitle from geometry events - narrow focus suppression from :focus to :focus-visible - apply JLPT lock selectors to word-name-match tokens (N1–N5) * fix: macOS overlay z-order and Yomitan compound token known highlighting - Release always-on-top when tracked mpv loses foreground on macOS - Skip visible overlay blur restacking on macOS to avoid covering unrelated windows - Prefer Yomitan internal parse tokens over fragmented scanner output for known-word decisions - Add regression tests for both behaviors * fix: macOS visible-overlay blur no longer invokes Windows-only blur call - Split win32/darwin branches in handleOverlayWindowBlurred so darwin visible blur returns early without calling onWindowsVisibleOverlayBlur - Add regression test asserting Windows callback stays inactive on macOS visible overlay blur - Close TASK-347
2026-07-30 07:21:32 -07:00 · 2026-05-12 12:08:09 -07:00
parent b68d17614d
commit 430373f010
176 changed files with 8174 additions and 569 deletions
@@ -533,7 +533,7 @@ test('requestYomitanTermFrequencies caches repeated term+reading lookups', async
  assert.equal(frequencyCalls, 1);
 });

-test('requestYomitanScanTokens uses left-to-right termsFind scanning instead of parseText', async () => {
+test('requestYomitanScanTokens prefers parseText tokenization over termsFind fragments', async () => {
  const scripts: string[] = [];
  const deps = createDeps(async (script) => {
    scripts.push(script);
@@ -549,6 +549,138 @@ test('requestYomitanScanTokens uses left-to-right termsFind scanning instead of
        ],
      };
    }
+    if (script.includes('parseText')) {
+      return [
+        {
+          source: 'scanning-parser',
+          index: 0,
+          content: [
+            [
+              {
+                text: '取り組んで',
+                reading: 'とりくんで',
+                headwords: [[{ term: '取り組む' }]],
+              },
+            ],
+          ],
+        },
+      ];
+    }
+    return [
+      {
+        surface: '取り',
+        reading: 'とり',
+        headword: '取る',
+        startPos: 0,
+        endPos: 2,
+      },
+      {
+        surface: '組んで',
+        reading: 'くんで',
+        headword: '組む',
+        startPos: 2,
+        endPos: 5,
+      },
+    ];
+  });
+
+  const result = await requestYomitanScanTokens('取り組んで', deps, {
+    error: () => undefined,
+  });
+
+  assert.deepEqual(result, [
+    {
+      surface: '取り組んで',
+      reading: 'とりくんで',
+      headword: '取り組む',
+      startPos: 0,
+      endPos: 5,
+    },
+  ]);
+  assert.ok(scripts.some((script) => script.includes('parseText')));
+  assert.ok(scripts.some((script) => script.includes('termsFind')));
+});
+
+test('requestYomitanScanTokens keeps scanner metadata when parse spans agree', async () => {
+  const deps = createDeps(async (script) => {
+    if (script.includes('optionsGetFull')) {
+      return {
+        profileCurrent: 0,
+        profiles: [
+          {
+            options: {
+              scanning: { length: 40 },
+            },
+          },
+        ],
+      };
+    }
+    if (script.includes('parseText')) {
+      return [
+        {
+          source: 'scanning-parser',
+          index: 0,
+          content: [
+            [
+              {
+                text: 'アクア',
+                reading: 'あくあ',
+                headwords: [[{ term: 'アクア' }]],
+              },
+            ],
+          ],
+        },
+      ];
+    }
+    return [
+      {
+        surface: 'アクア',
+        reading: 'あくあ',
+        headword: 'アクア',
+        startPos: 0,
+        endPos: 3,
+        isNameMatch: true,
+        wordClasses: ['n'],
+      },
+    ];
+  });
+
+  const result = await requestYomitanScanTokens('アクア', deps, {
+    error: () => undefined,
+  });
+
+  assert.deepEqual(result, [
+    {
+      surface: 'アクア',
+      reading: 'あくあ',
+      headword: 'アクア',
+      startPos: 0,
+      endPos: 3,
+      isNameMatch: true,
+      wordClasses: ['n'],
+    },
+  ]);
+});
+
+test('requestYomitanScanTokens falls back to left-to-right termsFind scanning', async () => {
+  const scripts: string[] = [];
+  const deps = createDeps(async (script) => {
+    scripts.push(script);
+    if (script.includes('optionsGetFull')) {
+      return {
+        profileCurrent: 0,
+        profiles: [
+          {
+            options: {
+              scanning: { length: 40 },
+            },
+          },
+        ],
+      };
+    }
+    if (script.includes('parseText')) {
+      return [];
+    }
    return [
      {
        surface: 'カズマ',
@@ -573,6 +705,7 @@ test('requestYomitanScanTokens uses left-to-right termsFind scanning instead of
      endPos: 3,
    },
  ]);
+  assert.ok(scripts.some((script) => script.includes('parseText')));
  const scannerScript = scripts.find((script) => script.includes('termsFind'));
  assert.ok(scannerScript, 'expected termsFind scanning request script');
  assert.doesNotMatch(scannerScript ?? '', /parseText/);
@@ -891,6 +1024,105 @@ test('requestYomitanScanTokens can use frequency from later exact secondary-matc
  ]);
 });

+test('requestYomitanScanTokens uses exact frequency entry when selected reading differs', async () => {
+  let scannerScript = '';
+  const deps = createDeps(async (script) => {
+    if (script.includes('termsFind')) {
+      scannerScript = script;
+      return [];
+    }
+    if (script.includes('optionsGetFull')) {
+      return {
+        profileCurrent: 0,
+        profileIndex: 0,
+        scanLength: 40,
+        dictionaries: ['JPDBv2㋕', 'Jiten', 'CC100'],
+        dictionaryPriorityByName: {
+          'JPDBv2㋕': 0,
+          Jiten: 1,
+          CC100: 2,
+        },
+        dictionaryFrequencyModeByName: {
+          'JPDBv2㋕': 'rank-based',
+          Jiten: 'rank-based',
+          CC100: 'rank-based',
+        },
+        profiles: [
+          {
+            options: {
+              scanning: { length: 40 },
+              dictionaries: [
+                { name: 'JPDBv2㋕', enabled: true, id: 0 },
+                { name: 'Jiten', enabled: true, id: 1 },
+                { name: 'CC100', enabled: true, id: 2 },
+              ],
+            },
+          },
+        ],
+      };
+    }
+    return null;
+  });
+
+  await requestYomitanScanTokens('第二走者', deps, {
+    error: () => undefined,
+  });
+
+  const result = (await runInjectedYomitanScript(scannerScript, (action, params) => {
+    if (action !== 'termsFind') {
+      throw new Error(`unexpected action: ${action}`);
+    }
+
+    const text = (params as { text?: string } | undefined)?.text ?? '';
+    if (!text.startsWith('第二')) {
+      return { originalTextLength: 0, dictionaryEntries: [] };
+    }
+
+    return {
+      originalTextLength: 2,
+      dictionaryEntries: [
+        {
+          headwords: [
+            {
+              term: '第二',
+              reading: 'だいに',
+              sources: [{ originalText: '第二', isPrimary: true, matchType: 'exact' }],
+            },
+          ],
+          frequencies: [],
+        },
+        {
+          headwords: [
+            {
+              term: '第二',
+              reading: '',
+              sources: [{ originalText: '第二', isPrimary: false, matchType: 'exact' }],
+            },
+          ],
+          frequencies: [
+            {
+              headwordIndex: 0,
+              dictionary: 'JPDBv2㋕',
+              frequency: 189513,
+              displayValue: '1820,189513句',
+            },
+          ],
+        },
+      ],
+    };
+  })) as Array<Record<string, unknown>>;
+
+  assert.deepEqual(result?.[0], {
+    surface: '第二',
+    reading: 'だいに',
+    headword: '第二',
+    startPos: 0,
+    endPos: 2,
+    isNameMatch: false,
+    frequencyRank: 1820,
+  });
+});
+
 test('requestYomitanScanTokens marks tokens backed by SubMiner character dictionary entries', async () => {
  const deps = createDeps(async (script) => {
    if (script.includes('optionsGetFull')) {
@@ -1049,6 +1281,60 @@ test('requestYomitanScanTokens marks grouped entries when SubMiner dictionary al
  assert.equal((result as Array<{ isNameMatch?: boolean }>)[0]?.isNameMatch, true);
 });

+test('requestYomitanScanTokens preserves matched headword word classes', async () => {
+  let scannerScript = '';
+  const deps = createDeps(async (script) => {
+    if (script.includes('termsFind')) {
+      scannerScript = script;
+      return [];
+    }
+    if (script.includes('optionsGetFull')) {
+      return {
+        profileCurrent: 0,
+        profiles: [
+          {
+            options: {
+              scanning: { length: 40 },
+            },
+          },
+        ],
+      };
+    }
+    return null;
+  });
+
+  await requestYomitanScanTokens('は', deps, { error: () => undefined });
+
+  const result = await runInjectedYomitanScript(scannerScript, (action, params) => {
+    if (action !== 'termsFind') {
+      throw new Error(`unexpected action: ${action}`);
+    }
+
+    const text = (params as { text?: string } | undefined)?.text;
+    if (text !== 'は') {
+      return { originalTextLength: 0, dictionaryEntries: [] };
+    }
+
+    return {
+      originalTextLength: 1,
+      dictionaryEntries: [
+        {
+          headwords: [
+            {
+              term: 'は',
+              reading: 'は',
+              wordClasses: ['prt'],
+              sources: [{ originalText: 'は', isPrimary: true, matchType: 'exact' }],
+            },
+          ],
+        },
+      ],
+    };
+  });
+
+  assert.deepEqual((result as Array<{ wordClasses?: string[] }>)[0]?.wordClasses, ['prt']);
+});
+
 test('requestYomitanScanTokens skips fallback fragments without exact primary source matches', async () => {
  const deps = createDeps(async (script) => {
    if (script.includes('optionsGetFull')) {