feat(tokenizer): use Yomitan word classes for subtitle POS filtering

- Carry matched headword wordClasses from termsFind into YomitanScanToken
- Map recognized Yomitan wordClasses to SubMiner coarse POS before annotation
- MeCab enrichment now fills only missing POS fields, preserving existing coarse pos1
- Exclude standalone grammar particles, して helper fragments, and single-kana surfaces from annotations
- Respect source-text punctuation gaps when counting N+1 sentence words
- Preserve known-word highlight on excluded kanji-containing tokens
- Add backlog tasks 304 (N+1 boundary bug) and 305 (wordClasses POS, done)
This commit is contained in:
2026-04-25 23:08:33 -07:00
parent d8934647a9
commit 6b7d0553a7
11 changed files with 926 additions and 39 deletions

View File

@@ -0,0 +1,27 @@
---
id: TASK-304
title: Fix N+1 sentence boundary counting across Yomitan punctuation gaps
status: In Progress
assignee: []
created_date: '2026-04-26 05:33'
labels:
- bug
- tokenizer
- annotations
dependencies: []
priority: medium
---
## Description
<!-- SECTION:DESCRIPTION:BEGIN -->
N+1 target selection should respect sentence-ending punctuation from the original subtitle text even when Yomitan token output omits punctuation tokens. Current behavior can treat multiple subtitle sentences as one token span and incorrectly satisfy the minimum content-token threshold.
<!-- SECTION:DESCRIPTION:END -->
## Acceptance Criteria
<!-- AC:BEGIN -->
- [ ] #1 A subtitle like `てんめ!ふざけんなよ!` does not mark `ふざけん`/similar single-content-token second sentence as N+1 when the minimum sentence word count is 3.
- [ ] #2 N+1 sentence segmentation uses original subtitle text offsets or equivalent source-boundary data, not only punctuation tokens returned by Yomitan.
- [ ] #3 Existing annotation exclusion behavior for particles/grammar tokens remains unchanged.
- [ ] #4 Regression tests cover Yomitan-style token streams where punctuation is absent from the token list.
<!-- AC:END -->

View File

@@ -0,0 +1,55 @@
---
id: TASK-305
title: Use Yomitan word classes for subtitle token POS filtering
status: Done
assignee: []
created_date: '2026-04-26 05:56'
updated_date: '2026-04-26 05:59'
labels:
- tokenizer
- yomitan
dependencies: []
priority: medium
---
## Description
<!-- SECTION:DESCRIPTION:BEGIN -->
Subtitle annotation filtering currently uses Yomitan token spans, then enriches those spans by running MeCab over the full normalized subtitle line. Add support for carrying Yomitan headword wordClasses from termsFind into SubMiner tokens so dictionary-backed tokens can provide coarse POS/tag metadata without vendored Yomitan changes. MeCab whole-line enrichment should remain a fallback/source of detailed POS data when Yomitan classes are absent.
<!-- SECTION:DESCRIPTION:END -->
## Acceptance Criteria
<!-- AC:BEGIN -->
- [x] #1 Yomitan scanner tokens preserve matched headword wordClasses when termsFind returns them.
- [x] #2 Subtitle tokenization maps recognized Yomitan wordClasses to coarse PartOfSpeech/POS metadata before annotation filtering.
- [x] #3 Whole-line MeCab enrichment remains available for missing or more detailed POS metadata and does not break existing subtitle annotation behavior.
- [x] #4 Focused tokenizer tests cover wordClasses extraction and POS mapping.
<!-- AC:END -->
## Implementation Plan
<!-- SECTION:PLAN:BEGIN -->
1. Add focused regression coverage for Yomitan scanner wordClasses payload and subtitle POS mapping.
2. Extend the app-owned Yomitan scanner payload to carry matched headword wordClasses when present.
3. Map recognized Yomitan wordClasses to SubMiner coarse PartOfSpeech/POS metadata before annotation filtering.
4. Keep MeCab whole-line enrichment as fallback/detail-fill for missing POS fields.
5. Run focused tokenizer tests and typecheck.
<!-- SECTION:PLAN:END -->
## Implementation Notes
<!-- SECTION:NOTES:BEGIN -->
Implemented app-only wordClasses extraction from termsFind results; no vendored Yomitan changes required. Recognized classes currently map prt, aux, v*, adj-i/adj-ix, adj-na, and noun-like classes to SubMiner POS metadata. MeCab enrichment now skips only tokens with complete pos1/pos2/pos3 and otherwise fills missing fields while preserving existing coarse pos1. Verification: bun test src/core/services/tokenizer/yomitan-parser-runtime.test.ts src/core/services/tokenizer.test.ts; bun run typecheck.
<!-- SECTION:NOTES:END -->
## Final Summary
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
Implemented app-only Yomitan wordClasses support for subtitle token annotation filtering. The scanner now carries matched headword wordClasses from termsFind results, tokenizer maps recognized classes into SubMiner coarse POS metadata before annotation, and MeCab whole-line enrichment continues to fill missing detailed POS fields without requiring vendored Yomitan changes.
Tests run:
- bun test src/core/services/tokenizer/yomitan-parser-runtime.test.ts src/core/services/tokenizer.test.ts
- bun run typecheck
Note: the working tree already had unrelated tokenizer/annotation edits and task-304 before this work; those were left intact.
<!-- SECTION:FINAL_SUMMARY:END -->