3.4 KiB
id, title, status, assignee, created_date, updated_date, labels, dependencies, priority
| id | title | status | assignee | created_date | updated_date | labels | dependencies | priority | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TASK-298 | Exclude kana grammar-helper merges like ことに from subtitle annotations | Done |
|
2026-04-26 00:08 | 2026-04-26 00:15 |
|
medium |
Description
Investigate and fix subtitle tokenizer annotation behavior where all-hiragana grammar-helper merged tokens such as ことに can be marked as N+1. Current likely path: Yomitan emits ことに with headword こと; MeCab enrichment supplies content-led POS (名詞|助詞, likely 非自立|格助詞); shared subtitle annotation filter does not exclude this family unless it matches narrower rules such as これで or explanatory endings.
Acceptance Criteria
- #1
ことに-style kana grammar-helper merges are not marked known, N+1, JLPT, or frequency-highlighted when their MeCab metadata indicates a non-independent noun plus helper particle. - #2 Regression coverage demonstrates the reported subtitle phrase does not mark
ことにas N+1 while preserving annotation for real lexical content tokens. - #3 Existing tokenizer annotation tests pass.
Implementation Plan
Approved approach (user: "let's do it"):
- Add a regression test for the reported
ことにcase using Yomitan tokenことに-> headwordことand MeCab metadata名詞|助詞/非自立|格助詞; assert all annotation fields are stripped while nearby lexical content can still be N+1. - Verify the new test fails before production changes.
- Update the shared subtitle annotation filter to exclude conservative kana-only grammar-helper merges: merged surface differs from headword, surface is kana-only, first POS component is
名詞, first POS2 component is非自立, and remaining POS components are grammar helpers (助詞/助動詞). - Run targeted tokenizer/annotation tests and update the task acceptance criteria/final notes.
Implementation Notes
Red test initially passed with headword こと because こと is already in JLPT_EXCLUDED_TERMS and the shared subtitle annotation filter checks that set. Updated regression to the live-risk shape surface=ことに, headword=事, with MeCab POS 名詞|助詞 / 非自立|格助詞; this failed before the filter change and passed after.
Final Summary
Implemented a conservative shared subtitle annotation filter for kana-only non-independent noun helper merges. Tokens such as ことに with a kanji dictionary headword like 事 are now stripped of known-word, N+1, JLPT, and frequency metadata when MeCab shows the first component as 名詞/非自立 and trailing components as grammar helpers.
Added unit coverage in src/core/services/tokenizer/annotation-stage.test.ts and an integration-style tokenizer regression for the reported phrase shape in src/core/services/tokenizer.test.ts, verifying ことに stays plain while a real lexical token can still become the N+1 target.
Validation: bun test src/core/services/tokenizer/annotation-stage.test.ts; bun test src/core/services/tokenizer.test.ts; bun run test:fast; bun run changelog:lint.