Files
SubMiner/backlog/tasks/task-209 - Exclude-grammar-tail-そうだ-from-subtitle-annotations.md
sudacode 5feed360ca feat: add app-owned YouTube subtitle flow with absPlayer-style parsing (#31)
* fix: harden preload argv parsing for popup windows

* fix: align youtube playback with shared overlay startup

* fix: unwrap mpv youtube streams for anki media mining

* docs: update docs for youtube subtitle and mining flow

* refactor: unify cli and runtime wiring for startup and youtube flow

* feat: update subtitle sidebar overlay behavior

* chore: add shared log-file source for diagnostics

* fix(ci): add changelog fragment for immersion changes

* fix: address CodeRabbit review feedback

* fix: persist canonical title from youtube metadata

* style: format stats library tab

* fix: address latest review feedback

* style: format stats library files

* test: stub launcher youtube deps in CI

* test: isolate launcher youtube flow deps

* test: stub launcher youtube deps in failing case

* test: force x11 backend in launcher ci harness

* test: address latest review feedback

* fix(launcher): preserve user YouTube ytdl raw options

* docs(backlog): update task tracking notes

* fix(immersion): special-case youtube media paths in runtime and tracking

* feat(stats): improve YouTube media metadata and picker key handling

* fix(ci): format stats media library hook

* fix: address latest CodeRabbit review items

* docs: update youtube release notes and docs

* feat: auto-load youtube subtitles before manual picker

* fix: restore app-owned youtube subtitle flow

* docs: update youtube playback docs and config copy

* refactor: remove legacy youtube launcher mode plumbing

* fix: refine youtube subtitle startup binding

* docs: clarify youtube subtitle startup behavior

* fix: address PR #31 latest review follow-ups

* fix: address PR #31 follow-up review comments

* test: harden youtube picker test harness

* udpate backlog

* fix: add timeout to youtube metadata probe

* docs: refresh youtube and stats docs

* update backlog

* update backlog

* chore: release v0.9.0
2026-03-24 00:01:24 -07:00

3.8 KiB

id, title, status, assignee, created_date, updated_date, labels, dependencies, references, priority, ordinal
id title status assignee created_date updated_date labels dependencies references priority ordinal
TASK-209 Exclude grammar-tail そうだ from subtitle annotations Done
codex
2026-03-20 04:06 2026-03-23 03:22
bug
tokenizer
/Users/sudacode/projects/japanese/SubMiner/src/core/services/tokenizer/annotation-stage.ts
/Users/sudacode/projects/japanese/SubMiner/src/core/services/tokenizer/annotation-stage.test.ts
/Users/sudacode/projects/japanese/SubMiner/src/core/services/tokenizer.test.ts
high 120500

Description

Sentence-final grammar-tail そうだ tokens can still receive subtitle annotation styling, including frequency highlighting, when Yomitan returns a standalone そうだ token and MeCab enriches it as an auxiliary-stem/coupla pattern (名詞|助動詞, 助動詞語幹). Keep the subtitle text visible, but treat this grammar tail like other grammar-only endings so it renders without annotation metadata.

Acceptance Criteria

  • #1 Sentence-final grammar-tail そうだ tokens enriched as auxiliary-stem/copula patterns do not receive frequency highlighting or other subtitle annotation metadata.
  • #2 The preceding lexical token in cases like 与えるそうだ keeps its existing annotation behavior.
  • #3 Regression tests cover the annotation-stage exclusion and end-to-end subtitle tokenization for the そうだ grammar-tail case.

Implementation Plan

  1. Add focused regression coverage for the reported 与えるそうだ case at both annotation-stage and tokenizeSubtitle levels.
  2. Reproduce failure by modeling the MeCab-enriched grammar-tail shape (名詞|助動詞, 特殊, 助動詞語幹) that currently keeps frequency metadata.
  3. Update subtitle-annotation exclusion logic to recognize auxiliary-stem/copula grammar tails via POS metadata plus normalized tail text, not a raw sentence-specific string match.
  4. Re-run targeted tokenizer and annotation-stage tests, then record the verification commands and outcome in the task notes.

Implementation Notes

Investigated reported 与えるそうだ case. MeCab tags そう as 名詞,特殊,助動詞語幹 and as 助動詞; after overlap enrichment the Yomitan token becomes pos1=名詞|助動詞, pos2=特殊, pos3=助動詞語幹, which currently escapes subtitle-annotation exclusion and can keep a frequency rank.

Implemented a POS-shape subtitle-annotation exclusion for MeCab-enriched auxiliary-stem grammar tails. The new predicate keys off merged tokens whose POS tags stay within 名詞/助動詞/助詞 and whose POS3 includes 助動詞語幹, which clears annotation metadata for そうだ-style tails without hard-coding the full subtitle text.

Verification: bun test src/core/services/tokenizer/annotation-stage.test.ts, bun test src/core/services/tokenizer.test.ts --test-name-pattern 'explanatory ending|interjection|single-kana merged tokens from frequency highlighting|auxiliary-stem そうだ grammar tails|composite function/content token from frequency highlighting|keeps frequency for content-led merged token with trailing colloquial suffixes'

Final Summary

Added regression coverage for 与えるそうだ and updated subtitle annotation exclusion logic to drop annotation metadata for MeCab-enriched auxiliary-stem grammar tails. The fix is POS-driven rather than sentence-specific, so そうだ-style grammar endings stay visible/hoverable as plain text while neighboring lexical tokens keep their existing frequency/JLPT behavior.