Files
SubMiner/backlog/tasks/task-176 - Exclude-interjections-and-sound-effects-from-subtitle-annotations.md

2.6 KiB

id, title, status, assignee, created_date, updated_date, labels, dependencies, references, priority, ordinal
id title status assignee created_date updated_date labels dependencies references priority ordinal
TASK-176 Exclude interjections and sound effects from subtitle annotations Done
codex
2026-03-15 12:07 2026-03-16 05:13
bug
tokenizer
renderer
/home/sudacode/projects/japanese/SubMiner/src/core/services/tokenizer.ts
/home/sudacode/projects/japanese/SubMiner/src/core/services/tokenizer/annotation-stage.ts
/home/sudacode/projects/japanese/SubMiner/src/core/services/tokenizer.test.ts
/home/sudacode/projects/japanese/SubMiner/src/renderer/subtitle-render.ts
/home/sudacode/projects/japanese/SubMiner/src/renderer/subtitle-render.test.ts
high 16500

Description

Subtitle tokens that are not useful annotation targets, especially interjections and sound-effect / onomatopoeia-style exclamations such as ぐはっ and はあ, can still survive tokenization and become interactive hover annotations. Keep the subtitle text visible, but remove these tokens from annotation payloads so they do not render hover targets or dictionary popovers.

Acceptance Criteria

  • #1 Interjection / sound-effect style tokens are excluded from subtitle annotation payloads and do not create interactive hover spans.
  • #2 Excluded tokens remain visible in rendered subtitle text as plain text.
  • #3 Regression tests cover at least one MeCab-tagged interjection case and one rendering-visible/plain-text case.

Implementation Plan

  1. Add regression coverage proving excluded tokens still come through visibly in subtitle text but no longer survive as annotation tokens.
  2. Introduce a shared annotation-eligibility predicate in the tokenizer annotation stage for interjections / SFX-like tokens.
  3. Filter subtitle token payloads through that predicate before renderer hover ranges/spans are built.
  4. Verify with targeted tokenizer and renderer tests.

Final Summary

Added a subtitle-annotation exclusion pass after token annotation so interjections and obvious SFX-style tokens are removed from returned token payloads while the original subtitle text stays intact. Coverage now includes MeCab-tagged 感動詞, repeated-kana interjections such as ああ, a mixed ぐはっ 猫 tokenizer case, and a renderer check proving omitted tokens stay visible as plain text instead of interactive hover spans.