Files
SubMiner/backlog/tasks/task-333 - Suppress-aru-subtitle-annotations.md
T

3.1 KiB

id, title, status, assignee, created_date, updated_date, labels, dependencies, priority
id title status assignee created_date updated_date labels dependencies priority
TASK-333 Suppress aru subtitle annotations Done
2026-05-04 04:39 2026-05-04 05:02
tokenizer
annotations
bug
medium

Description

Add ある / 有る to the subtitle annotation suppression path so aru tokens remain hoverable and never receive N+1, JLPT, frequency, or name-match annotation metadata. Known-word highlighting is special: if a filtered aru token is known and known highlighting is enabled, it should still render as known.

Acceptance Criteria

  • #1 ある and kanji headword/surface variants such as 有る are excluded by the subtitle annotation filter.
  • #2 Annotation stripping clears N+1, JLPT, frequency, and name metadata for aru tokens while preserving token hover data.
  • #3 Known-word highlighting still applies to filtered tokens, including aru, when known-word lookup marks them known.
  • #4 Regression coverage fails before the fix and passes after.

Implementation Plan

  1. Add ある/有る/在る to the shared subtitle annotation hard-exclusion terms.
  2. Preserve/recompute known-word status for filtered tokens while stripping N+1, JLPT, frequency, and name metadata.
  3. Add RED/GREEN unit and tokenizer regression coverage, plus a changelog fragment.
  4. Run targeted tests and full handoff gate.

Implementation Notes

TDD path: added failing annotation-stage coverage first. Initial implementation made targeted tests pass, then broader tokenizer coverage revealed an older fixture expecting ある to remain lexical; updated that integration expectation to the new requested behavior. Follow-up correction: known-word highlighting is the lone annotation exception for filtered tokens, so the strip path now preserves known state and annotateTokens recomputes known status for filtered tokens while still clearing N+1/JLPT/frequency/name metadata.

Final Summary

Suppressed non-known subtitle annotations for aru existence verbs by adding ある, 有る, and 在る to the shared hard-exclusion list. Corrected the filtered-token path so known-word highlighting still applies whenever known highlighting is enabled; filtered tokens now keep/gain isKnown but still lose N+1, JLPT, frequency, and name metadata.

Added and updated annotation-stage and tokenizer regression coverage for aru, particles, helper fragments, interjections, and other filtered known tokens. Added changes/333-aru-annotation-filter.md.

Validation passed: RED failures observed before implementation/correction; bun test src/core/services/tokenizer/annotation-stage.test.ts; bun test src/core/services/tokenizer.test.ts; bun run typecheck; bun run format:check:src; bun run changelog:lint; bun run test:fast; bun run test:env; bun run build; bun run test:smoke:dist.