Files
SubMiner/backlog/tasks/task-304 - Fix-N1-sentence-boundary-counting-across-Yomitan-punctuation-gaps.md
sudacode 6b7d0553a7 feat(tokenizer): use Yomitan word classes for subtitle POS filtering
- Carry matched headword wordClasses from termsFind into YomitanScanToken
- Map recognized Yomitan wordClasses to SubMiner coarse POS before annotation
- MeCab enrichment now fills only missing POS fields, preserving existing coarse pos1
- Exclude standalone grammar particles, して helper fragments, and single-kana surfaces from annotations
- Respect source-text punctuation gaps when counting N+1 sentence words
- Preserve known-word highlight on excluded kanji-containing tokens
- Add backlog tasks 304 (N+1 boundary bug) and 305 (wordClasses POS, done)
2026-04-25 23:08:33 -07:00

1.2 KiB

id, title, status, assignee, created_date, labels, dependencies, priority
id title status assignee created_date labels dependencies priority
TASK-304 Fix N+1 sentence boundary counting across Yomitan punctuation gaps In Progress
2026-04-26 05:33
bug
tokenizer
annotations
medium

Description

N+1 target selection should respect sentence-ending punctuation from the original subtitle text even when Yomitan token output omits punctuation tokens. Current behavior can treat multiple subtitle sentences as one token span and incorrectly satisfy the minimum content-token threshold.

Acceptance Criteria

  • #1 A subtitle like てんめ!ふざけんなよ! does not mark ふざけん/similar single-content-token second sentence as N+1 when the minimum sentence word count is 3.
  • #2 N+1 sentence segmentation uses original subtitle text offsets or equivalent source-boundary data, not only punctuation tokens returned by Yomitan.
  • #3 Existing annotation exclusion behavior for particles/grammar tokens remains unchanged.
  • #4 Regression tests cover Yomitan-style token streams where punctuation is absent from the token list.