SubMiner/backlog/tasks/task-304 - Fix-N1-sentence-boundary-counting-across-Yomitan-punctuation-gaps.md at 6b7d0553a77c076f9be3ad3e9916d0fd17d1d325 - SubMiner

mirror of https://github.com/ksyasuda/SubMiner.git synced 2026-04-26 04:19:27 -07:00

Files

feat(tokenizer): use Yomitan word classes for subtitle POS filtering

- Carry matched headword wordClasses from termsFind into YomitanScanToken
- Map recognized Yomitan wordClasses to SubMiner coarse POS before annotation
- MeCab enrichment now fills only missing POS fields, preserving existing coarse pos1
- Exclude standalone grammar particles, して helper fragments, and single-kana surfaces from annotations
- Respect source-text punctuation gaps when counting N+1 sentence words
- Preserve known-word highlight on excluded kanji-containing tokens
- Add backlog tasks 304 (N+1 boundary bug) and 305 (wordClasses POS, done)

2026-04-25 23:08:33 -07:00

1.2 KiB

Raw Blame History

id, title, status, assignee, created_date, labels, dependencies, priority

title

status

assignee

created_date

labels

dependencies

priority

TASK-304

Fix N+1 sentence boundary counting across Yomitan punctuation gaps

In Progress

2026-04-26 05:33

bug

tokenizer

annotations

medium

Description

N+1 target selection should respect sentence-ending punctuation from the original subtitle text even when Yomitan token output omits punctuation tokens. Current behavior can treat multiple subtitle sentences as one token span and incorrectly satisfy the minimum content-token threshold.

Acceptance Criteria

#1 A subtitle like てんめ！ふざけんなよ！ does not mark ふざけん/similar single-content-token second sentence as N+1 when the minimum sentence word count is 3.
#2 N+1 sentence segmentation uses original subtitle text offsets or equivalent source-boundary data, not only punctuation tokens returned by Yomitan.
#3 Existing annotation exclusion behavior for particles/grammar tokens remains unchanged.
#4 Regression tests cover Yomitan-style token streams where punctuation is absent from the token list.

1.2 KiB Raw Blame History

Description

Acceptance Criteria

1.2 KiB

Raw Blame History