SubMiner/backlog/tasks/task-108 - Exclude-single-kana-tokens-from-frequency-highlighting.md at 5d96f9d535973938d628f0660fd46e3ddadc53f7 - SubMiner

sudacode/SubMiner

Fork 0

mirror of https://github.com/ksyasuda/SubMiner.git synced 2026-03-07 03:22:17 -08:00

Files

sudacode 1d76e05cd3

fix(subtitle): tighten frequency token filtering

2026-03-07 01:28:37 -08:00

1.3 KiB

Raw Blame History

id, title, status, assignee, created_date, updated_date, labels, dependencies, priority, ordinal

title

status

assignee

created_date

updated_date

labels

dependencies

priority

ordinal

TASK-108

Exclude single kana tokens from frequency highlighting

Done

2026-03-07 01:18

2026-03-07 01:22

medium

9008

Description

Suppress frequency highlighting for single-character hiragana or katakana tokens. Scope is frequency-only: known/N+1/JLPT behavior stays unchanged.

Acceptance Criteria

#1 Single-character hiragana tokens do not retain frequencyRank.
#2 Single-character katakana tokens do not retain frequencyRank.
#3 Regression coverage exists at annotation-stage and tokenizer levels.

Final Summary

Added a frequency-only suppression rule for single-character kana tokens based on token surface, so bogus merged fragments like た and standalone one-character kana no longer keep frequencyRank. Regression coverage now exists both in the annotation stage and in the tokenizer path, while multi-character tokens and N+1/JLPT behavior remain unchanged.

Verification:

bun test src/core/services/tokenizer/annotation-stage.test.ts --timeout 20000
bun test src/core/services/tokenizer.test.ts --timeout 20000

1.3 KiB Raw Blame History

Description

Acceptance Criteria

Final Summary

1.3 KiB

Raw Blame History