SubMiner/backlog/tasks/task-60 - Remove-hard-coded-particle-term-exclusions-from-frequency-lookup.md at 48f93f4344aa63eb15a4ceb21b51e2b38aa93c18 - SubMiner

mirror of https://github.com/ksyasuda/SubMiner.git synced 2026-02-27 18:22:41 -08:00

Files

kyasuda 457e6f0f10 feat(tokenizer): refine Yomitan grouping and parser tooling

- map segmented Yomitan lines into single logical tokens and improve candidate selection heuristics

- limit frequency lookup to selected token text with POS-based exclusions and add debug logging hook

- add standalone Yomitan parser test script, deterministic utility-script shutdown, and docs/backlog updates

2026-02-16 17:41:24 -08:00

1.2 KiB

Raw Blame History

id, title, status, assignee, created_date, updated_date, labels, dependencies

title

status

assignee

created_date

updated_date

labels

dependencies

TASK-60

Remove hard-coded particle term exclusions from frequency lookup

Done

2026-02-16 22:20

2026-02-16 22:21

Description

Update tokenizer frequency filtering to rely on MeCab POS information instead of a hard-coded set of particle surface forms.

Acceptance Criteria

#1 FREQUENCY_EXCLUDED_PARTICLES hard-coded term list is removed.
#2 Frequency exclusion for particles/auxiliaries is driven by POS metadata.
#3 Tokenizer tests cover POS-driven exclusion behavior.

Final Summary

Removed hard-coded particle surface exclusions (FREQUENCY_EXCLUDED_PARTICLES) from tokenizer frequency logic. Frequency skip now relies on POS metadata only: partOfSpeech (particle/bound_auxiliary) and MeCab-enriched pos1 (助詞/助動詞) for Yomitan tokens. Added tokenizer test tokenizeSubtitleService skips frequency rank when Yomitan token is enriched as particle by mecab pos1 to validate POS-driven exclusion.

1.2 KiB Raw Blame History

Description

Acceptance Criteria

Final Summary

1.2 KiB

Raw Blame History