mirror of
https://github.com/ksyasuda/SubMiner.git
synced 2026-02-27 18:22:41 -08:00
- map segmented Yomitan lines into single logical tokens and improve candidate selection heuristics - limit frequency lookup to selected token text with POS-based exclusions and add debug logging hook - add standalone Yomitan parser test script, deterministic utility-script shutdown, and docs/backlog updates
1.2 KiB
1.2 KiB
id, title, status, assignee, created_date, updated_date, labels, dependencies
| id | title | status | assignee | created_date | updated_date | labels | dependencies |
|---|---|---|---|---|---|---|---|
| TASK-60 | Remove hard-coded particle term exclusions from frequency lookup | Done | 2026-02-16 22:20 | 2026-02-16 22:21 |
Description
Update tokenizer frequency filtering to rely on MeCab POS information instead of a hard-coded set of particle surface forms.
Acceptance Criteria
- #1
FREQUENCY_EXCLUDED_PARTICLEShard-coded term list is removed. - #2 Frequency exclusion for particles/auxiliaries is driven by POS metadata.
- #3 Tokenizer tests cover POS-driven exclusion behavior.
Final Summary
Removed hard-coded particle surface exclusions (FREQUENCY_EXCLUDED_PARTICLES) from tokenizer frequency logic. Frequency skip now relies on POS metadata only: partOfSpeech (particle/bound_auxiliary) and MeCab-enriched pos1 (助詞/助動詞) for Yomitan tokens. Added tokenizer test tokenizeSubtitleService skips frequency rank when Yomitan token is enriched as particle by mecab pos1 to validate POS-driven exclusion.