Files
SubMiner/backlog/tasks/task-60 - Remove-hard-coded-particle-term-exclusions-from-frequency-lookup.md
kyasuda 457e6f0f10 feat(tokenizer): refine Yomitan grouping and parser tooling
- map segmented Yomitan lines into single logical tokens and improve candidate selection heuristics

- limit frequency lookup to selected token text with POS-based exclusions and add debug logging hook

- add standalone Yomitan parser test script, deterministic utility-script shutdown, and docs/backlog updates
2026-02-16 17:41:24 -08:00

1.2 KiB

id, title, status, assignee, created_date, updated_date, labels, dependencies
id title status assignee created_date updated_date labels dependencies
TASK-60 Remove hard-coded particle term exclusions from frequency lookup Done
2026-02-16 22:20 2026-02-16 22:21

Description

Update tokenizer frequency filtering to rely on MeCab POS information instead of a hard-coded set of particle surface forms.

Acceptance Criteria

  • #1 FREQUENCY_EXCLUDED_PARTICLES hard-coded term list is removed.
  • #2 Frequency exclusion for particles/auxiliaries is driven by POS metadata.
  • #3 Tokenizer tests cover POS-driven exclusion behavior.

Final Summary

Removed hard-coded particle surface exclusions (FREQUENCY_EXCLUDED_PARTICLES) from tokenizer frequency logic. Frequency skip now relies on POS metadata only: partOfSpeech (particle/bound_auxiliary) and MeCab-enriched pos1 (助詞/助動詞) for Yomitan tokens. Added tokenizer test tokenizeSubtitleService skips frequency rank when Yomitan token is enriched as particle by mecab pos1 to validate POS-driven exclusion.