Files
SubMiner/backlog/tasks/task-60 - Remove-hard-coded-particle-term-exclusions-from-frequency-lookup.md
kyasuda 457e6f0f10 feat(tokenizer): refine Yomitan grouping and parser tooling
- map segmented Yomitan lines into single logical tokens and improve candidate selection heuristics

- limit frequency lookup to selected token text with POS-based exclusions and add debug logging hook

- add standalone Yomitan parser test script, deterministic utility-script shutdown, and docs/backlog updates
2026-02-16 17:41:24 -08:00

30 lines
1.2 KiB
Markdown

---
id: TASK-60
title: Remove hard-coded particle term exclusions from frequency lookup
status: Done
assignee: []
created_date: '2026-02-16 22:20'
updated_date: '2026-02-16 22:21'
labels: []
dependencies: []
---
## Description
<!-- SECTION:DESCRIPTION:BEGIN -->
Update tokenizer frequency filtering to rely on MeCab POS information instead of a hard-coded set of particle surface forms.
<!-- SECTION:DESCRIPTION:END -->
## Acceptance Criteria
<!-- AC:BEGIN -->
- [x] #1 `FREQUENCY_EXCLUDED_PARTICLES` hard-coded term list is removed.
- [x] #2 Frequency exclusion for particles/auxiliaries is driven by POS metadata.
- [x] #3 Tokenizer tests cover POS-driven exclusion behavior.
<!-- AC:END -->
## Final Summary
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
Removed hard-coded particle surface exclusions (`FREQUENCY_EXCLUDED_PARTICLES`) from tokenizer frequency logic. Frequency skip now relies on POS metadata only: `partOfSpeech` (`particle`/`bound_auxiliary`) and MeCab-enriched `pos1` (`助詞`/`助動詞`) for Yomitan tokens. Added tokenizer test `tokenizeSubtitleService skips frequency rank when Yomitan token is enriched as particle by mecab pos1` to validate POS-driven exclusion.
<!-- SECTION:FINAL_SUMMARY:END -->