mirror of
https://github.com/ksyasuda/SubMiner.git
synced 2026-05-02 04:19:27 -07:00
fix(subtitle): tighten frequency token filtering
This commit is contained in:
@@ -0,0 +1,42 @@
|
||||
---
|
||||
id: TASK-107
|
||||
title: 'Fix Yomitan scan-token fallback fragmentation on exact-source misses'
|
||||
status: Done
|
||||
assignee: []
|
||||
created_date: '2026-03-07 01:10'
|
||||
updated_date: '2026-03-07 01:12'
|
||||
labels: []
|
||||
dependencies: []
|
||||
priority: high
|
||||
ordinal: 9007
|
||||
---
|
||||
|
||||
## Description
|
||||
|
||||
<!-- SECTION:DESCRIPTION:BEGIN -->
|
||||
|
||||
Left-to-right Yomitan scanning can emit bogus fallback tokens when `termsFind` returns entries but none of their headwords carries an exact primary source for the consumed substring. Repro: `だが それでも届かぬ高みがあった` currently yields trailing fragments like `があ` / `た`, which blocks the real `あった` token from receiving frequency highlighting.
|
||||
|
||||
<!-- SECTION:DESCRIPTION:END -->
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
<!-- AC:BEGIN -->
|
||||
|
||||
- [x] #1 Scanner skips `termsFind` fallback entries that are not backed by an exact primary source for the consumed substring.
|
||||
- [x] #2 Repro line no longer yields bogus trailing fragments such as `があ`.
|
||||
- [x] #3 Regression coverage added for the scan-token path.
|
||||
|
||||
<!-- AC:END -->
|
||||
|
||||
## Final Summary
|
||||
|
||||
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
|
||||
|
||||
Removed the scan-token helper fallback that previously emitted a token from the first returned headword even when Yomitan did not report an exact primary source for the consumed substring. Added a focused regression test covering `だが それでも届かぬ高みがあった`, ensuring bogus `があ` fragmentation is skipped so the later `あった` exact match can still be tokenized and highlighted.
|
||||
|
||||
Verification:
|
||||
|
||||
- `bun test src/core/services/tokenizer/yomitan-parser-runtime.test.ts src/core/services/tokenizer.test.ts --timeout 20000`
|
||||
|
||||
<!-- SECTION:FINAL_SUMMARY:END -->
|
||||
@@ -0,0 +1,43 @@
|
||||
---
|
||||
id: TASK-108
|
||||
title: 'Exclude single kana tokens from frequency highlighting'
|
||||
status: Done
|
||||
assignee: []
|
||||
created_date: '2026-03-07 01:18'
|
||||
updated_date: '2026-03-07 01:22'
|
||||
labels: []
|
||||
dependencies: []
|
||||
priority: medium
|
||||
ordinal: 9008
|
||||
---
|
||||
|
||||
## Description
|
||||
|
||||
<!-- SECTION:DESCRIPTION:BEGIN -->
|
||||
|
||||
Suppress frequency highlighting for single-character hiragana or katakana tokens. Scope is frequency-only: known/N+1/JLPT behavior stays unchanged.
|
||||
|
||||
<!-- SECTION:DESCRIPTION:END -->
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
<!-- AC:BEGIN -->
|
||||
|
||||
- [x] #1 Single-character hiragana tokens do not retain `frequencyRank`.
|
||||
- [x] #2 Single-character katakana tokens do not retain `frequencyRank`.
|
||||
- [x] #3 Regression coverage exists at annotation-stage and tokenizer levels.
|
||||
|
||||
<!-- AC:END -->
|
||||
|
||||
## Final Summary
|
||||
|
||||
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
|
||||
|
||||
Added a frequency-only suppression rule for single-character kana tokens based on token `surface`, so bogus merged fragments like `た` and standalone one-character kana no longer keep `frequencyRank`. Regression coverage now exists both in the annotation stage and in the tokenizer path, while multi-character tokens and N+1/JLPT behavior remain unchanged.
|
||||
|
||||
Verification:
|
||||
|
||||
- `bun test src/core/services/tokenizer/annotation-stage.test.ts --timeout 20000`
|
||||
- `bun test src/core/services/tokenizer.test.ts --timeout 20000`
|
||||
|
||||
<!-- SECTION:FINAL_SUMMARY:END -->
|
||||
Reference in New Issue
Block a user