mirror of
https://github.com/ksyasuda/SubMiner.git
synced 2026-05-02 16:19:25 -07:00
Fix kana-only N+1 tokenizer regression test
- Use a pure-kana fixture for the subtitle token N+1 case - Update task notes for the latest CodeRabbit follow-up
This commit is contained in:
@@ -2,9 +2,10 @@
|
|||||||
id: TASK-305
|
id: TASK-305
|
||||||
title: Use Yomitan word classes for subtitle token POS filtering
|
title: Use Yomitan word classes for subtitle token POS filtering
|
||||||
status: Done
|
status: Done
|
||||||
assignee: []
|
assignee:
|
||||||
|
- Codex
|
||||||
created_date: '2026-04-26 05:56'
|
created_date: '2026-04-26 05:56'
|
||||||
updated_date: '2026-04-26 05:59'
|
updated_date: '2026-05-02 22:47'
|
||||||
labels:
|
labels:
|
||||||
- tokenizer
|
- tokenizer
|
||||||
- yomitan
|
- yomitan
|
||||||
@@ -34,22 +35,30 @@ Subtitle annotation filtering currently uses Yomitan token spans, then enriches
|
|||||||
3. Map recognized Yomitan wordClasses to SubMiner coarse PartOfSpeech/POS metadata before annotation filtering.
|
3. Map recognized Yomitan wordClasses to SubMiner coarse PartOfSpeech/POS metadata before annotation filtering.
|
||||||
4. Keep MeCab whole-line enrichment as fallback/detail-fill for missing POS fields.
|
4. Keep MeCab whole-line enrichment as fallback/detail-fill for missing POS fields.
|
||||||
5. Run focused tokenizer tests and typecheck.
|
5. Run focused tokenizer tests and typecheck.
|
||||||
|
|
||||||
|
2026-05-02 review follow-up: inspect latest CodeRabbit review on PR #57, classify each finding as actionable/not actionable, patch scoped issues, run focused verification, then update final notes. User request to address/assess the review is the approval for this follow-up.
|
||||||
<!-- SECTION:PLAN:END -->
|
<!-- SECTION:PLAN:END -->
|
||||||
|
|
||||||
## Implementation Notes
|
## Implementation Notes
|
||||||
|
|
||||||
<!-- SECTION:NOTES:BEGIN -->
|
<!-- SECTION:NOTES:BEGIN -->
|
||||||
Implemented app-only wordClasses extraction from termsFind results; no vendored Yomitan changes required. Recognized classes currently map prt, aux, v*, adj-i/adj-ix, adj-na, and noun-like classes to SubMiner POS metadata. MeCab enrichment now skips only tokens with complete pos1/pos2/pos3 and otherwise fills missing fields while preserving existing coarse pos1. Verification: bun test src/core/services/tokenizer/yomitan-parser-runtime.test.ts src/core/services/tokenizer.test.ts; bun run typecheck.
|
Implemented app-only wordClasses extraction from termsFind results; no vendored Yomitan changes required. Recognized classes currently map prt, aux, v*, adj-i/adj-ix, adj-na, and noun-like classes to SubMiner POS metadata. MeCab enrichment now skips only tokens with complete pos1/pos2/pos3 and otherwise fills missing fields while preserving existing coarse pos1. Verification: bun test src/core/services/tokenizer/yomitan-parser-runtime.test.ts src/core/services/tokenizer.test.ts; bun run typecheck.
|
||||||
|
|
||||||
|
2026-05-02 CodeRabbit latest review assessment: only current actionable finding was in src/core/services/tokenizer/annotation-stage.test.ts, where a kana-only regression fixture used mixed-script/punctuation surface text. Earlier CodeRabbit findings in this PR were already marked addressed by prior commits. Patched the fixture to use pure-kana surface/headword and renamed the test to match the exercised behavior. Verification: bun test src/core/services/tokenizer/annotation-stage.test.ts; bun run typecheck.
|
||||||
<!-- SECTION:NOTES:END -->
|
<!-- SECTION:NOTES:END -->
|
||||||
|
|
||||||
## Final Summary
|
## Final Summary
|
||||||
|
|
||||||
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
|
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
|
||||||
Implemented app-only Yomitan wordClasses support for subtitle token annotation filtering. The scanner now carries matched headword wordClasses from termsFind results, tokenizer maps recognized classes into SubMiner coarse POS metadata before annotation, and MeCab whole-line enrichment continues to fill missing detailed POS fields without requiring vendored Yomitan changes.
|
Implemented app-only Yomitan wordClasses support for subtitle token annotation filtering. The scanner carries matched headword wordClasses from termsFind results, tokenizer maps recognized classes into SubMiner coarse POS metadata before annotation, and MeCab whole-line enrichment continues to fill missing detailed POS fields without requiring vendored Yomitan changes.
|
||||||
|
|
||||||
|
2026-05-02 CodeRabbit follow-up:
|
||||||
|
- Assessed the latest CodeRabbit review on PR #57. Only one new actionable finding remained: the kana-only N+1 regression test used a mixed/punctuated surface.
|
||||||
|
- Updated the fixture in src/core/services/tokenizer/annotation-stage.test.ts to use a pure-kana unknown target and renamed the test accordingly.
|
||||||
|
|
||||||
Tests run:
|
Tests run:
|
||||||
- bun test src/core/services/tokenizer/yomitan-parser-runtime.test.ts src/core/services/tokenizer.test.ts
|
- bun test src/core/services/tokenizer/annotation-stage.test.ts
|
||||||
- bun run typecheck
|
- bun run typecheck
|
||||||
|
|
||||||
Note: the working tree already had unrelated tokenizer/annotation edits and task-304 before this work; those were left intact.
|
Note: earlier CodeRabbit findings on this PR were already marked addressed in prior commits; no further latest-review issues were left unresolved in this pass.
|
||||||
<!-- SECTION:FINAL_SUMMARY:END -->
|
<!-- SECTION:FINAL_SUMMARY:END -->
|
||||||
|
|||||||
@@ -669,7 +669,7 @@ test('annotateTokens N+1 handoff marks expected target when threshold is satisfi
|
|||||||
assert.equal(result[2]?.isNPlusOneTarget, false);
|
assert.equal(result[2]?.isNPlusOneTarget, false);
|
||||||
});
|
});
|
||||||
|
|
||||||
test('annotateTokens does not mark kana-only unknown target with subtitle punctuation as N+1', () => {
|
test('annotateTokens does not mark kana-only unknown target as N+1', () => {
|
||||||
const tokens = [
|
const tokens = [
|
||||||
makeToken({
|
makeToken({
|
||||||
surface: '何やら',
|
surface: '何やら',
|
||||||
@@ -688,12 +688,12 @@ test('annotateTokens does not mark kana-only unknown target with subtitle punctu
|
|||||||
endPos: 6,
|
endPos: 6,
|
||||||
}),
|
}),
|
||||||
makeToken({
|
makeToken({
|
||||||
surface: 'スイッチ…',
|
surface: 'すいっち',
|
||||||
headword: 'スイッチ',
|
headword: 'すいっち',
|
||||||
reading: 'スイッチ',
|
reading: 'スイッチ',
|
||||||
pos1: '名詞',
|
pos1: '名詞',
|
||||||
startPos: 6,
|
startPos: 6,
|
||||||
endPos: 11,
|
endPos: 10,
|
||||||
}),
|
}),
|
||||||
];
|
];
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user