refactor(tokenizer): remove MeCab fallback tokenization path

This commit is contained in:
2026-02-22 18:03:38 -08:00
parent f1dc418e2d
commit badb82280a
9 changed files with 212 additions and 480 deletions

View File

@@ -86,3 +86,4 @@ Read first. Keep concise.
| `codex-task109-discord-presence-20260222T220537Z-lkfv` | `codex-task109-discord-presence` | `Execute TASK-109 Discord Rich Presence integration end-to-end with plan-first workflow (no commit)` | `handoff` | `docs/subagents/agents/codex-task109-discord-presence-20260222T220537Z-lkfv.md` | `2026-02-22T22:36:40Z` |
| `opencode-task103-jellyfin-main-composer-20260222T221152Z-n3p7` | `opencode-task103-jellyfin-main-composer` | `Implement TASK-103 Jellyfin runtime wiring extraction from main.ts into composer module(s), tests, docs, and required validations (no commit).` | `in_progress` | `docs/subagents/agents/opencode-task103-jellyfin-main-composer-20260222T221152Z-n3p7.md` | `2026-02-22T22:11:52Z` |
| `opencode-task109-discord-presence-20260223T011027Z-j9r4` | `opencode-task109-discord-presence` | `Finalize TASK-109 Discord Rich Presence with plan-first workflow and backlog closure.` | `in_progress` | `docs/subagents/agents/opencode-task109-discord-presence-20260223T011027Z-j9r4.md` | `2026-02-23T01:15:39Z` |
| `codex-task88-yomitan-flow-20260223T012755Z-x4m2` | `codex-task88-yomitan-flow` | `Execute TASK-88 remove MeCab fallback tokenizer and simplify Yomitan token flow via plan-first workflow (no commit).` | `handoff` | `docs/subagents/agents/codex-task88-yomitan-flow-20260223T012755Z-x4m2.md` | `2026-02-23T01:44:16Z` |

View File

@@ -0,0 +1,40 @@
# Agent: `codex-task88-yomitan-flow-20260223T012755Z-x4m2`
- alias: `codex-task88-yomitan-flow`
- mission: `Execute TASK-88 remove MeCab fallback tokenizer and simplify Yomitan token flow via plan-first workflow (no commit).`
- status: `handoff`
- branch: `main`
- started_at: `2026-02-23T01:27:55Z`
- heartbeat_minutes: `5`
## Current Work (newest first)
- [2026-02-23T01:44:16Z] handoff: implementation + docs updates complete for TASK-88 scope; tokenizer fallback removed, parser-selection simplified to scanning-parser-only, focused tokenizer/subtitle tests + build + docs build green.
- [2026-02-23T01:44:16Z] test: `bun test src/core/services/tokenizer/parser-selection-stage.test.ts src/core/services/tokenizer.test.ts` pass (47); `bun test src/core/services/subtitle-processing-controller.test.ts` pass (6); `bun run build` pass; `bun run docs:build` pass.
- [2026-02-23T01:30:00Z] progress: wrote plan at `docs/plans/2026-02-23-task-88-yomitan-only-token-flow.md` via writing-plans skill and executed via executing-plans skill.
- [2026-02-23T01:27:55Z] intent: load backlog context for TASK-88, write plan with writing-plans skill, execute with executing-plans skill, validate via focused/full tests, no commit.
## Files Touched
- `docs/subagents/agents/codex-task88-yomitan-flow-20260223T012755Z-x4m2.md`
- `docs/subagents/INDEX.md`
- `docs/subagents/collaboration.md`
- `docs/plans/2026-02-23-task-88-yomitan-only-token-flow.md`
- `src/core/services/tokenizer.ts`
- `src/core/services/tokenizer/parser-selection-stage.ts`
- `src/core/services/tokenizer/parser-selection-stage.test.ts`
- `src/core/services/tokenizer.test.ts`
- `docs/usage.md`
- `docs/troubleshooting.md`
## Assumptions
- Backlog is initialized and TASK-88 title/context from MCP search is authoritative despite stale `task_view` collision on legacy TASK-88.
## Open Questions / Blockers
- Backlog MCP `task_view TASK-88` resolves to a legacy completed TASK-88 entry; current TASK-88 content had to be read from `backlog/tasks/task-88 - Remove-MeCab-fallback-tokenizer-and-simplify-Yomitan-token-flow.md`.
## Next Step
- If needed, repair duplicate TASK-88 ID collision in Backlog MCP so `task_view`/`task_edit` target the active To Do ticket.

View File

@@ -148,3 +148,5 @@ Shared notes. Append-only.
- [2026-02-23T01:10:27Z] [opencode-task109-discord-presence-20260223T011027Z-j9r4|opencode-task109-discord-presence] starting TASK-109 closure pass via Backlog MCP + writing-plans/executing-plans; scope validate existing Discord config/runtime/docs changes, close remaining DoD evidence, and finalize task status if gates pass.
- [2026-02-23T01:15:39Z] [opencode-task109-discord-presence-20260223T011027Z-j9r4|opencode-task109-discord-presence] user feedback from real Discord session: status resumed to Playing with noticeable delay; tuned default `discordPresence.updateIntervalMs` from 15000 to 3000 in defaults/docs/examples and updated focused config expectations; reran focused config + discord presence tests green.
- [2026-02-23T01:27:55Z] [codex-task88-yomitan-flow-20260223T012755Z-x4m2|codex-task88-yomitan-flow] starting TASK-88 via Backlog MCP + writing-plans/executing-plans; expected overlap in tokenizer modules (`src/core/services/tokenizer*`, Yomitan flow wiring/tests); will keep scope to MeCab fallback removal and token flow simplification.
- [2026-02-23T01:44:16Z] [codex-task88-yomitan-flow-20260223T012755Z-x4m2|codex-task88-yomitan-flow] completed TASK-88 implementation pass: removed MeCab fallback branch from `tokenizeSubtitle`, restricted parser-selection to `scanning-parser` candidates, refreshed tokenizer regressions for Yomitan-only flow, updated usage/troubleshooting docs, and verified tokenizer+subtitle suites/build/docs-build green.

View File

@@ -103,7 +103,7 @@ If you installed from the AppImage and see this error, the package may be incomp
**"MeCab not found on system"**
This is informational, not an error. SubMiner uses Yomitan's internal parser as the primary tokenizer and falls back to MeCab when needed. If MeCab is not installed, Yomitan handles all tokenization.
This is informational, not an error. SubMiner tokenization is driven by Yomitan's internal parser. MeCab availability checks may still run for auxiliary token metadata, but MeCab is not used as a tokenization fallback path.
To install MeCab:
@@ -113,10 +113,10 @@ To install MeCab:
**Words are not segmented correctly**
Japanese word boundaries depend on the tokenizer. If segmentation seems wrong:
Japanese word boundaries depend on Yomitan parser output. If segmentation seems wrong:
- Install MeCab for improved accuracy as a fallback.
- Note that CJK characters without spaces are segmented using `Intl.Segmenter` or character-level fallback, which is not always perfect.
- Verify Yomitan dictionaries are installed and active.
- Note that CJK characters without spaces are segmented using parser heuristics, which is not always perfect.
## Media Generation

View File

@@ -209,7 +209,7 @@ These keybindings only work when the overlay window has focus. See [Configuratio
1. MPV runs with an IPC socket at `/tmp/subminer-socket`
2. The overlay connects and subscribes to subtitle changes
3. Subtitles are tokenized with Yomitan's internal parser, with MeCab fallback when needed
3. Subtitles are tokenized with Yomitan's internal parser
4. Words are displayed as clickable spans
5. Clicking a word triggers Yomitan popup for dictionary lookup
6. Texthooker server runs at `http://127.0.0.1:5174` for external tools