Persist stats exclusions in DB and fix word metrics filtering (#60)

2026-07-08 01:08:53 -07:00 · 2026-05-03 20:06:13 -07:00
parent db30c61327
commit 0915b23dc8
33 changed files with 1890 additions and 208 deletions
@@ -0,0 +1,38 @@
+---
+id: TASK-325
+title: Fix session chart known-word percentage denominator
+status: Done
+assignee: []
+created_date: '2026-05-04 01:19'
+updated_date: '2026-05-04 01:23'
+labels:
+  - stats
+dependencies: []
+priority: medium
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Session detail known-word percentages should use the same filtered vocabulary occurrence rows for both known and total word counts. Current chart can divide known persisted word occurrences by raw token totals, causing excluded tokens to depress the known percentage.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 Session known-word timeline API exposes cumulative filtered total word counts alongside known counts.
+- [x] #2 Session detail chart computes known/unknown areas from filtered totals, not raw timeline token counts, when known-word data is available.
+- [x] #3 Session summary known-word rate uses filtered persisted word totals where available and preserves safe fallback behavior when known-word data is unavailable.
+- [x] #4 Regression tests cover filtered denominator behavior for the API and chart data path.
+<!-- AC:END -->
+
+## Implementation Notes
+
+<!-- SECTION:NOTES:BEGIN -->
+Implemented in-place fix using existing persisted word occurrence rows. `/api/stats/sessions/:id/known-words-timeline` now returns cumulative `totalWordsSeen` from filtered persisted occurrences, and session known-word rates divide by the same filtered total. Session detail chart builds known/unknown areas from `totalWordsSeen` instead of raw timeline `tokensSeen`.
+<!-- SECTION:NOTES:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Known-word percentages on session charts now use filtered persisted word totals for both numerator and denominator. No migration/backfill required; data comes from existing `imm_word_line_occurrences`. Added regression coverage for the API response/rate and chart data builder.
+<!-- SECTION:FINAL_SUMMARY:END -->
@@ -0,0 +1,42 @@
+---
+id: TASK-326
+title: Make stats word metrics honor filtering rules
+status: Done
+assignee: []
+created_date: '2026-05-04 01:35'
+updated_date: '2026-05-04 02:08'
+labels:
+  - stats
+dependencies: []
+priority: high
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Audit stats app metrics that show or derive from word totals and make them use filtered persisted vocabulary occurrences where the UI concept is learned/seen words. Preserve raw telemetry only where it is intentionally playback/token telemetry.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 Stats UI word totals, word rates, lookup-per-word rates, and chart word series use filtered persisted word occurrences where available.
+- [x] #2 Known-word metrics continue to use the same filtered denominator as known counts.
+- [x] #3 Trend, overview, library, session, and episode surfaces are audited with regression coverage for changed data paths.
+- [x] #4 Fallback behavior remains safe for sessions without persisted vocabulary occurrences.
+<!-- AC:END -->
+
+## Implementation Notes
+
+<!-- SECTION:NOTES:BEGIN -->
+Audit finding: raw `tokensSeen` / `totalTokensSeen` still feeds overview hints, dashboard aggregation, trends activity/progress/anime cumulative/library summary, lookup-per-100-word rates, session rows/recent sessions/episode sessions, and library/anime/media headers. Vocabulary and known unique word summaries already use persisted filtered vocabulary rows. Recommended design: query-time filtered word totals from existing `imm_word_line_occurrences`, with raw-token fallback only when a session has no persisted occurrence rows.
+
+Implemented shared query-time filtered word counts. Session summaries, overview hints, daily/monthly rollups, anime/media library/detail rows, anime episode rows, episode/media sessions, trends activity/progress/anime cumulative, library summary, and lookup-per-100-word ratios now use filtered persisted word occurrences. Fallback remains raw token totals only for sessions with no persisted subtitle-line rows.
+
+Follow-up implemented: Vocab frequency tables now apply the same tokenizer vocabulary predicate at read time, because old `imm_words` rows can predate current tokenizer exclusion rules. Vocabulary persistence and cleanup also mirror the broader subtitle-annotation grammar filters. Added common frequency stop terms observed in the stats vocabulary list to the shared tokenizer exclusion set so those rows are filtered consistently across subtitle annotations, persistence, cleanup, stats reads, and SQL word-count aggregates.
+<!-- SECTION:NOTES:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Stats word metrics now honor filtering rules through the read-model query layer. Existing persisted `imm_word_line_occurrences` provide the filtered denominator; no migration/backfill needed. Vocab tables filter stored rows on read using tokenizer vocabulary rules, so legacy noisy rows stop appearing without a migration. Added regressions for session/overview/rollup fallback behavior, trends/library lookup-rate behavior, vocabulary read filtering, cleanup filtering, and shared stop-term filtering.
+<!-- SECTION:FINAL_SUMMARY:END -->
@@ -0,0 +1,42 @@
+---
+id: TASK-327
+title: Persist stats page exclusion list in database
+status: Done
+assignee: []
+created_date: '2026-05-04 01:39'
+updated_date: '2026-05-04 01:49'
+labels:
+  - feature
+  - stats
+  - database
+dependencies: []
+priority: medium
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Add database-backed persistence for the stats page exclusion list. On first load with the new schema, seed the new table from the existing exclusion list source so existing user choices are preserved. After migration, update database rows whenever the exclusion list is changed or saved so it persists across browser sessions indefinitely.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 A new small database table stores stats page exclusion entries.
+- [x] #2 First load with the new schema seeds the table from the existing exclusion list source.
+- [x] #3 Subsequent exclusion list save/change operations update the database-backed list.
+- [x] #4 Regression coverage verifies migration/seed behavior and persistence updates.
+<!-- AC:END -->
+
+## Implementation Notes
+
+<!-- SECTION:NOTES:BEGIN -->
+Implemented DB-backed stats exclusion list using schema version 18 and new `imm_stats_excluded_words` table. Added read/replace query helpers, service methods, and `/api/stats/excluded-words` GET/PUT routes. Stats frontend now loads exclusions from DB, seeds the empty DB table from legacy `localStorage` on first load, and writes each toggle/restore/clear through the API while keeping localStorage in sync for compatibility. Added focused regression coverage for schema/read-replace, API routes, API client, and frontend bootstrap/update behavior. Verification: `bun run typecheck` passed; `bun test src/core/services/__tests__/stats-server.test.ts stats/src/lib/api-client.test.ts stats/src/hooks/useExcludedWords.test.ts` passed; `bun test src/core/services/immersion-tracker/storage-session.test.ts` passed; `bun run docs:test` passed; `bun run format:check:stats` passed; `bun run changelog:lint` passed. Blocked/unrelated: `bun run typecheck:stats` fails in existing stats files (`AnilistSelector.tsx`, `reading-utils*`, `session-grouping.test.ts`, `yomitan-lookup.test.tsx`); `bun run test:immersion:sqlite:src` fails existing `recordSubtitleLine counts exact Yomitan tokens for session metrics` expected 4 got 3; `bun run docs:build` fails missing `@catppuccin/vitepress/theme/macchiato/mauve.css` import.
+
+Added `src/core/services/__tests__/stats-server.test.ts` and `stats/src/hooks/useExcludedWords.test.ts` to the `test:core:src` allowlist so the new DB exclusion route/client/store regressions run in the maintained fast source lane.
+<!-- SECTION:NOTES:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Persisted the stats vocabulary exclusion list in SQLite with new schema version 18 table `imm_stats_excluded_words`. Added backend read/replace helpers and `/api/stats/excluded-words` GET/PUT routes, then wired the stats frontend exclusion store to load DB rows, seed an empty DB from legacy browser localStorage on first load, and update the DB on toggle/restore/clear. Updated docs and added changelog fragment. Focused tests and root typecheck pass; broader stats/docs/sqlite gates are blocked by unrelated existing failures recorded in notes.
+<!-- SECTION:FINAL_SUMMARY:END -->
@@ -0,0 +1,43 @@
+---
+id: TASK-329
+title: Keep JLPT subtitle styling underline-only
+status: Done
+assignee: []
+created_date: '2026-05-04 02:13'
+labels:
+  - bug
+  - renderer
+  - jlpt
+dependencies: []
+references:
+  - src/renderer/style.css
+  - src/renderer/subtitle-render.test.ts
+priority: medium
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Fix subtitle token styling so JLPT metadata never changes token text color. JLPT should only render the level marker/underline affordance while known, n+1, name-match, and frequency colors retain priority.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 JLPT-only subtitle tokens do not set token text color.
+- [x] #2 JLPT level marker/underline still uses configured JLPT color.
+- [x] #3 Existing known, n+1, name-match, and frequency text colors remain unchanged.
+<!-- AC:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Changed subtitle JLPT styling from text color to underline decoration and updated renderer CSS regression coverage.
+
+Verification:
+- `bun test src/renderer/subtitle-render.test.ts`
+- `bunx prettier --check src/renderer/subtitle-render.test.ts src/renderer/style.css`
+- `bun run typecheck`
+
+Blocked:
+- `bun run test:fast` fails in existing dirty stats/session work: `recordSubtitleLine counts exact Yomitan tokens for session metrics` expects `tokensSeen` 4 but gets 3.
+<!-- SECTION:FINAL_SUMMARY:END -->
@@ -0,0 +1,70 @@
+---
+id: TASK-330
+title: Fix PR 60 CI failures and CodeRabbit feedback
+status: Done
+assignee:
+  - codex
+created_date: '2026-05-04 02:50'
+updated_date: '2026-05-04 02:59'
+labels:
+  - ci
+  - pr-review
+dependencies: []
+references:
+  - 'https://github.com/ksyasuda/SubMiner/pull/60'
+priority: high
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Resolve failing GitHub Actions checks and actionable unresolved CodeRabbit review feedback on PR #60 (Persist stats exclusions in DB and fix word metrics filtering). Keep fixes scoped to the PR behavior and preserve existing project patterns.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 Failing GitHub Actions checks for PR #60 have an identified root cause and local fix.
+- [x] #2 All actionable unresolved CodeRabbit review comments on PR #60 are addressed locally or explicitly documented as non-actionable.
+- [x] #3 Relevant local verification passes for the changed code paths.
+- [x] #4 Task notes summarize CI failure context, review-comment handling, and any residual verification gaps.
+<!-- AC:END -->
+
+## Implementation Plan
+
+<!-- SECTION:PLAN:BEGIN -->
+1. Resolve PR #60 context and inspect GitHub Actions failures with the gh-fix-ci workflow.
+2. Fetch unresolved review threads with the gh-address-comments workflow, focusing on CodeRabbit actionable comments.
+3. Read the touched files/tests around the failing paths and comments; identify root cause before edits.
+4. Apply minimal fixes with regression coverage where appropriate.
+5. Run targeted verification first, then broader repo gates as time permits.
+6. Update Backlog notes/acceptance criteria with CI/comment outcomes and residual risks.
+<!-- SECTION:PLAN:END -->
+
+## Implementation Notes
+
+<!-- SECTION:NOTES:BEGIN -->
+Resolved PR #60 CI failure by restoring raw `tokensSeen` for session summaries while keeping filtered persisted word counts in aggregate/known-word paths. Addressed CodeRabbit feedback: fixed missing `headword` test fixture binding; paged vocabulary stats past filtered rows; preserved lifetime/rollup totals when retained-session recomputation is partial; emitted flat known-word timeline points for zero-visible-word line gaps; restored localStorage mocks; added rollback/retry behavior for excluded-word store persistence/initialization.
+<!-- SECTION:NOTES:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+Fixed the PR #60 CI failure and addressed actionable CodeRabbit feedback.
+
+Key changes:
+- Restored exact Yomitan token counts for session summary metrics while leaving filtered word counts for aggregate and known-word calculations.
+- Fixed malformed query test fixtures by binding `headword` into `imm_words` inserts.
+- Updated vocabulary stats to page until enough visible rows are collected after post-query filtering.
+- Made library/detail/rollup read models preserve lifetime or stored rollup totals when retained-session recomputation is partial, including dashboard rollup-derived word metrics.
+- Kept known-word timeline line positions stable by emitting flat points for missing line indexes.
+- Made excluded-word persistence rollback on failed writes, allow initialization retries after transient load failures, and restored mocked `localStorage` in tests.
+
+Verification passed:
+- `bun run typecheck`
+- `bun run test:fast`
+- `bun run test:env`
+- `bun run build`
+- `bun run test:smoke:dist`
+- `bun run format:check:src`
+- `git diff --check`
+<!-- SECTION:FINAL_SUMMARY:END -->