docs: add stats dashboard design docs, plans, and knowledge base

- Stats dashboard redesign design and implementation plans - Episode detail and Anki card link design - Internal knowledge base restructure - Backlog tasks for testing, verification, and occurrence tracking
2026-03-21 12:11:28 -07:00 · 2026-03-14 22:13:24 -07:00
parent 9eed37420e
commit ee95e86ad5
35 changed files with 5139 additions and 0 deletions
--- a/docs/plans/2026-03-14-immersion-occurrence-tracking-design.md
+++ b/docs/plans/2026-03-14-immersion-occurrence-tracking-design.md
@@ -0,0 +1,115 @@
+# Immersion Occurrence Tracking Design
+
+**Problem:** `imm_words` and `imm_kanji` only store global aggregates. They cannot answer "where did this word/kanji appear?" at the anime, episode, timestamp, or subtitle-line level.
+
+**Goals:**
+- Map normalized words and kanji back to exact subtitle lines.
+- Preserve repeated tokens inside one subtitle line.
+- Avoid storing token text repeatedly for each repeated token in the same line.
+- Keep the change additive and compatible with current top-word/top-kanji stats.
+
+**Non-Goals:**
+- Exact token character offsets inside a subtitle line.
+- Full stats UI redesign in the same change.
+- Replacing existing aggregate tables or existing vocabulary queries.
+
+## Recommended Approach
+
+Add a normalized subtitle-line table plus counted bridge tables from lines to canonical word and kanji rows. Keep `imm_words` and `imm_kanji` as canonical lexeme aggregates, then link them to `imm_subtitle_lines` through one row per unique lexeme per line with `occurrence_count`.
+
+This preserves total frequency within a line without duplicating token text or needing one row per repeated token. Reverse mapping becomes a simple join from canonical lexeme to line row to video/anime context.
+
+## Data Model
+
+### `imm_subtitle_lines`
+
+One row per recorded subtitle line.
+
+Suggested fields:
+- `line_id INTEGER PRIMARY KEY AUTOINCREMENT`
+- `session_id INTEGER NOT NULL`
+- `event_id INTEGER`
+- `video_id INTEGER NOT NULL`
+- `anime_id INTEGER`
+- `line_index INTEGER NOT NULL`
+- `segment_start_ms INTEGER`
+- `segment_end_ms INTEGER`
+- `text TEXT NOT NULL`
+- `CREATED_DATE INTEGER`
+- `LAST_UPDATE_DATE INTEGER`
+
+Notes:
+- `event_id` links back to `imm_session_events` when the subtitle-line event is written.
+- `anime_id` is nullable because some rows may predate anime linkage or come from unresolved media.
+
+### `imm_word_line_occurrences`
+
+One row per normalized word per subtitle line.
+
+Suggested fields:
+- `line_id INTEGER NOT NULL`
+- `word_id INTEGER NOT NULL`
+- `occurrence_count INTEGER NOT NULL`
+- `PRIMARY KEY(line_id, word_id)`
+
+`word_id` points at the canonical row in `imm_words`.
+
+### `imm_kanji_line_occurrences`
+
+One row per kanji per subtitle line.
+
+Suggested fields:
+- `line_id INTEGER NOT NULL`
+- `kanji_id INTEGER NOT NULL`
+- `occurrence_count INTEGER NOT NULL`
+- `PRIMARY KEY(line_id, kanji_id)`
+
+`kanji_id` points at the canonical row in `imm_kanji`.
+
+## Write Path
+
+During `recordSubtitleLine(...)`:
+
+1. Normalize and validate the line as today.
+2. Compute counted word and kanji occurrences for the line.
+3. Upsert canonical `imm_words` / `imm_kanji` rows as today.
+4. Insert one `imm_subtitle_lines` row for the line.
+5. Insert counted bridge rows for each normalized word and kanji found in that line.
+
+Counting rules:
+- Words: count repeated allowed tokens in the token list; skip tokens excluded by the existing POS/noise filter.
+- Kanji: count repeated kanji characters from the visible subtitle line text.
+
+## Query Shape
+
+Add reverse-mapping query functions for:
+- word -> recent occurrence rows
+- kanji -> recent occurrence rows
+
+Each row should include enough context for drilldown:
+- anime id/title
+- video id/title
+- session id
+- line index
+- segment start/end
+- subtitle text
+- occurrence count within that line
+
+Existing top-word/top-kanji aggregate queries stay in place.
+
+## Edge Cases
+
+- Repeated tokens in one line: store once per lexeme per line with `occurrence_count > 1`.
+- Duplicate identical lines in one session: each subtitle event gets its own `imm_subtitle_lines` row.
+- No anime link yet: keep `anime_id` null and still preserve the line/video/session mapping.
+- Legacy DBs: additive migration only; no destructive rebuild of existing word/kanji data.
+
+## Testing Strategy
+
+Start with focused DB-backed tests:
+- schema test for new line/bridge tables and indexes
+- service test for counted word/kanji line persistence
+- query tests for reverse mapping from word/kanji to line/anime/video context
+- migration test for existing DBs gaining the new tables cleanly
+
+Primary verification lane: `bun run test:immersion:sqlite:src`, then broader lanes only if API/runtime surfaces widen.