- Stats dashboard redesign design and implementation plans - Episode detail and Anki card link design - Internal knowledge base restructure - Backlog tasks for testing, verification, and occurrence tracking
4.1 KiB
Immersion Occurrence Tracking Design
Problem: imm_words and imm_kanji only store global aggregates. They cannot answer "where did this word/kanji appear?" at the anime, episode, timestamp, or subtitle-line level.
Goals:
- Map normalized words and kanji back to exact subtitle lines.
- Preserve repeated tokens inside one subtitle line.
- Avoid storing token text repeatedly for each repeated token in the same line.
- Keep the change additive and compatible with current top-word/top-kanji stats.
Non-Goals:
- Exact token character offsets inside a subtitle line.
- Full stats UI redesign in the same change.
- Replacing existing aggregate tables or existing vocabulary queries.
Recommended Approach
Add a normalized subtitle-line table plus counted bridge tables from lines to canonical word and kanji rows. Keep imm_words and imm_kanji as canonical lexeme aggregates, then link them to imm_subtitle_lines through one row per unique lexeme per line with occurrence_count.
This preserves total frequency within a line without duplicating token text or needing one row per repeated token. Reverse mapping becomes a simple join from canonical lexeme to line row to video/anime context.
Data Model
imm_subtitle_lines
One row per recorded subtitle line.
Suggested fields:
line_id INTEGER PRIMARY KEY AUTOINCREMENTsession_id INTEGER NOT NULLevent_id INTEGERvideo_id INTEGER NOT NULLanime_id INTEGERline_index INTEGER NOT NULLsegment_start_ms INTEGERsegment_end_ms INTEGERtext TEXT NOT NULLCREATED_DATE INTEGERLAST_UPDATE_DATE INTEGER
Notes:
event_idlinks back toimm_session_eventswhen the subtitle-line event is written.anime_idis nullable because some rows may predate anime linkage or come from unresolved media.
imm_word_line_occurrences
One row per normalized word per subtitle line.
Suggested fields:
line_id INTEGER NOT NULLword_id INTEGER NOT NULLoccurrence_count INTEGER NOT NULLPRIMARY KEY(line_id, word_id)
word_id points at the canonical row in imm_words.
imm_kanji_line_occurrences
One row per kanji per subtitle line.
Suggested fields:
line_id INTEGER NOT NULLkanji_id INTEGER NOT NULLoccurrence_count INTEGER NOT NULLPRIMARY KEY(line_id, kanji_id)
kanji_id points at the canonical row in imm_kanji.
Write Path
During recordSubtitleLine(...):
- Normalize and validate the line as today.
- Compute counted word and kanji occurrences for the line.
- Upsert canonical
imm_words/imm_kanjirows as today. - Insert one
imm_subtitle_linesrow for the line. - Insert counted bridge rows for each normalized word and kanji found in that line.
Counting rules:
- Words: count repeated allowed tokens in the token list; skip tokens excluded by the existing POS/noise filter.
- Kanji: count repeated kanji characters from the visible subtitle line text.
Query Shape
Add reverse-mapping query functions for:
- word -> recent occurrence rows
- kanji -> recent occurrence rows
Each row should include enough context for drilldown:
- anime id/title
- video id/title
- session id
- line index
- segment start/end
- subtitle text
- occurrence count within that line
Existing top-word/top-kanji aggregate queries stay in place.
Edge Cases
- Repeated tokens in one line: store once per lexeme per line with
occurrence_count > 1. - Duplicate identical lines in one session: each subtitle event gets its own
imm_subtitle_linesrow. - No anime link yet: keep
anime_idnull and still preserve the line/video/session mapping. - Legacy DBs: additive migration only; no destructive rebuild of existing word/kanji data.
Testing Strategy
Start with focused DB-backed tests:
- schema test for new line/bridge tables and indexes
- service test for counted word/kanji line persistence
- query tests for reverse mapping from word/kanji to line/anime/video context
- migration test for existing DBs gaining the new tables cleanly
Primary verification lane: bun run test:immersion:sqlite:src, then broader lanes only if API/runtime surfaces widen.