Files
SubMiner/docs/plans/2026-03-14-immersion-occurrence-tracking-design.md
sudacode cc5d270b8e docs: add stats dashboard design docs, plans, and knowledge base
- Stats dashboard redesign design and implementation plans
- Episode detail and Anki card link design
- Internal knowledge base restructure
- Backlog tasks for testing, verification, and occurrence tracking
2026-03-17 20:01:23 -07:00

4.1 KiB

Immersion Occurrence Tracking Design

Problem: imm_words and imm_kanji only store global aggregates. They cannot answer "where did this word/kanji appear?" at the anime, episode, timestamp, or subtitle-line level.

Goals:

  • Map normalized words and kanji back to exact subtitle lines.
  • Preserve repeated tokens inside one subtitle line.
  • Avoid storing token text repeatedly for each repeated token in the same line.
  • Keep the change additive and compatible with current top-word/top-kanji stats.

Non-Goals:

  • Exact token character offsets inside a subtitle line.
  • Full stats UI redesign in the same change.
  • Replacing existing aggregate tables or existing vocabulary queries.

Add a normalized subtitle-line table plus counted bridge tables from lines to canonical word and kanji rows. Keep imm_words and imm_kanji as canonical lexeme aggregates, then link them to imm_subtitle_lines through one row per unique lexeme per line with occurrence_count.

This preserves total frequency within a line without duplicating token text or needing one row per repeated token. Reverse mapping becomes a simple join from canonical lexeme to line row to video/anime context.

Data Model

imm_subtitle_lines

One row per recorded subtitle line.

Suggested fields:

  • line_id INTEGER PRIMARY KEY AUTOINCREMENT
  • session_id INTEGER NOT NULL
  • event_id INTEGER
  • video_id INTEGER NOT NULL
  • anime_id INTEGER
  • line_index INTEGER NOT NULL
  • segment_start_ms INTEGER
  • segment_end_ms INTEGER
  • text TEXT NOT NULL
  • CREATED_DATE INTEGER
  • LAST_UPDATE_DATE INTEGER

Notes:

  • event_id links back to imm_session_events when the subtitle-line event is written.
  • anime_id is nullable because some rows may predate anime linkage or come from unresolved media.

imm_word_line_occurrences

One row per normalized word per subtitle line.

Suggested fields:

  • line_id INTEGER NOT NULL
  • word_id INTEGER NOT NULL
  • occurrence_count INTEGER NOT NULL
  • PRIMARY KEY(line_id, word_id)

word_id points at the canonical row in imm_words.

imm_kanji_line_occurrences

One row per kanji per subtitle line.

Suggested fields:

  • line_id INTEGER NOT NULL
  • kanji_id INTEGER NOT NULL
  • occurrence_count INTEGER NOT NULL
  • PRIMARY KEY(line_id, kanji_id)

kanji_id points at the canonical row in imm_kanji.

Write Path

During recordSubtitleLine(...):

  1. Normalize and validate the line as today.
  2. Compute counted word and kanji occurrences for the line.
  3. Upsert canonical imm_words / imm_kanji rows as today.
  4. Insert one imm_subtitle_lines row for the line.
  5. Insert counted bridge rows for each normalized word and kanji found in that line.

Counting rules:

  • Words: count repeated allowed tokens in the token list; skip tokens excluded by the existing POS/noise filter.
  • Kanji: count repeated kanji characters from the visible subtitle line text.

Query Shape

Add reverse-mapping query functions for:

  • word -> recent occurrence rows
  • kanji -> recent occurrence rows

Each row should include enough context for drilldown:

  • anime id/title
  • video id/title
  • session id
  • line index
  • segment start/end
  • subtitle text
  • occurrence count within that line

Existing top-word/top-kanji aggregate queries stay in place.

Edge Cases

  • Repeated tokens in one line: store once per lexeme per line with occurrence_count > 1.
  • Duplicate identical lines in one session: each subtitle event gets its own imm_subtitle_lines row.
  • No anime link yet: keep anime_id null and still preserve the line/video/session mapping.
  • Legacy DBs: additive migration only; no destructive rebuild of existing word/kanji data.

Testing Strategy

Start with focused DB-backed tests:

  • schema test for new line/bridge tables and indexes
  • service test for counted word/kanji line persistence
  • query tests for reverse mapping from word/kanji to line/anime/video context
  • migration test for existing DBs gaining the new tables cleanly

Primary verification lane: bun run test:immersion:sqlite:src, then broader lanes only if API/runtime surfaces widen.