mirror of https://github.com/ksyasuda/SubMiner.git synced 2026-03-20 12:11:28 -07:00

Files

docs: add stats dashboard design docs, plans, and knowledge base

- Stats dashboard redesign design and implementation plans
- Episode detail and Anki card link design
- Internal knowledge base restructure
- Backlog tasks for testing, verification, and occurrence tracking

2026-03-17 20:01:23 -07:00

4.1 KiB

Raw Blame History

Immersion Occurrence Tracking Design

Problem: imm_words and imm_kanji only store global aggregates. They cannot answer "where did this word/kanji appear?" at the anime, episode, timestamp, or subtitle-line level.

Goals:

Map normalized words and kanji back to exact subtitle lines.
Preserve repeated tokens inside one subtitle line.
Avoid storing token text repeatedly for each repeated token in the same line.
Keep the change additive and compatible with current top-word/top-kanji stats.

Non-Goals:

Exact token character offsets inside a subtitle line.
Full stats UI redesign in the same change.
Replacing existing aggregate tables or existing vocabulary queries.

Recommended Approach

Add a normalized subtitle-line table plus counted bridge tables from lines to canonical word and kanji rows. Keep imm_words and imm_kanji as canonical lexeme aggregates, then link them to imm_subtitle_lines through one row per unique lexeme per line with occurrence_count.

This preserves total frequency within a line without duplicating token text or needing one row per repeated token. Reverse mapping becomes a simple join from canonical lexeme to line row to video/anime context.

Data Model

`imm_subtitle_lines`

One row per recorded subtitle line.

Suggested fields:

line_id INTEGER PRIMARY KEY AUTOINCREMENT
session_id INTEGER NOT NULL
event_id INTEGER
video_id INTEGER NOT NULL
anime_id INTEGER
line_index INTEGER NOT NULL
segment_start_ms INTEGER
segment_end_ms INTEGER
text TEXT NOT NULL
CREATED_DATE INTEGER
LAST_UPDATE_DATE INTEGER

Notes:

event_id links back to imm_session_events when the subtitle-line event is written.
anime_id is nullable because some rows may predate anime linkage or come from unresolved media.

`imm_word_line_occurrences`

One row per normalized word per subtitle line.

Suggested fields:

line_id INTEGER NOT NULL
word_id INTEGER NOT NULL
occurrence_count INTEGER NOT NULL
PRIMARY KEY(line_id, word_id)

word_id points at the canonical row in imm_words.

`imm_kanji_line_occurrences`

One row per kanji per subtitle line.

Suggested fields:

line_id INTEGER NOT NULL
kanji_id INTEGER NOT NULL
occurrence_count INTEGER NOT NULL
PRIMARY KEY(line_id, kanji_id)

kanji_id points at the canonical row in imm_kanji.

Write Path

During recordSubtitleLine(...):

Normalize and validate the line as today.
Compute counted word and kanji occurrences for the line.
Upsert canonical imm_words / imm_kanji rows as today.
Insert one imm_subtitle_lines row for the line.
Insert counted bridge rows for each normalized word and kanji found in that line.

Counting rules:

Words: count repeated allowed tokens in the token list; skip tokens excluded by the existing POS/noise filter.
Kanji: count repeated kanji characters from the visible subtitle line text.

Query Shape

Add reverse-mapping query functions for:

word -> recent occurrence rows
kanji -> recent occurrence rows

Each row should include enough context for drilldown:

anime id/title
video id/title
session id
line index
segment start/end
subtitle text
occurrence count within that line

Existing top-word/top-kanji aggregate queries stay in place.

Edge Cases

Repeated tokens in one line: store once per lexeme per line with occurrence_count > 1.
Duplicate identical lines in one session: each subtitle event gets its own imm_subtitle_lines row.
No anime link yet: keep anime_id null and still preserve the line/video/session mapping.
Legacy DBs: additive migration only; no destructive rebuild of existing word/kanji data.

Testing Strategy

Start with focused DB-backed tests:

schema test for new line/bridge tables and indexes
service test for counted word/kanji line persistence
query tests for reverse mapping from word/kanji to line/anime/video context
migration test for existing DBs gaining the new tables cleanly

Primary verification lane: bun run test:immersion:sqlite:src, then broader lanes only if API/runtime surfaces widen.

4.1 KiB Raw Blame History

Immersion Occurrence Tracking Design

Recommended Approach

Data Model

imm_subtitle_lines

imm_word_line_occurrences

imm_kanji_line_occurrences

Write Path

Query Shape

Edge Cases

Testing Strategy

4.1 KiB

Raw Blame History

`imm_subtitle_lines`

`imm_word_line_occurrences`

`imm_kanji_line_occurrences`