training and synthetic data generation code

2022-02-09 20:39:01 +01:00
parent a9085393f4
commit 975dbf4d5e
42 changed files with 7089 additions and 15 deletions
--- a/manga_ocr_dev/README.md
+++ b/manga_ocr_dev/README.md
@@ -0,0 +1,98 @@
+# Project structure
+
+```
+assets/                       # assets (see description below)
+manga_ocr/                    # release code (inference only)
+manga_ocr_dev/                # development code
+   env.py                     # global constants
+   data/                      # data preprocessing
+   synthetic_data_generator/  # generation of synthetic image-text pairs
+   training/                  # model training
+```
+
+## assets
+
+### fonts.csv
+csv with columns:
+- font_path: path to font file, relative to `FONTS_ROOT`
+- supported_chars: string of characters supported by this font
+- num_chars: number of supported characters
+- label: common/regular/special (used to sample regular fonts more often than special)
+
+List of fonts with metadata used by synthetic data generator.
+Provided file is just an example, you have to generate similar file for your own set of fonts,
+using `manga_ocr_dev/synthetic_data_generator/scan_fonts.py` script.
+Note that `label` will be filled with `regular` by default. You have to label your special fonts manually.
+
+### lines_example.csv
+csv with columns:
+- source: source of text
+- id: unique id of the line
+- line: line from language corpus
+
+Example of csv used for synthetic data generation.
+
+### len_to_p.csv
+csv with columns:
+- len: length of text
+- p: probability of text of this length occurring in manga
+
+Used by synthetic data generator to more-or-less match the natural distribution of text lengths.
+Computed based on Manga109-s dataset.
+
+### vocab.csv
+List of all characters supported by tokenizer.
+
+# Training OCR
+
+`env.py` contains global constants used across the repo. Set your paths to data etc. there.
+
+1. Download [Manga109-s](http://www.manga109.org/en/download_s.html) dataset.
+2. Set `MANGA109_ROOT`, so that your directory structure looks like this: 
+    ```
+    <MANGA109_ROOT>/
+        Manga109s_released_2021_02_28/
+            annotations/
+            annotations.v2018.05.31/
+            images/
+            books.txt
+            readme.txt
+    ```
+3. Preprocess Manga109-s with `data/process_manga109s.py`
+4. Optionally generate synthetic data (see below)
+5. Train with `manga_ocr_dev/training/train.py`
+
+# Synthetic data generation
+
+Generated data is split into packages (named `0000`, `0001` etc.) for easier management of large dataset.
+Each package is assumed to have similar data distribution, so that a properly balanced dataset
+can be built from any subset of packages.
+
+Data generation pipeline assumes following directory structure:
+
+```
+<DATA_SYNTHETIC_ROOT>/
+   img/           # generated images (output from generation pipeline)
+      0000/
+      0001/
+      ...
+   lines/         # lines from corpus (input to generation pipeline)
+      0000.csv
+      0001.csv
+      ...
+   meta/          # metadata (output from generation pipeline)
+      0000.csv
+      0001.csv
+      ...
+```
+
+To use a language corpus for data generation, `lines/*.csv` files must be provided.
+For a small example of such file see `assets/lines_example.csv`.
+
+To generate synthetic data:
+1. Generate backgrounds with `data/generate_backgrounds.py`.
+2. Put your fonts in `<FONTS_ROOT>`.
+3. Generate fonts metadata with `synthetic_data_generator/scan_fonts.py`.
+4. Optionally manually label your fonts with `common/regular/special` labels.
+5. Provide `<DATA_SYNTHETIC_ROOT>/lines/*.csv`.
+6. Run `synthetic_data_generator/run_generate.py` for each package.