training and synthetic data generation code
This commit is contained in:
98
manga_ocr_dev/README.md
Normal file
98
manga_ocr_dev/README.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# Project structure
|
||||
|
||||
```
|
||||
assets/ # assets (see description below)
|
||||
manga_ocr/ # release code (inference only)
|
||||
manga_ocr_dev/ # development code
|
||||
env.py # global constants
|
||||
data/ # data preprocessing
|
||||
synthetic_data_generator/ # generation of synthetic image-text pairs
|
||||
training/ # model training
|
||||
```
|
||||
|
||||
## assets
|
||||
|
||||
### fonts.csv
|
||||
csv with columns:
|
||||
- font_path: path to font file, relative to `FONTS_ROOT`
|
||||
- supported_chars: string of characters supported by this font
|
||||
- num_chars: number of supported characters
|
||||
- label: common/regular/special (used to sample regular fonts more often than special)
|
||||
|
||||
List of fonts with metadata used by synthetic data generator.
|
||||
Provided file is just an example, you have to generate similar file for your own set of fonts,
|
||||
using `manga_ocr_dev/synthetic_data_generator/scan_fonts.py` script.
|
||||
Note that `label` will be filled with `regular` by default. You have to label your special fonts manually.
|
||||
|
||||
### lines_example.csv
|
||||
csv with columns:
|
||||
- source: source of text
|
||||
- id: unique id of the line
|
||||
- line: line from language corpus
|
||||
|
||||
Example of csv used for synthetic data generation.
|
||||
|
||||
### len_to_p.csv
|
||||
csv with columns:
|
||||
- len: length of text
|
||||
- p: probability of text of this length occurring in manga
|
||||
|
||||
Used by synthetic data generator to more-or-less match the natural distribution of text lengths.
|
||||
Computed based on Manga109-s dataset.
|
||||
|
||||
### vocab.csv
|
||||
List of all characters supported by tokenizer.
|
||||
|
||||
# Training OCR
|
||||
|
||||
`env.py` contains global constants used across the repo. Set your paths to data etc. there.
|
||||
|
||||
1. Download [Manga109-s](http://www.manga109.org/en/download_s.html) dataset.
|
||||
2. Set `MANGA109_ROOT`, so that your directory structure looks like this:
|
||||
```
|
||||
<MANGA109_ROOT>/
|
||||
Manga109s_released_2021_02_28/
|
||||
annotations/
|
||||
annotations.v2018.05.31/
|
||||
images/
|
||||
books.txt
|
||||
readme.txt
|
||||
```
|
||||
3. Preprocess Manga109-s with `data/process_manga109s.py`
|
||||
4. Optionally generate synthetic data (see below)
|
||||
5. Train with `manga_ocr_dev/training/train.py`
|
||||
|
||||
# Synthetic data generation
|
||||
|
||||
Generated data is split into packages (named `0000`, `0001` etc.) for easier management of large dataset.
|
||||
Each package is assumed to have similar data distribution, so that a properly balanced dataset
|
||||
can be built from any subset of packages.
|
||||
|
||||
Data generation pipeline assumes following directory structure:
|
||||
|
||||
```
|
||||
<DATA_SYNTHETIC_ROOT>/
|
||||
img/ # generated images (output from generation pipeline)
|
||||
0000/
|
||||
0001/
|
||||
...
|
||||
lines/ # lines from corpus (input to generation pipeline)
|
||||
0000.csv
|
||||
0001.csv
|
||||
...
|
||||
meta/ # metadata (output from generation pipeline)
|
||||
0000.csv
|
||||
0001.csv
|
||||
...
|
||||
```
|
||||
|
||||
To use a language corpus for data generation, `lines/*.csv` files must be provided.
|
||||
For a small example of such file see `assets/lines_example.csv`.
|
||||
|
||||
To generate synthetic data:
|
||||
1. Generate backgrounds with `data/generate_backgrounds.py`.
|
||||
2. Put your fonts in `<FONTS_ROOT>`.
|
||||
3. Generate fonts metadata with `synthetic_data_generator/scan_fonts.py`.
|
||||
4. Optionally manually label your fonts with `common/regular/special` labels.
|
||||
5. Provide `<DATA_SYNTHETIC_ROOT>/lines/*.csv`.
|
||||
6. Run `synthetic_data_generator/run_generate.py` for each package.
|
||||
Reference in New Issue
Block a user