Tesseract - spectrumbranch/retro-translation-project GitHub Wiki
Training
Ground truth
Reading: https://github.com/tesseract-ocr/tesstrain?tab=readme-ov-file#provide-ground-truth-data
Data: https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip
Overview
- 1 instance of ground truth == 1 x.tif file + 1 x.gt.txt file, where x is the unique content identifier aka the filename prefix (iog-0001 etc)
- there are restrictions on what is considered acceptable for ground truth
- the image content must be a single line only
- and there are correlations with the training character set that must be made. characters inside of the text file should be included in the charset and not restricted against (there are ways to limit the domain of characters, which I believe for the most part we are going to use the DEFAULT settings that already exist in the japanese charsets that mort uses and tesseract provides)