Tesseract - spectrumbranch/retro-translation-project GitHub Wiki

Training

1 instance of ground truth == 1 x.tif file + 1 x.gt.txt file, where x is the unique content identifier aka the filename prefix (iog-0001 etc)
there are restrictions on what is considered acceptable for ground truth
the image content must be a single line only
and there are correlations with the training character set that must be made. characters inside of the text file should be included in the charset and not restricted against (there are ways to limit the domain of characters, which I believe for the most part we are going to use the DEFAULT settings that already exist in the japanese charsets that mort uses and tesseract provides)