Home - sakin070/train-tesseract GitHub Wiki
Training workflow for Tesseract 4 as a Makefile for dependency tracking and building the required software from source.
Install
leptonica, tesseract
You will need a recent version (>= 4.0.0beta1) of tesseract built with the training tools and matching leptonica bindings. Build instructions and more can be found in the Tesseract project wiki.
Alternatively, you can build leptonica and tesseract the Tesseract projectand install it to a subdirectory ./usr in the repo:
make leptonica tesseract
Tesseract will be built from the git repository, which requires CMake, autotools (including autotools-archive) and some additional libraries for the training tools. See the installation notes in the tesseract repository.
You need a recent version of Python 3.x. For image processing the Python library Pillow is used. If you don't have a global installation, please use the provided requirements file pip install -r requirements.txt.
Choose a name for your model. By convention, Tesseract stack models including language-specific resources use (lowercase) three-letter codes defined in ISO 639 with additional information separated by underscore. E.g., chi_tra_vert for traditional Chinese with vertical typesetting. Language-independent (i.e. script-specific) models use the capitalized name of the script type as identifier. E.g., Hangul_vert for Hangul script with vertical typesetting. In the following, the model name is referenced by MODEL_NAME.
Place ground truth consisting of line images and transcriptions in the folder data/MODEL_NAME-ground-truth. This list of files will be split into training and evaluation data, the ratio is defined by the RATIO_TRAIN variable.
Images must be TIFF and have the extension .tif or PNG and have the extension .png, .bin.png or .nrm.png.
Transcriptions must be single-line plain text and have the same name as the line image but with the image extension replaced by .gt.txt.
make training MODEL_NAME=name-of-the-resulting-model
If your model is not in the tessdata directory, move the model into the directory. Use your model by specifying the form the .traineddata as the language when parsing