Home - sakin070/train-tesseract GitHub Wiki

tesstrain

Training workflow for Tesseract 4 as a Makefile for dependency tracking and building the required software from source.

Install

leptonica, tesseract

You will need a recent version (>= 4.0.0beta1) of tesseract built with the training tools and matching leptonica bindings. Build instructions and more can be found in the Tesseract project wiki.

Alternatively, you can build leptonica and tesseract the Tesseract projectand install it to a subdirectory ./usr in the repo:

make leptonica tesseract

Tesseract will be built from the git repository, which requires CMake, autotools (including autotools-archive) and some additional libraries for the training tools. See the installation notes in the tesseract repository.

Python

You need a recent version of Python 3.x. For image processing the Python library Pillow is used. If you don't have a global installation, please use the provided requirements file pip install -r requirements.txt.

Choose model name

Choose a name for your model. By convention, Tesseract stack models including language-specific resources use (lowercase) three-letter codes defined in ISO 639 with additional information separated by underscore. E.g., chi_tra_vert for traditional Chinese with vertical typesetting. Language-independent (i.e. script-specific) models use the capitalized name of the script type as identifier. E.g., Hangul_vert for Hangul script with vertical typesetting. In the following, the model name is referenced by MODEL_NAME.

Provide ground truth

Place ground truth consisting of line images and transcriptions in the folder data/MODEL_NAME-ground-truth. This list of files will be split into training and evaluation data, the ratio is defined by the RATIO_TRAIN variable.

Images must be TIFF and have the extension .tif or PNG and have the extension .png, .bin.png or .nrm.png.

Transcriptions must be single-line plain text and have the same name as the line image but with the image extension replaced by .gt.txt.

Train

make training MODEL_NAME=name-of-the-resulting-model

Using your newly trained model

If your model is not in the tessdata directory, move the model into the directory. Use your model by specifying the form the .traineddata as the language when parsing

⚠️ **GitHub.com Fallback** ⚠️