Home - tesseract-ocr/tesstrain GitHub Wiki

Welcome to the tesstrain wiki!

tesstrain (formerly ocrd-train) is a collection of scripts and documentation for training of Tesseract with LSTM (supported by Tesseract 4 and newer releases).

Currently it includes a Makefile which allows training from real line images with ground truth (text transcriptions). Such data is available from a number of sources, see https://github.com/cneud/ocr-gt for a list.

Training from synthetic images is supported by training scripts (Shell, Python) which are still part of the Tesseract code base.

Examples

Training Fraktur with Austrian Newspapers
Training Fraktur with Neue Zürcher Zeitung
Training Fraktur with GT4HistOCR
Training Fraktur and Handwriting with German primers
Training Arabic Handwriting
Training Handwritten Text with German Konzilsprotokolle