Home - tesseract-ocr/tesstrain GitHub Wiki
Welcome to the tesstrain wiki!
tesstrain (formerly ocrd-train) is a collection of scripts and documentation for training of Tesseract with LSTM (supported by Tesseract 4 and newer releases).
Currently it includes a Makefile
which allows training from real line images with ground truth (text transcriptions).
Such data is available from a number of sources, see https://github.com/cneud/ocr-gt for a list.
Training from synthetic images is supported by training scripts (Shell, Python) which are still part of the Tesseract code base.
Examples
- Training Fraktur with Austrian Newspapers
- Training Fraktur with Neue Zürcher Zeitung
- Training Fraktur with GT4HistOCR
- Training Fraktur and Handwriting with German primers
- Training Arabic Handwriting
- Training Handwritten Text with German Konzilsprotokolle