Training via text and fonts - sakin070/train-tesseract GitHub Wiki

Welcome to the train-tesseract wiki!

Tesseract 4.00 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.

Neural networks require significantly more training data and train a lot slower than base Tesseract. For Latin-based languages, the existing model data provided has been trained on about 400000 text lines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of text lines. Instead of taking a few minutes to a couple of hours to train, Tesseract 4.00 takes a few days to a couple of weeks. Even with all these new training data, you might find it inadequate for your particular problem, and therefore you are here wanting to retrain it.

Setup required for training

Additional Libraries Required

sudo apt-get install libicu-dev libpango1.0-dev libcairo2-dev

Once the above additional libraries have been installed, run the following from the Tesseract source directory:

./configure

if you plan to run in docker (or do not require graphics):

./configure --disable-graphics

If you have the required dependencies to train your output should look like:

checking for pkg-config... [some valid path]

checking for lept >= 1.74... yes

checking for libarchive... yes

checking for icu-uc >= 52.1... yes

checking for icu-i18n >= 52.1... yes

checking for pango >= 1.22.0... yes

checking for cairo... yes

[...]

Training tools can be built and installed with:

If configure does not say the training tools can be built, you still need to add libraries or ensure that pkg-config can find them.

After configuring, you can attempt to build the training tools:

make

make training

sudo make training-install

It is also useful, but not required, to build ScrollView.jar:

make ScrollView.jar

export SCROLLVIEW_PATH=$PWD/java

Overview of the Training Process

  1. Prepare the training text.
  2. Render text to image + box file. (Or create hand-made box files for existing image data.)
  3. Make unicharset file. (Can be partially specified, ie created manually).
  4. Make a starter traineddata from the unicharset and optional dictionary data.
  5. Run tesseract to process image + box file to make training data set.
  6. Run training on training data set.
  7. Combine data files.

Data files required

To train for another language, you have to create some data files in the tessdata subdirectory, and then crunch these together into a single file, using combine_tessdata. The naming convention is languagecode.file_name Language codes for released files follow the ISO 639-3 standard, but any string can be used. The files used for English are:

  • tessdata/eng.config
  • tessdata/eng.unicharset
  • tessdata/eng.unicharambigs
  • tessdata/eng.inttemp
  • tessdata/eng.pffmtable
  • tessdata/eng.normproto
  • tessdata/eng.punc-dawg
  • tessdata/eng.word-dawg
  • tessdata/eng.number-dawg
  • tessdata/eng.freq-dawg

... and the final crunched file is:

  • tessdata/eng.traineddata

Requirements for text input files

Text input files (lang.config, lang.unicharambigs, font_properties, box files, wordlists for dictionaries...) need to meet these criteria:

  • ASCII or UTF-8 encoding without BOM
  • Unix end-of-line marker ('\n')
  • The last character must be an end of line marker ('\n'). Some text editors will show this as an empty line at the end of file. If you omit this you will get an error message containing last_char == '\n':Error:Assert failed....

As with base Tesseract, the completed LSTM model and everything else it needs is collected in the traineddata file. A starter traineddata file is given during training, and has to be setup in advance. It can contain:

  • Config file providing control parameters.
  • Unicharset defining the character set.
  • Unicharcompress, aka the recoder, which maps the unicharset further to the codes actually used by the neural network recognizer.
  • Punctuation pattern dawg, with patterns of punctuation allowed around words.
  • Word dawg. The system word-list language model.
  • Number dawg, with patterns of numbers that are allowed.

Bold elements must be provided. Others are optional, but if any of the dawgs are provided, the punctuation dawg must also be provided. A new tool: combine_lang_model is provided to make a starter traineddata from a unicharset and optional wordlists.

During training, the trainer writes checkpoint files, which is a standard behavior for neural network trainers. This allows training to be stopped and continued again later if desired. Any checkpoint can be converted to a full traineddata for recognition by using the --stop_training command-line flag.

Prepare a text file

The first step is to determine the full character set to be used and prepare a text or word processor file containing a set of examples. The most important points to bear in mind when creating a training file are:

Make sure there are a minimum number of samples of each character. 10 is good, but 5 is OK for rare characters.

There should be more samples of the more frequent characters - at least 20.

Don't make the mistake of grouping all the non-letters together. Make the text more realistic.
For example:

The quick brown fox jumps over the lazy dog. 0123456789 !@#$%^&(),.{}<>/?

is terrible! Much better is:

The (quick) brown {fox} jumps! over the $3,456.78 <lazy> #90 dog & duck/goose, as 12.5% of E-mail from [email protected] is spam?

This gives the text-line finding code a much better chance of getting sensible baseline metrics for the special characters.

Prepare a UTF-8 text file (training_text.txt) containing your training text according to the above specification.

Obtain truetype/opentype font files for the fonts that you wish to recognize.

training/text2image --text=training_text.txt --outputbase=[lang].[fontname].exp0 --font='Font Name' --fonts_dir=/path/to/your/fonts

Note that the argument to --font may contain spaces, and thus must be quoted. Eg:

training/text2image --text=training_text.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman Bold' --fonts_dir=/usr/share/fonts

To list all fonts in your system which can render the training text, run:

training/text2image --text=training_text.txt --outputbase=eng --fonts_dir=/usr/share/fonts  --find_fonts --min_coverage=1.0 --render_per_font=false

In this example, the training_text.txt file contains text written in English. A 'eng.fontlist.txt' file will be created.

There are a lot of other command-line arguments available to text2image. Run text2image --help to get more information.

If you used text2image, you can move to Run Tesseract for Training step.

Creating Training Data

If you use tesstrain.sh then required synthetic training data (box/tiff pairs and lstmf files) is created from the training text and given list of fonts.

Using tesstrain.sh

The setup for running tesstrain.sh is the same as for base Tesseract. Use --linedata_only option for LSTM training. Note that it is beneficial to have more training text and make more pages though, as neural nets don’t generalize as well and need to train on something similar to what they will be running on. If the target domain is severely limited, then all the dire warnings about needing a lot of training data may not apply, but the network specification may need to be changed.

Training data is created using tesstrain.sh as follows:

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

And the following is printed out after a successful run:

Created starter traineddata for LSTM training of language 'eng'

Run 'lstmtraining' command to continue LSTM training for language 'eng'

The following example shows the command line for training from scratch. Try it with the default training data created with the command-lines above.

* mkdir -p ~/tesstutorial/engoutput * training/lstmtraining --debug_interval 100 \ * --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ * --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \ * --model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \ * --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \ * --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \ * --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log

To evaluate a model

* training/lstmeval --model ~/tesstutorial/engoutput/base_checkpoint \ * --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ * --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

Combining the Output Files

training/lstmtraining --stop_training
--continue_from ~/tesstutorial/eng_from_chi/base_checkpoint
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata
--model_output ~/tesstutorial/eng_from_chi/eng.traineddata

Using your newly trained model

If your model is not in the tessdata directory, move the model into the directory. Use your model by specifying the form the .traineddata as the language when parsing

⚠️ **GitHub.com Fallback** ⚠️