tesstrain.sh - nickjwhite/tesseract GitHub Wiki

tesstrain.sh is a script that automatically calls the appropriate programs to create a new training for a language.

It uses various programs for training, so you need to build them with 'make training' before using it.

Example usage

$ tesstrain.sh --lang kan --langdata_dir /path/to/mylangdata

This is the most common usage of tesstrain.sh. See the comments at the top of tesstrain.sh for a description of all of the arguments it accepts.

Langdata

tesstrain.sh needs certain files to use in the training process. These are normally stored in a 'langdata' directory. The langdata for the languages that are officially supported by Tesseract are all stored in the langdata repository, but you can of course store langdata wherever you want.

Required files

At a minimum the following files need to be provided in your langdata directory (note that [lang] is the language code, e.g. 'kan' or 'eng'):

[lang]/[lang].training_text (alternatively this can be specified on the command line with --training_text /path/to/textfile)

Optional files

These files can optionally be in your langdata directory and will be used by tesstrain.sh to improve the quality of the training:

[lang]/[lang].wordlist (alternatively this can be specified on the command line with --wordlist /path/to/wordlist)
[lang]/[lang].word.bigrams
[lang]/[lang].numbers
[lang]/[lang].punc
[lang]/[lang].config
[lang]/[lang].unicharambigs
[lang]/[lang].training_text.bigram_freqs
[lang]/[lang].training_text.unigram_freqs

General files

These are general files that can affect multiple languages, but may be edited if needed (note that [Script] is the script name, e.g. 'Greek' or 'Hebrew'):

[Script].xheights
[Script].unicharset
font_properties

Langdata files

training_text

The training text is a text file that will used to train Tesseract for the language. It should contain several samples of each character, and be as close to a realistic sample of text as possible.

wordlist

The wordlist is a text file with a list of words, one per line, ordered by decreasing frequency (so the most common word first).

training_text.bigram_freqs

This is a text file with a list of bigrams (here meaning two characters) and the frequency with which they appear next to each other in the training_text, one bigram per line.

training_text.unigram_freqs

This is a text file with a list of unigrams (characters) and the frequency with which they appear next to each other in the training_text, one unigram per line.

word.bigrams

This is a text file with a list of bigrams (here meaning two words) separated by a space, one per line, sorted by how often they appear next to each other (most common at the top).

numbers, punc, config, unicharambigs

These files are all combined directly into the final training. See the TrainingTesseract page for details of their function and how to use them.

Options in `language-specific.sh`

Optionally you can set some training options, as well as set default options for training your language (such as fonts to use), by editing the training/language-specific.sh script.

To set options for your language, look towards the end of language-specific.sh where variables are set according to the language code (the set_lang_specific_parameters() function).

Fonts

You can either set the fonts to use as an argument to tesstrain.sh (as --fontlist 'Font 1+Font 2') or by specifying the fonts in language-specific.sh.

Other options

For other variables that can be set, you're best off just reading the set_lang_specific_parameters() function to see what they do, and tesstrain_utils.sh to see exactly how they are used.

tesstrain_utils.sh

Note that the majority of the work that tesstrain.sh does is actually imported from the training/tesstrain_utils.sh file, so if you're curious about exactly what it is doing, or want to change its behaviour, look there.