Tesseract - lmmx/devnotes GitHub Wiki

Installing dependencies and compiling Tesseract from source

GitHub source for Tesseract
Tesseract docs: 'tessdoc'
Install new languages via apt as tesseract-ocr-{lang} where {lang} is the language code or all
- Oddly, grc (Ancient Greek) doesn't exist here, but ell is Modern Greek. It doesn't seem to be in all either (see the dependency list with apt show)
- run as tesseract input.png out -l eng+ell etc.
After installing, use pip install pytesseract and set lang to the -l string as a parameter

There are multiple interfacing functions to pytesseract:

pytesseract.image_to_boxes
- the bounding boxes are for entire words
pytesseract.image_to_data
- these are per-letter bounding boxes
- it's more readable to use io.StringIO to wrap this string in the argument to pandas.read_csv with sep="\t" to load into a DF (that can be paged with pydoc)
- this page suggests supplying pytesseract.Output.DICT as the output type (I didn't try this yet...)
- "conf is the model's confidence for the prediction for the word within that bounding box. If conf is -1, that means that the corresponding bounding box contains a block of text, rather than just a single word."
pytesseract.image_to_osd
- OSD stands for 'orientation and script' detection
- see Combined Script and Page Orientation Estimation using the Tesseract OCR engine
pytesseract.image_to_string
- this is the transcribed string, i.e. conjoined OCR-detected characters, with spaces

The image_to_osd function fails for me as the apt-packaged version is 4.0.0, and there is a documented bug which raises an error for the wrapper and prevents the OSD function from returning.

The issue was resolved in version 4.1.1 (commit a209a6... took place on 20 Oct 2019, and glancing at the releases this was the next and most recent one) in December 2019.

To get version 4.1.1, you need to build from source as the Ubuntu PPA on apt doesn't provide it, and searching for 'tesseract' on the launchpad PPA listings brings up a development PPA with v5.0 and a packaged leptonlib

Before downloading this leptonlib (which is v1.78, the latest is 1.80) you can 'view package details' and download the .deb from here
- check cat /etc/os-release if you need to check what version of Ubuntu your OS is based on, if not given in lsb_release -a, in my case this is bionic
If you scroll to the section on 'built packages' under 'package details' you can see this is actually a dev version of the library which is packaged and available in apt already! The package name is liblept5 and it doesn't show up in apt search as its description is just "image processing library".
- This isn't mentioned on the Leptonica website, though a 2008 version of the docs does mention the name liblept as being its shortname.

So in short, to satisfy the Leptonica dependency:

sudo apt install liblept5

On my computer this is already satisfied, I presume it came with the distro. So the dependency is met, the docs for tesseract compilation also suggest you should get the dev library:

sudo apt-get install libleptonica-dev

"but if you are using an oldish version of Linux you will need to build from source"

The rest of the dependencies are listed at the tesseract docs and were all already met on my machine.

I also installed the developer tools:

sudo apt install libtesseract-dev

Installing Tesseract from Git

Linux.com gives a very swift guide to what is made to look quite complex, but it may be out of date (from 2016)

To install everything (man pages and training tool dependencies):

sudo apt-get install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config
sudo apt-get install --no-install-recommends asciidoc docbook-xsl xsltproc
sudo apt-get install libpango1.0-dev

I'm going to install into ~/opt/ and I'm going to clone the branch for the latest release using --branch (in this case 4.1.1) and just the most recent commit with --depth 1 (to avoid storing a large commit history on disk):

cd ~/opt
git clone --depth 1  https://github.com/tesseract-ocr/tesseract.git --branch 4.1.1 --single-branch

Specifying the branch to be a specific release gives the message:

You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by switching back to a branch.

This is fine, ignore it. If you just clone master you currently get the v5 alpha.

cd tesseract/
./autogen.sh
./configure
make
sudo make install
sudo ldconfig

The make step takes a while...

...but once it's done you should be able to run tesseract --version and get 4.1.1 (or whichever version you installed).

Getting the best LSTM trained data

I am wanting to use multiple languages, and in the issue #1579 thread, it's mentioned that there is a repo of 'best' trained data which should be used for OCR.

The repo is tesseract-ocr/tessdata_best, and the README says:

These models only work with the LSTM OCR engine of Tesseract 4.

which means that you don't need to pull the release, you can just pull the master branch (whereas the tesseract engine's master branch has the v5 alpha)

To install the .traineddata files, you can follow the instructions here which mention that these will be slower (but for my use case here, slower is acceptable). tessdata are what ship with Linux distros, tessdata-best sacrifice "a lot of speed" for "slightly better accuracy" and tessdata-fast is least accurate.

The tessdata-best files are dated September 2017.

There are no instructions for how to install these data files, but inside ~/opt/tesseract there's a subdirectory tessdata which contains no .traineddata files (verify with: find ./ -iname "*.traineddata" 2> /dev/null)

There probably should be: this unanswered StackOverflow question [whose author should probably have used the Google Groups forum specifically for Tesseract] mentions having his eng.traineddata file directly underneath the location of his tessdata directory

My Linux distro version of tesseract [the one I'm not going to use but obtained via apt] is at /usr/share/tesseract-ocr/4.00/tessdata/, and this directory contains

configs/
ell.traineddata
eng.traineddata
osd.traineddata
pdf.ttf
tessconfigs/

while which tesseract points to /usr/local/bin/tesseract

It's not clear how this program knows where to find the tessdata directory, but the man page notes:

To use a non-standard language pack named foo.traineddata, set the TESSDATA_PREFIX environment variable so the file can be found at TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the argument -l foo.

In other words, you'll probably want to switch out either tessdata_best or tessdata_fast by supplying their location with the environment variable TESSDATA_PREFIX. I'll keep things simple by just cloning them both [separately] into opt for now. Again I only want the most recent commit in the history to avoid git-induced bloat.

cd ~/opt/
git clone --depth 1 [email protected]:tesseract-ocr/tessdata_fast.git
git clone --depth 1 [email protected]:tesseract-ocr/tessdata_best.git

(Go take a break as this will take a while)

After looking at the language codes (and remembering that ancient Greek, grc, was not shown, and that osd is also there for "orientation and script detection"), I deleted a lot of them (but take a look for your own needs!)

for dirname in tessdata_{best,fast}; do
  cd $dirname
  rm {[a-c]*,da?,di?,dz?,enm,ep?,es?,eu?,fa?,fi?,fr[^a],gl?,gu?,[h-k]*,lao,lav,li?,lt?,[m-n]*,o?i,[p-z]*}.traineddata
  cd ..
done

This leaves:

for dirname in tessdata_{best,fast}; do
  cd $dirname
  echo $dirname
  ls *.traineddata
  echo
  cd ..
done

⇣

tessdata_best
deu.traineddata  ell.traineddata  eng.traineddata  fra.traineddata
grc.traineddata  lat.traineddata  osd.traineddata

tessdata_fast
deu.traineddata  ell.traineddata  eng.traineddata  equ.traineddata
fra.traineddata  grc.traineddata  lat.traineddata  osd.traineddata

i.e. German, Greek (both modern and ancient), English, French, Latin, Orientation and Script Detection, and equations

Equations (equ) is only in tessdata_fast, for some reason

From browsing Q&A forums, there's a common mistake people make when trying to supply this data: they set the environment variable TESSDATA_PREFIX to be the path to the tessdata/ directory itself (be it standard, fast, or best), when as the variable name suggests [but is ambiguous], you actually want to set this variable to be the path of the parent directory of the tessdata directory, i.e. the prefix to which tessdata can be appended to get the full path to it.

Since I can't figure out how exactly I'm supposed to store these, I chose to make two new directories best and fast, both as subdirectories of tessdatas (plural!) and then I moved the tessdata_best and tessdata_fast into the corresponding subdirectories best and fast and renamed them to tessdata as this seems to be the expected directory name:

~/opt/tessdatas/best/tessdata/ (formerly tessdata_best)
~/opt/tessdatas/fast/tessdata/ (formerly tessdata_fast)

I am wanting to use the 'best' one, so I will export the environment variable TESSDATA_PREFIX in my .bashrc as:

export TESSDATA_PREFIX="$HOME/opt/tessdatas/best/"

Note that if you comment out this line, the standard quality, default (i.e. included in /usr/share) language packs will be used instead
- Unless you changed the configure line above to ./configure --prefix=/usr, this will be at /usr/local/share/tessdata
- See the post-install instructions

You can provide the path to a tesseract executable as a raw string, as the README for tesseract mentions, like this:

from pytesseract.pytesseract import tesseract_cmd
tesseract_cmd = r"/usr/local/bin/tesseract"

which is where the make install line put it by default (so I don't think this needs to be done, but FYI in case you installed it somewhere else). Check where tesseract is with which tesseract.