Tesseract - lmmx/devnotes GitHub Wiki
Installing dependencies and compiling Tesseract from source
- GitHub source for Tesseract
- Tesseract docs: 'tessdoc'
- Install new languages via
apt
astesseract-ocr-{lang}
where{lang}
is the language code orall
- Oddly,
grc
(Ancient Greek) doesn't exist here, butell
is Modern Greek. It doesn't seem to be inall
either (see the dependency list withapt show
) - run as
tesseract input.png out -l eng+ell
etc.
- Oddly,
- After installing, use
pip install pytesseract
and setlang
to the-l
string as a parameter
There are multiple interfacing functions to pytesseract:
pytesseract.image_to_boxes
- the bounding boxes are for entire words
pytesseract.image_to_data
- these are per-letter bounding boxes
- it's more readable to use
io.StringIO
to wrap this string in the argument topandas.read_csv
withsep="\t"
to load into a DF (that can be paged withpydoc
) - this page suggests supplying
pytesseract.Output.DICT
as the output type (I didn't try this yet...) - "conf is the model's confidence for the prediction for the word within that bounding box. If conf is -1, that means that the corresponding bounding box contains a block of text, rather than just a single word."
pytesseract.image_to_osd
- OSD stands for 'orientation and script' detection
- see Combined Script and Page Orientation Estimation using the Tesseract OCR engine
pytesseract.image_to_string
- this is the transcribed string, i.e. conjoined OCR-detected characters, with spaces
The image_to_osd
function fails for me as the apt
-packaged version is 4.0.0, and there is a documented bug which raises an error for the wrapper and prevents the OSD function from returning.
The issue was resolved in version 4.1.1 (commit a209a6... took place on 20 Oct 2019, and glancing at the releases this was the next and most recent one) in December 2019.
To get version 4.1.1, you need to build from source as the Ubuntu PPA on apt
doesn't provide it, and searching for 'tesseract' on the launchpad PPA listings brings up a development PPA with v5.0 and a packaged leptonlib
- Before downloading this leptonlib (which is v1.78, the latest is 1.80) you can 'view package details' and download the
.deb
from here- check
cat /etc/os-release
if you need to check what version of Ubuntu your OS is based on, if not given inlsb_release -a
, in my case this isbionic
- check
- If you scroll to the section on 'built packages' under 'package details' you can see this is actually a dev version of the library which is packaged and available in
apt
already! The package name isliblept5
and it doesn't show up inapt search
as its description is just "image processing library".- This isn't mentioned on the Leptonica website, though a 2008 version of the docs does mention the name
liblept
as being its shortname.
- This isn't mentioned on the Leptonica website, though a 2008 version of the docs does mention the name
So in short, to satisfy the Leptonica dependency:
sudo apt install liblept5
On my computer this is already satisfied, I presume it came with the distro. So the dependency is met, the docs for tesseract compilation also suggest you should get the dev library:
sudo apt-get install libleptonica-dev
"but if you are using an oldish version of Linux you will need to build from source"
The rest of the dependencies are listed at the tesseract docs and were all already met on my machine.
I also installed the developer tools:
sudo apt install libtesseract-dev
Linux.com gives a very swift guide to what is made to look quite complex, but it may be out of date (from 2016)
To install everything (man pages and training tool dependencies):
sudo apt-get install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config
sudo apt-get install --no-install-recommends asciidoc docbook-xsl xsltproc
sudo apt-get install libpango1.0-dev
I'm going to install into ~/opt/
and I'm going to clone the branch for the latest release using --branch
(in this case 4.1.1
) and just the most recent commit with --depth 1
(to avoid storing a large commit history on disk):
cd ~/opt
git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git --branch 4.1.1 --single-branch
Specifying the branch to be a specific release gives the message:
You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by switching back to a branch.
This is fine, ignore it. If you just clone master you currently get the v5 alpha.
cd tesseract/
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
The make
step takes a while...
...but once it's done you should be able to run tesseract --version
and get 4.1.1 (or whichever
version you installed).
Getting the best LSTM trained data
I am wanting to use multiple languages, and in the issue #1579 thread, it's mentioned that there is a repo of 'best' trained data which should be used for OCR.
The repo is tesseract-ocr/tessdata_best, and the README says:
These models only work with the LSTM OCR engine of Tesseract 4.
which means that you don't need to pull the release, you can just pull the master branch (whereas the tesseract engine's master branch has the v5 alpha)
To install the .traineddata
files, you can follow the instructions here
which mention that these will be slower (but for my use case here, slower is acceptable). tessdata
are what ship with Linux distros,
tessdata-best
sacrifice "a lot of speed" for "slightly better accuracy" and tessdata-fast
is least accurate.
The tessdata-best
files are dated September 2017.
There are no instructions for how to install these data files, but inside ~/opt/tesseract
there's a subdirectory tessdata
which contains no .traineddata
files (verify with: find ./ -iname "*.traineddata" 2> /dev/null
)
There probably should be: this unanswered StackOverflow question [whose author should probably have used the Google Groups forum specifically for Tesseract] mentions having his eng.traineddata
file directly underneath the location of his tessdata
directory
My Linux distro version of tesseract [the one I'm not going to use but obtained via apt
] is at /usr/share/tesseract-ocr/4.00/tessdata/
,
and this directory contains
configs/
ell.traineddata
eng.traineddata
osd.traineddata
pdf.ttf
tessconfigs/
while which tesseract
points to /usr/local/bin/tesseract
It's not clear how this program knows where to find the tessdata
directory, but the man page notes:
To use a non-standard language pack named
foo.traineddata
, set theTESSDATA_PREFIX
environment variable so the file can be found atTESSDATA_PREFIX/tessdata/foo.traineddata
and give Tesseract the argument-l foo
.
In other words, you'll probably want to switch out either tessdata_best
or tessdata_fast
by supplying their location
with the environment variable TESSDATA_PREFIX
. I'll keep things simple by just cloning them both [separately] into opt
for now.
Again I only want the most recent commit in the history to avoid git
-induced bloat.
cd ~/opt/
git clone --depth 1 [email protected]:tesseract-ocr/tessdata_fast.git
git clone --depth 1 [email protected]:tesseract-ocr/tessdata_best.git
(Go take a break as this will take a while)
After looking at the language codes (and remembering that ancient Greek, grc
, was not shown, and that osd
is also there for "orientation and script detection"), I deleted a lot of them (but take a look for your own needs!)
for dirname in tessdata_{best,fast}; do
cd $dirname
rm {[a-c]*,da?,di?,dz?,enm,ep?,es?,eu?,fa?,fi?,fr[^a],gl?,gu?,[h-k]*,lao,lav,li?,lt?,[m-n]*,o?i,[p-z]*}.traineddata
cd ..
done
This leaves:
for dirname in tessdata_{best,fast}; do
cd $dirname
echo $dirname
ls *.traineddata
echo
cd ..
done
⇣
tessdata_best
deu.traineddata ell.traineddata eng.traineddata fra.traineddata
grc.traineddata lat.traineddata osd.traineddata
tessdata_fast
deu.traineddata ell.traineddata eng.traineddata equ.traineddata
fra.traineddata grc.traineddata lat.traineddata osd.traineddata
i.e. German, Greek (both modern and ancient), English, French, Latin, Orientation and Script Detection, and equations
- Equations (equ) is only in
tessdata_fast
, for some reason
From browsing Q&A forums, there's a common mistake people make when trying to supply this data: they set
the environment variable TESSDATA_PREFIX
to be the path to the tessdata/
directory itself
(be it standard, fast, or best), when as the variable name suggests [but is ambiguous],
you actually want to set this variable to be the path of the parent directory of the tessdata
directory,
i.e. the prefix to which tessdata
can be appended to get the full path to it.
Since I can't figure out how exactly I'm supposed to store these, I chose to make two new directories best
and fast
,
both as subdirectories of tessdatas
(plural!) and then I moved the tessdata_best
and tessdata_fast
into the corresponding
subdirectories best
and fast
and renamed them to tessdata
as this seems to be the expected directory name:
~/opt/tessdatas/best/tessdata/
(formerlytessdata_best
)~/opt/tessdatas/fast/tessdata/
(formerlytessdata_fast
)
I am wanting to use the 'best' one, so I will export the environment variable TESSDATA_PREFIX
in my .bashrc
as:
export TESSDATA_PREFIX="$HOME/opt/tessdatas/best/"
- Note that if you comment out this line, the standard quality, default (i.e. included in
/usr/share
) language packs will be used instead- Unless you changed the
configure
line above to./configure --prefix=/usr
, this will be at/usr/local/share/tessdata
- See the post-install instructions
- Unless you changed the
You can provide the path to a tesseract executable as a raw string, as the README for tesseract mentions, like this:
from pytesseract.pytesseract import tesseract_cmd
tesseract_cmd = r"/usr/local/bin/tesseract"
which is where the make install
line put it by default (so I don't think this needs to be done,
but FYI in case you installed it somewhere else). Check where tesseract
is with which tesseract
.