OCR - byuawsfhtl/RLL_computer_vision GitHub Wiki
Intro
OCR (Optical Character Recognition) is used on printed characters, where HWR (Hand-Writing Recognition) is used on handwriting. There are plenty of OCR models available, such as by AWS, Meta, Apple, Adobe, and Google. We normally use Google's Tesseract OCR which works on images of printed text. Tesseract will also do a small amount of segmentation for predictable formats.
Creating the project
Option 1: Google Colab
/colab.research.google.com/drive/1blzHW256dY6RyhPKE_qCc4SpKWg874x3?usp=sharing
Essentially all that needs to be done here is to change the file path to the file in google drive where your pictures have been uploaded. The csv files will save in that same folder.
Option 2: Lab or Remote Desktop
Start the environment First, we need to enter the correct python environment. Do this by running the command prompt, and typing in the following:
THE R DRIVE IS NO LONGER IN USE; USE THE V DRIVE
R:
cd JoePriceResearch\Python\Miniconda\Scripts
activate ml_rec_env
spyder
You will need to click through an error message, but have no fear, spyder will load shortly after. Within spyder, navigate to: V:\FHSS-JoePriceResearch\data\ocr and open the ocr text file.
You will need to replace the first file path with the location of the images you will be using. The second file path will stay the same- that is the location of the tesseract program used to read the images.
This code needs to be changed quite a bit depending on what kind of image you are dealing with. For example, the way it is set up right now is meant for a record with 13 vertical entries, like a census. It crops the image into the 13 pieces and reads each one individually. It also is set up to find the last number on each line and remove any characters after that number. This can be useful for columns of a record that always end in a number, such as a date. You will need to do quite a bit of customization for your project.
wrappers
Pytesseract
In python, we often use the Pytesseract wrapper to access Tesseract. This is a wrapper for the Tesseract CLI, which becomes inefficient for large datasets because the model is loaded upon every call. It is otherwise very user-friendly and effective.
There are options to thread (read several images at the same time) with Pytesseract that can boost optimality.
Tesserocr
If speed is an issue or you would like to access word or character attributes such as boldness, letter case (e.g., uppercase or lowercase letters) or identify the font type, you should use Tesserocr. This wrapper wraps to C++ and only loads the model once. Threading is also available. Here is consolidated documentation of many of the basic things you can do with Tesserocr.
Unlike Pytesseract, most of the preprocessing such as skew adjustment and line recognition is done in-house, resulting in less written code for you.
preprocessing
Like most computer vision projects, you will need to perform some kind of preprocessing prior to passing an image through an OCR. There are many ways to do this, but one of the easiest and most efficient packages is OpenCV.
A Colab tutorial for how to use it for preprocessing is available on the Getting Started page.
Installing Tesseract
To use Tesseract locally, you will need to download the entire model and its dependencies. It is also available remotely on Google Colab.
Locally
MacOS
If you want to use Tesseract on your Mac, a normal pip install will not work. You will need to install it via Homebrew, which is a MacOS-specific supplementary package manager.
To install Hombrew:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Once homebrew is installed, use one of the following codes, depending on if you use Apple Silicon (M-series chips) or Intel/AMD.
If your computer has Apple Silicon:
$ arch -arm64 brew install Tesseract
otherwise:
$ brew install Tesseract
Tesseract on the Supercomputer
Setup
To use tesseract, you'll need to run the following
module load tesseract
If this doesn't work for you, try running these lines before loading the module
module restore
export MODULEPATH="$MODULEPATH:/nobackup/scratch/grp/fslg_JoePriceResearch/tools/modulefiles"
module save
Running tesseract
The basic run format of tesseract is:
tesseract input output [options]
Where:
- input is either "stdin" or a file name
- output is either "stdout" or a file name (without a file extension)
- options are an optional list of space separated modifiers (see below)
Examples
tesseract test.jpg stdout
will extract the text from test.jpg and print the results to the terminal.
tesseract test.jpg outFile
will extract the text from test.jpg and save the results in outFile.txt (a text file is the default output; options can change this).
Options
option | description |
---|---|
--psm num | Specify page segmentation mode |
--user-words path | Specify a custom dictionary to use |
--user-patterns path | Specify a custom pattern list to use |
--dpi value | Specify the DPI for an image |
-l language | Specify the language |
--list-langs | List available languages |
Segmentation modes
mode | meaning |
---|---|
0 | Orientation and script detection (OSD) only. |
1 | Automatic page segmentation with OSD. |
2 | (not implemented) |
3 | Fully automatic page segmentation, but no OSD. (Default) |
4 | Assume a single column of text of variable sizes. |
5 | Assume a single uniform block of vertically aligned text. |
6 | Assume a single uniform block of text. |
7 | Treat the image as a single text line. |
8 | Treat the image as a single word. |
9 | Treat the image as a single word in a circle. |
10 | Treat the image as a single character. |
11 | Sparse text. Find as much text as possible in no particular order. |
12 | Sparse text with OSD. |
13 | Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific. |
Troubleshooting
If there are issues with loading the module or creating the module, you can always run the executable directly.
~/fsl_groups/fslg_JoePriceResearch/compute/tools/tesseract/5.2.0/bin/tesseract
You can also create an alias for the long command:
alias tesseract="~/fsl_groups/fslg_JoePriceResearch/compute/tools/tesseract/5.2.0/bin/tesseract"
If there are other issues, reach out on slack.
Additional resources
Tesseract has a good documentation page that dive into how to create user word files, user pattern files, more troubleshooting, etc.