OCR - byuawsfhtl/RLL_computer_vision GitHub Wiki

Intro

OCR (Optical Character Recognition) is used on printed characters, where HWR (Hand-Writing Recognition) is used on handwriting. There are plenty of OCR models available, such as by AWS, Meta, Apple, Adobe, and Google. We normally use Google's Tesseract OCR which works on images of printed text. Tesseract will also do a small amount of segmentation for predictable formats.

Creating the project

Option 1: Google Colab

/colab.research.google.com/drive/1blzHW256dY6RyhPKE_qCc4SpKWg874x3?usp=sharing

Essentially all that needs to be done here is to change the file path to the file in google drive where your pictures have been uploaded. The csv files will save in that same folder.

Option 2: Lab or Remote Desktop

Start the environment First, we need to enter the correct python environment. Do this by running the command prompt, and typing in the following:

THE R DRIVE IS NO LONGER IN USE; USE THE V DRIVE

R:
cd JoePriceResearch\Python\Miniconda\Scripts
activate ml_rec_env
spyder

You will need to click through an error message, but have no fear, spyder will load shortly after. Within spyder, navigate to: V:\FHSS-JoePriceResearch\data\ocr and open the ocr text file.

You will need to replace the first file path with the location of the images you will be using. The second file path will stay the same- that is the location of the tesseract program used to read the images.

This code needs to be changed quite a bit depending on what kind of image you are dealing with. For example, the way it is set up right now is meant for a record with 13 vertical entries, like a census. It crops the image into the 13 pieces and reads each one individually. It also is set up to find the last number on each line and remove any characters after that number. This can be useful for columns of a record that always end in a number, such as a date. You will need to do quite a bit of customization for your project.

wrappers

Pytesseract

In python, we often use the Pytesseract wrapper to access Tesseract. This is a wrapper for the Tesseract CLI, which becomes inefficient for large datasets because the model is loaded upon every call. It is otherwise very user-friendly and effective.

There are options to thread (read several images at the same time) with Pytesseract that can boost optimality.

Tesserocr

If speed is an issue or you would like to access word or character attributes such as boldness, letter case (e.g., uppercase or lowercase letters) or identify the font type, you should use Tesserocr. This wrapper wraps to C++ and only loads the model once. Threading is also available. Here is consolidated documentation of many of the basic things you can do with Tesserocr.

Unlike Pytesseract, most of the preprocessing such as skew adjustment and line recognition is done in-house, resulting in less written code for you.

preprocessing

Like most computer vision projects, you will need to perform some kind of preprocessing prior to passing an image through an OCR. There are many ways to do this, but one of the easiest and most efficient packages is OpenCV.

A Colab tutorial for how to use it for preprocessing is available on the Getting Started page.

Installing Tesseract

To use Tesseract locally, you will need to download the entire model and its dependencies. It is also available remotely on Google Colab.

Locally

MacOS

If you want to use Tesseract on your Mac, a normal pip install will not work. You will need to install it via Homebrew, which is a MacOS-specific supplementary package manager.

To install Hombrew: $ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Once homebrew is installed, use one of the following codes, depending on if you use Apple Silicon (M-series chips) or Intel/AMD.

If your computer has Apple Silicon: $ arch -arm64 brew install Tesseract

otherwise: $ brew install Tesseract

Tesseract on the Supercomputer

Setup

To use tesseract, you'll need to run the following

module load tesseract

If this doesn't work for you, try running these lines before loading the module

module restore
export MODULEPATH="$MODULEPATH:/nobackup/scratch/grp/fslg_JoePriceResearch/tools/modulefiles"
module save

Running tesseract

The basic run format of tesseract is:

tesseract input output [options]

Where:

  • input is either "stdin" or a file name
  • output is either "stdout" or a file name (without a file extension)
  • options are an optional list of space separated modifiers (see below)

Examples

tesseract test.jpg stdout will extract the text from test.jpg and print the results to the terminal.

tesseract test.jpg outFile will extract the text from test.jpg and save the results in outFile.txt (a text file is the default output; options can change this).

Options

option description
--psm num Specify page segmentation mode
--user-words path Specify a custom dictionary to use
--user-patterns path Specify a custom pattern list to use
--dpi value Specify the DPI for an image
-l language Specify the language
--list-langs List available languages

Segmentation modes

mode meaning
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

Troubleshooting

If there are issues with loading the module or creating the module, you can always run the executable directly.

~/fsl_groups/fslg_JoePriceResearch/compute/tools/tesseract/5.2.0/bin/tesseract

You can also create an alias for the long command:

alias tesseract="~/fsl_groups/fslg_JoePriceResearch/compute/tools/tesseract/5.2.0/bin/tesseract"

If there are other issues, reach out on slack.

Additional resources

Tesseract has a good documentation page that dive into how to create user word files, user pattern files, more troubleshooting, etc.