OCR tips - stb-tester/stb-tester GitHub Wiki

OCR Tips

Tesseract is the open-source Optical Character Recognition (OCR) engine that stb-tester uses to read text from images. stbt.ocr allows you to customise tesseract's parameters; this document contains advice for improving tesseract's accuracy in specific scenarios.

Tesseract is optimised for reading pages of prose, so it doesn't always yield good results when reading the rather disjointed text typical of a graphical user interface.

Tesseract uses a dictionary to help choose the correct word even if individual characters were misread. This is helpful when reading real words but it can get in the way when reading characters and words with a different structure.

General tips

Crop the region in which you are performing OCR tight to the text using the region parameter to ocr and match_text

Matching some known text

Use the tesseract_user_words or tesseract_user_patterns parameters to ocr and match_text to tell the OCR engine what you're expecting.

Use fuzzy matching to check if the returned text matches what you were expecting. e.g. a function like:

  def fuzzy_match(string1, string2, threshold=0.8):
      import difflib
      return difflib.SequenceMatcher(None, string1, string2).ratio() >= threshold

Example

Looking for the text "EastEnders":

text = stbt.ocr(region=stbt.Region(52, 34, 120, 50))
assert fuzzy_match(text, "EastEnders")

Matching serial numbers

For example, here is a code generated at random by one user's set-top box, for the purpose of pairing with a second-screen device:

>>> import cv2, stbt
>>> ujjm2lge = cv2.imread("images/UJJM2LGE.png")

The code consists of 8 characters randomly chosen from the upper case letters and the digits. Tesseract is not expecting this and we need to give it some help to recognise any characters at all:

>>> stbt.ocr(frame=ujjm2lge)
u''

The parameter mode is useful in this situation. This controls tesseract's segmentation -- how tesseract expects the text to be laid out in the image. The default OcrMode.PAGE_SEGMENTATION_WITHOUT_OSD tells tesseract to expect a page of text. In this case we are expecting a single string so we try OcrMode.SINGLE_WORD:

>>> stbt.ocr(frame=ujjm2lge, mode=stbt.OcrMode.SINGLE_WORD)
u'UJJMZLGE'

Close, but no cigar (tesseract read the "2" as a "Z"). Tesseract thinks that a word with a 2 in the middle is unlikely. We can tell tesseract to expect an 8-character word consisting of letters or digits, by specifying tesseract_user_patterns:

>>> stbt.ocr(frame=ujjm2lge, mode=stbt.OcrMode.SINGLE_WORD,
...          tesseract_user_patterns=[r'\n\n\n\n\n\n\n\n'])
u'UJJM2LGE'

Success.

Unfortunately the tesseract pattern language is a little idiosyncratic. It looks a little like regular expressions but is incompatible and much more limited. The only documentation is in a header file in the tesseract sources. Very roughly regexes and tesseract patterns match up like this:

tesseract	regex
`\c`	`[a-zA-Z]`
`\d`	`[0-9]`
`\n`	`[a-zA-Z0-9]`
`\p`	`[:punct:]`
`\a`	`[a-z]`
`\A`	`[A-Z]`
`\*`	`*`

Note: tesseract_user_patterns requires tesseract 3.03 or later.

Other parameters that might be useful

I haven't tried these, myself. You would set these parameters via stbt.ocr's tesseract_config parameter.

To use only the user-supplied words & patterns, man tesseract recommends:

load_system_dawg     F
load_freq_dawg       F

From the FAQ for How do I recognize only digits?:

tessedit_char_whitelist 0123456789

Interesting options from tesseract --print-parameters:

tessedit_parallelize    0       Run in parallel where possible