OCR tips - stb-tester/stb-tester GitHub Wiki
OCR Tips
Tesseract is the open-source Optical Character Recognition (OCR) engine that
stb-tester uses to read text from images. stbt.ocr allows you to customise
tesseract's parameters; this document contains advice for improving tesseract's
accuracy in specific scenarios.
Tesseract is optimised for reading pages of prose, so it doesn't always yield good results when reading the rather disjointed text typical of a graphical user interface.
Tesseract uses a dictionary to help choose the correct word even if individual characters were misread. This is helpful when reading real words but it can get in the way when reading characters and words with a different structure.
General tips
- Crop the region in which you are performing OCR tight to the text using the
regionparameter toocrandmatch_text
Matching some known text
-
Use the
tesseract_user_wordsortesseract_user_patternsparameters toocrandmatch_textto tell the OCR engine what you're expecting. -
Use fuzzy matching to check if the returned text matches what you were expecting. e.g. a function like:
def fuzzy_match(string1, string2, threshold=0.8): import difflib return difflib.SequenceMatcher(None, string1, string2).ratio() >= threshold
Example
Looking for the text "EastEnders":
text = stbt.ocr(region=stbt.Region(52, 34, 120, 50))
assert fuzzy_match(text, "EastEnders")
Matching serial numbers
For example, here is a code generated at random by one user's set-top box, for the purpose of pairing with a second-screen device:
>>> import cv2, stbt
>>> ujjm2lge = cv2.imread("images/UJJM2LGE.png")

The code consists of 8 characters randomly chosen from the upper case letters and the digits. Tesseract is not expecting this and we need to give it some help to recognise any characters at all:
>>> stbt.ocr(frame=ujjm2lge)
u''
The parameter mode is useful in this situation. This controls tesseract's
segmentation -- how tesseract expects the text to be laid out in the image. The
default OcrMode.PAGE_SEGMENTATION_WITHOUT_OSD tells tesseract to expect a
page of text. In this case we are expecting a single string so we try
OcrMode.SINGLE_WORD:
>>> stbt.ocr(frame=ujjm2lge, mode=stbt.OcrMode.SINGLE_WORD)
u'UJJMZLGE'
Close, but no cigar (tesseract read the "2" as a "Z"). Tesseract thinks that a
word with a 2 in the middle is unlikely. We can tell tesseract to expect an
8-character word consisting of letters or digits, by specifying
tesseract_user_patterns:
>>> stbt.ocr(frame=ujjm2lge, mode=stbt.OcrMode.SINGLE_WORD,
... tesseract_user_patterns=[r'\n\n\n\n\n\n\n\n'])
u'UJJM2LGE'
Success.
Unfortunately the tesseract pattern language is a little idiosyncratic. It looks a little like regular expressions but is incompatible and much more limited. The only documentation is in a header file in the tesseract sources. Very roughly regexes and tesseract patterns match up like this:
| tesseract | regex |
|---|---|
\c |
[a-zA-Z] |
\d |
[0-9] |
\n |
[a-zA-Z0-9] |
\p |
[:punct:] |
\a |
[a-z] |
\A |
[A-Z] |
\* |
* |
Note: tesseract_user_patterns requires tesseract 3.03 or later.
Other parameters that might be useful
I haven't tried these, myself. You would set these parameters via
stbt.ocr's tesseract_config parameter.
To use only the user-supplied words & patterns, man tesseract recommends:
load_system_dawg F
load_freq_dawg F
From the FAQ for How do I recognize only digits?:
tessedit_char_whitelist 0123456789
Interesting options from tesseract --print-parameters:
tessedit_parallelize 0 Run in parallel where possible