How to turn off OCR (useful if you only want metadata extraction) - chrismattmann/tika-python GitHub Wiki

Problem

Even if parser.from_text(x, service = 'meta') is selected, Tika extracts the content. For files that need OCR'ing this can take a lot of time.

There are some solutions offered by Tika here to turn off OCR'ing. Since tika-python uses a Tika Server the last solution can be used:

parser.from_file(x, service = 'meta', headers = {"X-Tika-OCRskipOcr": 'true'})

This also works with service = 'all'. It returns the content if there is content that can be returned without OCR.