David's Tesseract branch - dhendrix/tesseract GitHub Wiki
Welcome to my Tesseract branch!
Please visit the main Tesseract repository here: https://github.com/tesseract-ocr
What
This is a branch of Tesseract at b68be44 with one modification: The addition of an optional command-line argument to take a second image file.
The images supplied should have exactly the same textual content and layout, otherwise the search functionality will be broken. Do not try this with images that are different in any way other than application of rudimentary filters to help text stand out.
Why
- I want to use Tesseract to generate OCR data for a "cleaned-up" image, but I also want the final searchable PDF to look like my original scanned image. This means I don't have to keep a copy of the original image to print or share in the future.
- In my experience, cleaning up the PDF helped increase OCR accuracy significantly. Unfortunately OCR-optimized images often look significantly different (and subjectively uglier) than the original.
- I had difficulty generating a searchable PDF with raw hOCR data and using other tools such as hocr2pdf.
- Tesseract is already capable of generating a searchable PDF. Making it output a searchable PDF that used the original image seemed like the easiest way to get what I want.
How
Simply use the "--visible-pdf-image" argument to specify the image you wish to be visible in the output PDF.
Example: tesseract -l eng --visible-pdf-image "original_image.png" "cleaned_up_image.png" out pdf
This will generate out.pdf using cleaned_up_image.png for OCR and original_image.png for the actual image embedded in the PDF.
To clean up the original image, I suggest "mogrify -trim" from ImageMagick to cut off excess whitespace at edges (especially for scanned images) and textcleaner from Fred's ImageMagick scripts: http://www.fmwconcepts.com/imagemagick/textcleaner/index.php