LP OCR.py - haltosan/RA-python-tools GitHub Wiki

Layout Parser

This uses a model trained to find distinct regions within a page and run OCR on each region. Region detection is used to improve OCR performance as well as deal with more complicated data and parsing it (tables, parallel columns of text, etc.). This page is documentation for the script LP-OCR.py.

Here is a link to a colab that does layout parsing, ocr, and retraining if you need another example of it.

Prerequisites

These lines should get all requirements that aren't already on the system.

pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/[email protected]#egg=detectron2"
pip install "layoutparser[ocr]"
sudo apt install tesseract-ocr
pip install pytesseract

GPU is required for any retraining.

Usage

python LP-OCR.py [options] pdfDirectory pdfName startPage endPage model
  Options: (assume opposite behavior is default)
     --delImages | -d      delete images after running (aka don't keep a cache) 
     --cache     | -c      use an existing image cache
     --imageOnly | -i      only create images from pdf, don't do ocr

 acceptable values for model:
     [directory containing model_final.pth and config.yaml]
     prima      prima layout, good for general use
     lsr        trained pubLayout for land patent records

option	description
pdfDirectory	the directory you want the script to run in
pdfName	the name of the pdf; please make sure this doesn't navigate any directories (like Documents/file.pdf), just the pdf name
startPage	page to start with
endPage	page to end on (inclusive)
model	which model to use to parse out the text regions
delImages	during run time, images of each page of the pdf will be created; if this flag is set, those images will be deleted
cache	creating images from the pdf can take a while. If these pictures are kept, more runs on the same document will be faster. -c ensures these existing images are used to speed up the program
imageOnly	exits as soon as images from the pdf are created; this option will not do any ocr

There is a hidden option -o which was made for a specific project. If set, the only command line arguments are as follows:

python LP-OCR.py -o imageDirectory model

All images in the imageDirectory will then be transcribed into one text file called 'output.txt'. model has the same expected values.

Models

Current selections

There are currently 2 models to pick from: prima and customPubLayout.

prima: This is the PrimaLayout that came with the pretrained model zoo with layout parser. This does a decent job at most layouts.
customPubLayout: This is a retrained version of the PubLayNet model that came with the layout parser pretrained zoo. It was trained on the land records.

Custom models

To use a different model, the only required files are model_final.pth and config.yaml. The config file is specific to layout parser, see layout parser website for getting a valid config file. See the later sections on retraining an existing model.

Both the model_final.pth and config.yaml need to be in the same directory if you are using the command line arguments.

Debugging

Here are some issues that may arise and the fixes that are recommended:

DLL issues
- ensure that the pythoncom37.dll and files aren't found multiple times in the PATH
- see https://github.com/mhammond/pywin32/issues/1709 and https://github.com/Azure/azure-cli/issues/17986
OSError: [Errno 22] Invalid argument: 'C:\Users\USER/.torch/iopath_cache\s/h7th27jfv19rxiy\model_final.pth?dl=1.lock'
- ensure the model weights are correctly downloaded and/or loaded; may also be an issue with pytesseract
- see https://stackoverflow.com/questions/68094922/pytorch-throws-oserror-on-detectron2layoutmodel
Cannot find field 'gt_masks' in the given Instances!
- ensure that you have masks defined in the training data, or switch to a model that doesn't require masks (faster verses mask)
- https://github.com/facebookresearch/detectron2/issues/485
Can't find model weights after training a model
- save model weights
- https://stackoverflow.com/questions/66037566/how-to-save-a-model-using-defaulttrainer-in-detectron2
Just a good link to have: https://towardsdatascience.com/auto-parse-and-understand-any-document-5d72e81b0be9

Model Retraining

For a detectron2 training tutorial, see this colab.

You'll need the following git repo to have a config file that works in the next stages. Here's an example of the next stages with some explanations to help supplement the documentation already on the github page.

git clone https://github.com/Layout-Parser/layout-model-training.git
cd layout-model-training
python tools/train_net.py --dataset_name "lsr" --json_annotation_train "/content/training/coco.json" --image_path_train "/content/training" --config-file "/content/layout-model-training/pubLay.yml" MODEL.WEIGHTS "/content/layout-model-training/pubLay.pth" OUTPUT_DIR "/content/out" SOLVER.MAX_ITER 300 SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.00025 MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE 128 MODEL.ROI_HEADS.NUM_CLASSES 1

train_net.py has 2 different types of arguments: script specific and detectron2 specific arguments. See the github project for documentation on any argument starting with -- while all other arguments (in all caps) are detectron2 arguments. The --json_annotation_train argument is expecting a json file that follows the coco dataset format. The config files can be found in the layout parser pretrained zoo. No clear documentation exists on how to get a config file for other models.

After training is done, there should be model_final.pth and config.yaml files near the working directory. These are the same files that are required for a custom model in the ocr script.