LP OCR.py - haltosan/RA-python-tools GitHub Wiki

Layout Parser

This uses a model trained to find distinct regions within a page and run OCR on each region. Region detection is used to improve OCR performance as well as deal with more complicated data and parsing it (tables, parallel columns of text, etc.). This page is documentation for the script LP-OCR.py.

Here is a link to a colab that does layout parsing, ocr, and retraining if you need another example of it.

Prerequisites

These lines should get all requirements that aren't already on the system.

pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/[email protected]#egg=detectron2"
pip install "layoutparser[ocr]"
sudo apt install tesseract-ocr
pip install pytesseract

GPU is required for any retraining.

Usage

python LP-OCR.py [options] pdfDirectory pdfName startPage endPage model
  Options: (assume opposite behavior is default)
     --delImages | -d      delete images after running (aka don't keep a cache) 
     --cache     | -c      use an existing image cache
     --imageOnly | -i      only create images from pdf, don't do ocr

 acceptable values for model:
     [directory containing model_final.pth and config.yaml]
     prima      prima layout, good for general use
     lsr        trained pubLayout for land patent records
option description
pdfDirectory the directory you want the script to run in
pdfName the name of the pdf; please make sure this doesn't navigate any directories (like Documents/file.pdf), just the pdf name
startPage page to start with
endPage page to end on (inclusive)
model which model to use to parse out the text regions
delImages during run time, images of each page of the pdf will be created; if this flag is set, those images will be deleted
cache creating images from the pdf can take a while. If these pictures are kept, more runs on the same document will be faster. -c ensures these existing images are used to speed up the program
imageOnly exits as soon as images from the pdf are created; this option will not do any ocr

There is a hidden option -o which was made for a specific project. If set, the only command line arguments are as follows:

python LP-OCR.py -o imageDirectory model

All images in the imageDirectory will then be transcribed into one text file called 'output.txt'. model has the same expected values.


Models

Current selections

There are currently 2 models to pick from: prima and customPubLayout.

  • prima: This is the PrimaLayout that came with the pretrained model zoo with layout parser. This does a decent job at most layouts.
  • customPubLayout: This is a retrained version of the PubLayNet model that came with the layout parser pretrained zoo. It was trained on the land records.

Custom models

To use a different model, the only required files are model_final.pth and config.yaml. The config file is specific to layout parser, see layout parser website for getting a valid config file. See the later sections on retraining an existing model.

Both the model_final.pth and config.yaml need to be in the same directory if you are using the command line arguments.

Debugging

Here are some issues that may arise and the fixes that are recommended:

Model Retraining

For a detectron2 training tutorial, see this colab.

You'll need the following git repo to have a config file that works in the next stages. Here's an example of the next stages with some explanations to help supplement the documentation already on the github page.

git clone https://github.com/Layout-Parser/layout-model-training.git
cd layout-model-training
python tools/train_net.py --dataset_name "lsr" --json_annotation_train "/content/training/coco.json" --image_path_train "/content/training" --config-file "/content/layout-model-training/pubLay.yml" MODEL.WEIGHTS "/content/layout-model-training/pubLay.pth" OUTPUT_DIR "/content/out" SOLVER.MAX_ITER 300 SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.00025 MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE 128 MODEL.ROI_HEADS.NUM_CLASSES 1

train_net.py has 2 different types of arguments: script specific and detectron2 specific arguments. See the github project for documentation on any argument starting with -- while all other arguments (in all caps) are detectron2 arguments. The --json_annotation_train argument is expecting a json file that follows the coco dataset format. The config files can be found in the layout parser pretrained zoo. No clear documentation exists on how to get a config file for other models.

After training is done, there should be model_final.pth and config.yaml files near the working directory. These are the same files that are required for a custom model in the ocr script.