LP OCR.py - haltosan/RA-python-tools Wiki

Layout Parser

This uses a model trained to find distinct regions within a page and run OCR on each region. Region detection is used to improve OCR performance as well as deal with more complicated data and parsing it (tables, parallel columns of text, etc.). This page is documentation for the script LP-OCR.py.

Here is a link to a colab that does layout parsing, ocr, and retraining if you need another example of it.

Prerequisites

These lines should get all requirements that aren't already on the system.

pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/[email protected]#egg=detectron2"
pip install "layoutparser[ocr]"
sudo apt install tesseract-ocr
pip install pytesseract

GPU is required for any retraining.

Usage

python LP-OCR.py [options] pdfDirectory pdfName startPage endPage model
  Options: (assume opposite behavior is default)
     --delImages | -d      delete images after running (aka don't keep a cache) 
     --cache     | -c      use an existing image cache
     --imageOnly | -i      only create images from pdf, don't do ocr

 acceptable values for model:
     [directory containing model_final.pth and config.yaml]
     prima      prima layout, good for general use
     lsr        trained pubLayout for land patent records

There is a hidden option -o which was made for obits. If set, the only command line arguments are as follows:

python LP-OCR.py -o imageDirectory model

All images in the imageDirectory will then be transcribed into one text file called 'output.txt'. model has the same expected values.


Models

Current selections

There are currently 2 models to pick from: prima and customPubLayout.

Custom models

To use a different model, the only required files are model_final.pth and config.yaml. The config file is specific to layout parser, see layout parser website for getting a valid config file. See the later sections on retraining an existing model.

Both the model_final.pth and config.yaml need to be in the same directory if you are using the command line arguments.

Debugging

Here are some issues that may arise and the fixes that are recommended:

Model Retraining

For a detectron2 training tutorial, see this colab.

You'll need the following git repo to have a config file that works in the next stages. Here's an example of the next stages with some explanations to help supplement the documentation already on the github page.

git clone https://github.com/Layout-Parser/layout-model-training.git
cd layout-model-training
python tools/train_net.py --dataset_name "lsr" --json_annotation_train "/content/training/coco.json" --image_path_train "/content/training" --config-file "/content/layout-model-training/pubLay.yml" MODEL.WEIGHTS "/content/layout-model-training/pubLay.pth" OUTPUT_DIR "/content/out" SOLVER.MAX_ITER 300 SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.00025 MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE 128 MODEL.ROI_HEADS.NUM_CLASSES 1

train_net.py has 2 different types of arguments: script specific and detectron2 specific arguments. See the github project for documentation on any argument starting with -- while all other arguments (in all caps) are detectron2 arguments. The --json_annotation_train argument is expecting a json file that follows the coco dataset format. The config files can be found in the layout parser pretrained zoo. No clear documentation exists on how to get a config file for other models.

After training is done, there should be model_final.pth and config.yaml files near the working directory. These are the same files that are required for a custom model in the ocr script.