LP OCR.py - haltosan/RA-python-tools GitHub Wiki
Layout Parser
This uses a model trained to find distinct regions within a page and run OCR on each region. Region detection is used to improve OCR performance as well as deal with more complicated data and parsing it (tables, parallel columns of text, etc.). This page is documentation for the script LP-OCR.py.
Here is a link to a colab that does layout parsing, ocr, and retraining if you need another example of it.
Prerequisites
These lines should get all requirements that aren't already on the system.
pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/[email protected]#egg=detectron2"
pip install "layoutparser[ocr]"
sudo apt install tesseract-ocr
pip install pytesseract
GPU is required for any retraining.
Usage
python LP-OCR.py [options] pdfDirectory pdfName startPage endPage model
Options: (assume opposite behavior is default)
--delImages | -d delete images after running (aka don't keep a cache)
--cache | -c use an existing image cache
--imageOnly | -i only create images from pdf, don't do ocr
acceptable values for model:
[directory containing model_final.pth and config.yaml]
prima prima layout, good for general use
lsr trained pubLayout for land patent records
option | description |
---|---|
pdfDirectory | the directory you want the script to run in |
pdfName | the name of the pdf; please make sure this doesn't navigate any directories (like Documents/file.pdf), just the pdf name |
startPage | page to start with |
endPage | page to end on (inclusive) |
model | which model to use to parse out the text regions |
delImages | during run time, images of each page of the pdf will be created; if this flag is set, those images will be deleted |
cache | creating images from the pdf can take a while. If these pictures are kept, more runs on the same document will be faster. -c ensures these existing images are used to speed up the program |
imageOnly | exits as soon as images from the pdf are created; this option will not do any ocr |
There is a hidden option -o
which was made for a specific project. If set, the only command line arguments are as follows:
python LP-OCR.py -o imageDirectory model
All images in the imageDirectory
will then be transcribed into one text file called 'output.txt'. model
has the same expected values.
Models
Current selections
There are currently 2 models to pick from: prima and customPubLayout.
- prima: This is the PrimaLayout that came with the pretrained model zoo with layout parser. This does a decent job at most layouts.
- customPubLayout: This is a retrained version of the PubLayNet model that came with the layout parser pretrained zoo. It was trained on the land records.
Custom models
To use a different model, the only required files are model_final.pth and config.yaml. The config file is specific to layout parser, see layout parser website for getting a valid config file. See the later sections on retraining an existing model.
Both the model_final.pth and config.yaml need to be in the same directory if you are using the command line arguments.
Debugging
Here are some issues that may arise and the fixes that are recommended:
-
DLL issues
- ensure that the pythoncom37.dll and files aren't found multiple times in the PATH
- see https://github.com/mhammond/pywin32/issues/1709 and https://github.com/Azure/azure-cli/issues/17986
-
OSError: [Errno 22] Invalid argument: 'C:\Users\USER/.torch/iopath_cache\s/h7th27jfv19rxiy\model_final.pth?dl=1.lock'
- ensure the model weights are correctly downloaded and/or loaded; may also be an issue with pytesseract
- see https://stackoverflow.com/questions/68094922/pytorch-throws-oserror-on-detectron2layoutmodel
-
Cannot find field 'gt_masks' in the given Instances!
- ensure that you have masks defined in the training data, or switch to a model that doesn't require masks (faster verses mask)
- https://github.com/facebookresearch/detectron2/issues/485
-
Can't find model weights after training a model
-
Just a good link to have: https://towardsdatascience.com/auto-parse-and-understand-any-document-5d72e81b0be9
Model Retraining
For a detectron2 training tutorial, see this colab.
You'll need the following git repo to have a config file that works in the next stages. Here's an example of the next stages with some explanations to help supplement the documentation already on the github page.
git clone https://github.com/Layout-Parser/layout-model-training.git
cd layout-model-training
python tools/train_net.py --dataset_name "lsr" --json_annotation_train "/content/training/coco.json" --image_path_train "/content/training" --config-file "/content/layout-model-training/pubLay.yml" MODEL.WEIGHTS "/content/layout-model-training/pubLay.pth" OUTPUT_DIR "/content/out" SOLVER.MAX_ITER 300 SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.00025 MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE 128 MODEL.ROI_HEADS.NUM_CLASSES 1
train_net.py has 2 different types of arguments: script specific and detectron2 specific arguments. See the github project for documentation on any argument starting with --
while all other arguments (in all caps) are detectron2 arguments. The --json_annotation_train
argument is expecting a json file that follows the coco dataset format. The config files can be found in the layout parser pretrained zoo. No clear documentation exists on how to get a config file for other models.
After training is done, there should be model_final.pth and config.yaml files near the working directory. These are the same files that are required for a custom model in the ocr script.