LP OCR.py - haltosan/RA-python-tools Wiki
This uses a model trained to find distinct regions within a page and run OCR on each region. Region detection is used to improve OCR performance as well as deal with more complicated data and parsing it (tables, parallel columns of text, etc.). This page is documentation for the script LP-OCR.py.
Here is a link to a colab that does layout parsing, ocr, and retraining if you need another example of it.
These lines should get all requirements that aren't already on the system.
pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/[email protected]#egg=detectron2" pip install "layoutparser[ocr]" sudo apt install tesseract-ocr pip install pytesseract
GPU is required for any retraining.
python LP-OCR.py [options] pdfDirectory pdfName startPage endPage model Options: (assume opposite behavior is default) --delImages | -d delete images after running (aka don't keep a cache) --cache | -c use an existing image cache --imageOnly | -i only create images from pdf, don't do ocr acceptable values for model: [directory containing model_final.pth and config.yaml] prima prima layout, good for general use lsr trained pubLayout for land patent records
pdfDirectory - the directory you want the script to run in
pdfName - the name of the pdf; please make sure this doesn't navigate any directories (like Documents/file.pdf), just the pdf name
startPage - page to start with
endPage - page to end on (inclusive)
model - which model to use to parse out the text regions
delImages - during run time, images of each page of the pdf will be created; if this flag is set, those images will be deleted
cache - creating images from the pdf can take a while. If these pictures are kept, more runs on the same document will be faster. -c ensures these existing images are used to speed up the program
imageOnly - exits as soon as images from the pdf are created; this option will not do any ocr
There is a hidden option
-o which was made for obits. If set, the only command line arguments are as follows:
python LP-OCR.py -o imageDirectory model
All images in the
imageDirectory will then be transcribed into one text file called 'output.txt'.
model has the same expected values.
There are currently 2 models to pick from: prima and customPubLayout.
- prima: This is the PrimaLayout that came with the pretrained model zoo with layout parser. This does a decent job at most layouts.
- customPubLayout: This is a retrained version of the PubLayNet model that came with the layout parser pretrained zoo. It was trained on the land records.
To use a different model, the only required files are model_final.pth and config.yaml. The config file is specific to layout parser, see layout parser website for getting a valid config file. See the later sections on retraining an existing model.
Both the model_final.pth and config.yaml need to be in the same directory if you are using the command line arguments.
Here are some issues that may arise and the fixes that are recommended:
OSError: [Errno 22] Invalid argument: 'C:\Users\USER/.torch/iopath_cache\s/h7th27jfv19rxiy\model_final.pth?dl=1.lock'
- ensure the model weights are correctly downloaded and/or loaded; may also be an issue with pytesseract
- see https://stackoverflow.com/questions/68094922/pytorch-throws-oserror-on-detectron2layoutmodel
Cannot find field 'gt_masks' in the given Instances!
- ensure that you have masks defined in the training data, or switch to a model that doesn't require masks (faster verses mask)
Can't find model weights after training a model
Just a good link to have: https://towardsdatascience.com/auto-parse-and-understand-any-document-5d72e81b0be9
For a detectron2 training tutorial, see this colab.
You'll need the following git repo to have a config file that works in the next stages. Here's an example of the next stages with some explanations to help supplement the documentation already on the github page.
git clone https://github.com/Layout-Parser/layout-model-training.git cd layout-model-training python tools/train_net.py --dataset_name "lsr" --json_annotation_train "/content/training/coco.json" --image_path_train "/content/training" --config-file "/content/layout-model-training/pubLay.yml" MODEL.WEIGHTS "/content/layout-model-training/pubLay.pth" OUTPUT_DIR "/content/out" SOLVER.MAX_ITER 300 SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.00025 MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE 128 MODEL.ROI_HEADS.NUM_CLASSES 1
train_net.py has 2 different types of arguments: script specific and detectron2 specific arguments. See the github project for documentation on any argument starting with
-- while all other arguments (in all caps) are detectron2 arguments. The
--json_annotation_train argument is expecting a json file that follows the coco dataset format. The config files can be found in the layout parser pretrained zoo. No clear documentation exists on how to get a config file for other models.
After training is done, there should be model_final.pth and config.yaml files near the working directory. These are the same files that are required for a custom model in the ocr script.