LP OCR rotating.py - haltosan/RA-python-tools GitHub Wiki

Layout Parser (with automated image rotating)

[note: the -cleaned version of this file has replaced the normal version]

This project is more specialized LP-OCR project. The data is a large amount of images that aren't always rotated the correct way and we need to extract text from them. The general structure is this:

for(image in images){
    do {
        layout = layoutParser(image);
        text = tesseractOCR(image, layout);
        image = rotate90(image);
    } while(isGarbage(text));
}

This is also meant to run in an environment where access to networks drives will fail at unknown times and at any moment. This causes file reading to fail, library functions to fail (because the network drives have the environment we're running) and just about anything else you can think of to fail. There are large error detecting blocks for this reason.

Usage

python LP-OCR-rotating.py [starting folder]

The starting folder argument gives the "working folder" of the script. It moves here and accesses files in every folder/sub folder inside it. Transcriptions are saved in 4 image batches. They are appended to a file called '#-save.txt' where # is the batch number (see appending_save() for details on saving). There is a log generated and saved as '#-lp.log' (same #). Both files are saved in SAVE_FILE_PATH.

This has some hardcoded paths/variables. These all need to be pointing at the correct files:

variable	description
pytesseract.pytesseract.tesseract_cmd	the path to tesseract.exe
CONFIG	the config .yaml file for a layout parser model
MODEL	the model weights (.pth file) for a layout parser model / detectron2 model
LABEL_MAP	a layout parser specific map (see layout parser's website)
TEXT_LABEL	whatever string (from label map) that indicates your region of interest
IMAGE_FILE_TYPE	file type; needs to be all caps
SAVE_FILE_PATH	folder for saving output
TASK_BATCH_STRING	don't touch this; it gets set automatically later
CURRENT_PROGRESS	also don't touch; same thing

A model that is setup for layout parser is required for this project. Layout parser has pretrained models for this that are ready to go. LP-OCR.py in the wiki details how to get a custom model loaded in instead of a pretrained model.