German Konzilsprotokolle - tesseract-ocr/tesstrain GitHub Wiki

Training Tesseract with handwritten text: German Konzilsprotokolle

The question whether Tesseract works for handwritten text recognition has been asked multiple times. In the following, an experiment which might help to answer this question is documented.

The data

We used a data set which was created in the context of the READ project by Dirk Alvermann (Universitätsarchiv Greifswald) and has been published via Zenodo:

Tobias Grüning, Gundram Leifert, Johannes Michael, Tobias Strauß, Max Weidemann, Roger Labahn. (2016). read_dataset_german_konzilsprotokolle [Data set]. Zenodo. http://doi.org/10.5281/zenodo.215383

It contains 8 770 transcribed text lines of handwritten historical documents from the late 18th century. They are represented as image-PAGE-XML pairs.

Preparations

Download and extraction result in the following directory structure:

── german_konzilsprotokolle
   ├── data
   │   └── Greifswald_Alvermann
   │       ├── Copy_of_1794-95
   │       │   ├── page
   │       │   └── tif
   │       ├── Copy_of_1795-96
   │       │   ├── page
   │       │   └── tif
   │       ├── Copy_of_1796-97
   │       │   ├── page
   │       │   └── tif
   │       ├── Copy_of_AA_1794-95
   │       │   ├── page
   │       │   └── tif
   └── lists

Since tesstrain expects text-image pairs in the line level, the first step is to extract them from the (page) images using the coordinates given in the PAGE XML. Luckily, ocrd_segment offers an ease-of-use processor for this task. Entering the OCR-D ecosphere also gives us access to various image preprocessing operations such as (superior) binarization and denoising.

Creating and filling an OCR-D workspace

We assume a working installation of OCR-D's core module and the OCR-D modules ocrd_cis, ocrd_olena and ocrd_segment. Navigate to a directory of your choice and run

ocrd workspace init .

Images and XML files may be added to the workspace via

mkdir IMG
mkdir PAGE
for i in `find /path/to/german_konzilsprotokolle/data/Greifswald_Alvermann/Copy_of_1794-95 -name "*.tif"`; do base=`basename $i .tif`; echo $base; mv $i IMG/1794-95_${base}.tif; mv /path/to/german_konzilsprotokolle/data/Greifswald_Alvermann/Copy_of_1794-95/page/${base}.xml PAGE/1794-95_${base}.xml; done
for i in `find IMG -name "*.tif"`; do base=`basename $i .tif`; ocrd workspace add $i -G IMG -i ${base}_img -g $base -m 'image/tiff'; done
for i in `find PAGE -name "*.xml"`; do base=`basename $i .xml`; ocrd workspace add $i -G PAGE -i ${base}_page -g $base -m 'application/vnd.prima.page+xml'; done

For technical reasons, the attribute imageFilename of the element Page in the PAGE XML file names has to be adjusted accordingly. E.g.,

cd PAGE/
for i in `find . -name "1794-95_*"`; do echo $i; sed -i 's/imageFilename="/imageFilename="IMG\/1794-95_/' $i; done

Repeat these steps for the other Copy_of_ directories as well.

Preparing and extracting line images for training Tesseract

As an initial setup, we choose binarization following Wolf et al. (2002) and denoising (aka. despeckling) as provided by ocropy. Both operations can be comfortably applied to the existing line annotations using the corresponding OCR-D interfaces:

ocrd-olena-binarize -I PAGE -O WOLF,WOLF-IMG -m mets.xml -p <(echo '{"impl":"wolf"}')
ocrd-cis-ocropy-denoise -I WOLF -O DENOISE,DENOISE-IMG -m mets.xml -p '{"level-of-operation": "line"}'

This results in shiny line images. E.g., Example for an optimized line. Those can be extracted and accompanied with the corresponding GT using

ocrd-segment-extract-lines -I DENOISE -O LINES -m mets.xml