German Konzilsprotokolle - tesseract-ocr/tesstrain GitHub Wiki
Training Tesseract with handwritten text: German Konzilsprotokolle
The question whether Tesseract works for handwritten text recognition has been asked multiple times. In the following, an experiment which might help to answer this question is documented.
The data
We used a data set which was created in the context of the READ project by Dirk Alvermann (UniversitΓ€tsarchiv Greifswald) and has been published via Zenodo:
Tobias GrΓΌning, Gundram Leifert, Johannes Michael, Tobias StrauΓ, Max Weidemann, Roger Labahn. (2016). read_dataset_german_konzilsprotokolle [Data set]. Zenodo. http://doi.org/10.5281/zenodo.215383
It contains 8β―770 transcribed text lines of handwritten historical documents from the late 18th century. They are represented as image-PAGE-XML pairs.
Preparations
Download and extraction result in the following directory structure:
ββ german_konzilsprotokolle
βββ data
β βββ Greifswald_Alvermann
β βββ Copy_of_1794-95
β β βββ page
β β βββ tif
β βββ Copy_of_1795-96
β β βββ page
β β βββ tif
β βββ Copy_of_1796-97
β β βββ page
β β βββ tif
β βββ Copy_of_AA_1794-95
β β βββ page
β β βββ tif
βββ lists
Since tesstrain
expects text-image pairs in the line level, the first step is to extract them from the (page) images using the coordinates given in the PAGE XML. Luckily, ocrd_segment
offers an ease-of-use processor for this task. Entering the OCR-D ecosphere also gives us access to various image preprocessing operations such as (superior) binarization and denoising.
Creating and filling an OCR-D workspace
We assume a working installation of OCR-D's core module and the OCR-D modules ocrd_cis
, ocrd_olena
and ocrd_segment
. Navigate to a directory of your choice and run
ocrd workspace init .
Images and XML files may be added to the workspace via
mkdir IMG
mkdir PAGE
for i in `find /path/to/german_konzilsprotokolle/data/Greifswald_Alvermann/Copy_of_1794-95 -name "*.tif"`; do base=`basename $i .tif`; echo $base; mv $i IMG/1794-95_${base}.tif; mv /path/to/german_konzilsprotokolle/data/Greifswald_Alvermann/Copy_of_1794-95/page/${base}.xml PAGE/1794-95_${base}.xml; done
for i in `find IMG -name "*.tif"`; do base=`basename $i .tif`; ocrd workspace add $i -G IMG -i ${base}_img -g $base -m 'image/tiff'; done
for i in `find PAGE -name "*.xml"`; do base=`basename $i .xml`; ocrd workspace add $i -G PAGE -i ${base}_page -g $base -m 'application/vnd.prima.page+xml'; done
For technical reasons, the attribute imageFilename
of the element Page
in the PAGE XML file names has to be adjusted accordingly. E.g.,
cd PAGE/
for i in `find . -name "1794-95_*"`; do echo $i; sed -i 's/imageFilename="/imageFilename="IMG\/1794-95_/' $i; done
Repeat these steps for the other Copy_of_
directories as well.
Preparing and extracting line images for training Tesseract
As an initial setup, we choose binarization following Wolf et al. (2002) and denoising (aka. despeckling) as provided by ocropy
. Both operations can be comfortably applied to the existing line annotations using the corresponding OCR-D interfaces:
ocrd-olena-binarize -I PAGE -O WOLF,WOLF-IMG -m mets.xml -p <(echo '{"impl":"wolf"}')
ocrd-cis-ocropy-denoise -I WOLF -O DENOISE,DENOISE-IMG -m mets.xml -p '{"level-of-operation": "line"}'
This results in shiny line images. E.g.,
Those can be extracted and accompanied with the corresponding GT using
ocrd-segment-extract-lines -I DENOISE -O LINES -m mets.xml