Extracting Character Images - hcts-hra/ecpo-fulltext-experiments GitHub Wiki

In order to train and evaluate an OCR classifier, there is a dire need for a dataset containing labeled character images from the corpus. Their extraction is approached by manually cropping text blocks that are part of folds annotated in the text ground truth. After binarization, horizontal and vertical projection profiles are computed for skew correction.

The same profiles can be used to build a grid, which works well in many cases since columns are generously spaced and thus easily separable by the vertical projections (first image). However, wherever rows are not perfectly aligned the method fails (second image, e.g. in the third row):

Computing horizontal projection profiles column-wise and not for the whole crop seems to solve the problem (second image, third row looks better now) but causes other problems where characters contain white space (first image, third column from the left: the character 二 is split up; second image, eighth column: the character 主 is spit up):

The solution is a hybrid approach that computes horizontal projections both globally and locally. It then uses the global profile as a backbone and corrects the per-column separation if the local per-column horizontal projection yields a separator within proximity defined by a threshold (e.g. 6 px). After fine-tuning the parameters*, this solves almost all problematic cases I could manually find and outputs a grid that can easily be mapped to the text ground truth.

*the threshold within which the adjustment described above is allowed; the prominence and minimum distances between peaks in the projection profiles; the RLSA-threshold for emphasizing character locations vs. gaps before computing the local projection profile; the offset constant for better binarization

As long as the grid layout is being adhered to, this approach is robust enough to deal with skewed and sparsely typeset text blocks:

However, any inconsistency in the grid will render the method unusable:

The indentation has to be dealt with manually by adding "empty" characters to the annotation in the position of the corresponding slots. Other than that, the method entirely relies on correct annotation. Missing lines are easily detected as this will cause the no. of detected columns in the image to surpass the no. of lines in the corresponding annotation. Missing or extra characters within a line are not automatically detected and would cause the character assignment to shift by one slot (disastrous!), so annotations have to be double-checked. Wrong or swapped characters that don't affect the line length (e.g. in the sparse layout example above 莫 mò was accidentally mis-annotated as 墨 mò and 侖 lún was wrongly annotated as 倫 lún) are hard to be detected and will unavoidably lead to lower accuracy.

During the process of extracting images of single characters, the following method is used for normalization and contrast enhancing:

Globally (for the whole crop): Employ adaptive thresholding: Every pixel whose value is bigger (= brighter) than the average of a surrounding 7x7-kernel is set to 255 (white). Separately, every pixel whose value is bigger (= brighter) than the median of the image (called "threshold" below) is set to 255. Every other pixel keeps its grey value. Choosing the median arises from the supposition that there are more background pixels than content pixels.
Locally (after cropping rectangles containing one character each): Ignoring white pixels, linearly rescale pixel values from [min_val,threshold] to [0,255]. min_val refers to the darkest pixel in the image. This allows even for very lightly printed characters to appear stronger and have the decisive features more strongly separated from the background, as the character 當 in this example:

After manually cropping 637 grid-consistent text blocks and assigning the corresponding section of the ground truth text to it, the above method yields 72,246 images of 3086 unique characters. Since these are non-squared and it is not trivial to know where to crop to obtain a square, white padding is added instead (either top and bottom padding or left and right padding). A 70-30 train-dev split (of entire crops, not single character images), yields e.g. these 91 squared training set images of the character 當: