Morphological Opening to Connect Text Blocks - hcts-hra/ecpo-fulltext-experiments GitHub Wiki

NOTE: Due to image size this wiki is probably more convenient to read in a half-width window.

The following section will use this crop as an example:

This approach roughly follows the description in section 2.1 in Liu et al. (2010).

1. Binarization

The original image is first binarized by adaptive mean thresholding. After removing contours (connected black pixels) larger than a heuristically chosen value, we get dense text blocks separated by white space:

2. Morphological Operations

We can now use morphological opening to remove smaller white space between black pixels while retaining larger white space, making the text blocks connect to a single contour. Again, we need a heuristic for the depth of opening (= number of erosion and dilation iterations):

If opening causes unwanted pixel bridges between separate text blocks to build up, additional closing might be needed to get rid of these since they would negatively impact the next step:

3. Draw Boxes

Drawing boxes around all connected objects greater than a (again, heuristically chosen) value allows for a fairly good approximation of the location of all text blocks, including a handleable amount of noise.

Heuristics

Heuristics used in this example (in px):

  1. Kernel size and offset for adaptive mean thresholding: 305; 10
  2. Size of contours to be removed after binarization: 100
  3. Kerzel size and no. of erosion/dilation iterations for morphological opening: 3; 5
  4. Size of contours to draw boxes around after opening: 200

Problems

The main problem with this approach lies in the difficulty to decide upon heuristics that don't only work for one image section or even one image, but e.g. a whole year's collection of issues or even the entire image corpus. One might consider implementing basic algorithms to adapt the above values for each fold.

Another issue lies in drawing rectangles during the final step, as not every text box is rectangular. It also comes with the need to de-rotate the images, as many of the scans have been done at a slight angle which step 3 (drawing boxes) is not dealing with.

The image below is a different section of the same fold used in the examples above, showing that even within the same fold some heuristics might be suitable in one part but yield bad results in another: The lower blue box is a bad approximation. I manually added two red boxes to show the desired result. They demonstrate how

  1. heuristic 2 was chosen too high, as the vertical separator between them is too short and thus missed out, leading to the connection of actually separate text blocks during step 2 (morphological opening); however lowering the value for heuristic 2 will lead to headings being removed in step 1 due to multiple characters ending up connected after binarization;
  2. the polygonal shape of the right red box cannot be approximated during step 3 (drawing boxes);
  3. the heading is missed out: Headings are often too far away to connect with the text block during step 2.
  4. We also see how heuristic 4 was chosen too high, as a whole text block at the top of the image is missed out; however lowering heuristic 4 leads to considerable increase in noise.

Finally, this approach doesn't seem fit to extract text blocks from entire folds of the whole corpus, especially due to substantial variation in spacing over the years, issues and even within one single fold, making step 2 rather hard to automate. The method however does remain useful for separating multiple text blocks extracted with different approaches.

Read next: Finding and Connecting Separators or go back to Home.