Column seperator module - AdrianC2000/InvoiceScannerApp GitHub Wiki

Purpose of Column seperator module is to create a list of Column objects based on the table image

ColumnsSeparator

  1. Class receives a table image, that was previously extracted
  2. Image is processed:
    1. Table image is transformed into the binary image
    2. Using ImageRotator class binary image skew angle is calculated and fixed
  3. Using ContoursDefiner class:
    1. Table's contours are calculated
    2. Table's contours positions are recalculated - single line position is calculated based on the mean coordinates values, so the contours are even and straight
  4. Then list of Column objects is prepared and returned:
    1. Column -> list of Position objects (cells in single column)
    2. Position -> starting_x, starting_y, ending_x, ending_y

Example with images

  1. Table image

2 Extracted table

  1. Binary image and skew fixed

3 Binary table

  1. Original table's contours

4 Original contours

  1. Fixed table's contours

5 Fixed contours

  1. Extracted list of columns on the image

6 Table with bounding boxes

ImageRotator

Image rotator receives a table image and fix its skew angle - often scanned invoice are not scanned perfectly straight, so this class fixes this. The procedure is as follows:

  1. Separate horizontal lines of the table using ContourDefiner
  2. Getting first horizontal lines (so the line above the header row)
  3. Calculating first and last points that creates that line (so most left and right points x and y coordinates)
  4. Calculating the angle based on those two points and rotating the whole table by that angle

Example with image

image

ContoursDefiner

ContoursDefiner is a class that extracts contours from the table image. The contours are extracted as the ndarray. This class also removes the redundant part of the table (in the current approach only rectangular table content is processed because the below part is commonly a table summarization, which right now is not processed).

Using the cv2 library horizontal and vertical lines are extracted:

contours, _ = cv2.findContours(table_contours_image, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

Besides calculating the contours, ContoursDefiner class also calculates fixed contours, so that every cell is a rectangle. Each horizontal and vertical line is separated, and then:

  1. For horizontal lines mean y coordinate is calculated
  2. For vertical lines mean x coordinate is calculated

Example with image

  1. Extracted contours:

4 Original contours

  1. Fixed contours (with redundant table's part removed):

5 Fixed contours