Extractors module - AdrianC2000/InvoiceScannerApp GitHub Wiki

Extractors module consists of 2 extractors:

  1. KeyDataExtractor- class for extracting key data, like currency, invoice ID, buyer and seller info etc
  2. TableExtractor- class for table extraction from the invoice

KeyDataExtractor

KeyDataExtractor module consists of 3 extractors:

  1. BlocksExtractor- class for extracting blocks of texts
  2. KeyValuesExtractor- class for extracting key values - invoice number, currency, and listing date
  3. PersonValuesExtractor- class for extracting person values - name, address, and NIP number

BlocksExtractor

Using page.blocks annotation supplied by Google Vision OCR API blocks of text are extracted, and then for each block:

  • rows are extracted - each row is assigned its text and position
  • whole block position is calculated So the class produces a BlockPosition object with block rows and position.

KeyValuesExtractor

Extracting key values - invoice number, currency and listing date. The class receives a list of blocks with matching key words. Process is as follows:

  1. Preliminary search - looking for each key value in the same block
  2. If every value was not found, Deep search

KeyValuesExtractor uses simple resolvers for each key value. The alleged key value is chosen by the key word (or last key word from the set) position. The next line is searched if the key value is not next to the key word.

Currency resolver

Requirements: only letters, 2 or 3 signs, previous word was a number.

Invoice number resolver

Requirements: contains number

Listing date resolver

Requirements: date format

Each resolver returns SearchResponse with ValueFindingStatus (FOUND, VALUE_BELOW, VALUE_ON_THE_RIGHT, VALUE_BELOW_OR_ON_THE_RIGHT) - when key value is not found, it can be estimated where it is based on the preliminary search. Those statuses are then used by the Deep search:

  1. For status VALUE_BELOW whole first line of the block below is searched
  2. For status VALUE_ON_THE_RIGHT only row on the same y position in the block on the right is searched
  3. For status VALUE_BELOW_OR_ON_THE_RIGHT whole first line of the block below is searched, and if not found then only row on the same y position in the block on the right is searched

After this process every key values should be found, and missing ones receive VALUE_MISSING status.

PersonValuesExtractor

Extracting person key values - name, address and NIP. The class receives a list of blocks with matching key words. Process is as follows:

  1. Preliminary search - looking for each key value in the same block
  2. If every value was not found, Deep search PersonValuesExtractor uses extended resolvers for each key value.

As the key words positions for each values are connected, this extractor is more complex than the previous one:

  1. Firstly, address row index is calculated - words like "ul", "al", "os" are searched
  2. Then ZIP code row index is calculated - based on the regepx r'[0-9]{2}-[0-9]{3}'
  3. Finally NIP row index is calculated based on the "nip" word

First 2 points give two indexes - all rows between them are classified as the address. If only one is found, the other key value is searched below or on the right. Name is usually above the address, so it is extracted as the rows above the address rows. NIP is simply searched next to the key word.

If any value is not found, it is searched below or on the right. Address case is a bit tricky - if the status in the preliminary search is not FOUND, but some value is passed it means that the extended search will look for the missing part of the address and merge it with the found one.

TableExtractor

The table extractor is much simpler. Currently, the application only supports reading contoured tables - borderless tables are not supported. Because of that, table extraction is a simple task - using the cv2 library morphological closure is applied, and after that, the outer contours of each element with borders are found:

# Morphological closure
close = cv2.morphologyEx(255 - thr, cv2.MORPH_CLOSE, cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3)))
# Finding outer contours
contours, _ = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)

Then, the biggest contoured object is classified as a table - this approach is far from perfect and will be reconsidered with the attempt to handle borderless tables.