Classifiers module - AdrianC2000/InvoiceScannerApp GitHub Wiki

Purpose of Classifiers module is to extract and classify elements that match the fixed list of typical words.

Classifiers module consists of two submodules:

BlockClassifier, which returns a list of blocks assigned to classified type (based on the block content) - MatchingBlock
HeadersClassifier, which returns a list of MatchingHeader, which is an object with a phrase from the header cell and the ConfidenceCalculation object, which contains confidence percentage and specific header type that matches the cell content.

BlockClassifier

Given a list of BlockPosition (so the list of blocks with its content divided into rows) this class searches for the blocks that have characteristic words that correspond to the searched key values. In the key_words_database.json file there are all possible key words that are searched with the list of patterns set. The pattern set is an object that consists of two fields:

patterns - list of words that are searched
enough_fit - parameter which states how many of the words from the patterns list must match to state that there is a fit

The order of the pattern set for every key word is important - each list is iterated starting from the most accurate fit (so the most words) - this way we can avoid the situation where for example there are multiple "invoice" words on the document, and the algorithm would randomly choose the block. By introducing the pattern sets the most accurate set is checked first, and then the requirements are reduced to find any fit.

List of MatchingBlock is returned - so only blocks that have the searched phrases inside of them. MatchingBlock consists of:

BlockPosition object - so the block rows and whole block position
key_word - a key_word that a given block was assigned into
row_index - integer, the row in which the found key word pattern was found
last_word_index - the last word index that matches the key word pattern

The last two parameters are used in the KeyValuesExtractor to extract the actual key values.

HeadersClassifier

Given a RowContent- so the list of string (each header cell content with words in correct order) this class matches each column header with its type. In the table_headers_database.json file there are all possible headers types and corresponding values that are searched. The order is important - words are checked from the beginning of each list so that the searched sentence can match every searched value.

If headers cell content contains more than one word, then the summarized Confidence value is calculated (simply as the sum of confidences divided by the word count).

The algorithm compares each cell with each header pattern set and then returns the one with the highest confidence value. If confidence is higher than 90% then the following headers will skip this header type, as this confidence is enough to be sure that the header was correctly matched with the type.