dataMatching.py - haltosan/RA-python-tools GitHub Wiki

Table Data Matching

This is used to merge csv files that contain similar data ([name, place] with [name, birth, death] based on name). Still a work in progress.

Note:

This requires the file_analysis.py and predicates.py files to be in the same directory. Feel free to use these functions provided with the two libraries.

General Functions

Syntax:

getColumn(csvFile: list, key: str) -> list:

Get a column from a csv file (read in by file_analysis.get()) based on key ('id').


matchOnValue(tableA: list, tableB: list, columnKey: str) -> tuple:

Generator. Return a tuple that has the rows where both databases matched on the column key value. The first element is from table tableA, the second from tableB. columnKey indicates which column to use to match.


fuzzyMatch(tableA: list, tableB: list, columnKey: str, matchFunction) -> tuple:

Generator. Same as above, but uses matchFunction to match. If matchFunction returns true, the rows are considered matched. matchFunction needs to be in the following form: f(a: str, b: str) -> bool (this is similar to the predicate functions, it just accepts 2 parameters).


join(tableA: list, tableB: list, columnKey: str, matchFunction = lambda x,y : x == y, reason: str = None) -> list:

Comines all rows from tableA and tableB where tableA[_columnKey] matches tableB[columnKey], as determined by matchFunction. matchFunction should follow the format of f(x: str, y: str) -> bool. reason adds a final column ('reason') that indicates the join reason (close birthdates, matching surname, etc.). If reason is None, no column is added.


tableKeys(table: list) -> list:

Returns the keys for a given table (the first row)


project(table: list, newHeader: list) -> list:

Change column ordering and inclusion depending on the newHeader. For example, a table that had ['id','name','place'] as a header could be changed to ['place','id'] and the place column would come first, then the id column. The name column would be removed.


projectNumerical(table: list, newHeader: list) -> list:

Same as project but uses column numbers instead of string keys. For example: if a table had a header ['id', 'name', 'place'] and was projected using [1,0] as the header, it would become ['name', 'id']. This is useful if the same key appears twice in a header.


matchingColumns(table: list, column1: int, column2: int) -> list:

Generator. Only keeps rows such that table[row][column1] == table[row][column2].


getAll(generator) -> list:

Returns a list that contains all values from a generator.


Example

>>> a = get('a.csv')
>>> b = get('b'.csv')
>>> matchGenerator = fuzzyMatch(a, b, 'date', lambda dateA, dateB: abs(int(dateA) - int(dateB)) < 2)
>>> for match in matchGenerator:
>>>     print(match)

('0,Bill,Nebraska,4', '0,Nebraska,4,blue')
('0,Bill,Nebraska,4', '1,Nebraska,5,red')
('1,Bob,New York,5', '0,Nebraska,4,blue')
('1,Bob,New York,5', '1,Nebraska,5,red')
('2,Joe,California,4', '0,Nebraska,4,blue')
('2,Joe,California,4', '1,Nebraska,5,red')
('3,Fred,Utah,5', '0,Nebraska,4,blue')
('3,Fred,Utah,5', '1,Nebraska,5,red')
('4,Ted,Washington,5', '0,Nebraska,4,blue')
('4,Ted,Washington,5', '1,Nebraska,5,red')

This loads in 2 csv files and runs a fuzzy match on them to see what dates are within 1 day of each other.


>>> a = get('a.csv')
>>> b = get('b.csv')
>>> tableKeys(a)
['id', 'name', 'place', 'date']
>>> tableKeys(b)
['id', 'place', 'date', 'color']
>>> aNamePlace = project(a, ['name', 'place'])
>>> bPlaceColor = project(b, ['place', 'color'])
>>> abJoin = join(aNamePlace, bPlaceColor, 'place', reason = 'Placed matched')
>>> tableKeys(abJoin)
['name', 'place', 'place', 'color', 'reason']
>>> abJoin = projectNumerical(abJoin, [0, 1, 3, 4])
>>> for x in abJoin:
	print(x)

name,place,color,reason
Bill,Nebraska,blue,Placed matched
Bill,Nebraska,red,Placed matched
Joe,California,purple,Placed matched

Read in 2 tables (a, b) and look at the column keys we have. We can filter out the columns we don't need on both and join all table rows together that have the same location. We can then remove the extra 'place' column and print out the results.

Dictionary Based Matching Functions

This uses dictionary lookups to perform much faster data matching. This is a lot faster than the other match functions, but it only matches exactly.

Syntax

buildDict(keyCol: list, dataCol: list) -> dict:

Creates a dictionary that has a key column value correspond to some data column value. For example, if key=name and data=id, dict['Bob'] would return a set of id's that correspond to the name Bob. Another use is looking up rows by value. If key=id and data=tableRows, dict['5'] would be the rows that have an id of 5. This is used in later functions.


dictMatch(tableA: list, tableB: list, keyA: str, keyB: str, header: list, matchFunction = lambda dbLine, obitsLine: dbLine == obitsLine) -> list:

Match tableA rows to tableB if tableA[keyA] == tableB[keyB]. header is the new header values to place on the table. matchFunction is used on both rows as a later check to determine if they match. For example, keyA=name keyB=name and matchFunction matches address of both lines. It needs to have the following signature def f(tableARow, tableBRow) -> boolean:.


The rest of the functions are used with old features/functionality or refined to a specific use case.


Examples

>>> nameLookup = buildDict(getColumn(db, 'firstname'), getColumn(db, 'surname'))
>>> nameLookup['bob']
{'eldridge', 'fraser', 'stevenson'}
>>> 
>>> nameLookup['not in the table']     
Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    nameLookup['not in the table']
KeyError: 'not in the table'

Create a dictionary that allows for searches based on first name that gives all surnames that are on the same rows as 'bob'. We see there is a 'bob eldridge', a 'bob fraser', and 'bob stevenson'. If we attempt to lookup a name that is not contained in the db table, we get a KeyError.


>>> matchingRows = dictMatch(db, obits, 'firstname', 'First_1', header, namesAndDatesExactly)
>>> matchingRows[:2]
['id, firstname, lastname, birthyear, deathyear, First_1, Last_1, B_Year, D_Year, imageLocation',
    '7, bob, eldridge, 1901, 1985, bob, eldridge, 1901, 1985, images/1103.jpg',
    '112, peter, smith, 1899, 1957, peter, smith, 1899, 1957, images/2404.jpg']

dictMatch first narrows down our search by only comparing rows that match exactly on a first name. Once we find rows that have a first name matching, we run namesAndDatesExactly (a function written earlier) on both rows. If this returns true, then we combine the rows and return them together. This allows us to quickly find all rows that match exactly on name and date information. This matched table can now create a connection between 'id' and 'imageLocation'.

⚠️ **GitHub.com Fallback** ⚠️