Update dataframe with new values - lvphj/epydemiology GitHub Wiki

A function that identifies new values in a second dataframe that are not present in a first dataframe

phjUpdateLUT()

myDF = epy.phjUpdateLUT(phjExistDF,
                        phjNewDF,
                        phjIDName,
                        phjVarNameList,
                        phjMissStr,
                        phjMissCode,
                        phjIgnoreCase = True,
                        phjPrintResults = False)

Description

This function was written primarily to facilitate updating look-up tables (LUTs) in relational databases based on new datasets being entered into a database. As an example, a LUT may have two columns, an 'id' column as the primary key and a column containing a variable definition. A new update of the data may identify new entries that need to be entered into the LUT. It is important that new items can be added to the LUT without changing the existing 'id' values which may already have been used to link to data in other tables.

As an illustrative example, imagine a database has a table of employees and a lookup key that links to a table of departments in a company:

employee                                 dept
|----|---------------|---------|         |----|-----------------|-----------|
| id |          name | dept_id |         | id |       dept_name | other_var |
|----|---------------|---------|         |----|-----------------|-----------|
| 01 | David Jones   |       1 |         |  1 | Human Resources |       abc |
| 02 | Jenny Moore   |       1 |         |  2 |   Manufacturing |       def |
| 03 | Peter Jenkins |       2 |         |----|-----------------|-----------|
| 04 | John Smith    |       2 |
| 05 | Anne Fellows  |       1 |
|----|---------------|---------|

If a new employee from a different department (e.g. IT Support) is added to the employee table then it is necessary to add IT Support to the dept LUT with a new ID value. It is important to ensure that the pre-existing id values are not altered; so, for example, it wouldn't be appropriate to list all unique department names in order and allocate an id because, in this hypothetical example, alphabetical order would put departments in the order Human Resources - IT Support - Manufacturing which means that id value for Manufacturing would change from 2 to 3.

This function identifies new values that don't already exist in the LUT. It then calculates a new range of id values that follow on from pre-existing id values but don't duplicate. The function also recognises that a missing value code and string need to be added to the LUT. If a missing value is not already included in the LUT then details are added to the lookup table.

In the above example, the function will return a Pandas dataframe that gives new values to be entered into the LUT such as:

    id   dept_name
0    3  IT Support
1  999     missing

The contents of this dataframe can be added to the LUT using Pandas df.to_sql() with append parameter set to True.

Function parameters

phjExistDF

Pandas dataframes of existing values taken from the LUT in an SQL database.
phjNewDF

Dataframe containing potentially new values.
phjIDName

Name of column that contains the id of the LUT.
phjVarName

Name of column or list of column names.
phjMissStr

Missing value string (e.g. 'missing')
phjMissCode

Integer code used to code missing values.
phjIgnoreCase (default = True)

Case is ignored when comparing what information already exists in database. However, if new data needs to be added, then the case of the data entries as it exists in the new dataframe is maintained.
phjPrintResults (default = False)

Print results

Exceptions raised

None

Returns

Pandas dataframe containing new values that were containing in second dataframe but not in the first.

Other notes

None.

Example

Testing phjUpdateLUT() function with dataframe with single column

old_df = pd.DataFrame({'id':[1,2,3,4,5,6],
                       'm':['a','b','c','d','e','f']})
 
new_df = pd.DataFrame({'id':[1,2,3,4],
                       'm':['b','e','g','h']})
 
update_df = epy.phjUpdateLUT(phjExistDF = old_df,
                             phjNewDF = new_df,
                             phjIDName = 'id',
                             phjVarNameList = ['m'],
                             phjMissStr = 'missing',
                             phjMissCode = 999,
                             phjIgnoreCase = False,
                             phjPrintResults = True)

To give results:

Existing dataframe
------------------
   id  m
0   1  a
1   2  b
2   3  c
3   4  d
4   5  e
5   6  f


New dataframe
-------------
   id  m
0   1  b
1   2  e
2   3  g
3   4  h


Dataframe of new values
-----------------------
   id  m
2   3  g
3   4  h


List of new ID values
---------------------
[7, 8]


New dataframe with new ID
-------------------------
   id  m
2   7  g
3   8  h


Returned dataframe
------------------
    id        m
0    7        g
1    8        h
2  999  missing

Testing phjUpdateLUT() function with dataframe with multiple columns

old_df = pd.DataFrame({'id':[1,2,3,4,5,6],
                       'm':['a','b','c','d','e','f'],
                       'n':['A','B','C','D','E','F']})
 
new_df = pd.DataFrame({'id':[1,2,3,4,5],
                       'm':['b','e','g','h','a'],
                       'n':['BB','e','GG','H','a']})
 
update_df = epy.phjUpdateLUT(phjExistDF = old_df,
                             phjNewDF = new_df,
                             phjIDName = 'id',
                             phjVarNameList = ['m','n'],
                             phjMissStr = 'missing',
                             phjMissCode = 999,
                             phjIgnoreCase = True,
                             phjPrintResults = True)
 
print('Updated dataframe')
print('-----------------')
print(update_df)

To give results:

Existing dataframe
------------------
   id  m  n
0   1  a  A
1   2  b  B
2   3  c  C
3   4  d  D
4   5  e  E
5   6  f  F


New dataframe
-------------
   id  m   n
0   1  b  BB
1   2  e   e
2   3  g  GG
3   4  h   H
4   5  a   a


Dataframe of new values
-----------------------
   id  m   n
0   1  b  BB
2   3  g  GG
3   4  h   H


List of new ID values
---------------------
[7, 8, 9]


New dataframe with new ID
-------------------------
   id  m   n
0   7  b  BB
2   8  g  GG
3   9  h   H


Returned dataframe
------------------
    id        m        n
0    7        b       BB
1    8        g       GG
2    9        h        H
3  999  missing  missing


Updated dataframe
-----------------
    id        m        n
0    7        b       BB
1    8        g       GG
2    9        h        H
3  999  missing  missing