Update dataframe with new values - lvphj/epydemiology GitHub Wiki
A function that identifies new values in a second dataframe that are not present in a first dataframe
phjUpdateLUT()
myDF = epy.phjUpdateLUT(phjExistDF,
phjNewDF,
phjIDName,
phjVarNameList,
phjMissStr,
phjMissCode,
phjIgnoreCase = True,
phjPrintResults = False)
Description
This function was written primarily to facilitate updating look-up tables (LUTs) in relational databases based on new datasets being entered into a database. As an example, a LUT may have two columns, an 'id' column as the primary key and a column containing a variable definition. A new update of the data may identify new entries that need to be entered into the LUT. It is important that new items can be added to the LUT without changing the existing 'id' values which may already have been used to link to data in other tables.
As an illustrative example, imagine a database has a table of employees and a lookup key that links to a table of departments in a company:
employee dept
|----|---------------|---------| |----|-----------------|-----------|
| id | name | dept_id | | id | dept_name | other_var |
|----|---------------|---------| |----|-----------------|-----------|
| 01 | David Jones | 1 | | 1 | Human Resources | abc |
| 02 | Jenny Moore | 1 | | 2 | Manufacturing | def |
| 03 | Peter Jenkins | 2 | |----|-----------------|-----------|
| 04 | John Smith | 2 |
| 05 | Anne Fellows | 1 |
|----|---------------|---------|
If a new employee from a different department (e.g. IT Support) is added to the employee table then it is necessary to add IT Support to the dept LUT with a new ID value. It is important to ensure that the pre-existing id values are not altered; so, for example, it wouldn't be appropriate to list all unique department names in order and allocate an id because, in this hypothetical example, alphabetical order would put departments in the order Human Resources - IT Support - Manufacturing which means that id value for Manufacturing would change from 2 to 3.
This function identifies new values that don't already exist in the LUT. It then calculates a new range of id values that follow on from pre-existing id values but don't duplicate. The function also recognises that a missing value code and string need to be added to the LUT. If a missing value is not already included in the LUT then details are added to the lookup table.
In the above example, the function will return a Pandas dataframe that gives new values to be entered into the LUT such as:
id dept_name
0 3 IT Support
1 999 missing
The contents of this dataframe can be added to the LUT using Pandas df.to_sql()
with append parameter set to True
.
Function parameters
-
phjExistDF
Pandas dataframes of existing values taken from the LUT in an SQL database.
-
phjNewDF
Dataframe containing potentially new values.
-
phjIDName
Name of column that contains the id of the LUT.
-
phjVarName
Name of column or list of column names.
-
phjMissStr
Missing value string (e.g. 'missing')
-
phjMissCode
Integer code used to code missing values.
-
phjIgnoreCase (default = True)
Case is ignored when comparing what information already exists in database. However, if new data needs to be added, then the case of the data entries as it exists in the new dataframe is maintained.
-
phjPrintResults (default = False)
Print results
Exceptions raised
None
Returns
Pandas dataframe containing new values that were containing in second dataframe but not in the first.
Other notes
None.
Example
Testing phjUpdateLUT() function with dataframe with single column
old_df = pd.DataFrame({'id':[1,2,3,4,5,6],
'm':['a','b','c','d','e','f']})
new_df = pd.DataFrame({'id':[1,2,3,4],
'm':['b','e','g','h']})
update_df = epy.phjUpdateLUT(phjExistDF = old_df,
phjNewDF = new_df,
phjIDName = 'id',
phjVarNameList = ['m'],
phjMissStr = 'missing',
phjMissCode = 999,
phjIgnoreCase = False,
phjPrintResults = True)
To give results:
Existing dataframe
------------------
id m
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
New dataframe
-------------
id m
0 1 b
1 2 e
2 3 g
3 4 h
Dataframe of new values
-----------------------
id m
2 3 g
3 4 h
List of new ID values
---------------------
[7, 8]
New dataframe with new ID
-------------------------
id m
2 7 g
3 8 h
Returned dataframe
------------------
id m
0 7 g
1 8 h
2 999 missing
Testing phjUpdateLUT() function with dataframe with multiple columns
old_df = pd.DataFrame({'id':[1,2,3,4,5,6],
'm':['a','b','c','d','e','f'],
'n':['A','B','C','D','E','F']})
new_df = pd.DataFrame({'id':[1,2,3,4,5],
'm':['b','e','g','h','a'],
'n':['BB','e','GG','H','a']})
update_df = epy.phjUpdateLUT(phjExistDF = old_df,
phjNewDF = new_df,
phjIDName = 'id',
phjVarNameList = ['m','n'],
phjMissStr = 'missing',
phjMissCode = 999,
phjIgnoreCase = True,
phjPrintResults = True)
print('Updated dataframe')
print('-----------------')
print(update_df)
To give results:
Existing dataframe
------------------
id m n
0 1 a A
1 2 b B
2 3 c C
3 4 d D
4 5 e E
5 6 f F
New dataframe
-------------
id m n
0 1 b BB
1 2 e e
2 3 g GG
3 4 h H
4 5 a a
Dataframe of new values
-----------------------
id m n
0 1 b BB
2 3 g GG
3 4 h H
List of new ID values
---------------------
[7, 8, 9]
New dataframe with new ID
-------------------------
id m n
0 7 b BB
2 8 g GG
3 9 h H
Returned dataframe
------------------
id m n
0 7 b BB
1 8 g GG
2 9 h H
3 999 missing missing
Updated dataframe
-----------------
id m n
0 7 b BB
1 8 g GG
2 9 h H
3 999 missing missing