Update LUT to latest values - lvphj/epydemiology GitHub Wiki

def epy.phjUpdateLUTToLatestValues(phjDF,
                                   phjIDVarName,
                                   phjGroupbyVarName,
                                   phjAddCountCol = True,
                                   phjPrintResults = False)

Description

It is assumed that values in the phjIDVarName indicate a relative order in which values have been added (i.e. rows with higher ID values will have been added later). The ``phjIDVarName``` variable is used to sort the dataframe rows within groupby groups.

Example

# Define 2 dataframes. The values in df2 will be used to update the values in
# the df1, either by changing values that already exist or by adding new values.

df1 = pd.DataFrame({'id':[1,2,3,4,5,6,7,8],
                   'name':['a','b','c','d','e','f','g','h'],
                   'value':[999,22,33,44,55,66,999,88]})

df2 = pd.DataFrame({'id':[9,10,11,12],
                    'name':['a','i','d','g'],
                    'value':[11,99,None,77]})

# Append df2 to the end of df1
df = df1.append(df2).sort_values(by = ['name','id'])

print('First dataframe')
print('---------------')
print(df1)
print('\n')

print('Second dataframe')
print('----------------')
print(df2)
print('\n')

print('Joined dataframes')
print('-----------------')
print(df)

The initial dataframes look like:

First dataframe
---------------
   id name  value
0   1    a    999
1   2    b     22
2   3    c     33
3   4    d     44
4   5    e     55
5   6    f     66
6   7    g    999
7   8    h     88


Second dataframe
----------------
   id name  value
0   9    a   11.0
1  10    i   99.0
2  11    d    NaN
3  12    g   77.0


Joined dataframes
-----------------
   id name  value
0   1    a  999.0
0   9    a   11.0
1   2    b   22.0
2   3    c   33.0
3   4    d   44.0
2  11    d    NaN
4   5    e   55.0
5   6    f   66.0
6   7    g  999.0
3  12    g   77.0
7   8    h   88.0
1  10    i   99.0

The first dataframe is updated as follows:

myDF = epy.phjUpdateLUTToLatestValues(phjDF = df,
                                      phjIDVarName = 'id',
                                      phjGroupbyVarName = 'name',
                                      phjAddCountCol = True,
                                      phjPrintResults = True)

This produces the following output:

Original sorted dataframe with count variable
---------------------------------------------
   id name  value  n
0   1    a  999.0  2
0   9    a   11.0  2
1   2    b   22.0  1
2   3    c   33.0  1
3   4    d   44.0  2
2  11    d    NaN  2
4   5    e   55.0  1
5   6    f   66.0  1
6   7    g  999.0  2
3  12    g   77.0  2
7   8    h   88.0  1
1  10    i   99.0  1


Updated dataframe
-----------------
   id name  value  n
0   1    a   11.0  2
1   2    b   22.0  1
2   3    c   33.0  1
3   4    d    NaN  2
4   5    e   55.0  1
5   6    f   66.0  1
6   7    g   77.0  2
7   8    h   88.0  1
8  10    i   99.0  1