Collpase on patient ID - lvphj/epydemiology GitHub Wiki
This function collapses a dataframe based on patient ID.
phjCollapseOnPatientID()
df = epy.phjCollapseOnPatientID(phjAllDataDF,
phjPatientIDVarName,
phjConsultationIDVarName = None,
phjConsultationDateVarName = None,
phjFreeTextVarName = None,
phjAggDict = None,
phjPrintResults = False)
Description
This function collapses a dataframe of individual consultation data based on patient ID and returns the data as a Pandas dataframe.
Function parameters
-
phjAllDataDF
Pandas dataframe containing all data that needs to be included in the collapsed output dataframe.
-
phjPatientIDVarName
-
phjConsultationIDVarName (default = None)
-
phjConsultationDateVarName (default = None)
-
phjFreeTextVarName (default = None)
-
phjAggDict (default = None)
This parameter is a dictionary or ordered dictionary that defines how individual columns in the dataframe should be collapsed. If
phjAggDict
is left asNone
then the function will aggregate a few key variables in a very specific way (namely, the consultation ID variable will be aggregated by counting the number of consultations (and renaming the variable 'count'), the consultation date variable will be aggregated by identifying the first and last consultation date for each individual, and the freetext field will be aggregated by concatenating each row in order with '///' as the separator between fields). All other aggregations will be performed by taking the last data entry for each patient; for example, a column containing postcode data for a patient who has moved several times will be collapsed by taking the final postcode entry. However, individual variables can be collapsed using different functions by defining those functions in thephjAggDict
parameter. Examples of some commonly used functions are:count
lambda x: ' /// '.join(x.fillna('EMPTY FIELD'))
– concatenates fields separated by ' /// 'lambda x:x.value_counts().index[0]
– Gets the most common (mode) for the grouplambda x: sum(i == 'yes' for i in x)
– Counts how many times 'yes' occurs in groupnp.sum
np.max
np.min
['first','last']
– finds first and last values (and creates a multi-index)
An ordered dictionary defining aggregation methods for variables in the dataframe might be defined as:
import collections
phjAggOrderedDict = collections.OrderedDict()
phjAggOrderedDict['postcode'] = "last"
phjAggOrderedDict['clinicID'] = lambda x:x.value_counts().index[0] # Gets the most common (i.e. mode)
phjAggOrderedDict['consultID'] = "count"
phjAggOrderedDict['consultDate'] = ["first","last"]
phjAggOrderedDict['clinicalFreetext'] = lambda x: ' /// '.join(x.fillna('EMPTY FIELD'))
All remaining variables in the dataframe will be aggregated using last
.
-
phjPrintResults (default = False)
Print the results of the function.
Exceptions raised
None.
Returns
Pandas dataframe containing collapsed data.
Other notes
None.
Example
An example of the function in use is given below.
# The following libraries are imported automatically but are incuded here for completeness.
import pandas as pd
import pandas re
import collections
import epydemiology as epy
myTempDF = epy.phjCollapseOnPatientID(phjAllDataDF,
phjPatientIDVarName,
phjConsultationIDVarName = None,
phjConsultationDateVarName = None,
phjFreeTextVarName = None,
phjAggDict = None,
phjPrintResults = False)