Collpase on patient ID - lvphj/epydemiology GitHub Wiki

This function collapses a dataframe based on patient ID.

phjCollapseOnPatientID()

df = epy.phjCollapseOnPatientID(phjAllDataDF,
                                phjPatientIDVarName,
                                phjConsultationIDVarName = None,
                                phjConsultationDateVarName = None,
                                phjFreeTextVarName = None,
                                phjAggDict = None,
                                phjPrintResults = False)

Description

This function collapses a dataframe of individual consultation data based on patient ID and returns the data as a Pandas dataframe.

Function parameters

phjAllDataDF

Pandas dataframe containing all data that needs to be included in the collapsed output dataframe.
phjPatientIDVarName
phjConsultationIDVarName (default = None)
phjConsultationDateVarName (default = None)
phjFreeTextVarName (default = None)
phjAggDict (default = None)

This parameter is a dictionary or ordered dictionary that defines how individual columns in the dataframe should be collapsed. If phjAggDict is left as None then the function will aggregate a few key variables in a very specific way (namely, the consultation ID variable will be aggregated by counting the number of consultations (and renaming the variable 'count'), the consultation date variable will be aggregated by identifying the first and last consultation date for each individual, and the freetext field will be aggregated by concatenating each row in order with '///' as the separator between fields). All other aggregations will be performed by taking the last data entry for each patient; for example, a column containing postcode data for a patient who has moved several times will be collapsed by taking the final postcode entry. However, individual variables can be collapsed using different functions by defining those functions in the phjAggDict parameter. Examples of some commonly used functions are:
1. count
2. lambda x: ' /// '.join(x.fillna('EMPTY FIELD')) – concatenates fields separated by ' /// '
3. lambda x:x.value_counts().index[0] – Gets the most common (mode) for the group
4. lambda x: sum(i == 'yes' for i in x) – Counts how many times 'yes' occurs in group
5. np.sum
6. np.max
7. np.min
8. ['first','last'] – finds first and last values (and creates a multi-index)

An ordered dictionary defining aggregation methods for variables in the dataframe might be defined as:

import collections

phjAggOrderedDict = collections.OrderedDict()
phjAggOrderedDict['postcode'] = "last"
phjAggOrderedDict['clinicID'] = lambda x:x.value_counts().index[0]   # Gets the most common (i.e. mode)
phjAggOrderedDict['consultID'] = "count"
phjAggOrderedDict['consultDate'] = ["first","last"]
phjAggOrderedDict['clinicalFreetext'] = lambda x: ' /// '.join(x.fillna('EMPTY FIELD'))

All remaining variables in the dataframe will be aggregated using last.

phjPrintResults (default = False)

Print the results of the function.

Exceptions raised

None.

Returns

Pandas dataframe containing collapsed data.

Other notes

None.

Example

An example of the function in use is given below.

# The following libraries are imported automatically but are incuded here for completeness.
import pandas as pd
import pandas re
import collections
import epydemiology as epy

myTempDF = epy.phjCollapseOnPatientID(phjAllDataDF,
                                      phjPatientIDVarName,
                                      phjConsultationIDVarName = None,
                                      phjConsultationDateVarName = None,
                                      phjFreeTextVarName = None,
                                      phjAggDict = None,
                                      phjPrintResults = False)