Convert a disease summary table to a dataframe of binary outcomes - lvphj/epydemiology GitHub Wiki
phjSummaryTableToBinaryOutcomes()
import numpy as np
import pandas as pd
import epydemiology as epy
df = epy.phjSummaryTableToBinaryOutcomes(phjDF,
phjVarsToIncludeList,
phjSuccVarName = None,
phjFailVarName = None,
phjTotalVarName = None,
phjOutcomeVarName = 'outcome',
phjPrintResults = False)
Description
This function converts a table containing summary count data (e.g. number of cases of disease per year) and converts it to a dataframe containing binary outcome data. This is useful when creating logistic regression models but the function does not have an option to include frequency weights.
As an example, the following table contains entirely hypothetical data showing annual number of cases of disease (and unaffected controls) per year (together with one of any number of additional unrelated variables).
year | cases | controls | comment |
---|---|---|---|
2010 | 23 | 1023 | Small number of cases |
2011 | 34 | 1243 | Proportion increase |
2012 | 41 | 1145 | Trend continues |
2013 | 57 | 2017 | Decreased proportion |
2014 | 62 | 1876 | Increased again |
This function converts the above table (passed as a dataframe) to a new dataframe containing 1 row per case and where cases are identified as '1' and controls as '0', as shown below. Any number of additional variables can be included, the contents of which will simply be repeated to each row where it is relevant.
year | outcome | |
---|---|---|
0 | 2010 | 1 |
1 | 2010 | 1 |
2 | 2010 | 1 |
... | ... | ... |
7518 | 2014 | 0 |
7519 | 2014 | 0 |
7520 | 2014 | 0 |
Function parameters
-
phjDF
The name of the Pandas dataframe that contains the summary table information.
-
phjVarsToIncludeList
A list of variables to include or retain in the returned dataframe. The variables containing 'case' and 'control' data may, or may not, be included; the function automatically excludes these variables is present when generating the new dataframe.
-
phjSuccVarName (default = None)
The name of the variable containing the 'case' or 'success' data, which will be coded as 1. Only 2 of the 3 variables (successes, failures and total) need to be entered – the missing variable (if required) will be calculated (assuming successes + failures = total).
-
phjFailVarName (default = None)
The name of the variable containing the 'control' or 'failure' data, which will be coded as 0. Only 2 of the 3 variables (successes, failures and total) need to be entered – the missing variable (if required) will be calculated.
-
phjTotalVarName (default = None)
The name of the variable containing the 'total' number of observations (i.e. successes + failures). Only 2 of the 3 variables (successes, failures and total) need to be entered – the missing variable (if required) will be calculated.
-
phjOutcomeVarName (default = 'outcome')
The name of the new variable that will contain the outcome data (i.e. the 0 and 1 data).
-
phjPrintResults (default = False)
Prints intermediate results. Not used in this function.
Exceptions raised
AssertionErrors raised if parameters passed are incorrect.
Returns
Pandas dataframe containing binary outcome data.
Other notes
None.
Example
# Generate the dataframe used in the original description of the function
df = pd.DataFrame({'year':[2010,2011,2012,2013,2014],
'cases':[23,34,41,57,62],
'controls':[1023,1243,1145,2017,1876],
'comment':['Small number of cases',
'Proportion increase',
'Trend continues',
'Decreased proportion',
'Increased again']})
# Reorder the columns a little
df = df['year','cases','controls','comment'](/lvphj/epydemiology/wiki/'year','cases','controls','comment')
# Convert to dataframe containing binary outcome data
newDF = epy.phjSummaryTableToBinaryOutcomes(phjDF = df,
phjVarsToIncludeList = ['year','cases','controls'],
phjSuccVarName = 'cases',
phjFailVarName = 'controls',
phjTotalVarName = None,
phjOutcomeVarName = 'outcome',
phjPrintResults = False)
# Print results
print('Original table of summary results\n')
print(df)
print('\n')
print('Dataframe of binary outcomes\n')
with pd.option_context('display.max_rows',6, 'display.max_columns',2):
print(newDF)
Produces the following output:
Original table of summary results
year cases controls comment
0 2010 23 1023 Small number of cases
1 2011 34 1243 Proportion increase
2 2012 41 1145 Trend continues
3 2013 57 2017 Decreased proportion
4 2014 62 1876 Increased again
Dataframe of binary outcomes
year outcome
0 2010 1
1 2010 1
2 2010 1
... ... ...
7518 2014 0
7519 2014 0
7520 2014 0
[7521 rows x 2 columns]