Convert a disease summary table to a dataframe of binary outcomes - lvphj/epydemiology GitHub Wiki

phjSummaryTableToBinaryOutcomes()

import numpy as np
import pandas as pd
import epydemiology as epy

df = epy.phjSummaryTableToBinaryOutcomes(phjDF,
                                         phjVarsToIncludeList,
                                         phjSuccVarName = None,
                                         phjFailVarName = None,
                                         phjTotalVarName = None,
                                         phjOutcomeVarName = 'outcome',
                                         phjPrintResults = False)

Description

This function converts a table containing summary count data (e.g. number of cases of disease per year) and converts it to a dataframe containing binary outcome data. This is useful when creating logistic regression models but the function does not have an option to include frequency weights.

As an example, the following table contains entirely hypothetical data showing annual number of cases of disease (and unaffected controls) per year (together with one of any number of additional unrelated variables).

year	cases	controls	comment
2010	23	1023	Small number of cases
2011	34	1243	Proportion increase
2012	41	1145	Trend continues
2013	57	2017	Decreased proportion
2014	62	1876	Increased again

This function converts the above table (passed as a dataframe) to a new dataframe containing 1 row per case and where cases are identified as '1' and controls as '0', as shown below. Any number of additional variables can be included, the contents of which will simply be repeated to each row where it is relevant.

	year	outcome
0	2010	1
1	2010	1
2	2010	1
...	...	...
7518	2014	0
7519	2014	0
7520	2014	0

Function parameters

phjDF

The name of the Pandas dataframe that contains the summary table information.
phjVarsToIncludeList

A list of variables to include or retain in the returned dataframe. The variables containing 'case' and 'control' data may, or may not, be included; the function automatically excludes these variables is present when generating the new dataframe.
phjSuccVarName (default = None)

The name of the variable containing the 'case' or 'success' data, which will be coded as 1. Only 2 of the 3 variables (successes, failures and total) need to be entered – the missing variable (if required) will be calculated (assuming successes + failures = total).
phjFailVarName (default = None)

The name of the variable containing the 'control' or 'failure' data, which will be coded as 0. Only 2 of the 3 variables (successes, failures and total) need to be entered – the missing variable (if required) will be calculated.
phjTotalVarName (default = None)

The name of the variable containing the 'total' number of observations (i.e. successes + failures). Only 2 of the 3 variables (successes, failures and total) need to be entered – the missing variable (if required) will be calculated.
phjOutcomeVarName (default = 'outcome')

The name of the new variable that will contain the outcome data (i.e. the 0 and 1 data).
phjPrintResults (default = False)

Prints intermediate results. Not used in this function.

Exceptions raised

AssertionErrors raised if parameters passed are incorrect.

Returns

Pandas dataframe containing binary outcome data.

Other notes

None.

Example

# Generate the dataframe used in the original description of the function
df = pd.DataFrame({'year':[2010,2011,2012,2013,2014],
                   'cases':[23,34,41,57,62],
                   'controls':[1023,1243,1145,2017,1876],
                   'comment':['Small number of cases',
                              'Proportion increase',
                              'Trend continues',
                              'Decreased proportion',
                              'Increased again']})

# Reorder the columns a little
df = df['year','cases','controls','comment'](/lvphj/epydemiology/wiki/'year','cases','controls','comment')

# Convert to dataframe containing binary outcome data
newDF = epy.phjSummaryTableToBinaryOutcomes(phjDF = df,
                                            phjVarsToIncludeList = ['year','cases','controls'],
                                            phjSuccVarName = 'cases',
                                            phjFailVarName = 'controls',
                                            phjTotalVarName = None,
                                            phjOutcomeVarName = 'outcome',
                                            phjPrintResults = False)

# Print results
print('Original table of summary results\n')
print(df)

print('\n')

print('Dataframe of binary outcomes\n')
with pd.option_context('display.max_rows',6, 'display.max_columns',2):
    print(newDF)

Produces the following output:

Original table of summary results

   year  cases  controls                comment
0  2010     23      1023  Small number of cases
1  2011     34      1243    Proportion increase
2  2012     41      1145        Trend continues
3  2013     57      2017   Decreased proportion
4  2014     62      1876        Increased again


Dataframe of binary outcomes

      year  outcome
0     2010        1
1     2010        1
2     2010        1
...    ...      ...
7518  2014        0
7519  2014        0
7520  2014        0

[7521 rows x 2 columns]