Convert a disease summary table to a dataframe of binary outcomes - lvphj/epydemiology GitHub Wiki

phjSummaryTableToBinaryOutcomes()

import numpy as np
import pandas as pd
import epydemiology as epy

df = epy.phjSummaryTableToBinaryOutcomes(phjDF,
                                         phjVarsToIncludeList,
                                         phjSuccVarName = None,
                                         phjFailVarName = None,
                                         phjTotalVarName = None,
                                         phjOutcomeVarName = 'outcome',
                                         phjPrintResults = False)

Description

This function converts a table containing summary count data (e.g. number of cases of disease per year) and converts it to a dataframe containing binary outcome data. This is useful when creating logistic regression models but the function does not have an option to include frequency weights.

As an example, the following table contains entirely hypothetical data showing annual number of cases of disease (and unaffected controls) per year (together with one of any number of additional unrelated variables).

year cases controls comment
2010 23 1023 Small number of cases
2011 34 1243 Proportion increase
2012 41 1145 Trend continues
2013 57 2017 Decreased proportion
2014 62 1876 Increased again

This function converts the above table (passed as a dataframe) to a new dataframe containing 1 row per case and where cases are identified as '1' and controls as '0', as shown below. Any number of additional variables can be included, the contents of which will simply be repeated to each row where it is relevant.

year outcome
0 2010 1
1 2010 1
2 2010 1
... ... ...
7518 2014 0
7519 2014 0
7520 2014 0

Function parameters

  1. phjDF

    The name of the Pandas dataframe that contains the summary table information.

  2. phjVarsToIncludeList

    A list of variables to include or retain in the returned dataframe. The variables containing 'case' and 'control' data may, or may not, be included; the function automatically excludes these variables is present when generating the new dataframe.

  3. phjSuccVarName (default = None)

    The name of the variable containing the 'case' or 'success' data, which will be coded as 1. Only 2 of the 3 variables (successes, failures and total) need to be entered – the missing variable (if required) will be calculated (assuming successes + failures = total).

  4. phjFailVarName (default = None)

    The name of the variable containing the 'control' or 'failure' data, which will be coded as 0. Only 2 of the 3 variables (successes, failures and total) need to be entered – the missing variable (if required) will be calculated.

  5. phjTotalVarName (default = None)

    The name of the variable containing the 'total' number of observations (i.e. successes + failures). Only 2 of the 3 variables (successes, failures and total) need to be entered – the missing variable (if required) will be calculated.

  6. phjOutcomeVarName (default = 'outcome')

    The name of the new variable that will contain the outcome data (i.e. the 0 and 1 data).

  7. phjPrintResults (default = False)

    Prints intermediate results. Not used in this function.

Exceptions raised

AssertionErrors raised if parameters passed are incorrect.

Returns

Pandas dataframe containing binary outcome data.

Other notes

None.

Example

# Generate the dataframe used in the original description of the function
df = pd.DataFrame({'year':[2010,2011,2012,2013,2014],
                   'cases':[23,34,41,57,62],
                   'controls':[1023,1243,1145,2017,1876],
                   'comment':['Small number of cases',
                              'Proportion increase',
                              'Trend continues',
                              'Decreased proportion',
                              'Increased again']})

# Reorder the columns a little
df = df['year','cases','controls','comment'](/lvphj/epydemiology/wiki/'year','cases','controls','comment')

# Convert to dataframe containing binary outcome data
newDF = epy.phjSummaryTableToBinaryOutcomes(phjDF = df,
                                            phjVarsToIncludeList = ['year','cases','controls'],
                                            phjSuccVarName = 'cases',
                                            phjFailVarName = 'controls',
                                            phjTotalVarName = None,
                                            phjOutcomeVarName = 'outcome',
                                            phjPrintResults = False)

# Print results
print('Original table of summary results\n')
print(df)

print('\n')

print('Dataframe of binary outcomes\n')
with pd.option_context('display.max_rows',6, 'display.max_columns',2):
    print(newDF)

Produces the following output:

Original table of summary results

   year  cases  controls                comment
0  2010     23      1023  Small number of cases
1  2011     34      1243    Proportion increase
2  2012     41      1145        Trend continues
3  2013     57      2017   Decreased proportion
4  2014     62      1876        Increased again


Dataframe of binary outcomes

      year  outcome
0     2010        1
1     2010        1
2     2010        1
...    ...      ...
7518  2014        0
7519  2014        0
7520  2014        0

[7521 rows x 2 columns]