Calculate and plot proportions - lvphj/epydemiology GitHub Wiki

Python functions to calculate binomial and multinomial proportions using data stored in a Pandas dataframe.

phjCalculateBinomialProportions()

df = epy.phjCalculateBinomialProportions(phjDF,
                                         phjColumnsList,
                                         phjSuccess = 'yes',
                                         phjGroupVarName = None,
                                         phjMissingValue = 'missing',
                                         phjBinomialConfIntMethod = 'normal',
                                         phjAlpha = 0.05,
                                         phjPlotProportions = True,
                                         phjGroupsToPlotList = None,
                                         phjSortProportions = False,
                                         phjGraphTitle = None,
                                         phjPrintResults = False)

phjCalculateMultinomialProportions()

df = epy.phjCalculateMultinomialProportions(phjDF,
                                            phjCategoryVarName,
                                            phjGroupVarName = None,
                                            phjMissingValue = 'missing',
                                            phjMultinomialConfIntMethod = 'goodman',
                                            phjAlpha = 0.05,
                                            phjPlotRelFreq = True,
                                            phjCategoriesToPlotList = 'all',
                                            phjGroupsToPlotList = None,   # Currently not implemented
                                            phjGraphTitle = None,
                                            phjPrintResults = False)

Description

The above two functions – phjCalculateBinomialProportions() and phjCalculateMultinomialProportions() – are closely related and will be discussed and described together.

The functions can be used to rapidly summarise and visualise two common-encountered (at least, in my research) types of data. The first summarises data which consists of rows of records (representing individuals) and a series of binomial (dummy-esque) variables indicating whether a characteristic is present or absent (see below). These are not true dummy variables because categories are not necessarily mutually exclusive and each variable is considered as an individual characteristic. The confidence intervals for each category are calculated as individual binomial intervals (using StatsModels functions).

The second data structure consists of rows of data (representing individuals) and a single variable which contains numerous categories. In this case, all the categories are mutually exclusive. The proportions (or relative frequencies) are calculated for each category level and the confidence intervals are calculated as multinomial intervals (using StatsModels functions).

The series of binomial data might take the form shown on the left whilst the multinomial dataset might take the form shown on the right below:

Binomial data structure                                   Multinomial data structure
------------------------------------------------          ------------------------------
| id |    group  |       A |       B |       C |          | id |    group  |  category |
|----|-----------|---------|---------|---------|          |----|-----------|-----------|
|  1 |     case  |     yes |      no |     yes |          |  1 |     case  |    np.nan |
|  2 |     case  |     yes |  np.nan |     yes |          |  2 |     case  |   spaniel |
|  3 |  control  |      no | missing |     yes |          |  3 |     case  |   missing |
|  4 |     case  |      no |     yes |  np.nan |          |  4 |  control  |   terrier |
|  5 |  control  |      no |     yes |      no |          |  5 |  control  |    collie |
|  6 |  control  |      no |     yes |     yes |          |  6 |     case  |  labrador |
|  7 |     case  |      no |     yes |     yes |          |  7 |     case  |  labrador |
|  8 |     case  |     yes |      no |     yes |          |  8 |     case  |    collie |
|  9 |  control  | missing |      no |      no |          |  9 |  control  |   spaniel |
| 10 |     case  |     yes |      no |      no |          | 10 |  control  |   spaniel |
------------------------------------------------          | 11 |  control  |  labrador |
                                                          | 12 |  control  |    collie |
                                                          | 13 |     case  |   terrier |
                                                          | 14 |     case  |   terrier |
                                                          | 15 |     case  |   terrier |
                                                          | 16 |  control  |    collie |
                                                          | 17 |  control  |  labrador |
                                                          | 18 |  control  |  labrador |
                                                          | 19 |  control  |  labrador |
                                                          | 20 |     case  |   spaniel |
                                                          | 21 |     case  |   spaniel |
                                                          | 22 |     case  |    collie |
                                                          | 23 |     case  |    collie |
                                                          | 24 |     case  |    collie |
                                                          | 25 |   np.nan  |   terrier |
                                                          | 26 |   np.nan  |   spaniel |
                                                          ------------------------------

In both datasets, missing values can be entered either as np.nan or as a missing value string such as 'missing' which is then defined when the function is called.

These example datasets can be produced using the following Python code:

import numpy as np
import pandas as pd

binomDataDF = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
                            'group':['case','case','control','case','control','control','case','case','control','case'],
                            'A':['yes','yes','no','no','no','no','no','yes','missing','yes'],
                            'B':['no',np.nan,'missing','yes','yes','yes','yes','no','no','no'],
                            'C':['yes','yes','yes',np.nan,'no','yes','yes','yes','no','no']})

multinomDataDF = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26],
                               'group':['case','case','case','control','control','case','case','case','control','control','control','control','case','case','case','control','control','control','control','case','case','case','case','case',np.nan,np.nan],
                               'category':[np.nan,'spaniel','missing','terrier','collie','labrador','labrador','collie','spaniel','spaniel','labrador','collie','terrier','terrier','terrier','collie','labrador','labrador','labrador','spaniel','spaniel','collie','collie','collie','terrier','spaniel']})

The output summary tables in each case would be very similar:

    Summary table for binomial proportions
    
    |-----|---------------|-----------------|---------------|---------------|---------------|---------------|
    |     |  case_success | control_success |    case_total | control_total |     case_prop |  control_prop |
    |-----|---------------|-----------------|---------------|---------------|---------------|---------------|
    |   A |               |                 |               |               |               |               |
    |-----|---------------|-----------------|---------------|---------------|---------------|---------------|
    |   B |               |                 |               |               |               |               |
    |-----|---------------|-----------------|---------------|---------------|---------------|---------------|
    |   C |               |                 |               |               |               |               |
    |-----|---------------|-----------------|---------------|---------------|---------------|---------------|
    
    * The 'success' columns give the number of 'successes' in each variable.
      The 'total' columns give the total number of rows (missing values excluded) for each variable.
      The 'prop' columns give the proportion of successes.


    Summary table for multinomial proportions
    
    |---------|---------------|---------------|---------------|---------------|
    |         |    case_count | control_count |     case_prop |  control_prop |
    |---------|---------------|---------------|---------------|---------------|
    | spaniel |               |               |               |               |
    |---------|---------------|---------------|---------------|---------------|
    | terrier |               |               |               |               |
    |---------|---------------|---------------|---------------|---------------|
    | labrador|               |               |               |               |
    |---------|---------------|---------------|---------------|---------------|
    | collie  |               |               |               |               |
    |---------|---------------|---------------|---------------|---------------|
    
    * The 'count' columns give the absolute counts.
      The 'prop' columns give the proportion of the total.

The confidence intervals (either binomial or multinomial) are added to the table as separate columns containing lower and upper limits.

And the data would be plotted in a similar fashion (although the method used to calculate the error bars would be different).

     R  |           |-|                                 |           |-|             
     e  |           |/|-|                               |           |/|-|           |/| case
     l  |     |-|   |/| |           |-|               P |     |-|   |/| |           
        |     | |   |/| |           |/|-|             r |     | |   |/| |           | | control
     F  |   |-| |   |/| |     |-|   |/| |       OR    o |   |-| |   |/| |     |-|   
     r  |   |/| |   |/| |   |-| |   |/| |             p |   |/| |   |/| |   |-| |   
     e  |   |/| |   |/| |   |/| |   |/| |               |   |/| |   |/| |   |/| |   
     q  |-----------------------------------            |---------------------------
             spn     ter     lab     col                      A       B       C

Function parameters

phjCalculateBinomialProportions() function

The phjCalculateBinomialProportions() function takes the following parameters:

  1. phjDF

    The Pandas dataframe containing the data to be analysed. The dataframe does not need to be sliced before use because the data columns that need to be used are defined in the function arguments.

  2. phjColumnsList

    A list of the columns that need to be analysed. Each of these columns should be binary variables and should contain only binary data. Missing values (either in the form of a specified missing value or a np.nan value will be removed before analysis).

  3. phjSuccess (default = 'yes')

    The text string or value that is used to indicate a positive value or a 'success'. The default assumes that data will be coded as 'yes' or 'no'.

  4. phjGroupVarName (default = None)

    It is likely that some analyses will need to summarise data over two distinct categories (e.g. 'case' and 'control' data may be summarised separately). This variable should contain the column heading for the variable that defines group membership. The default is None. If phjGroupVarName is None, the whole dataset is analysed as a single group.

  5. phjMissingValue (default = 'missing')

    The text string or value that indicates a success.

  6. phjBinomialConfIntMethod (default = 'normal')

    This argument defines the method to be used to calculate the binomial confidence intervals. The options available are those that can be handled by the statsmodel.stats.proporotion proportion_confint() method. The default is 'normal' but the full list of options (taken from the statsmodels website) are:

    1. normal : asymptotic normal approximation
    2. agresti_coull : Agresti-Coull interval
    3. beta : Clopper-Pearson interval based on Beta distribution
    4. wilson : Wilson Score interval
    5. jeffreys : Jeffreys Bayesian Interval
    6. binom_test : experimental, inversion of binom_test
  7. phjAlpha (default = 0.05)

    The desired value for alpha; the default is 0.05 (which leads to the calculation of 95% confidence intervals.

  8. phjPlotProportions (default = True)

    Determines whether a bar chart of proportions (with errors bars) should be plotted.

  9. phjGroupsToPlotList (default = None; other options are a list of group names or 'all'.)

    The data may be calculated for numerous groups but it may not be desired for the plot to display all groups. This argument is a list of groups which should be displayed in the plot.

  10. phjSortProportions (default = False)

If no group variable is given (phjGroupVarName = None) or only a single group is named to be plotted in phjGroupsToPlotList, this argument indicates whether the columns should be sorted. Default is 'False' but other options are 'asc' or desc'.

  1. phjGraphTitle (default = None)

The title of the graph.

  1. phjPrintResults (default = False)

Indicates whether the results should be printed to screen as the function progresses.

phjCalculateMultinomialProportions() function

The phjCalculateMultinomialProportions() function takes the following parameters:

  1. phjDF

    The Pandas dataframe containing the data to be analysed. The dataframe does not need to be sliced before use because the data columns that need to be used are defined in the function arguments.

  2. phjCategoryVarName (default = None)

    The name of the column that defines the category variable over which the multinomial confidence intervals are calculated.

  3. phjGroupVarName (default = None)

    It is likely that some analysis will need to summarise data over two distinct categories (e.g. 'case' and 'control' data may be summarised separately). This varialble should contain the column heading for the variable that defines group membership. The default is None. If phjGroupVarName is None, the whole dataset is analysed as a single group.

  4. phjMissingValue (default = 'missing')

    The text string or value that indicates a success.

  5. phjMultinomialConfIntMethod (default = 'goodman')

    This argument defines the method to be used to calculate the multinomial confidence intervals. The options available are those that can be handled by the statsmodel.stats.proporotion multinomial_proportions_confint() method. The default is 'goodman' but the full list of options (taken from the statsmodels website) are:

    1. goodman: based on a chi-squared approximation, valid if all values in counts are greater or equal to 5
    2. sison-glaz: less conservative than goodman, but only valid if counts has 7 or more categories (len(counts) >= 7)
  6. phjAlpha (default = 0.05)

    The desired value for alpha; the default is 0.05 (which leads to the calculation of 95% confidence intervals.

  7. phjPlotRelFreq (default = True)

    Determines whether a bar chart of proportions (with errors bars) is plotted.

  8. phjCategoriesToPlotList (default = 'all')

A list of column names that should be plotted on a bar chart.

  1. phjGroupsToPlotList (default = 'all')

    The data may be calculated for numerous groups but it may not be desired for the plot to display all groups. This argument is a list of groups which should be displayed in the plot.

  2. phjGraphTitle (default = None)

The title of the graph.

  1. phjPrintResults (default = False)

Indicates whehter the results should be printed to screed as the function progresses.

Exceptions raised

None

Returns

These functions both return Pandas dataframes containing a table of proportions and confidence intervals.

Other notes

None

Example

Examples of the functions in use are given below.

Example of calculating binomial proportions (using phjCaculateBinomialProportions() function)

phjTempDF = pd.DataFrame({'group':['g1','g1','g2','g1','g2','g2','g1','g1','g2','g1'],
                          'A':['yes','yes','no','no','no','no','no','yes',np.nan,'yes'],
                          'B':['no',np.nan,np.nan,'yes','yes','yes','yes','no','no','no'],
                          'C':['yes','yes','yes',np.nan,'no','yes','yes','yes','no','no']})

print(phjTempDF)
print('\n')

phjPropDF = epy.phjCalculateBinomialProportions(phjDF = phjTempDF,
                                                phjColumnsList = ['A','B','C'],
                                                phjSuccess = 'yes',
                                                phjGroupVarName = 'group',
                                                phjMissingValue = 'missing',
                                                phjBinomialConfIntMethod = 'normal',
                                                phjAlpha = 0.05,
                                                phjPlotProportions = True,
                                                phjGroupsToPlotList = 'all',
                                                phjSortProportions = True,
                                                phjGraphTitle = None,
                                                phjPrintResults = False)

print(phjPropDF)

This produces the following results:

  group    A    B    C
0    g1  yes   no  yes
1    g1  yes  NaN  yes
2    g2   no  NaN  yes
3    g1   no  yes  NaN
4    g2   no  yes   no
5    g2   no  yes  yes
6    g1   no  yes  yes
7    g1  yes   no  yes
8    g2  NaN   no   no
9    g1  yes   no   no


   g1_95CI_lint  g1_95CI_llim  g1_95CI_uint  g1_95CI_ulim  g1_obs   g1_prop  \
A      0.377195      0.289471      0.333333      1.000000       6  0.666667   
B      0.400000      0.000000      0.429407      0.829407       5  0.400000   
C      0.350609      0.449391      0.200000      1.000000       5  0.800000   

   g1_success  g2_95CI_lint  g2_95CI_llim  g2_95CI_uint  g2_95CI_ulim  g2_obs  \
A           4      0.000000      0.000000      0.000000      0.000000       3   
B           2      0.533435      0.133232      0.333333      1.000000       3   
C           4      0.489991      0.010009      0.489991      0.989991       4   

    g2_prop  g2_success  
A  0.000000           0  
B  0.666667           2  
C  0.500000           2  

Example of calculating multinomial proportions (using phjCalculateMultinomialProportions() function)

phjTempDF = pd.DataFrame({'group':['case','case','case','control','control','case','case','case','control','control','control','control','case','case','case','control','control','control','control','case','case','case','case','case',np.nan,np.nan],
                          'category':[np.nan,'spaniel','missing','terrier','collie','labrador','labrador','collie','spaniel','spaniel','labrador','collie','terrier','terrier','terrier','collie','labrador','labrador','labrador','spaniel','spaniel','collie','collie','collie','terrier','spaniel'],
                          'catint':[1,2,3,2,3,2,1,2,1,2,3,2,3,2,3,1,2,3,2,3,2,3,2,3,1,2]})

print(phjTempDF)
print('\n')

phjRelFreqDF = epy.phjCalculateMultinomialProportions(phjDF = phjTempDF,
                                                      phjCategoryVarName = 'category',
                                                      phjGroupVarName = 'group',
                                                      phjMissingValue = 'missing',
                                                      phjMultinomialConfIntMethod = 'goodman',
                                                      phjAlpha = 0.05,
                                                      phjPlotRelFreq = True,
                                                      phjCategoriesToPlotList = 'all',
                                                      phjGroupsToPlotList = 'all',   # Currently not implemented
                                                      phjGraphTitle = 'Relative frequencies (Goodman CI)',
                                                      phjPrintResults = True)

print(phjRelFreqDF)

This produces the following results:

      group  category  catint
0      case       NaN       1
1      case   spaniel       2
2      case   missing       3
3   control   terrier       2
4   control    collie       3
5      case  labrador       2
6      case  labrador       1
7      case    collie       2
8   control   spaniel       1
9   control   spaniel       2
10  control  labrador       3
11  control    collie       2
12     case   terrier       3
13     case   terrier       2
14     case   terrier       3
15  control    collie       1
16  control  labrador       2
17  control  labrador       3
18  control  labrador       2
19     case   spaniel       3
20     case   spaniel       2
21     case    collie       3
22     case    collie       2
23     case    collie       3
24      NaN   terrier       1
25      NaN   spaniel       2



Category levels:  ['spaniel', 'terrier', 'collie', 'labrador']
Group levels:  ['case', 'control'] 

          case_count  control_count  case_prop  control_prop  case_95CI_llim  \
spaniel            3              2   0.250000           0.2        0.068217   
terrier            3              1   0.250000           0.1        0.068217   
collie             4              3   0.333333           0.3        0.108808   
labrador           2              4   0.166667           0.4        0.034702   

          case_95CI_ulim  control_95CI_llim  control_95CI_ulim  
spaniel         0.602809           0.041845           0.588663  
terrier         0.602809           0.012443           0.494901  
collie          0.671876           0.082588           0.671084  
labrador        0.526666           0.132347           0.744489  
          case_count  control_count  case_prop  control_prop  case_95CI_llim  \
spaniel            3              2   0.250000           0.2        0.068217   
terrier            3              1   0.250000           0.1        0.068217   
collie             4              3   0.333333           0.3        0.108808   
labrador           2              4   0.166667           0.4        0.034702   

          case_95CI_ulim  control_95CI_llim  control_95CI_ulim  
spaniel         0.602809           0.041845           0.588663  
terrier         0.602809           0.012443           0.494901  
collie          0.671876           0.082588           0.671084  
labrador        0.526666           0.132347           0.744489