Calculate and plot proportions - lvphj/epydemiology GitHub Wiki
Python functions to calculate binomial and multinomial proportions using data stored in a Pandas dataframe.
phjCalculateBinomialProportions()
df = epy.phjCalculateBinomialProportions(phjDF,
phjColumnsList,
phjSuccess = 'yes',
phjGroupVarName = None,
phjMissingValue = 'missing',
phjBinomialConfIntMethod = 'normal',
phjAlpha = 0.05,
phjPlotProportions = True,
phjGroupsToPlotList = None,
phjSortProportions = False,
phjGraphTitle = None,
phjPrintResults = False)
phjCalculateMultinomialProportions()
df = epy.phjCalculateMultinomialProportions(phjDF,
phjCategoryVarName,
phjGroupVarName = None,
phjMissingValue = 'missing',
phjMultinomialConfIntMethod = 'goodman',
phjAlpha = 0.05,
phjPlotRelFreq = True,
phjCategoriesToPlotList = 'all',
phjGroupsToPlotList = None, # Currently not implemented
phjGraphTitle = None,
phjPrintResults = False)
Description
The above two functions – phjCalculateBinomialProportions() and phjCalculateMultinomialProportions()
– are closely related and will be discussed and described together.
The functions can be used to rapidly summarise and visualise two common-encountered (at least, in my research) types of data. The first summarises data which consists of rows of records (representing individuals) and a series of binomial (dummy-esque) variables indicating whether a characteristic is present or absent (see below). These are not true dummy variables because categories are not necessarily mutually exclusive and each variable is considered as an individual characteristic. The confidence intervals for each category are calculated as individual binomial intervals (using StatsModels functions).
The second data structure consists of rows of data (representing individuals) and a single variable which contains numerous categories. In this case, all the categories are mutually exclusive. The proportions (or relative frequencies) are calculated for each category level and the confidence intervals are calculated as multinomial intervals (using StatsModels functions).
The series of binomial data might take the form shown on the left whilst the multinomial dataset might take the form shown on the right below:
Binomial data structure Multinomial data structure
------------------------------------------------ ------------------------------
| id | group | A | B | C | | id | group | category |
|----|-----------|---------|---------|---------| |----|-----------|-----------|
| 1 | case | yes | no | yes | | 1 | case | np.nan |
| 2 | case | yes | np.nan | yes | | 2 | case | spaniel |
| 3 | control | no | missing | yes | | 3 | case | missing |
| 4 | case | no | yes | np.nan | | 4 | control | terrier |
| 5 | control | no | yes | no | | 5 | control | collie |
| 6 | control | no | yes | yes | | 6 | case | labrador |
| 7 | case | no | yes | yes | | 7 | case | labrador |
| 8 | case | yes | no | yes | | 8 | case | collie |
| 9 | control | missing | no | no | | 9 | control | spaniel |
| 10 | case | yes | no | no | | 10 | control | spaniel |
------------------------------------------------ | 11 | control | labrador |
| 12 | control | collie |
| 13 | case | terrier |
| 14 | case | terrier |
| 15 | case | terrier |
| 16 | control | collie |
| 17 | control | labrador |
| 18 | control | labrador |
| 19 | control | labrador |
| 20 | case | spaniel |
| 21 | case | spaniel |
| 22 | case | collie |
| 23 | case | collie |
| 24 | case | collie |
| 25 | np.nan | terrier |
| 26 | np.nan | spaniel |
------------------------------
In both datasets, missing values can be entered either as np.nan or as a missing value string such as 'missing' which is then defined when the function is called.
These example datasets can be produced using the following Python code:
import numpy as np
import pandas as pd
binomDataDF = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
'group':['case','case','control','case','control','control','case','case','control','case'],
'A':['yes','yes','no','no','no','no','no','yes','missing','yes'],
'B':['no',np.nan,'missing','yes','yes','yes','yes','no','no','no'],
'C':['yes','yes','yes',np.nan,'no','yes','yes','yes','no','no']})
multinomDataDF = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26],
'group':['case','case','case','control','control','case','case','case','control','control','control','control','case','case','case','control','control','control','control','case','case','case','case','case',np.nan,np.nan],
'category':[np.nan,'spaniel','missing','terrier','collie','labrador','labrador','collie','spaniel','spaniel','labrador','collie','terrier','terrier','terrier','collie','labrador','labrador','labrador','spaniel','spaniel','collie','collie','collie','terrier','spaniel']})
The output summary tables in each case would be very similar:
Summary table for binomial proportions
|-----|---------------|-----------------|---------------|---------------|---------------|---------------|
| | case_success | control_success | case_total | control_total | case_prop | control_prop |
|-----|---------------|-----------------|---------------|---------------|---------------|---------------|
| A | | | | | | |
|-----|---------------|-----------------|---------------|---------------|---------------|---------------|
| B | | | | | | |
|-----|---------------|-----------------|---------------|---------------|---------------|---------------|
| C | | | | | | |
|-----|---------------|-----------------|---------------|---------------|---------------|---------------|
* The 'success' columns give the number of 'successes' in each variable.
The 'total' columns give the total number of rows (missing values excluded) for each variable.
The 'prop' columns give the proportion of successes.
Summary table for multinomial proportions
|---------|---------------|---------------|---------------|---------------|
| | case_count | control_count | case_prop | control_prop |
|---------|---------------|---------------|---------------|---------------|
| spaniel | | | | |
|---------|---------------|---------------|---------------|---------------|
| terrier | | | | |
|---------|---------------|---------------|---------------|---------------|
| labrador| | | | |
|---------|---------------|---------------|---------------|---------------|
| collie | | | | |
|---------|---------------|---------------|---------------|---------------|
* The 'count' columns give the absolute counts.
The 'prop' columns give the proportion of the total.
The confidence intervals (either binomial or multinomial) are added to the table as separate columns containing lower and upper limits.
And the data would be plotted in a similar fashion (although the method used to calculate the error bars would be different).
R | |-| | |-|
e | |/|-| | |/|-| |/| case
l | |-| |/| | |-| P | |-| |/| |
| | | |/| | |/|-| r | | | |/| | | | control
F | |-| | |/| | |-| |/| | OR o | |-| | |/| | |-|
r | |/| | |/| | |-| | |/| | p | |/| | |/| | |-| |
e | |/| | |/| | |/| | |/| | | |/| | |/| | |/| |
q |----------------------------------- |---------------------------
spn ter lab col A B C
Function parameters
phjCalculateBinomialProportions() function
The phjCalculateBinomialProportions() function takes the following parameters:
-
phjDF
The Pandas dataframe containing the data to be analysed. The dataframe does not need to be sliced before use because the data columns that need to be used are defined in the function arguments.
-
phjColumnsList
A list of the columns that need to be analysed. Each of these columns should be binary variables and should contain only binary data. Missing values (either in the form of a specified missing value or a np.nan value will be removed before analysis).
-
phjSuccess (default = 'yes')
The text string or value that is used to indicate a positive value or a 'success'. The default assumes that data will be coded as 'yes' or 'no'.
-
phjGroupVarName (default = None)
It is likely that some analyses will need to summarise data over two distinct categories (e.g. 'case' and 'control' data may be summarised separately). This variable should contain the column heading for the variable that defines group membership. The default is None. If phjGroupVarName is None, the whole dataset is analysed as a single group.
-
phjMissingValue (default = 'missing')
The text string or value that indicates a success.
-
phjBinomialConfIntMethod (default = 'normal')
This argument defines the method to be used to calculate the binomial confidence intervals. The options available are those that can be handled by the statsmodel.stats.proporotion proportion_confint() method. The default is 'normal' but the full list of options (taken from the statsmodels website) are:
normal
: asymptotic normal approximationagresti_coull
: Agresti-Coull intervalbeta
: Clopper-Pearson interval based on Beta distributionwilson
: Wilson Score intervaljeffreys
: Jeffreys Bayesian Intervalbinom_test
: experimental, inversion of binom_test
-
phjAlpha (default = 0.05)
The desired value for alpha; the default is 0.05 (which leads to the calculation of 95% confidence intervals.
-
phjPlotProportions (default = True)
Determines whether a bar chart of proportions (with errors bars) should be plotted.
-
phjGroupsToPlotList (default = None; other options are a list of group names or 'all'.)
The data may be calculated for numerous groups but it may not be desired for the plot to display all groups. This argument is a list of groups which should be displayed in the plot.
-
phjSortProportions (default = False)
If no group variable is given (phjGroupVarName = None) or only a single group is named to be plotted in phjGroupsToPlotList, this argument indicates whether the columns should be sorted. Default is 'False' but other options are 'asc' or desc'.
- phjGraphTitle (default = None)
The title of the graph.
- phjPrintResults (default = False)
Indicates whether the results should be printed to screen as the function progresses.
phjCalculateMultinomialProportions() function
The phjCalculateMultinomialProportions() function takes the following parameters:
-
phjDF
The Pandas dataframe containing the data to be analysed. The dataframe does not need to be sliced before use because the data columns that need to be used are defined in the function arguments.
-
phjCategoryVarName (default = None)
The name of the column that defines the category variable over which the multinomial confidence intervals are calculated.
-
phjGroupVarName (default = None)
It is likely that some analysis will need to summarise data over two distinct categories (e.g. 'case' and 'control' data may be summarised separately). This varialble should contain the column heading for the variable that defines group membership. The default is None. If phjGroupVarName is None, the whole dataset is analysed as a single group.
-
phjMissingValue (default = 'missing')
The text string or value that indicates a success.
-
phjMultinomialConfIntMethod (default = 'goodman')
This argument defines the method to be used to calculate the multinomial confidence intervals. The options available are those that can be handled by the statsmodel.stats.proporotion multinomial_proportions_confint() method. The default is 'goodman' but the full list of options (taken from the statsmodels website) are:
goodman
: based on a chi-squared approximation, valid if all values incounts
are greater or equal to 5sison-glaz
: less conservative thangoodman
, but only valid ifcounts
has 7 or more categories (len(counts) >= 7
)
-
phjAlpha (default = 0.05)
The desired value for alpha; the default is 0.05 (which leads to the calculation of 95% confidence intervals.
-
phjPlotRelFreq (default = True)
Determines whether a bar chart of proportions (with errors bars) is plotted.
-
phjCategoriesToPlotList (default = 'all')
A list of column names that should be plotted on a bar chart.
-
phjGroupsToPlotList (default = 'all')
The data may be calculated for numerous groups but it may not be desired for the plot to display all groups. This argument is a list of groups which should be displayed in the plot.
-
phjGraphTitle (default = None)
The title of the graph.
- phjPrintResults (default = False)
Indicates whehter the results should be printed to screed as the function progresses.
Exceptions raised
None
Returns
These functions both return Pandas dataframes containing a table of proportions and confidence intervals.
Other notes
None
Example
Examples of the functions in use are given below.
Example of calculating binomial proportions (using phjCaculateBinomialProportions() function)
phjTempDF = pd.DataFrame({'group':['g1','g1','g2','g1','g2','g2','g1','g1','g2','g1'],
'A':['yes','yes','no','no','no','no','no','yes',np.nan,'yes'],
'B':['no',np.nan,np.nan,'yes','yes','yes','yes','no','no','no'],
'C':['yes','yes','yes',np.nan,'no','yes','yes','yes','no','no']})
print(phjTempDF)
print('\n')
phjPropDF = epy.phjCalculateBinomialProportions(phjDF = phjTempDF,
phjColumnsList = ['A','B','C'],
phjSuccess = 'yes',
phjGroupVarName = 'group',
phjMissingValue = 'missing',
phjBinomialConfIntMethod = 'normal',
phjAlpha = 0.05,
phjPlotProportions = True,
phjGroupsToPlotList = 'all',
phjSortProportions = True,
phjGraphTitle = None,
phjPrintResults = False)
print(phjPropDF)
This produces the following results:
group A B C
0 g1 yes no yes
1 g1 yes NaN yes
2 g2 no NaN yes
3 g1 no yes NaN
4 g2 no yes no
5 g2 no yes yes
6 g1 no yes yes
7 g1 yes no yes
8 g2 NaN no no
9 g1 yes no no
g1_95CI_lint g1_95CI_llim g1_95CI_uint g1_95CI_ulim g1_obs g1_prop \
A 0.377195 0.289471 0.333333 1.000000 6 0.666667
B 0.400000 0.000000 0.429407 0.829407 5 0.400000
C 0.350609 0.449391 0.200000 1.000000 5 0.800000
g1_success g2_95CI_lint g2_95CI_llim g2_95CI_uint g2_95CI_ulim g2_obs \
A 4 0.000000 0.000000 0.000000 0.000000 3
B 2 0.533435 0.133232 0.333333 1.000000 3
C 4 0.489991 0.010009 0.489991 0.989991 4
g2_prop g2_success
A 0.000000 0
B 0.666667 2
C 0.500000 2
Example of calculating multinomial proportions (using phjCalculateMultinomialProportions() function)
phjTempDF = pd.DataFrame({'group':['case','case','case','control','control','case','case','case','control','control','control','control','case','case','case','control','control','control','control','case','case','case','case','case',np.nan,np.nan],
'category':[np.nan,'spaniel','missing','terrier','collie','labrador','labrador','collie','spaniel','spaniel','labrador','collie','terrier','terrier','terrier','collie','labrador','labrador','labrador','spaniel','spaniel','collie','collie','collie','terrier','spaniel'],
'catint':[1,2,3,2,3,2,1,2,1,2,3,2,3,2,3,1,2,3,2,3,2,3,2,3,1,2]})
print(phjTempDF)
print('\n')
phjRelFreqDF = epy.phjCalculateMultinomialProportions(phjDF = phjTempDF,
phjCategoryVarName = 'category',
phjGroupVarName = 'group',
phjMissingValue = 'missing',
phjMultinomialConfIntMethod = 'goodman',
phjAlpha = 0.05,
phjPlotRelFreq = True,
phjCategoriesToPlotList = 'all',
phjGroupsToPlotList = 'all', # Currently not implemented
phjGraphTitle = 'Relative frequencies (Goodman CI)',
phjPrintResults = True)
print(phjRelFreqDF)
This produces the following results:
group category catint
0 case NaN 1
1 case spaniel 2
2 case missing 3
3 control terrier 2
4 control collie 3
5 case labrador 2
6 case labrador 1
7 case collie 2
8 control spaniel 1
9 control spaniel 2
10 control labrador 3
11 control collie 2
12 case terrier 3
13 case terrier 2
14 case terrier 3
15 control collie 1
16 control labrador 2
17 control labrador 3
18 control labrador 2
19 case spaniel 3
20 case spaniel 2
21 case collie 3
22 case collie 2
23 case collie 3
24 NaN terrier 1
25 NaN spaniel 2
Category levels: ['spaniel', 'terrier', 'collie', 'labrador']
Group levels: ['case', 'control']
case_count control_count case_prop control_prop case_95CI_llim \
spaniel 3 2 0.250000 0.2 0.068217
terrier 3 1 0.250000 0.1 0.068217
collie 4 3 0.333333 0.3 0.108808
labrador 2 4 0.166667 0.4 0.034702
case_95CI_ulim control_95CI_llim control_95CI_ulim
spaniel 0.602809 0.041845 0.588663
terrier 0.602809 0.012443 0.494901
collie 0.671876 0.082588 0.671084
labrador 0.526666 0.132347 0.744489
case_count control_count case_prop control_prop case_95CI_llim \
spaniel 3 2 0.250000 0.2 0.068217
terrier 3 1 0.250000 0.1 0.068217
collie 4 3 0.333333 0.3 0.108808
labrador 2 4 0.166667 0.4 0.034702
case_95CI_ulim control_95CI_llim control_95CI_ulim
spaniel 0.602809 0.041845 0.588663
terrier 0.602809 0.012443 0.494901
collie 0.671876 0.082588 0.671084
labrador 0.526666 0.132347 0.744489