Categorise a continuous variable using predefined breaks, quantiles or optimised break positions - lvphj/epydemiology GitHub Wiki
Python function to convert a continuous variable to a categorical variable using predefined breaks, quantiles or optimised break positions for data in a Pandas dataframe.
phjCategoriseContinuousVariable()
df,list = epy.phjCategoriseContinuousVariable(phjDF,
phjContinuousVarName = None,
phjMissingValue = 'missing',
phjNumberOfCategoriesInt = 5,
phjNewCategoryVarName = None,
phjCategorisationMethod = 'jenks',
phjNewCategoryNamesList = None,
phjReturnBreaks = True,
phjPrintResults = False)
or
df = epy.phjCategoriseContinuousVariable(phjDF,
phjContinuousVarName = None,
phjMissingValue = 'missing',
phjNumberOfCategoriesInt = 5,
phjNewCategoryVarName = None,
phjCategorisationMethod = 'jenks',
phjNewCategoryNamesList = None,
phjReturnBreaks = False,
phjPrintResults = False)
Description
Function parameters
phjNumberOfCategriesInt (default = 5). Minimum = 2. Maximum number of categories arbitrarily set to 100 or number of values in series, whichever is smaller.
Exceptions raised
None
Returns
Pandas dataframe containing a tabulation of the log odds for a categorised variable.
Other notes
Check:
- The extreme values in the list of breaks are extended by a small percentage to make sure they include all values. Check whether these are the values that are actually returned as a list from the function.
- When calculating the Jenks breaks, the process can be very slow. Jenks breaks are therefore calculated on a random sample taken from the dataframe. The lowest and highest values are replaced by the data minimum and maximum (possibly extended by a small percentage as in 1. above).
Example
An example of the function in use is given below:
# Define example dataset
phjTempDF = pd.DataFrame({'binDepVar':['yes']*50000 + ['no']*50000,
'riskFactorCont':np.random.uniform(0,1,100000)})
with pd.option_context('display.max_rows', 10, 'display.max_columns', 5):
print(phjTempDF)
# Categorise a continuous variable
phjTempDF, phjBreaksList = epy.phjCategoriseContinuousVariable(phjDF = phjTempDF,
phjContinuousVarName = 'riskFactorCont',
phjMissingValue = 'missing',
phjNumberOfCategoriesInt = 6,
phjNewCategoryVarName = 'catVar',
phjCategorisationMethod = 'jenks',
phjReturnBreaks = True,
phjPrintResults = False)
with pd.option_context('display.max_rows', 10, 'display.max_columns', 5):
print('\nCategorised variable')
print(phjTempDF)
print('\n')
print('Breaks')
print(phjBreaksList)
Output
binDepVar riskFactorCont
0 yes 0.268203
1 yes 0.871220
2 yes 0.501282
3 yes 0.858652
4 yes 0.723276
... ... ...
99995 no 0.943760
99996 no 0.953255
99997 no 0.080429
99998 no 0.091481
99999 no 0.925220
[100000 rows x 2 columns]
Categorised variable
binDepVar riskFactorCont catVar
0 yes 0.268203 1
1 yes 0.871220 5
2 yes 0.501282 3
3 yes 0.858652 5
4 yes 0.723276 4
... ... ... ...
99995 no 0.943760 5
99996 no 0.953255 5
99997 no 0.080429 0
99998 no 0.091481 0
99999 no 0.925220 5
[100000 rows x 3 columns]
Breaks
[1.1083476758642629e-05, 0.16544335416294453, 0.32239443898189324, 0.48614660309506053, 0.65418653301496088, 0.83115022562356933, 1.0009938172219826]