Convert columns of binary data to a square matrix containing co occurrences - lvphj/epydemiology GitHub Wiki
Function to produce a Numpy array from a group of binary variables to show co-occurrence.
phjBinaryVarsToSquareMatrix()
arr = epy.phjBinaryVarsToSquareMatrix(phjDataDF,
phjColumnNamesList,
phjOutputFormat = 'arr',
phjPrintResults = False)
Description
This function takes a number of variables containing binary data and returns a Numpy array representing a square matrix that shows co-occurrence of positive variables.
Function parameters
-
phjDataDF Pandas dataframe
-
phjColumnNamesList A list of variable names contained in the dataframe that contains binary data.
-
phjOutputFormat (default = 'arr') Output format. Default is a Numpy array ('arr'). Alternative is 'df' to return a Pandas dataframe.
-
phjPrintResults (default = False)
Print verbose output during execution of scripts. If running on Jupyter-Notebook, setting phjPrintResults = True
causes a lot a output and can cause problems connecting to kernel. It is recommended to set phjPrintResults = False
routinely to avoid possible problems when using Jupyter-notebook.
Exceptions raised
None.
Returns
By default, function returns a Numpy array of a square matrix (phjOutputFormat = 'arr'). Matrix can be returned as a Pandas dataframe (phjOutputFormat = 'df').
Other notes
None.
Example
Output a numpy array
import pandas as pd
rawDataDF = pd.DataFrame({'a':[0,1,1,1,0,0,1,0],
'b':[1,1,0,0,1,0,0,1],
'c':[0,0,1,0,1,1,1,1],
'd':[1,0,0,0,1,0,0,0],
'e':[1,0,0,0,0,1,0,0]})
columns = ['a','b','c','d','e']
print('Raw data')
print(rawDataDF)
print('\n')
phjMatrix = epy.phjBinaryVarsToSquareMatrix(phjDataDF = rawDataDF,
phjColumnNamesList = columns,
phjOutputFormat = 'arr',
phjPrintResults = False)
print('Returned square matrix')
print(phjMatrix)
Output:
Raw data
a b c d e
0 0 1 0 1 1
1 1 1 0 0 0
2 1 0 1 0 0
3 1 0 0 0 0
4 0 1 1 1 0
5 0 0 1 0 1
6 1 0 1 0 0
7 0 1 1 0 0
Returned square matrix
[[1 1 2 0 0]
[1 0 2 2 1]
[2 2 0 1 1]
[0 2 1 0 1]
[0 1 1 1 0]]
Output a Pandas dataframe
rawDataDF = pd.DataFrame({'a':[0,1,1,1,0,0,1,0],
'b':[1,1,0,0,1,0,0,1],
'c':[0,0,1,0,1,1,1,1],
'd':[1,0,0,0,1,0,0,0],
'e':[1,0,0,0,0,1,0,0]})
columns = ['a','b','c','d','e']
print('Raw data')
print(rawDataDF)
print('\n')
phjMatrixDF = epy.phjBinaryVarsToSquareMatrix(phjDataDF = rawDataDF,
phjColumnNamesList = columns,
phjOutputFormat = 'df',
phjPrintResults = False)
print('Returned square matrix dataframe')
print(phjMatrixDF)
Output:
Raw data
a b c d e
0 0 1 0 1 1
1 1 1 0 0 0
2 1 0 1 0 0
3 1 0 0 0 0
4 0 1 1 1 0
5 0 0 1 0 1
6 1 0 1 0 0
7 0 1 1 0 0
Returned square matrix dataframe
a b c d e
a 1 1 2 0 0
b 1 0 2 2 1
c 2 2 0 1 1
d 0 2 1 0 1
e 0 1 1 1 0