Convert columns of binary data to a square matrix containing co occurrences - lvphj/epydemiology GitHub Wiki

Function to produce a Numpy array from a group of binary variables to show co-occurrence.

phjBinaryVarsToSquareMatrix()

arr = epy.phjBinaryVarsToSquareMatrix(phjDataDF,
                                      phjColumnNamesList,
                                      phjOutputFormat = 'arr',
                                      phjPrintResults = False)

Description

This function takes a number of variables containing binary data and returns a Numpy array representing a square matrix that shows co-occurrence of positive variables.

Function parameters

phjDataDF Pandas dataframe
phjColumnNamesList A list of variable names contained in the dataframe that contains binary data.
phjOutputFormat (default = 'arr') Output format. Default is a Numpy array ('arr'). Alternative is 'df' to return a Pandas dataframe.
phjPrintResults (default = False)

Print verbose output during execution of scripts. If running on Jupyter-Notebook, setting phjPrintResults = True causes a lot a output and can cause problems connecting to kernel. It is recommended to set phjPrintResults = False routinely to avoid possible problems when using Jupyter-notebook.

Exceptions raised

None.

Returns

By default, function returns a Numpy array of a square matrix (phjOutputFormat = 'arr'). Matrix can be returned as a Pandas dataframe (phjOutputFormat = 'df').

Other notes

None.

Example

Output a numpy array

import pandas as pd

rawDataDF = pd.DataFrame({'a':[0,1,1,1,0,0,1,0],
                          'b':[1,1,0,0,1,0,0,1],
                          'c':[0,0,1,0,1,1,1,1],
                          'd':[1,0,0,0,1,0,0,0],
                          'e':[1,0,0,0,0,1,0,0]})

columns = ['a','b','c','d','e']

print('Raw data')
print(rawDataDF)
print('\n')

phjMatrix = epy.phjBinaryVarsToSquareMatrix(phjDataDF = rawDataDF,
                                        phjColumnNamesList = columns,
                                        phjOutputFormat = 'arr',
                                        phjPrintResults = False)
                                        
print('Returned square matrix')
print(phjMatrix)

Output:

Raw data
   a  b  c  d  e
0  0  1  0  1  1
1  1  1  0  0  0
2  1  0  1  0  0
3  1  0  0  0  0
4  0  1  1  1  0
5  0  0  1  0  1
6  1  0  1  0  0
7  0  1  1  0  0


Returned square matrix
[[1 1 2 0 0]
 [1 0 2 2 1]
 [2 2 0 1 1]
 [0 2 1 0 1]
 [0 1 1 1 0]]

Output a Pandas dataframe

rawDataDF = pd.DataFrame({'a':[0,1,1,1,0,0,1,0],
                          'b':[1,1,0,0,1,0,0,1],
                          'c':[0,0,1,0,1,1,1,1],
                          'd':[1,0,0,0,1,0,0,0],
                          'e':[1,0,0,0,0,1,0,0]})

columns = ['a','b','c','d','e']

print('Raw data')
print(rawDataDF)
print('\n')

phjMatrixDF = epy.phjBinaryVarsToSquareMatrix(phjDataDF = rawDataDF,
                                              phjColumnNamesList = columns,
                                              phjOutputFormat = 'df',
                                              phjPrintResults = False)
                                        
print('Returned square matrix dataframe')
print(phjMatrixDF)

Output:

Raw data
   a  b  c  d  e
0  0  1  0  1  1
1  1  1  0  0  0
2  1  0  1  0  0
3  1  0  0  0  0
4  0  1  1  1  0
5  0  0  1  0  1
6  1  0  1  0  0
7  0  1  1  0  0


Returned square matrix dataframe
   a  b  c  d  e
a  1  1  2  0  0
b  1  0  2  2  1
c  2  2  0  1  1
d  0  2  1  0  1
e  0  1  1  1  0