Convert long dataframe to wide format containing binary variables - lvphj/epydemiology GitHub Wiki

phjLongToWideBinary()

import numpy as np
import pandas as pd
import collections
import epydemiology as epy

myDF = epy.phjLongToWideBinary(phjDF,
                               phjGroupbyVarName,
                               phjVariablesVarName,
                               phjValuesDict = {0:0,1:1},
                               phjPrintResults = False)

Description

This function converts a dataframe containing a grouping variable and a variable containing a series of factors that may or may not be present and converts to a wide dataframe containing a series of binary variables indicating whether the factor is present or not. For example, it converts:

	X	Y
0	1	a
1	1	b
2	1	d
3	2	b
4	2	c
5	3	d
6	3	e
7	3	a
8	3	f
9	4	b

to:

	X	a	b	d	c	e	f
0	1	1	1	1	0	0	0
1	2	0	1	0	1	0	0
2	3	1	0	1	0	1	1
3	4	0	1	0	0	0	0

Function parameters

phjDF

Dataframe containing a grouping variable and a variable containing categories.
phjGroupbyVarName

Name of grouping variable.
phjVariablesVarName

Name of variable containing category levels.
phjValuesDict (default = {0:0,1:1})

Dictionary to define how to represent '0' and '1' values.
phjPrintResults (default = False)

Print intermediate values. No effect in current function.

Exceptions raised

AssertionError

AssertionError raised if parameters passed to function are incorrect.

Other notes

None

Example

df = pd.DataFrame({'X':[1,1,1,2,2,3,3,3,3,4],
                   'Y':['a','b','d','b','c','d','e','a','f','b']})

newDF = epy.phjLongToWideBinary(phjDF = df,
                                phjGroupbyVarName = 'X',
                                phjVariablesVarName = 'Y',
                                phjValuesDict = {0:0,1:1},
                                phjPrintResults = False)

print('Original dataframe\n')
print(df)

print('\n')

print('New wide dataframe\n')
print(newDF)

This produces the following output:

Original dataframe

   X  Y
0  1  a
1  1  b
2  1  d
3  2  b
4  2  c
5  3  d
6  3  e
7  3  a
8  3  f
9  4  b


New wide dataframe

   X  a  b  d  c  e  f
0  1  1  1  1  0  0  0
1  2  0  1  0  1  0  0
2  3  1  0  1  0  1  1
3  4  0  1  0  0  0  0