Convert long dataframe to wide format containing binary variables - lvphj/epydemiology GitHub Wiki

phjLongToWideBinary()

import numpy as np
import pandas as pd
import collections
import epydemiology as epy

myDF = epy.phjLongToWideBinary(phjDF,
                               phjGroupbyVarName,
                               phjVariablesVarName,
                               phjValuesDict = {0:0,1:1},
                               phjPrintResults = False)

Description

This function converts a dataframe containing a grouping variable and a variable containing a series of factors that may or may not be present and converts to a wide dataframe containing a series of binary variables indicating whether the factor is present or not. For example, it converts:

X Y
0 1 a
1 1 b
2 1 d
3 2 b
4 2 c
5 3 d
6 3 e
7 3 a
8 3 f
9 4 b

to:

X a b d c e f
0 1 1 1 1 0 0 0
1 2 0 1 0 1 0 0
2 3 1 0 1 0 1 1
3 4 0 1 0 0 0 0

Function parameters

  1. phjDF

    Dataframe containing a grouping variable and a variable containing categories.

  2. phjGroupbyVarName

    Name of grouping variable.

  3. phjVariablesVarName

    Name of variable containing category levels.

  4. phjValuesDict (default = {0:0,1:1})

    Dictionary to define how to represent '0' and '1' values.

  5. phjPrintResults (default = False)

    Print intermediate values. No effect in current function.

Exceptions raised

  1. AssertionError

    AssertionError raised if parameters passed to function are incorrect.

Other notes

None

Example

df = pd.DataFrame({'X':[1,1,1,2,2,3,3,3,3,4],
                   'Y':['a','b','d','b','c','d','e','a','f','b']})

newDF = epy.phjLongToWideBinary(phjDF = df,
                                phjGroupbyVarName = 'X',
                                phjVariablesVarName = 'Y',
                                phjValuesDict = {0:0,1:1},
                                phjPrintResults = False)

print('Original dataframe\n')
print(df)

print('\n')

print('New wide dataframe\n')
print(newDF)

This produces the following output:

Original dataframe

   X  Y
0  1  a
1  1  b
2  1  d
3  2  b
4  2  c
5  3  d
6  3  e
7  3  a
8  3  f
9  4  b


New wide dataframe

   X  a  b  d  c  e  f
0  1  1  1  1  0  0  0
1  2  0  1  0  1  0  0
2  3  1  0  1  0  1  1
3  4  0  1  0  0  0  0