Convert Pandas dataframe from wide to long format - lvphj/epydemiology GitHub Wiki

Converts a 'wide' dataframe to 'long' format.

This function meets a specific data situation where an individual
subject is described in a series of categorical variables and the
additional numerical values are given in a series of variables
which are entered as separate columns. Clearly, this situation is
exactly what Pandas' wide_to_long() function is designed to do. This
wrapper, however, automatically adds an appropriate prefix to the
column headings to act as stubname and renames the column to more
intuitive values.

This function also provides the opportunity for duplicate columns and rows (as
defined by a series of named columns) to be automatically aggregated before
the wide_to_long() function is applied.

The function retains all the named descriptive columns but then
converts the series of numeric columns to just two columns, one
called 'var' (phjPrefixStr) and the other called 'value' (phjValueStr).

As a visual example, the following dataframe (df) consists of two
categorical variables defining each row, namely cat1 and cat2. The
remaining columns (stuff, things and value) represent variables that
need to be used to convert to long function.

       cat1 cat2  stuff  things  value
    0     1    a   2230     489    143
    1     2    b   1134     267     87
    2     3    c   1165     256     25
    3     4    d   1176     232     47
    4     5    e   2279     478     86
    5     6    f   1165     276     67
    6     7    g   1176     287     87
    7     8    h   3456     850    207
    8     9    i   1156     245     34
    9    10    j   1176     223     62

longdf = epy.phjWide2Long(phjDF = df,
                          phjReqVarList = ['cat1','cat2'],
                          phjReqVarAllPresent = True,
                          phjAutoAgg = True,
                          phjColAgg = 'sum', # Currently the only option
                          phjRowAgg = 'sum', # Currently the only option
                          phjNewCatColNameStr = 'category',
                          phjNewAggColNameStr = 'count',
                          phjDropZeros = False,
                          phjPrintResults = False)

print(longdf)

    cat1 cat2 category  count
0      1    a    stuff   2230
1      1    a   things    489
2      1    a    value    143
3      2    b    stuff   1134
4      2    b   things    267
5      2    b    value     87
6      3    c    stuff   1165
7      3    c   things    256
8      3    c    value     25
9      4    d    stuff   1176
10     4    d   things    232
11     4    d    value     47
12     5    e    stuff   2279
13     5    e   things    478
14     5    e    value     86
15     6    f    stuff   1165
16     6    f   things    276
17     6    f    value     67
18     7    g    stuff   1176
19     7    g   things    287
20     7    g    value     87
21     8    h    stuff   3456
22     8    h   things    850
23     8    h    value    207
24     9    i    stuff   1156
25     9    i   things    245
26     9    i    value     34
27    10    j    stuff   1176
28    10    j   things    223
29    10    j    value     62