Aggregate duplicate columns and rows in Pandas dataframe - lvphj/epydemiology GitHub Wiki

Function identifies duplicate column headings and duplicate rows (based
on subset of variables) and aggregates cells to produce a dataframe that
contains no duplicates.

Importing a dataframe from an Excel file, does not allow duplicate column
headings, and only the last duplicate column is retained. However, a dataframe
can be imported with unique column headings but, after some data wrangling,
duplicate column headings might be created. For example, a dataframe might
be imported from a spreadsheet with column headings 'other' and 'other*'.
These column headings would be considered deistinct and would be imported as
two separate columns into a Pandas dataframe. However, after cleaning, it
might be the case that the asterisk is removed resulting in two columns named
'other'. This function will aggregate these duplicated columns (default by
summing).

Similarly, rows of data may be duplicated on a selection of defining categorical
variables and the data in remaining columns therefore needs to be aggregated
across rows to produce a single output.

Parameters
----------

phjDF: Pandas dataframe

phjReqVarList: List of column headings (variables) that will be used to
identify duplicated rows.

phjReqVarAllPresent: (default = True) Boolean variable to indicate that ALL
the columns listed in phjReqVarList must be present in the dataframe. In
most cases, the required list of variables should all be present in the
dataframe. However, in some cases, it might be that the data should only be
grouped by those listed variables that are present in the dataframe. This
situation has arisen when processing multiple files and in a small number
of files, one of the variables had not been included. Therefore, if
phjReqVarAllPresent is set to False, adjust to only include those required
variables that are present in the dataframe.

phjAutoAgg: (default = False) Automatically aggregate duplicate columns and
rows; if True, simply return the original dataframe. Setting to False and
combining with phjPrintResults set to True allows the duplication to be viewed
before modifying the dataframe.

phjColAgg: (default = 'sum') Aggregation method to use to aggregate duplicate
columns. N.B. Currently, 'sum' is the only option.

phjRowAgg: (default = 'sum') Aggregation method to use to aggregate duplicate
rows. N.B. Currently, 'sum' is the only option.

phjPrintResults: (default = False) Boolean to indicate whether intermediate
results should be printed.

Returns
-------
If no errors have been encountered, the function returns a dataframe where
duplicated columns and duplicated rows have been aggregated.

Raises
------
None

See also
--------
None

Examples
--------

import numpy as np
import pandas as pd

# Define a dataframe that has duplicated column headings and duplicated rows
dupdf = pd.DataFrame({'cat1':[1,1,2,3,4,5,5,6,7,8,8,8,9,10],
                      'cat2':['a','a','b','c','d','e','e','f','g','h','h','h','i','j'],
                      'stuff':[10,20,34,65,76,45,34,65,76,87,23,46,56,76],
                      'things':[55,34,67,56,32,15,63,76,87,89,67,94,45,23],
                      'morestuff':[100,100,100,100,100,100,100,100,100,100,100,100,100,100],
                      'morethings':[200,200,200,200,200,200,200,200,200,200,200,200,200,200],
                      'yetmorestuff':[1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000],
                      'value':[67,76,87,25,47,52,34,67,87,87,66,54,34,62]
                     })
                        
print(dupdf)

        cat1 cat2  stuff  things  morestuff  morethings  yetmorestuff  value
    0      1    a     10      55        100         200          1000     67
    1      1    a     20      34        100         200          1000     76
    2      2    b     34      67        100         200          1000     87
    3      3    c     65      56        100         200          1000     25
    4      4    d     76      32        100         200          1000     47
    5      5    e     45      15        100         200          1000     52
    6      5    e     34      63        100         200          1000     34
    7      6    f     65      76        100         200          1000     67
    8      7    g     76      87        100         200          1000     87
    9      8    h     87      89        100         200          1000     87
    10     8    h     23      67        100         200          1000     66
    11     8    h     46      94        100         200          1000     54
    12     9    i     56      45        100         200          1000     34
    13    10    j     76      23        100         200          1000     62

# Clean up process that will create columns with duplicated name
dupdf = dupdf.rename(columns = {'morestuff':'stuff'})
dupdf = dupdf.rename(columns = {'yetmorestuff':'stuff'})
dupdf = dupdf.rename(columns = {'morethings':'things'})

print(dupdf)

        cat1 cat2  stuff  things  stuff  things  stuff  value
    0      1    a     10      55    100     200   1000     67
    1      1    a     20      34    100     200   1000     76
    2      2    b     34      67    100     200   1000     87
    3      3    c     65      56    100     200   1000     25
    4      4    d     76      32    100     200   1000     47
    5      5    e     45      15    100     200   1000     52
    6      5    e     34      63    100     200   1000     34
    7      6    f     65      76    100     200   1000     67
    8      7    g     76      87    100     200   1000     87
    9      8    h     87      89    100     200   1000     87
    10     8    h     23      67    100     200   1000     66
    11     8    h     46      94    100     200   1000     54
    12     9    i     56      45    100     200   1000     34
    13    10    j     76      23    100     200   1000     62


# To list the duplicated column headings and duplicated rows (as defined
# by the listed columns in phjReqVarList), run the phjAggDupColsAndRows()
# function with phjAutoAgg set to False and phjPrintResults set to True.
aggdf = epy.phjAggDupColsAndRows(phjDF = dupdf,
                                 phjReqVarList = ['cat1','cat2'],
                                 phjReqVarAllPresent = True,
                                 phjAutoAgg = False,
                                 phjColAgg = 'sum', # Currently the only option
                                 phjRowAgg = 'sum', # Currently the only option
                                 phjPrintResults = True)

    Duplicated columns
    ------------------
    Column 'stuff': number of occurrences = 3
    Column 'things': number of occurrences = 2


    Duplicated rows
    ---------------
        cat1 cat2  count
    0      1    a      2
    1      1    a      2
    5      5    e      2
    6      5    e      2
    9      8    h      3
    10     8    h      3
    11     8    h      3


    The phjAutoAgg parameter was set to False and, therefore, the original dataframe has not been altered.


# To aggregate the duplicated columns and rows, set phjAutoAgg to True
aggdf = epy.phjAggDupColsAndRows(phjDF = dupdf,
                                 phjReqVarList = ['cat1','cat2'],
                                 phjReqVarAllPresent = True,
                                 phjAutoAgg = True,
                                 phjColAgg = 'sum', # Currently the only option
                                 phjRowAgg = 'sum', # Currently the only option
                                 phjPrintResults = False)

print(aggdf)

       cat1 cat2  stuff  things  value
    0     1    a   2230     489    143
    1     2    b   1134     267     87
    2     3    c   1165     256     25
    3     4    d   1176     232     47
    4     5    e   2279     478     86
    5     6    f   1165     276     67
    6     7    g   1176     287     87
    7     8    h   3456     850    207
    8     9    i   1156     245     34
    9    10    j   1176     223     62