Function identifies duplicate column headings and duplicate rows (based
on subset of variables) and aggregates cells to produce a dataframe that
contains no duplicates.
Importing a dataframe from an Excel file, does not allow duplicate column
headings, and only the last duplicate column is retained. However, a dataframe
can be imported with unique column headings but, after some data wrangling,
duplicate column headings might be created. For example, a dataframe might
be imported from a spreadsheet with column headings 'other' and 'other*'.
These column headings would be considered deistinct and would be imported as
two separate columns into a Pandas dataframe. However, after cleaning, it
might be the case that the asterisk is removed resulting in two columns named
'other'. This function will aggregate these duplicated columns (default by
summing).
Similarly, rows of data may be duplicated on a selection of defining categorical
variables and the data in remaining columns therefore needs to be aggregated
across rows to produce a single output.
Parameters
----------
phjDF: Pandas dataframe
phjReqVarList: List of column headings (variables) that will be used to
identify duplicated rows.
phjReqVarAllPresent: (default = True) Boolean variable to indicate that ALL
the columns listed in phjReqVarList must be present in the dataframe. In
most cases, the required list of variables should all be present in the
dataframe. However, in some cases, it might be that the data should only be
grouped by those listed variables that are present in the dataframe. This
situation has arisen when processing multiple files and in a small number
of files, one of the variables had not been included. Therefore, if
phjReqVarAllPresent is set to False, adjust to only include those required
variables that are present in the dataframe.
phjAutoAgg: (default = False) Automatically aggregate duplicate columns and
rows; if True, simply return the original dataframe. Setting to False and
combining with phjPrintResults set to True allows the duplication to be viewed
before modifying the dataframe.
phjColAgg: (default = 'sum') Aggregation method to use to aggregate duplicate
columns. N.B. Currently, 'sum' is the only option.
phjRowAgg: (default = 'sum') Aggregation method to use to aggregate duplicate
rows. N.B. Currently, 'sum' is the only option.
phjPrintResults: (default = False) Boolean to indicate whether intermediate
results should be printed.
Returns
-------
If no errors have been encountered, the function returns a dataframe where
duplicated columns and duplicated rows have been aggregated.
Raises
------
None
See also
--------
None
Examples
--------
import numpy as np
import pandas as pd
# Define a dataframe that has duplicated column headings and duplicated rows
dupdf = pd.DataFrame({'cat1':[1,1,2,3,4,5,5,6,7,8,8,8,9,10],
'cat2':['a','a','b','c','d','e','e','f','g','h','h','h','i','j'],
'stuff':[10,20,34,65,76,45,34,65,76,87,23,46,56,76],
'things':[55,34,67,56,32,15,63,76,87,89,67,94,45,23],
'morestuff':[100,100,100,100,100,100,100,100,100,100,100,100,100,100],
'morethings':[200,200,200,200,200,200,200,200,200,200,200,200,200,200],
'yetmorestuff':[1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000],
'value':[67,76,87,25,47,52,34,67,87,87,66,54,34,62]
})
print(dupdf)
cat1 cat2 stuff things morestuff morethings yetmorestuff value
0 1 a 10 55 100 200 1000 67
1 1 a 20 34 100 200 1000 76
2 2 b 34 67 100 200 1000 87
3 3 c 65 56 100 200 1000 25
4 4 d 76 32 100 200 1000 47
5 5 e 45 15 100 200 1000 52
6 5 e 34 63 100 200 1000 34
7 6 f 65 76 100 200 1000 67
8 7 g 76 87 100 200 1000 87
9 8 h 87 89 100 200 1000 87
10 8 h 23 67 100 200 1000 66
11 8 h 46 94 100 200 1000 54
12 9 i 56 45 100 200 1000 34
13 10 j 76 23 100 200 1000 62
# Clean up process that will create columns with duplicated name
dupdf = dupdf.rename(columns = {'morestuff':'stuff'})
dupdf = dupdf.rename(columns = {'yetmorestuff':'stuff'})
dupdf = dupdf.rename(columns = {'morethings':'things'})
print(dupdf)
cat1 cat2 stuff things stuff things stuff value
0 1 a 10 55 100 200 1000 67
1 1 a 20 34 100 200 1000 76
2 2 b 34 67 100 200 1000 87
3 3 c 65 56 100 200 1000 25
4 4 d 76 32 100 200 1000 47
5 5 e 45 15 100 200 1000 52
6 5 e 34 63 100 200 1000 34
7 6 f 65 76 100 200 1000 67
8 7 g 76 87 100 200 1000 87
9 8 h 87 89 100 200 1000 87
10 8 h 23 67 100 200 1000 66
11 8 h 46 94 100 200 1000 54
12 9 i 56 45 100 200 1000 34
13 10 j 76 23 100 200 1000 62
# To list the duplicated column headings and duplicated rows (as defined
# by the listed columns in phjReqVarList), run the phjAggDupColsAndRows()
# function with phjAutoAgg set to False and phjPrintResults set to True.
aggdf = epy.phjAggDupColsAndRows(phjDF = dupdf,
phjReqVarList = ['cat1','cat2'],
phjReqVarAllPresent = True,
phjAutoAgg = False,
phjColAgg = 'sum', # Currently the only option
phjRowAgg = 'sum', # Currently the only option
phjPrintResults = True)
Duplicated columns
------------------
Column 'stuff': number of occurrences = 3
Column 'things': number of occurrences = 2
Duplicated rows
---------------
cat1 cat2 count
0 1 a 2
1 1 a 2
5 5 e 2
6 5 e 2
9 8 h 3
10 8 h 3
11 8 h 3
The phjAutoAgg parameter was set to False and, therefore, the original dataframe has not been altered.
# To aggregate the duplicated columns and rows, set phjAutoAgg to True
aggdf = epy.phjAggDupColsAndRows(phjDF = dupdf,
phjReqVarList = ['cat1','cat2'],
phjReqVarAllPresent = True,
phjAutoAgg = True,
phjColAgg = 'sum', # Currently the only option
phjRowAgg = 'sum', # Currently the only option
phjPrintResults = False)
print(aggdf)
cat1 cat2 stuff things value
0 1 a 2230 489 143
1 2 b 1134 267 87
2 3 c 1165 256 25
3 4 d 1176 232 47
4 5 e 2279 478 86
5 6 f 1165 276 67
6 7 g 1176 287 87
7 8 h 3456 850 207
8 9 i 1156 245 34
9 10 j 1176 223 62