03 02 Aggregating Data - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

Summary Statistics

  • .mean()
  • .median()
  • .max()
  • .min()
  • The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super efficient.
# Import NumPy and create custom IQR function
import numpy as np
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr,np.median]))

Cumulative statistics

  • .cumsum()
  • .cummax()

Counting

  • .drop_duplicates(subset = )
  • .value_counts() get proportions by passing in normalize=True
# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=["store","type"])
print(store_types.head())

# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset = ["store","department"])
print(store_depts.head())

# Subset the rows that are holiday weeks and drop duplicate dates
holiday_dates = sales[sales["is_holiday"]==True].drop_duplicates(subset="date")

# Print date col of holiday_dates
print(holiday_dates["date"])

# Count the number of stores of each type
store_counts = store_types["type"].value_counts()
print(store_counts)

# Get the proportion of stores of each type
store_props = store_types["type"].value_counts(normalize=True)
print(store_props)

# Count the number of each department number and sort
dept_counts_sorted = store_depts["department"].value_counts(sort=True)
print(dept_counts_sorted)

# Get the proportion of departments of each number and sort
dept_props_sorted = store_depts["department"].value_counts(sort=True, normalize=True)
print(dept_props_sorted)

Grouped summary statistics

  • .groupby()
  • `dog.groupby("color")["weight_kg"].agg([min,max,sum])
  • `dog.groupby(["color","breed"])["weight_kg"].min()
# Group by type; calc total weekly sales
sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = sales_by_type/sales["weekly_sales"].sum()
print(sales_propn_by_type)

<output>:
type
A    0.91
B    0.09
Name: weekly_sales, dtype: float64

Pivot tables

  • group by to pivot table
  • dogs.pivot_table(values="weight_kg",index="color")
# Import NumPy as np
import numpy as np

# Pivot for mean and median weekly_sales for each store type
mean_med_sales_by_type = sales.pivot_table(values="weekly_sales",index="type",aggfunc=[np.mean,np.median])

# Print mean_med_sales_by_type
print(mean_med_sales_by_type)

<script.py> output:
                 mean       median
         weekly_sales weekly_sales
    type                          
    A       23674.667     11943.92
    B       25696.678     13336.08

# Pivot for mean weekly_sales by store type and holiday 
mean_sales_by_type_holiday = sales.pivot_table(values="weekly_sales",index="type",columns= "is_holiday")

# Print mean_sales_by_type_holiday
print(mean_sales_by_type_holiday)

<output>:
is_holiday      False    True 
type                          
A           23768.584  590.045
B           25751.981  810.705

Filling missing values and summing in pivot tables

# Print the mean weekly_sales by department and type; fill missing values with 0s; sum all rows and cols
print(sales.pivot_table(values="weekly_sales", index="department", columns="type", fill_value=0, margins=True))
⚠️ **GitHub.com Fallback** ⚠️