03 02 Aggregating Data - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki
Summary Statistics
.mean()
.median()
.max()
.min()
The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super efficient.
# Import NumPy and create custom IQR functionimportnumpyasnpdefiqr(column):
returncolumn.quantile(0.75) -column.quantile(0.25)
# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemploymentprint(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr,np.median]))
Cumulative statistics
.cumsum()
.cummax()
Counting
.drop_duplicates(subset = )
.value_counts() get proportions by passing in normalize=True
# Drop duplicate store/type combinationsstore_types=sales.drop_duplicates(subset=["store","type"])
print(store_types.head())
# Drop duplicate store/department combinationsstore_depts=sales.drop_duplicates(subset= ["store","department"])
print(store_depts.head())
# Subset the rows that are holiday weeks and drop duplicate datesholiday_dates=sales[sales["is_holiday"]==True].drop_duplicates(subset="date")
# Print date col of holiday_datesprint(holiday_dates["date"])
# Count the number of stores of each typestore_counts=store_types["type"].value_counts()
print(store_counts)
# Get the proportion of stores of each typestore_props=store_types["type"].value_counts(normalize=True)
print(store_props)
# Count the number of each department number and sortdept_counts_sorted=store_depts["department"].value_counts(sort=True)
print(dept_counts_sorted)
# Get the proportion of departments of each number and sortdept_props_sorted=store_depts["department"].value_counts(sort=True, normalize=True)
print(dept_props_sorted)
# Group by type; calc total weekly salessales_by_type=sales.groupby("type")["weekly_sales"].sum()
# Get proportion for each typesales_propn_by_type=sales_by_type/sales["weekly_sales"].sum()
print(sales_propn_by_type)
<output>:
typeA0.91B0.09Name: weekly_sales, dtype: float64
# Import NumPy as npimportnumpyasnp# Pivot for mean and median weekly_sales for each store typemean_med_sales_by_type=sales.pivot_table(values="weekly_sales",index="type",aggfunc=[np.mean,np.median])
# Print mean_med_sales_by_typeprint(mean_med_sales_by_type)
<script.py>output:
meanmedianweekly_salesweekly_sales
type
A23674.66711943.92B25696.67813336.08# Pivot for mean weekly_sales by store type and holiday mean_sales_by_type_holiday=sales.pivot_table(values="weekly_sales",index="type",columns="is_holiday")
# Print mean_sales_by_type_holidayprint(mean_sales_by_type_holiday)
<output>:
is_holidayFalseTruetypeA23768.584590.045B25751.981810.705
Filling missing values and summing in pivot tables
# Print the mean weekly_sales by department and type; fill missing values with 0s; sum all rows and colsprint(sales.pivot_table(values="weekly_sales", index="department", columns="type", fill_value=0, margins=True))