GroupBy - rFronteddu/general_wiki GitHub Wiki

Aggregators

The main idea behind an aggregate function is to take multiple inputs and return a single output.

The most common aggregate functions:

AVG() - returns average value (you can use ROUND() to specify precision)
COUNT() - returns number of values (simply returns the number of rows so we just use COUNT(*))
MAX() - returns maximum value
MIN() - returns minimum value
SUM() - returns the sum of all values

Aggregate function calls happen only in the SELECT clause or the HAVING clause.

SELECT MIN(replacement_cost) FROM film;

SELECT MAX(replacement_cost), MIN(replacement_cost) FROM film;

SELECT ROUND(AVG(replacement_cost), 2) 
FROM film;

SELECT SUM(replacement_cost) 
FROM film;

GROUP BY

GROUP BY allows us to aggregate columns per some category.

Note that GROUP BY must appear right after a FROM or WHERE statement.

Note that in the SELECT statement, columns must either have an aggregate function or be in the GROUP BY call.

If we are selecting a category_column in the GROUP BY it must be in the SELECT statement.
If a column has an aggregate function, it does need to be included in the GROUP BY call.
WHERE statements should not refer to the aggregation result (you should use HAVING for that).

# this will show sum of sales grouped by company first and then division
SELECT company, division, SUM(sales)
FROM finance_table
GROUP BY company, division

If you want to sort results based on the aggregate, make sure to reference the entire function

SELECT company, SUM(sales)
FROM finance_table
GROUP BY company
ORDER BY SUM(sales)
LIMIT 5;

SELECT category_col, AGG(data_col)
FROM table
GROUP BY category_col;

SELECT category_col, AGG(data_col)
FROM table
WHERE category_col != 'A'
GROUP BY category_col;

SELECT customer_id, SUM(amount)
FROM payment
GROUP BY customer_id
ORDER BY SUM(amount);

# How much each customer spent for each staff ID
SELECT customer_id, staff_id, SUM(amount)
FROM payment
GROUP BY staff_id, customer_id
ORDER BY customer_id;`

HAVING

The HAVING clause allows us to filter after an aggregation has taken place.

# We could filter before executing the GROUP BY with a where, but if we need to wait for an aggregate, we can use having.
* We cannot use WHERE to filter based of aggregates results, because those happen after WHERE is executed.
* whenever you can is better to do the WHERE first to filter out rows to process.

SELECT company, SUM(sales) FROM finance_table GROUP BY company HAVING SUM(sales) > 1000;