GroupBy - rFronteddu/general_wiki GitHub Wiki

Aggregators

The main idea behind an aggregate function is to take multiple inputs and return a single output.

The most common aggregate functions:

  • AVG() - returns average value (you can use ROUND() to specify precision)
  • COUNT() - returns number of values (simply returns the number of rows so we just use COUNT(*))
  • MAX() - returns maximum value
  • MIN() - returns minimum value
  • SUM() - returns the sum of all values

Aggregate function calls happen only in the SELECT clause or the HAVING clause.

SELECT MIN(replacement_cost) FROM film;
SELECT MAX(replacement_cost), MIN(replacement_cost) FROM film;
SELECT ROUND(AVG(replacement_cost), 2) 
FROM film;
SELECT SUM(replacement_cost) 
FROM film;

GROUP BY

GROUP BY allows us to aggregate columns per some category.

Note that GROUP BY must appear right after a FROM or WHERE statement.

Note that in the SELECT statement, columns must either have an aggregate function or be in the GROUP BY call.

  • If we are selecting a category_column in the GROUP BY it must be in the SELECT statement.
  • If a column has an aggregate function, it does need to be included in the GROUP BY call.
  • WHERE statements should not refer to the aggregation result (you should use HAVING for that).
# this will show sum of sales grouped by company first and then division
SELECT company, division, SUM(sales)
FROM finance_table
GROUP BY company, division
  • If you want to sort results based on the aggregate, make sure to reference the entire function
SELECT company, SUM(sales)
FROM finance_table
GROUP BY company
ORDER BY SUM(sales)
LIMIT 5;
SELECT category_col, AGG(data_col)
FROM table
GROUP BY category_col;
SELECT category_col, AGG(data_col)
FROM table
WHERE category_col != 'A'
GROUP BY category_col;
SELECT customer_id, SUM(amount)
FROM payment
GROUP BY customer_id
ORDER BY SUM(amount);
# How much each customer spent for each staff ID
SELECT customer_id, staff_id, SUM(amount)
FROM payment
GROUP BY staff_id, customer_id
ORDER BY customer_id;`

HAVING

The HAVING clause allows us to filter after an aggregation has taken place.

# We could filter before executing the GROUP BY with a where, but if we need to wait for an aggregate, we can use having.
* We cannot use WHERE to filter based of aggregates results, because those happen after WHERE is executed.
* whenever you can is better to do the WHERE first to filter out rows to process.

SELECT company, SUM(sales) FROM finance_table GROUP BY company HAVING SUM(sales) > 1000;