Postgres sql Window function analytics - ghdrako/doc_snipets GitHub Wiki
https://www.postgresql.org/docs/current/functions-window.html Learn
Name | Description |
---|---|
ROW_NUMBER |
Number the current row within its partition starting from 1. |
DENSE_RANK |
Rank the current row within its partition without gaps. Could also be used just as easily as RANK(), but it is defined as the rank of values, not the rank of rows. |
RANK |
Rank the current row within its partition with gaps. Is defined as the rank of rows, not the rank of values. For example, if the first two rows have a tie, the third row will get 3 from the RANK() function |
LAG |
Return a value evaluated at the row that is at a specified physical offset row before the current row within the partition. |
LEAD |
Return a value evaluated at the row that is offset rows after the current row within the partition. |
NTILE |
Divide rows in a partition as equally as possible and assign each row an integer starting from 1 to theargument value. |
SELECT ...
agregation OVER (PARTITION BY <columnes>)
FROM <liste_tables>
WHERE <predicats>
ROW_NUMBER() OVER(
PARTITION BY column_1, column_2 -- This states how you want to split your data into windows. It can be split according to one or more columns.
-- This is optional. If excluded, the window is the entire table.
ORDER BY column_3, column_4 -- This states how each window should be sorted before the function is applied.
-- This is optional in MySQL, PostgreSQL, and SQLite. It is required in Oracle and SQL Server.
)
- Use
FIRST_VALUE
andLAST_VALUE
to return the first and last rows of a window, respectively - Use
NTH_VALUE
to return a specific rank number within each window. SQL Server does not supportNTH_VALUE
. The second parameter inNTH_VALUE(name, 2)
is what specifies the second value in the window. This can be any positive integer.
SELECT *
FROM (
SELECT gender, name, babies,
ROW_NUMBER() OVER (PARTITION BY gender ORDER BY babies DESC) AS popularity
FROM baby_names) AS popularity_table
)
WHERE popularity IN (1,2);
SELECT * FROM employes ;
matricule | nom | service | salaire
-----------+----------+-------------+----------
00000001 | Dupuis | Direction | 10000.00
00000004 | Fantasio | Courrier | 4500.00
00000006 | Prunelle | Publication | 4000.00
00000020 | Lagaffe | Courrier | 3000.00
00000040 | Lebrac | Publication | 3000.00
-- suma w grupach
SELECT matricule, salaire, service,
SUM(salaire) OVER (PARTITION BY service) AS total_salaire_service
FROM employes;
matricule | salaire | service | total_salaire_service
-----------+----------+-------------+-----------------------
00000004 | 4500.00 | Courrier | 7500.00
00000020 | 3000.00 | Courrier | 7500.00
00000001 | 10000.00 | Direction | 10000.00
00000006 | 4000.00 | Publication | 7000.00
00000040 | 3000.00 | Publication | 7000.00
SELECT ...
<agregation> OVER (ORDER BY <colonnes>)
FROM <liste_tables>
WHERE <predicats>
-- suma narastajaca
SELECT matricule, salaire,
SUM(salaire) OVER (ORDER BY matricule)
FROM employes;
matricule | salaire | sum
-----------+----------+----------
00000001 | 10000.00 | 10000.00
00000004 | 4500.00 | 14500.00
00000006 | 4000.00 | 18500.00
00000020 | 3000.00 | 21500.00
00000040 | 3000.00 | 24500.00 20
SELECT country, year, production,
min(production) OVER (),
min(production) OVER (ORDER BY year)
FROM t_oil
WHERE year BETWEEN 1978 AND 1983
AND country = 'Iran';
country | year | production | min | min
---------+------+-------------+------+------
Iran | 1978 | 5302| 1321 | 5302
Iran | 1979 | 3218 | 1321 | 3218
Iran | 1980 | 1479 | 1321 | 1479
Iran | 1981 | 1321 | 1321 | 1321
Iran | 1982 | 2397 | 1321 | 1321
Iran | 1983 | 2454 | 1321 | 1321
(6 rows)
If the aggregate is used without ORDER BY, it will automatically take the minimum of the entire dataset inside your windows. This doesn’t happen if there is an ORDER BY clause. In this case, it will always be the minimum up to this point, given the order that you have defined.
SELECT ...
<agregation> OVER (PARTITION BY <colonnes> ORDER BY <colonnes>)
FROM <liste_tables>
WHERE <predicats>
-- ranking populacji krajów w ramach kontynentów
SELECT continent, pays, population,
rank() OVER (PARTITION BY continent ORDER BY population DESC) AS rang
FROM population;
-- average of the country you are dealing with
SELECT
country,
year,
production,
consumption,
avg(production) OVER (PARTITION BY country)
FROM
t_oil;
The OVER clause defines the window we are looking at. In this case, the window is the country the row belongs to. In other words, the query returns the rows according to the rest of the rows in this country.
The year column is not sorted. The query does not contain an explicit sort order, so it might be that data is returned in random order. Remember, SQL does not promise sorted output unless you explicitly state what you want.
PARTITION BY
clause takes any expression not only column to partition data
SELECT year, production,
avg(production) OVER (PARTITION BY year < 1990)
FROM t_oil
WHERE country = 'Canada'
ORDER BY year;
The point is that data is split using an expression. year < 1990 can return two values: true or false. Depending on the group a year is in, it will be assigned to the pre-1990 average or the post-1990 average.
The most important thing is that a moving window should be use with an ORDER BY clause. Otherwise, there will be major problems.PostgreSQL would actually accept the query, but the result would be totally wrong. Remember, feeding data to a sliding window without ordering it first will simply lead to random data.
SELECT
{columns},
{window_func} OVER (
PARTITION BY {partition_key}
ORDER BY {order_key}
{rangeorrows} BETWEEN {frame_start} AND {frame_end}
)
FROM
{table1};
-
{columns}
are the columns to retrieve from tables for the query, -
{window_func}
is the window function you want to use, -
{partition_key}
is the column or columns you want to partition on, -
{order_key}
is the column or columns you want to order by, -
{rangeorrows}
is either theRANGE
keyword or theROWS
keyword, -
{frame_start}
is a keyword indicating where to start the window frame, -
{frame_end}
is a keyword indicating where to end the window frame, and -
{table1}
is the table or joined tables you want to pull data from. - {frame_start} and {frame_end} can be one of the following values:
-
UNBOUNDED PRECEDING
: A keyword that, when used for {frame_start}, refers to the first record of the partition. -
{offset} PRECEDING
: A keyword referring to {offset} (an integer) rows or ranges before the current row. -
CURRENT ROW
: Refers to the current row. -
{offset} FOLLOWING
: A keyword referring to {offset} (an integer) rows or ranges after the current row. -
UNBOUNDED FOLLOWING
: A keyword that, when used for {frame_end}, refers to the last record of the partition.
-
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
defines the window. In this example, up to three rows will be in use: the current row, the one before it, and the one after the current row.
test=# SELECT *,
array_agg(id)OVER (
ORDER BY id
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
FROM generate_series(1, 5) AS id;
id | array_agg
----+-----------
1 | {1,2}
2 | {1,2,3}
3 | {2,3,4}
4 | {3,4,5}
5 | {4,5}
(5 rows)
The array_agg
function will turn a list of values into a PostgreSQL array. It will help to explain how the sliding window operates. First array contains only two values because there is no entry before 1, and therefore, the array is not full. PostgreSQL does not add null entries because they would be ignored by aggregates anyway. The same happens at the end of the data.
This keyword means that everything before the current line will be in the window.
test=# SELECT *,
array_agg(id) OVER (
ORDER BY id
ROWS BETWEEN 2 FOLLOWING AND UNBOUNDED
FOLLOWING
)
FROM generate_series(1, 5) AS id;
id | array_agg
----+-----------
1 | {3,4,5}
2 | {4,5}
3 | {5}
4 |
5 |
(5 rows)
If you want to exclude the current row from your calculation, SQL offers some syntactic sugar, as shown in the next example:
test=# SELECT year,
production,
array_agg(production) OVER (ORDER BY year
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
EXCLUDE CURRENT ROW)
FROM t_oil
WHERE country = 'USA' AND year < 1970;
year | production | array_agg
------+------------+---------------
1965 | 9014 | {9579}
1966 | 9579 | {9014,10219}
1967 | 10219 | {9579,10600}
1968 | 10600 | {10219,10828}
1969 | 10828 | {10600}
(5 rows)
{ RANGE | ROWS | GROUPS } frame_start [
frame_exclusion ]
{ RANGE | ROWS | GROUPS } BETWEEN frame_start AND
frame_end [ frame_exclusion ]