Postgres sql Window function analytics - ghdrako/doc

https://www.postgresql.org/docs/current/functions-window.html Learn

Usefull statistic functions

Name	Description
`ROW_NUMBER`	Number the current row within its partition starting from 1. Unique numbers even if duplicates
`DENSE_RANK`	Rank the current row within its partition without gaps. Could also be used just as easily as RANK(), but it is defined as the rank of values, not the rank of rows. Assigns same rank to duplicates, no gaps.
`RANK`	Rank the current row within its partition with gaps. Is defined as the rank of rows, not the rank of values. For example, if the first two rows have a tie, the third row will get 3 from the RANK() function. Assigns same rank to duplicates, skips next numbers
`LAG`	Return a value evaluated at the row that is at a specified physical offset row before the current row within the partition.
`LEAD`	Return a value evaluated at the row that is offset rows after the current row within the partition.
`NTILE`	Divide rows in a partition as equally as possible and assign each row an integer starting from 1 to theargument value.
`NTH_VALUE`	Retrieve the value from the nth row in an ordered set within a specified window. This functionality is invaluable when we need to pinpoint specific data points, such as the nth highest value or nth row in a group

SELECT ...
agregation OVER (PARTITION BY <columnes>) 
FROM <liste_tables> 
WHERE <predicats>

ROW_NUMBER() OVER(
  PARTITION BY column_1, column_2 -- This states how you want to split your data into windows. It can be split according to one or more columns.
                                  -- This is optional. If excluded, the window is the entire table.
  ORDER BY column_3, column_4     -- This states how each window should be sorted before the function is applied.
                                  -- This is optional in MySQL, PostgreSQL, and SQLite. It is required in Oracle and SQL Server.
)

Use FIRST_VALUE and LAST_VALUE to return the first and last rows of a window, respectively
Use NTH_VALUE to return a specific rank number within each window. SQL Server does not support NTH_VALUE. The second parameter in NTH_VALUE(name, 2) is what specifies the second value in the window. This can be any positive integer.

Example Return the First Two Values in Each Group

SELECT * 
  FROM (
     SELECT gender, name, babies,
            ROW_NUMBER() OVER (PARTITION BY gender ORDER BY babies DESC) AS popularity
       FROM baby_names) AS popularity_table
      )
WHERE popularity IN (1,2);

SELECT * FROM employes ;
matricule | nom | service | salaire 
-----------+----------+-------------+----------
00000001 | Dupuis | Direction | 10000.00 
00000004 | Fantasio | Courrier | 4500.00 
00000006 | Prunelle | Publication | 4000.00 
00000020 | Lagaffe | Courrier | 3000.00 
00000040 | Lebrac | Publication | 3000.00 

-- suma w grupach
SELECT matricule, salaire, service, 
       SUM(salaire) OVER (PARTITION BY service) AS total_salaire_service 
FROM employes;

matricule | salaire | service | total_salaire_service 
-----------+----------+-------------+-----------------------
00000004 | 4500.00 | Courrier | 7500.00 
00000020 | 3000.00 | Courrier | 7500.00 
00000001 | 10000.00 | Direction | 10000.00
00000006 | 4000.00 | Publication | 7000.00 
00000040 | 3000.00 | Publication | 7000.00

SELECT ...
<agregation> OVER (ORDER BY <colonnes>) 
FROM <liste_tables> 
WHERE <predicats>


-- suma narastajaca
SELECT matricule, salaire, 
       SUM(salaire) OVER (ORDER BY matricule) 
FROM employes;
matricule | salaire | sum 
-----------+----------+----------
00000001 | 10000.00 | 10000.00 
00000004 | 4500.00 | 14500.00 
00000006 | 4000.00 | 18500.00 
00000020 | 3000.00 | 21500.00 
00000040 | 3000.00 | 24500.00 20

SELECT country, year, production,
min(production) OVER (),
min(production) OVER (ORDER BY year)
FROM t_oil
WHERE year BETWEEN 1978 AND 1983
AND country = 'Iran';
country | year | production | min | min
---------+------+-------------+------+------
Iran | 1978 | 5302| 1321 | 5302
Iran | 1979 | 3218 | 1321 | 3218
Iran | 1980 | 1479 | 1321 | 1479
Iran | 1981 | 1321 | 1321 | 1321
Iran | 1982 | 2397 | 1321 | 1321
Iran | 1983 | 2454 | 1321 | 1321
(6 rows)

If the aggregate is used without ORDER BY, it will automatically take the minimum of the entire dataset inside your windows. This doesn’t happen if there is an ORDER BY clause. In this case, it will always be the minimum up to this point, given the order that you have defined.

SELECT ...
<agregation> OVER (PARTITION BY <colonnes> ORDER BY <colonnes>) 
FROM <liste_tables> 
WHERE <predicats>

-- ranking populacji krajów w ramach kontynentów
SELECT continent, pays, population, 
rank() OVER (PARTITION BY continent ORDER BY population DESC) AS rang 
FROM population;

-- average of the country you are dealing with
 SELECT
country,
year,
production,
consumption,
avg(production) OVER (PARTITION BY country)
FROM
t_oil;

The OVER clause defines the window we are looking at. In this case, the window is the country the row belongs to. In other words, the query returns the rows according to the rest of the rows in this country.

Important note

The year column is not sorted. The query does not contain an explicit sort order, so it might be that data is returned in random order. Remember, SQL does not promise sorted output unless you explicitly state what you want.

PARTITION BY clause takes any expression not only column to partition data

SELECT year, production,
avg(production) OVER (PARTITION BY year < 1990)
FROM t_oil
WHERE country = 'Canada'
ORDER BY year;

The point is that data is split using an expression. year < 1990 can return two values: true or false. Depending on the group a year is in, it will be assigned to the pre-1990 average or the post-1990 average.

Window Frame(sliding windows)

The most important thing is that a moving window should be use with an ORDER BY clause. Otherwise, there will be major problems.PostgreSQL would actually accept the query, but the result would be totally wrong. Remember, feeding data to a sliding window without ordering it first will simply lead to random data.

SELECT 
  {columns},
  {window_func} OVER (
    PARTITION BY {partition_key} 
    ORDER BY {order_key} 
    {rangeorrows} BETWEEN {frame_start} AND {frame_end}
  )
FROM 
  {table1};

{columns} are the columns to retrieve from tables for the query,
{window_func} is the window function you want to use,
{partition_key} is the column or columns you want to partition on,
{order_key} is the column or columns you want to order by,
{rangeorrows} is either the RANGE keyword or the ROWS keyword,
{frame_start} is a keyword indicating where to start the window frame,
{frame_end} is a keyword indicating where to end the window frame, and
{table1} is the table or joined tables you want to pull data from.
{frame_start} and {frame_end} can be one of the following values:
- UNBOUNDED PRECEDING: A keyword that, when used for {frame_start}, refers to the first record of the partition.
- {offset} PRECEDING: A keyword referring to {offset} (an integer) rows or ranges before the current row.
- CURRENT ROW: Refers to the current row.
- {offset} FOLLOWING: A keyword referring to {offset} (an integer) rows or ranges after the current row.
- UNBOUNDED FOLLOWING: A keyword that, when used for {frame_end}, refers to the last record of the partition.

ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING defines the window. In this example, up to three rows will be in use: the current row, the one before it, and the one after the current row.

test=# SELECT *,
array_agg(id)OVER (
ORDER BY id
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
FROM generate_series(1, 5) AS id;
id | array_agg
----+-----------
1 | {1,2}
2 | {1,2,3}
3 | {2,3,4}
4 | {3,4,5}
5 | {4,5}
(5 rows)

The array_agg function will turn a list of values into a PostgreSQL array. It will help to explain how the sliding window operates. First array contains only two values because there is no entry before 1, and therefore, the array is not full. PostgreSQL does not add null entries because they would be ignored by aggregates anyway. The same happens at the end of the data.

UNBOUNDED PRECEDING

This keyword means that everything before the current line will be in the window.

test=# SELECT *,
array_agg(id) OVER (
ORDER BY id
ROWS BETWEEN 2 FOLLOWING AND UNBOUNDED
FOLLOWING
)
FROM generate_series(1, 5) AS id;
 id | array_agg
----+-----------
1 | {3,4,5}
2 | {4,5}
3 | {5}
4 |
5 |
(5 rows)

If you want to exclude the current row from your calculation, SQL offers some syntactic sugar, as shown in the next example:

test=# SELECT year,
production,
array_agg(production) OVER (ORDER BY year
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
EXCLUDE CURRENT ROW)
FROM t_oil
WHERE country = 'USA' AND year < 1970;
year | production | array_agg
------+------------+---------------
1965 | 9014 | {9579}
1966 | 9579 | {9014,10219}
1967 | 10219 | {9579,10600}
1968 | 10600 | {10219,10828}
1969 | 10828 | {10600}
(5 rows)

difference between ROWS and RANGE

{ RANGE | ROWS | GROUPS } frame_start [
frame_exclusion ]
{ RANGE | ROWS | GROUPS } BETWEEN frame_start AND
frame_end [ frame_exclusion ]

Postgres sql Window function analytics - ghdrako/doc_snipets GitHub Wiki

Usefull statistic functions

Example Return the First Two Values in Each Group

Important note

Window Frame(sliding windows)

UNBOUNDED PRECEDING

difference between ROWS and RANGE

⚠️ GitHub.com Fallback ⚠️

Postgres sql Window function analytics - ghdrako/doc_snipets GitHub Wiki

Usefull statistic functions

Example Return the First Two Values in Each Group

Important note

Window Frame(sliding windows)

UNBOUNDED PRECEDING

difference between ROWS and RANGE

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️