Window and Analysis Functions - ignacio-alorre/Hive GitHub Wiki
Windowing allows you to create a window on a set of data further allowing aggregation surrounding that data.
Important!!: For all these examples I used a partitioned table, but the existence of a partition is not a must. Actually the PARTITION
inside the OVER()
clause can be done on any field of the hive table, independently if the table do have a partition or not.
Position functions
- LEAD
- LAG
- FIRST_VALUE
- LAST_VALUE
Creating a partitioned table to demo the results of these functions
-- Creating the sampel tables
CREATE TABLE IF NOT EXISTS D_sensor
(date String, value Int)
PARTITIONED BY (sensor String)
INSERT INTO TABLE D_sensor
PARTITION (sensor = 'S1')
(value, date)
VALUES
(10, '2020-03-01'),
(12, '2020-03-02'),
(11, '2020-03-03'),
(13, '2020-03-04')
INSERT INTO TABLE D_sensor
PARTITION (sensor = 'S2')
(value, date)
VALUES
(11, '2020-03-01'),
(12, '2020-03-02'),
(13, '2020-03-03'),
(11, '2020-03-04')
1- LEAD
This function returns the value of an expression using column values from a following row. You specify an integer offset, which designates a row position some number of rows after to the current row, by default it is one row. Any column references in the expression argument refer to column values from that later row. Returns null, or a given value as 3rd parameter, when the lead for the current row extends beyond the end of the window
Syntax
lead(expr [, offset] [, default]) OVER([partition_by_clause] order_by_clause)
Example
-- LEAD on OVER PARTITION
SELECT sensor, date, value,
LEAD(value, 1 , 0) OVER (PARTITION BY sensor ORDER BY date) as leadValue
FROM D_sensor;
2- LAG
This function returns the value of an expression using column values from a preceding row. You specify an integer offset, which designates a row position some number of rows previous to the current row, by default this value is 1. Any column references in the expression argument refer to column values from that prior row. Returns null, or a given value as 3rd parameter, when the lag for the current row extends before the beginning of the window
Syntax
lag(expr [, offset] [, default]) OVER ([partition_by_clause] order_by_clause)
Example
-- LAG on OVER PARTITION
SELECT sensor,
LAG(value, 1, 0) OVER (PARTITION BY sensor ORDER BY date) as prevValue
FROM D_sensor;
3- FIRST VALUE
This takes at most two parameters. The first parameter is the column for which you want the first value, the second (optional) parameter must be a boolean which define if we skip null values (True) or not (False, by default). Returns the expression value from the first row in the window. The return value is NULL if the input expression is NULL.
Syntax
first_value(expr,bool) OVER([partition_by_clause] order_by_clause [window_clause])
Example
-- FIRST_VALUE on OVER PARTITION
SELECT sensor, value,
FIRST_VALUE(value) OVER (PARTITION BY sensor ORDER BY date) as avgVal
FROM D_sensor;
4- LAST_VALUE
This takes at most two parameters. The first parameter is the column for which you want the last value, the second (optional) parameter must be a boolean which define if we skip null values (True) or not (False, by default). Returns the expression value from the last row in the window. The return value is NULL if the input expression is NULL.
Syntax
last_value(expr) OVER([partition_by_clause] order_by_clause [window_clause])
Example
-- LAST_VALUE on OVER PARTITION
SELECT sensor, value,
LAST_VALUE(value) OVER (PARTITION BY sensor ORDER BY date) as lastVal
FROM D_sensor;
Aggregation Functions
- COUNT
- SUM
- MIN
- MAX
- AVG
1- COUNT
Returns the COUNT
the number of rows in the partition
Example
-- COUNT on OVER PARTITION
SELECT sensor, value,
COUNT(value) OVER (PARTITION BY sensor ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as countVal
FROM D_sensor;
2- SUM
Returns the SUM
of the values which fall in the given window
Example
-- SUM on OVER PARTITION WINDOW OF 1
SELECT sensor, value,
SUM(value) OVER (PARTITION BY sensor ORDER BY date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) as sum
FROM D_sensor;
Example
-- SUM on OVER PARTITION WINDOW OF 2
SELECT sensor, value,
SUM(value) OVER (PARTITION BY sensor ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as sum
FROM D_sensor;
MIN
Returns the MIN
value of a given column for a provided window
Example
-- MIN on OVER PARTITION
SELECT sensor, value,
MIN(value) OVER (PARTITION BY sensor ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as minVal
FROM D_sensor;
MAX
Returns the MAX
value of a given column for a provided window
Example
-- MAX on OVER PARTITION
SELECT sensor, value,
MAX(value) OVER (PARTITION BY sensor ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as maxVal
FROM D_sensor;
AVG
Returns the AVG
value of a given column for a provided window
Example
-- AVG on OVER PARTITION
SELECT sensor, value,
AVG(value) OVER (PARTITION BY sensor ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as avgVal
FROM D_sensor;
Analytics functions
- RANK
- ROW_NUMBER
- DENSE_RANK
- CUME_DIST
- PERCENT_RANK
- NTILE
Below is a special table to demo the functionality of all these functions:
INSERT INTO TABLE D_sensor
PARTITION (sensor = 'S1')
(value, date)
VALUES
(10, '2020-03-01'),
(12, '2020-03-02'),
(11, '2020-03-03'),
(12, '2020-03-03'),
(14, '2020-03-04'),
(14, '2020-03-04')
ROW_NUMBER
Assigns unique numbers to each row within the PARTITION
given the ORDER BY
clause
RANK
Behaves like ROW_NUMBER()
, except that “equal” rows are ranked the same
DENSE_RANK
DENSE_RANK()
is a rank with no gaps. Tt behaves like ROW_NUMBER()
when we add the DISTINCT
keyword.
-- RANK
-- ROW_NUMBER
-- DENSE_RANK
SELECT sensor, value,
RANK() OVER (PARTITION BY sensor ORDER BY date) AS rank,
ROW_NUMBER() OVER (PARTITION BY sensor ORDER BY date) AS rowNumber,
DENSE_RANK() OVER(PARTITION BY sensor ORDER BY date) AS denseRank
FROM D_sensor
where sensor = 'S1';
CUME_DIST
Returns the cumulative distribution of a value within a group of values; that is, the percentage of partition values less than or equal to the value in the current row. This represents the number of rows preceding or peer with the current row in the window ordering of the window partition divided by the total number of rows in the window partition. Return values range from 0 to 1.
Syntax
CUME_DIST() OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
PERCENT_RANK
Returns the percentage of partition values less than the value in the current row, excluding the highest value. Return values range from 0 to 1 and represent the row relative rank, calculated as the result of this formula, where rank is the row rank and rows is the number of partition rows: (rank - 1) / (rows - 1)
Syntax
PERCENT_RANK() OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
Example
SELECT sensor, date, value,
CUME_DIST() OVER (PARTITION BY sensor ORDER BY date) AS cumeDist,
PERCENT_RANK() OVER (PARTITION BY sensor ORDER BY date) AS pecentRank
FROM D_sensor
where sensor = 'S1';
NTILE
Firstly divides the given window into the number of percentiles provided as a parameter. Returns the percentile to which the row belows.
SELECT sensor, date, value,
NTILE(2) OVER (PARTITION BY sensor ORDER BY date) AS ntile2,
NTILE(3) OVER (PARTITION BY sensor ORDER BY date) AS ntile3,
NTILE(4) OVER (PARTITION BY sensor ORDER BY date) AS ntile4
FROM D_sensor
where sensor = 'S1';
Amazon Interview Question
Given the number of visitors in a museum, we want to know the gain and loss in the number of visitors with respect each day
For example
CREATE TABLE IF NOT EXISTS d_visitors
(mdate string, visitor string)
INSERT INTO TABLE d_visitors VALUES
('2019-01-01', 'v1'),
('2019-01-01', 'v2'),
('2019-01-01', 'v3'),
('2019-01-02', 'v2'),
('2019-01-03', 'v2'),
('2019-01-03', 'v4'),
('2019-01-03', 'v5')
Should return:
Notes:
- Since we need to compare values of two consecutive rows we need to use LAG()
- We need to
GROUP BY
mdate, so in the OVER() we musn't addPARTITION BY
mdate, otherwise we will get nulls - We can add an function inside a function, for example
LAG(COUNT(VISITORS), 1, 0)
- In case we may face duplicates or same visitors coming twice same day, we need to add in both
COUNT()
theDISTINCT
select mdate, cnt,
case when cnt-prev > 0 then cnt-prev else 0 end as visitorsGain,
case when cnt-prev < 0 then prev-cnt else 0 end as visitorsLoss
from(
select mdate,
count(distinct visitor) as cnt,
LAG(count(distinct visitor),1, 0) over (order by mdate) as prev
from d_visitors
group by mdate) vc
Sources
- https://acadgild.com/blog/windowing-functions-in-hive
- https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html
- https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
- https://dzone.com/articles/difference-between-rownumber
- https://stackoverflow.com/a/7747342/1773841