Window and Analysis Functions - ignacio-alorre/Hive GitHub Wiki

Windowing allows you to create a window on a set of data further allowing aggregation surrounding that data.

Important!!: For all these examples I used a partitioned table, but the existence of a partition is not a must. Actually the PARTITION inside the OVER() clause can be done on any field of the hive table, independently if the table do have a partition or not.

Source Code

Position functions

LEAD
LAG
FIRST_VALUE
LAST_VALUE

Creating a partitioned table to demo the results of these functions

-- Creating the sampel tables
CREATE TABLE IF NOT EXISTS D_sensor 
(date String, value Int) 
PARTITIONED BY (sensor String)

INSERT INTO TABLE D_sensor 
PARTITION (sensor = 'S1')
(value, date)
VALUES
(10, '2020-03-01'),
(12, '2020-03-02'), 
(11, '2020-03-03'), 
(13, '2020-03-04')

INSERT INTO TABLE D_sensor 
PARTITION (sensor = 'S2')
(value, date)
VALUES
(11, '2020-03-01'), 
(12, '2020-03-02'), 
(13, '2020-03-03'), 
(11, '2020-03-04')

1- LEAD

This function returns the value of an expression using column values from a following row. You specify an integer offset, which designates a row position some number of rows after to the current row, by default it is one row. Any column references in the expression argument refer to column values from that later row. Returns null, or a given value as 3rd parameter, when the lead for the current row extends beyond the end of the window

Syntax

lead(expr [, offset] [, default]) OVER([partition_by_clause] order_by_clause)

Example

-- LEAD on OVER PARTITION
SELECT sensor, date, value, 
LEAD(value, 1 , 0) OVER (PARTITION BY sensor ORDER BY date) as leadValue
FROM D_sensor;

2- LAG

This function returns the value of an expression using column values from a preceding row. You specify an integer offset, which designates a row position some number of rows previous to the current row, by default this value is 1. Any column references in the expression argument refer to column values from that prior row. Returns null, or a given value as 3rd parameter, when the lag for the current row extends before the beginning of the window

Syntax

lag(expr [, offset] [, default]) OVER ([partition_by_clause] order_by_clause)

Example

-- LAG on OVER PARTITION
SELECT sensor, 
LAG(value, 1, 0) OVER (PARTITION BY sensor ORDER BY date) as prevValue
FROM D_sensor;

3- FIRST VALUE

This takes at most two parameters. The first parameter is the column for which you want the first value, the second (optional) parameter must be a boolean which define if we skip null values (True) or not (False, by default). Returns the expression value from the first row in the window. The return value is NULL if the input expression is NULL.

Syntax

first_value(expr,bool) OVER([partition_by_clause] order_by_clause [window_clause])

Example

-- FIRST_VALUE on OVER PARTITION
SELECT sensor, value,
FIRST_VALUE(value) OVER (PARTITION BY sensor ORDER BY date) as avgVal
FROM D_sensor;

4- LAST_VALUE

This takes at most two parameters. The first parameter is the column for which you want the last value, the second (optional) parameter must be a boolean which define if we skip null values (True) or not (False, by default). Returns the expression value from the last row in the window. The return value is NULL if the input expression is NULL.

Syntax

last_value(expr) OVER([partition_by_clause] order_by_clause [window_clause])

Example

-- LAST_VALUE on OVER PARTITION
SELECT sensor, value,
LAST_VALUE(value) OVER (PARTITION BY sensor ORDER BY date) as lastVal
FROM D_sensor;

Aggregation Functions

COUNT
SUM
MIN
MAX
AVG

1- COUNT

Returns the COUNT the number of rows in the partition

Example

-- COUNT on OVER PARTITION
SELECT sensor, value,
COUNT(value) OVER (PARTITION BY sensor ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as countVal
FROM D_sensor;

2- SUM

Returns the SUM of the values which fall in the given window

Example

-- SUM on OVER PARTITION WINDOW OF 1
SELECT sensor, value,
SUM(value) OVER (PARTITION BY sensor ORDER BY date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) as sum
FROM D_sensor;

Example

-- SUM on OVER PARTITION WINDOW OF 2
SELECT sensor, value,
SUM(value) OVER (PARTITION BY sensor ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as sum
FROM D_sensor;

MIN

Returns the MIN value of a given column for a provided window

Example

-- MIN on OVER PARTITION
SELECT sensor, value,
MIN(value) OVER (PARTITION BY sensor ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as minVal
FROM D_sensor;

MAX

Returns the MAX value of a given column for a provided window

Example

-- MAX on OVER PARTITION
SELECT sensor, value,
MAX(value) OVER (PARTITION BY sensor ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as maxVal
FROM D_sensor;

AVG

Returns the AVG value of a given column for a provided window

Example

-- AVG on OVER PARTITION
SELECT sensor, value,
AVG(value) OVER (PARTITION BY sensor ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as avgVal
FROM D_sensor;

Analytics functions

RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE

Below is a special table to demo the functionality of all these functions:

INSERT INTO TABLE D_sensor 
PARTITION (sensor = 'S1')
(value, date)
VALUES
(10, '2020-03-01'),
(12, '2020-03-02'), 
(11, '2020-03-03'), 
(12, '2020-03-03'),
(14, '2020-03-04'),
(14, '2020-03-04')

ROW_NUMBER

Assigns unique numbers to each row within the PARTITION given the ORDER BY clause

RANK

Behaves like ROW_NUMBER(), except that “equal” rows are ranked the same

DENSE_RANK

DENSE_RANK() is a rank with no gaps. Tt behaves like ROW_NUMBER() when we add the DISTINCT keyword.

-- RANK
-- ROW_NUMBER
-- DENSE_RANK
SELECT sensor, value,
RANK() OVER (PARTITION BY sensor ORDER BY date) AS rank,
ROW_NUMBER() OVER (PARTITION BY sensor ORDER BY date) AS rowNumber,
DENSE_RANK() OVER(PARTITION BY sensor ORDER BY date) AS denseRank
FROM D_sensor
where sensor = 'S1';

CUME_DIST

Returns the cumulative distribution of a value within a group of values; that is, the percentage of partition values less than or equal to the value in the current row. This represents the number of rows preceding or peer with the current row in the window ordering of the window partition divided by the total number of rows in the window partition. Return values range from 0 to 1.

Syntax

 CUME_DIST() OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

PERCENT_RANK

Returns the percentage of partition values less than the value in the current row, excluding the highest value. Return values range from 0 to 1 and represent the row relative rank, calculated as the result of this formula, where rank is the row rank and rows is the number of partition rows: (rank - 1) / (rows - 1)

Syntax

PERCENT_RANK() OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

Example

SELECT sensor, date, value,
CUME_DIST() OVER (PARTITION BY sensor ORDER BY date) AS cumeDist,
PERCENT_RANK() OVER (PARTITION BY sensor ORDER BY date) AS pecentRank
FROM D_sensor
where sensor = 'S1';

NTILE

Firstly divides the given window into the number of percentiles provided as a parameter. Returns the percentile to which the row belows.

SELECT sensor, date, value,
NTILE(2) OVER (PARTITION BY sensor ORDER BY date) AS ntile2,
NTILE(3) OVER (PARTITION BY sensor ORDER BY date) AS ntile3,
NTILE(4) OVER (PARTITION BY sensor ORDER BY date) AS ntile4
FROM D_sensor
where sensor = 'S1';

Amazon Interview Question

Given the number of visitors in a museum, we want to know the gain and loss in the number of visitors with respect each day

For example

CREATE TABLE IF NOT EXISTS d_visitors 
(mdate string, visitor string)

INSERT INTO TABLE d_visitors VALUES 
('2019-01-01', 'v1'), 
('2019-01-01', 'v2'), 
('2019-01-01', 'v3'), 
('2019-01-02', 'v2'), 
('2019-01-03', 'v2'), 
('2019-01-03', 'v4'), 
('2019-01-03', 'v5')

Should return:

Notes:

Since we need to compare values of two consecutive rows we need to use LAG()
We need to GROUP BY mdate, so in the OVER() we musn't add PARTITION BY mdate, otherwise we will get nulls
We can add an function inside a function, for example LAG(COUNT(VISITORS), 1, 0)
In case we may face duplicates or same visitors coming twice same day, we need to add in both COUNT() the DISTINCT

select mdate, cnt,
case when cnt-prev > 0 then cnt-prev else 0 end as visitorsGain,
case when cnt-prev < 0 then prev-cnt else 0 end as visitorsLoss
from(
select mdate,
count(distinct visitor) as cnt,
LAG(count(distinct visitor),1, 0) over (order by mdate) as prev 
from d_visitors
group by mdate) vc

Sources