Order, Sort, Cluster and Distribute By - ignacio-alorre/Hive GitHub Wiki

Order By

colOrder: ( ASC | DESC )
colNullOrder: (NULLS FIRST | NULLS LAST)

SELECT * FROM src orderBy

Note: In the strict mode (i.e, hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. The limit clause is not necessary if you set hive.mapred.mode=nonstrict. The reason is in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.

Sort By

colOrder: ( ASC | DESC )

SELECT * FROM src SORT BY key ASC, value DESC

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.

Difference between Sort By and Order By

Hive supports SORT BY which sorts the data per reducer.

  • ORDER BY: Guarantees total order in the output
  • SORT BY: Only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

Cluster By and Distribute By

Cluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.

Cluster By is a short-cut for both, Distribute By and Sort By.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

For example:

Input:

x1
x2
x4
x3
x1

Reducer 1 got:

x1
x2
x1

Reducer 2 got:

x4
x3

Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but they are not guaranteed to be clustered in adjacent positions.

Reducer 1 got:

x1
x1
x2

Reducer 2 got:

x4
x3

Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.

SELECT col1, col2 FROM t1 CLUSTER BY col1

SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC

Sources