SQL query execution order - ignacio-alorre/Hive GitHub Wiki

Execution order of a SQL query:

FROM, JOIN
WHERE
GROUP BY
HAVING
SELECT
DISTINCT
ORDER BY
LIMIT, OFFSET

Let's use an example to clarify some of this.

select school_year, student_id, avg(gpa) as gpa_avg 
from gpa_history
where is_required = TRUE
group by school_year, student_id
having avg(gpa) > 3.5

Sinve we want to filter out optional courses before computing average GPAs, we need to add the WHERE is_required = TRUE instead HAVING, because WHERE is executed before GROUP BY and HAVING. The reason why we can not write HAVING avg_gpa >= 3.5 is that avg_gpa is defined as part of SELECT, so it cannot be referred to in steps executed before SELECT.

EXPLAIN

The EXPLAIN command shows the execution plan for a query

EXPLAIN
SELECT id, COLLECT(DISTINCT name) 
FROM D_customer
WHERE segment IS NOT null
GROUP BY id
ORDER BY id
LIMIT 2

Summarising the output of this query:

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:     
        filterExpr: segment is not null (type: boolean)         [WHERE segment IS NOT null]
        Select Operator     
          expressions: id (type: int), name (type: string)      [SELECT id, name] !! No DISTINCT or COLLECT here still
      Reduce Operator Tree:
        Group By Operator                                       [GROUP BY id] 
          aggregations: collect(DISTINCT KEY._col1:0._col0)     [COLLECT DISTINCT name]
  Stage: Stage-2
    Map Reduce
      Map Operator Tree: col0
        sort order: +                                           [SORT BY id]
      Reduce Operator Tree:
        Select Operator    
          Limit: 2                                              [LIMIT 2]
  Stage: Stage-0
    Fetch Operator
      limit: 2

The full output of this query is:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: d_customer
            filterExpr: segment is not null (type: boolean)
            Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: segment is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int), name (type: string)
                outputColumnNames: id, name
                Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                Group By Operator
                  aggregations: collect(DISTINCT name)
                  keys: id (type: int), name (type: string)
                  mode: hash
                  outputColumnNames: _col0, _col1, _col2
                  Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator
                    key expressions: _col0 (type: int), _col1 (type: string)
                    sort order: ++
                    Map-reduce partition columns: _col0 (type: int)
                    Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
      Reduce Operator Tree:
        Group By Operator
          aggregations: collect(DISTINCT KEY._col1:0._col0)
          keys: KEY._col0 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: int)
              sort order: +
              Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              TopN Hash Memory Usage: 0.1
              value expressions: _col1 (type: array<string>)
      Reduce Operator Tree:
        Select Operator
          expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: array<string>)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
          Limit
            Number of rows: 2
            Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 2
      Processor Tree:
        ListSink

SQL query execution order - ignacio-alorre/Hive GitHub Wiki

EXPLAIN

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️