IMPALA SQL_Common - loukenny/atme GitHub Wiki

Logical Processing Order

Categories of data types

Exact numerics; Integers, numbers w/o a decimal point
- Whole numbers
  - smallint
  - tinyint
  - int -bigint
- Decimal numbers
  - numeric
  - decimal
  - money
  - smallmoney
- Approximate numerics; don't store exact numbers, store approximate numeric values instead, avoid using them in the WHERE clause
  - Float
  - Real
- Date and time
  - time
  - date
  - samalldatetime
  - datetime
- Character strings(ASCII - store strings w ENG characters)
  - char
  - varchar
  - text
- Unicode character strings(non-ASCII - store strings w ALL LANGUAGE words)
  - nchar
  - nvarchar
  - ntext
- Binary strings
- Other data types

WHERE → SELECT

Calculations on columns in the WHERE filter condition could increase query times
Applying functions to columns in the WHERE filter condition could increase query times
같은 결과 but 처리속도 ① < ②, ① is proper method
① TotalRebounds 칼럼에 대해서만 WHERE 필터가 적용되는 반면,
② 전체 행에 대해 DRebound+ORebound 계산, >=1000 계산을 해야함으로 부담

WHERE / HAVING 차이

WHERE filter individual row
HAVING numeric filter on grouped or aggregated rows
- Don't use HAVING to filter individual or ungrouped rows
- 개별 행에 대에 HAVING 조건을 걸 경우 쿼리 처리 시간이 많이 걸림
  → 이 방식으로 HAVING 사용할 경우, should filter the individual rows after grouping, unnecessarily tying up resources and potentially increasing the time it takes for a run

Interrogation(질문, 취조) after SELECT

interrogation 등 쿼리 처리속도가 느려지기 때문에 경우에 따라 data만 추출
이후 데이터 분석은 R, Python 등 프로그램을 사용하는 경우도 다수
- SELECT TOP 5 col1, ... -- 상위 5개 행만 추출
- SELECT TOP 1 PERCENT col1, ... -- 상위 1% 갯수에 해당하는 행만 추출
- Impala, top 사용 불가 → LIMIT, ORDER BY ASC DESC 함께 사용
ORDER BY is useful for data interrogation and unless there is a good reason to sort the data in a query, try to avoid using it

Managing duplicates

Duplicate rows can be the result of a poor database design, a poorly designed query or both
중복행을 제거하기 위해서
- DISTINCT(), GROUP BY-집계함수 사용 시 / UNION, UNION ALL-중복행 허용
- 쿼리 처리 속도를 증가시킬 가능성 있음

Differentiating Techniques

서로 유사한 기능을 하는 다음 세 가지 경우

Techniques

Joins

Combine 2+ tables
- simple operations/aggregations
- ex) What is the total sales per employee?

Correlated subqueries

Match subqueries & tables or w subqueries
- simplify syntax, allow you to circumvent multiple, complex joins
- avoid limits of joins, join two seperate columns in one table
- high processing time slow down
- ex) Who does each employee report to in a company?

Multiple/Nested subqueries

Useful when multi-step transformations
- improve accuracy & reproducibility
- ex) What is the average deal size closed by each sales representative in the quarter?

Common Table Expressions (CTEs)

Organize subqueries sequentially
Can reference other CTEs
- useful when dealing w large number of pieces, summary table
- ex) How did the marketing, sales, growth, engineering teams perform on key metrics?

Which do I use?

Depends on your database/question
- use and reuse your queries
- generate clear and accurate results