IMPALA SQL_Common - loukenny/atme GitHub Wiki
Logical Processing Order
Categories of data types
- Exact numerics; Integers, numbers w/o a decimal point
-
Whole numbers
- smallint
- tinyint
- int -bigint
-
Decimal numbers
- numeric
- decimal
- money
- smallmoney
-
Approximate numerics; don't store exact numbers, store approximate numeric values instead, avoid using them in the WHERE clause
- Float
- Real
-
Date and time
- time
- date
- samalldatetime
- datetime
-
Character strings(ASCII - store strings w ENG characters)
- char
- varchar
- text
-
Unicode character strings(non-ASCII - store strings w ALL LANGUAGE words)
- nchar
- nvarchar
- ntext
-
Binary strings
-
Other data types
-
WHERE โ SELECT
- Calculations on columns in the WHERE filter condition could increase query times
- Applying functions to columns in the WHERE filter condition could increase query times
๊ฐ์ ๊ฒฐ๊ณผ but ์ฒ๋ฆฌ์๋ โ < โก, โ is proper method
โ TotalRebounds ์นผ๋ผ์ ๋ํด์๋ง WHERE ํํฐ๊ฐ ์ ์ฉ๋๋ ๋ฐ๋ฉด,
โก ์ ์ฒด ํ์ ๋ํด DRebound+ORebound ๊ณ์ฐ, >=1000 ๊ณ์ฐ์ ํด์ผํจ์ผ๋ก ๋ถ๋ด
WHERE / HAVING ์ฐจ์ด
WHEREfilter individual rowHAVINGnumeric filter on grouped or aggregated rows- Don't use HAVING to filter individual or ungrouped rows
- ๊ฐ๋ณ ํ์ ๋์ HAVING ์กฐ๊ฑด์ ๊ฑธ ๊ฒฝ์ฐ ์ฟผ๋ฆฌ ์ฒ๋ฆฌ ์๊ฐ์ด ๋ง์ด ๊ฑธ๋ฆผ
โ ์ด ๋ฐฉ์์ผ๋ก HAVING ์ฌ์ฉํ ๊ฒฝ์ฐ, should filter the individual rows after grouping, unnecessarily tying up resources and potentially increasing the time it takes for a run
Interrogation(์ง๋ฌธ, ์ทจ์กฐ) after SELECT
- interrogation ๋ฑ ์ฟผ๋ฆฌ ์ฒ๋ฆฌ์๋๊ฐ ๋๋ ค์ง๊ธฐ ๋๋ฌธ์ ๊ฒฝ์ฐ์ ๋ฐ๋ผ data๋ง ์ถ์ถ
- ์ดํ ๋ฐ์ดํฐ ๋ถ์์ R, Python ๋ฑ ํ๋ก๊ทธ๋จ์ ์ฌ์ฉํ๋ ๊ฒฝ์ฐ๋ ๋ค์
SELECT TOP 5 col1, ...-- ์์ 5๊ฐ ํ๋ง ์ถ์ถSELECT TOP 1 PERCENT col1, ...-- ์์ 1% ๊ฐฏ์์ ํด๋นํ๋ ํ๋ง ์ถ์ถ- Impala, top ์ฌ์ฉ ๋ถ๊ฐ โ
LIMIT,ORDER BYASCDESCํจ๊ป ์ฌ์ฉ
ORDER BYis useful for data interrogation and unless there is a good reason to sort the data in a query, try to avoid using it
Managing duplicates
- Duplicate rows can be the result of a poor database design, a poorly designed query or both
- ์ค๋ณตํ์ ์ ๊ฑฐํ๊ธฐ ์ํด์
DISTINCT(),GROUP BY-์ง๊ณํจ์ ์ฌ์ฉ ์ /UNION,UNION ALL-์ค๋ณตํ ํ์ฉ- ์ฟผ๋ฆฌ ์ฒ๋ฆฌ ์๋๋ฅผ ์ฆ๊ฐ์ํฌ ๊ฐ๋ฅ์ฑ ์์
Differentiating Techniques
์๋ก ์ ์ฌํ ๊ธฐ๋ฅ์ ํ๋ ๋ค์ ์ธ ๊ฐ์ง ๊ฒฝ์ฐ
Techniques
Joins
- Combine 2+ tables
- simple operations/aggregations
- ex) What is the total sales per employee?
Correlated subqueries
- Match subqueries & tables or w subqueries
- simplify syntax, allow you to circumvent multiple, complex joins
- avoid limits of joins, join two seperate columns in one table
- high processing time slow down
- ex) Who does each employee report to in a company?
Multiple/Nested subqueries
- Useful when multi-step transformations
- improve accuracy & reproducibility
- ex) What is the average deal size closed by each sales representative in the quarter?
Common Table Expressions (CTEs)
- Organize subqueries sequentially
- Can reference other CTEs
- useful when dealing w large number of pieces, summary table
- ex) How did the marketing, sales, growth, engineering teams perform on key metrics?
Which do I use?
- Depends on your database/question
- use and reuse your queries
- generate clear and accurate results