IMPALA SQL_Common - loukenny/atme GitHub Wiki

Logical Processing Order


Categories of data types

  • Exact numerics; Integers, numbers w/o a decimal point
    • Whole numbers

      • smallint
      • tinyint
      • int -bigint
    • Decimal numbers

      • numeric
      • decimal
      • money
      • smallmoney
    • Approximate numerics; don't store exact numbers, store approximate numeric values instead, avoid using them in the WHERE clause

      • Float
      • Real
    • Date and time

      • time
      • date
      • samalldatetime
      • datetime
    • Character strings(ASCII - store strings w ENG characters)

      • char
      • varchar
      • text
    • Unicode character strings(non-ASCII - store strings w ALL LANGUAGE words)

      • nchar
      • nvarchar
      • ntext
    • Binary strings

    • Other data types


WHERE โ†’ SELECT

  • Calculations on columns in the WHERE filter condition could increase query times
  • Applying functions to columns in the WHERE filter condition could increase query times
    ๊ฐ™์€ ๊ฒฐ๊ณผ but ์ฒ˜๋ฆฌ์†๋„ โ‘  < โ‘ก, โ‘  is proper method
    โ‘  TotalRebounds ์นผ๋Ÿผ์— ๋Œ€ํ•ด์„œ๋งŒ WHERE ํ•„ํ„ฐ๊ฐ€ ์ ์šฉ๋˜๋Š” ๋ฐ˜๋ฉด,
    โ‘ก ์ „์ฒด ํ–‰์— ๋Œ€ํ•ด DRebound+ORebound ๊ณ„์‚ฐ, >=1000 ๊ณ„์‚ฐ์„ ํ•ด์•ผํ•จ์œผ๋กœ ๋ถ€๋‹ด

WHERE / HAVING ์ฐจ์ด

  • WHERE filter individual row
  • HAVING numeric filter on grouped or aggregated rows
    • Don't use HAVING to filter individual or ungrouped rows
    • ๊ฐœ๋ณ„ ํ–‰์— ๋Œ€์— HAVING ์กฐ๊ฑด์„ ๊ฑธ ๊ฒฝ์šฐ ์ฟผ๋ฆฌ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์ด ๋งŽ์ด ๊ฑธ๋ฆผ
      โ†’ ์ด ๋ฐฉ์‹์œผ๋กœ HAVING ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ, should filter the individual rows after grouping, unnecessarily tying up resources and potentially increasing the time it takes for a run

Interrogation(์งˆ๋ฌธ, ์ทจ์กฐ) after SELECT

  • interrogation ๋“ฑ ์ฟผ๋ฆฌ ์ฒ˜๋ฆฌ์†๋„๊ฐ€ ๋А๋ ค์ง€๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฝ์šฐ์— ๋”ฐ๋ผ data๋งŒ ์ถ”์ถœ
  • ์ดํ›„ ๋ฐ์ดํ„ฐ ๋ถ„์„์€ R, Python ๋“ฑ ํ”„๋กœ๊ทธ๋žจ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๋„ ๋‹ค์ˆ˜
    • SELECT TOP 5 col1, ... -- ์ƒ์œ„ 5๊ฐœ ํ–‰๋งŒ ์ถ”์ถœ
    • SELECT TOP 1 PERCENT col1, ... -- ์ƒ์œ„ 1% ๊ฐฏ์ˆ˜์— ํ•ด๋‹นํ•˜๋Š” ํ–‰๋งŒ ์ถ”์ถœ
    • Impala, top ์‚ฌ์šฉ ๋ถˆ๊ฐ€ โ†’ LIMIT, ORDER BY ASC DESC ํ•จ๊ป˜ ์‚ฌ์šฉ
  • ORDER BY is useful for data interrogation and unless there is a good reason to sort the data in a query, try to avoid using it

Managing duplicates

  • Duplicate rows can be the result of a poor database design, a poorly designed query or both
  • ์ค‘๋ณตํ–‰์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด์„œ
    • DISTINCT(), GROUP BY-์ง‘๊ณ„ํ•จ์ˆ˜ ์‚ฌ์šฉ ์‹œ / UNION, UNION ALL-์ค‘๋ณตํ–‰ ํ—ˆ์šฉ
    • ์ฟผ๋ฆฌ ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ๊ฐ€๋Šฅ์„ฑ ์žˆ์Œ

Differentiating Techniques

์„œ๋กœ ์œ ์‚ฌํ•œ ๊ธฐ๋Šฅ์„ ํ•˜๋Š” ๋‹ค์Œ ์„ธ ๊ฐ€์ง€ ๊ฒฝ์šฐ

Techniques

Joins

  • Combine 2+ tables
    • simple operations/aggregations
    • ex) What is the total sales per employee?

Correlated subqueries

  • Match subqueries & tables or w subqueries
    • simplify syntax, allow you to circumvent multiple, complex joins
    • avoid limits of joins, join two seperate columns in one table
    • high processing time slow down
    • ex) Who does each employee report to in a company?

Multiple/Nested subqueries

  • Useful when multi-step transformations
    • improve accuracy & reproducibility
    • ex) What is the average deal size closed by each sales representative in the quarter?

Common Table Expressions (CTEs)

  • Organize subqueries sequentially
  • Can reference other CTEs
    • useful when dealing w large number of pieces, summary table
    • ex) How did the marketing, sales, growth, engineering teams perform on key metrics?

Which do I use?

  • Depends on your database/question
    • use and reuse your queries
    • generate clear and accurate results