SQL - robbiehume/CS-Notes GitHub Wiki

Links

SQL software

  • Squirrel, toad

Look into

  • offset, limit
  • Indexes
  • Stored procedures
  • CASE statement
  • VIEWs: virtual tables
  • Window functions

General

  • Always use single quotes (') for strings to ensure compatibility across different SQL dialects
    • To include a single quote inside a string, escape it by doubling the quote (''): SELECT 'It''s a great day!' AS message;
  • Use double quotes (") for identifiers if needed (e.g., column names with spaces or special characters)

Tips and tricks

  • Return the number of desired results + 1 so you know if there's more to get on the next page or not

Joins (link)

  • The JOIN operator lets you combine related information from multiple tables into a new table
  • Inner join vs outer join
    • Inner join: JOIN; keeps only the rows from both tables that's related to each other (in the resulting table)
      • In MySQL, a plain JOIN without an ON is treated as a CROSS JOIN
    • Outer joins: will also keep rows that are not related to the other table and missing data will be filled with NULL
      • Left (outer) join: LEFT JOIN; keeps the unrelated data from the left (the first) table
        • Keeps any rows from the first table that don't have a match with the second table
      • Right (outer) join: RIGHT JOIN; keeps the unrelated data from the right (the second) table
        • Keeps any rows from the second table that don't have a match with the first table
      • Full (outer) join: FULL JOIN; keeps all rows from both tables
    • The ON clause is similar to WHERE for SELECT
    • Syntax:
      •    SELECT pets.name AS pet_name, owners.name AS owner
           FROM pets
           JOIN owners
           ON pets.owner_id = owners.id;  
    • Should use table aliases if joining a table with itself
    • Can also do joins by having multiple tables in the FROM statement
  • Cross Join: produces a Cartesian product, combining each row from the first table with every row from the second table
    • Use: you need every combination of rows from two tables
    • In MySQL, a plain JOIN without an ON is treated as a CROSS JOIN
  • Self Join: joins a table to itself; same as an inner join with itself
    • Use: you need to compare rows within the same table

Table alisases:

  •    SELECT t.id
       FROM table_name t    // or FROM table_name AS t

Types of keys:

  • NOTE: only 3 types of keys are actually used in a database (primary, unique, and foreign). The rest are only concepts of RDBMS
  • Super key: a set of one or more than one key that can be used to identify a record uniquely in a table
    • Primary key, unique key, and alternate key are a subset of super keys
    • A super key can contain multiple attributes that might not be able to identify tuples in a table independently, but when grouped with certain keys, they can identify tuples uniquely
  • Candidate key: a set of one or more fields/columns that can identify a record uniquely in a table
    • There can be multiple candidate keys in one table and each candidate key can work as a primary key
  • Primary key: a set of one or more fields/columns of a table that can uniquely identify a record in a table
    • There is only one chosen primary key
  • Alternate key: a candidate key that currently is not a primary key
  • Composite key: a combination of more than one field/column of a table
  • Unique key: a set of one or more fields/columns of a tale that uniquely identify a record in a table
    • It's like a primary key, but it can accept only one NULLvalue and it can not have duplicate values, while PK doesn't allow any NULL or duplicate values
  • Foreign key: a field/column in the table that is the primary key in another table

General Notes

  • explain: In MySQL / MariaDB (different keyword in others?), you can add "explain" to the beginning of the query to see each step it takes
    • If any of the query steps say "Using full table", then it's potentially making your query slower
  • \G: in MySQL / MariaDB can add \G to the end before (or instead of) the semicolon to display a better row format for the results

Operators

BETWEEN:

  • Check if a date column is between two dates
  • SELECT *
    FROM events
    WHERE event_date BETWEEN '2024-11-01' AND '2024-11-30';

Aggregate functions

  • SUM(), AVG(), COUNT(), MIN(), MAX()
  • Scenario Example Requires GROUP BY?
    Standard Aggregate SELECT SUM(sales) FROM table No
    Grouped Aggregate SELECT region, SUM(sales) GROUP BY region Yes
    Window Function SELECT SUM(sales) OVER (...) FROM table No (uses OVER)

GROUP BY:

  • It's used with aggregate functions and used in collaboration with the SELECT statement to arrange identical data into groups
  • Can use column numbers (1, 2, etc.) instead of names. The numbers correspond to the order of the columns in the SELECT statement
    • This is especially beneficial when a column in the select is an expression
    • Only available in MySQL, PostgreSQL, and some other databases
    • Ex: 1 refers to CONCAT(publisher, ' Books'):
      SELECT CONCAT(publisher, ' Books'), COUNT(*)
      FROM Books
      GROUP BY 1;
    • Since this isn't accepted across all databases, column name aliases may be better for consistency:
      SELECT CONCAT(publisher, ' Books') AS publisher_books, COUNT(*)
      FROM Books
      GROUP BY publisher_books;
  • When you use a GROUP BY, all columns in the SELECT statement must either:
    • Be part of the GROUP BY clause, or
    • Be aggregated using an aggregate function (e.g., COUNT, MAX, MIN).
  • When doing a COUNT() on a query with a GROUP BY, it will give a count for each of the groups
  • For filtering a GROUP BY with an aggregate function, need to use HAVING instead of WHERE:
    • SELECT MAX(number) AS largest_single_number
      FROM (
        SELECT number
        FROM Numbers
        GROUP BY number
        HAVING COUNT(*) = 1
      );

ORDER BY:

  • You don't need to SELECT the column you're applying the ORDER BY to
  • Can order by multiple columns (will order by the first, and use the second column to order any rows that have the same first column value)

LIMIT:

  • Can provide an limit of how many results to select: LIMIT 5
  • Can also add an offset: LIKE offset, limit_num;
    • Ex: LIMIT 10, 5; gets 5 results starting at result 10 (or 9?)

LIKE:

  • On some RDBMS, LIKE is case-sensitive; on others it's not
  • Wildcards:
    • _: matches exactly one character; WHERE title LIKE '_arm' ()
      • WHERE title LIKE '_arm'  -- Matches titles where a four-character word ends with "arm" (e.g., "Farm", "Harm")
    • %: matches zero or more characters
      • WHERE title LIKE '% %'  -- Ensure the title has more than one word (contains a space between one or more characters on both sides)
        WHERE LOWER(title) NOT LIKE '%z%'  -- Exclude titles containing 'z' (case-insensitive)
    • []: matches any single character within the brackets;
      • WHERE title LIKE '[AB]%'  -- Matches titles starting with either "A" or "B"
  • Common use cases:
    • -- Search for substrings
      SELECT * FROM users WHERE username LIKE '%john%';  -- Finds usernames containing "john" anywhere in the string
      
      -- Starts with
      SELECT * FROM employees WHERE name LIKE 'A%';  -- Finds names starting with "A
      
      -- Ends with
      SELECT * FROM products WHERE category LIKE '%tools';  -- Finds categories ending with "tools"
      
      -- Search for a specific pattern
      SELECT * FROM files WHERE filename LIKE 'report_2023_%.pdf';  -- Matches filenames starting with "report_2023_" and ending with ".pdf"
  • Performance Tips:
    • Avoid using LIKE '%pattern%' on large datasets as it prevents the use of indexes and can slow down queries.
    • Consider using full-text search or indexed columns for better performance.

CASE:

  • It's a conditional expression that allows you to perform logical checks and return different values based on the conditions
    • It's similar to an if/else statement
  • It's typically used in the SELECT, WHERE, ORDER BY, or HAVING clauses to introduce conditional logic into your queries
  • Structure (ELSE is optional)
    -- Simple CASE Expression
    CASE expression
      WHEN value1 THEN result1
      WHEN value2 THEN result2
      ...
      ELSE default_result
    END
    
    -- Searched CASE Expression
    CASE 
      WHEN condition1 THEN result1
      WHEN condition2 THEN result2
      ...
      ELSE default_result
    END
  • Ex: check if there are any items
    SELECT 
      CASE 
        WHEN item_count > 0 THEN 'True'
        ELSE 'False END) AS laptop_views,
      END AS contains_items
    FROM inventory;

COUNT():

  • By default, COUNT will include duplicate values in the count. It essentially counts all rows for which there is a non-null value in the column
  • If only wanting to count the unique values in a column you can do COUNT(DISTINCT <col_name>)
  • Two ways to handle NULL values:
    • SELECT count(col_name): ignores the count of all the NULL values in the col_name column
    • SELECT COUNT(*): counts rows regardless of the NULL values
  • When doing a COUNT() on a query with a GROUP BY, it will give a count for each of the groups

SUM:

UNION: combines the results of two or more SELECT queries into a single result set

  • All duplicate rows are removed by default, unless you do UNION ALL
  • It requires all SELECT queries to have:
    • The same number of columns
    • Matching column data types in corresponding positions
  • Ex:
    • SELECT name FROM Customers
      UNION
      SELECT name FROM Suppliers;

IN:

  • If you want to get rows that have certain values you can use IN instead of multiple = statements
  • Ex: WHERE color = 'red' OR color = 'blue' --> WHERE color IN ('red', 'blue')

EXISTS:

  • If the subquery returns rows, then the result from the outer query is added to the result set
    • If it returns NULL, then it's not added (skipped)
  • Can use the NOT operator to inverse the EXISTS clause

ANY

  • True if the comparison is true for ANY of the values of the subquery

ALL

  • True if the comparison is true for ALL of the values of the subquery

Window Functions

Overview of components

  • 📊 Window Function: The calculation (e.g., SUM(), ROW_NUMBER())
  • 🪟 OVER() Clause: The window (which rows to include)
  • 🧩 PARTITION BY: How to group windows
  • <window_function>() OVER (PARTITION BY column_name ORDER BY column_name)

Function Types

  • Ranking Functions:
    • ROW_NUMBER() – Assigns a unique number to each row within a partition
    • RANK() – Assigns a rank with gaps for ties
    • DENSE_RANK() – Assigns a rank without gaps for ties
  • Analytic Functions:
    • LAG() – Finds the value of a previous row
    • LEAD() – Finds the value of the next row
    • FIRST_VALUE() – Finds the first value in the window
    • LAST_VALUE() – Finds the last value in the window
  • Aggregate Functions that can be used as Window Functions (but don't have to be):
    • SUM() – Running totals
    • AVG() – Moving averages
    • COUNT() – Cumulative counts
    • MIN()/MAX() – Sliding minimum/maximum

Window Function Clauses: OVER() and PARTITION BY

  • OVER() Clause:
    • Defines the window (set of rows) for the window function
    • It tells SQL:
      • How to partition the data (group it)
      • How to order the data (sequence it)
    • Required for all window functions (except aggregate functions). It defines the window frame
  • PARTITION BY Clause
    • Similar to GROUP BY but without reducing rows
    • It splits the result set into partitions (subgroups) before performing calculations
    • Each partition acts like a mini-table for the window function to operate on
    • It's optional
      • Without PARTITION BY:
        • The window function acts on all rows together.
      • With PARTITION BY:
        • The window function resets for each partition (group)

How to speed up (optimize) a query:

  • Indexing
  • Optimizing JOINs
  • Avoiding SELECT * (retrieve only necessary data)
  • Query caching for repeated queries
  • Use LIMIT to restrict the number of rows returned
  • Try to avoid subqueries and use JOINs where possible
  • Stored procedures?

Stored procedures

  • Stored procedures are recompiled collections of SQL statements that are saved and stored in a database for repeated use
  • Best situations for stored procedures:
    • Complex business logic that can be handled entirely within the database
    • Reducing application-database interaction when multiple queries would otherwise be required

Database indexes:

  • https://www.codecademy.com/paths/analyze-data-with-sql/tracks/analyze-data-sql-get-started-with-sql/modules/analyze-data-sql-learn-manipulation-c4b/articles/sql-indexes
  • Indexes help speed up querying by providing a method to quickly lookup requested data
  • Simply put, an index is a pointer to data in a table. It's very similar to a index in the back of a book
  • Indexes serve as lookup tables that efficiently store data for quicker retrieval
  • A table can have multiple indexes
  • Updating a table that has indexes takes more time than updating one without (b/c the indexes also need an update)
    • So only create indexes on columns that will be frequently searched against
  • Ex: getting records in the past 24 hours
    • By indexing a timestamp column you could look at the timestamp and once there's one that is > 24 hours ago, you know you can stop looking
    • Without indexing, you would have to look through ALL the records to check each timestamp and thus greatly increasing the time complexity

Subqueries (AKA nested query)

  • https://learnsql.com/blog/sql-subquery-types/
  • https://learnsql.com/blog/sql-subquery-examples/
  • https://learnsql.com/blog/sql-subqueries/
  • A subquery is a query placed within another SQL query
    • They can be included in the WHERE, FROM, or SELECT clauses of the main query
  • Can use them with the ANY or ALL keywords if the subquery can return multiple rows
  • Can also use them with the IN operator
  • Scalar subqueries: return a single value, or exactly one row and exactly one column
    • Ex:
          SELECT name, listed_price
          FROM paintings
          WHERE listed_price > (
              SELECT AVG(listed_price)
              FROM paintings
          );
  • Multirow subqueries: return either one column with multiple rows (i.e. a list of values) or multiple columns with multiple rows (i.e. tables)
  • Correlated subqueries: where the inner query relies on information from the outer query
    • They refer to the table from the outer query
    • Each subquery is processed one-by-one for each value in the outer query
    • You can even use them in the SELECT statement of the outer query
      • SELECT name, (SELECT AVG(age) FROM cats c2 WHERE c2.name = c1.name)
        FROM cats c1
  • Ex of subquery in FROM (find most number of cats in one breed)
    • SELECT MAX(number_of_cats)
      FROM (SELECT breed, COUNT(*) AS number_of_cats
            FROM cat
            GROUP BY breed) breed_count  // need this subquery table alias

CTEs (common table expressions) aka WITH

  • https://learnsql.com/blog/what-is-sql-cte/
  • A CTE, also referred to as a WITH clause, is a temporary named result set that you can reference anywhere in your query
  • CTEs can help organize and simplify long, complex hierarchical queries and improve readability by breaking them down into smaller blocks
  • If possible, CTEs are better to use than subqueries
  • Can be used in SELECT, INSERT, UPDATE, DELETE, and MERGE statements
  • Ex:
    •     
      --CTE
      WITH cte_sales AS
      	(SELECT EmployeeID, COUNT(*) AS OrderID 
      	 FROM Orders
      	 GROUP BY EmployeeID)
      --Query using CTE
      SELECT AVG(OrderID) AS average_orders_per_employee
      FROM cte_sales; 
    • This is a simple example, but can create multiple CTEs which is when you really start to see the advantages
  • Recursive CTE

Subqueries vs CTEs

  • https://learnsql.com/blog/sql-subquery-cte-difference/
  • https://learnsql.com/blog/reasons-to-use-ctes/
  • https://towardsdatascience.com/sql-for-data-analysis-subquery-vs-cte-699ef629d9eb
  • There are many cases where CTEs are better to use than subqueries
  • From a performance standpoint, there's not much difference because CTEs run as subqueries
  • Main differences / benefits
    • CTEs can be recursive
    • CTEs are reusable
    • CTEs can be more readable
    • CTEs must always have a name
  • Benefits of subqueries:
    • Can be used in the WHERE clause
    • Can do correlated subqueries
  • Comparison
    Feature Subqueries CTEs
    Readability Good for simple queries Better for complex or multi-step queries
    Reusability Limited to the specific clause Can be reused multiple times in the query
    Recursion No Yes (recursive CTEs)
    Optimization May be executed multiple times (per row) Typically optimized better, but may materialize
    Performance Can be slower for nested queries Can perform better, but large CTEs may use more memory
    Use Cases Simple filters, one-time calculations Complex logic, recursive queries, reusability

Subqueries vs JOINs

Syntax differences between MySQL (MariaDB), PostgreSQL, and SQLite

  • MySQL: not case-sensitive, can use either "" or ''
  • PostgreSQL: case-sensitive, can only use ''
  • SQLite: case-sensitive, ca use either "" or ''

Complex Queries

  • Start with the data model: Which tables store which data and the nature of relations b/w these tables
  • First step is to determine which tables will be used, determined by: All tables containing data needed to display the result If any of these tables are not directly related, also include all tables between them and figure out how to join them correctly
  • Start with smaller/simpler queries and put them together, checking the results along the way
  • 0.1. look at the question, what will I need?
    • What tables?
    • GROUP BY?
    • Subquery?
    • Outer joins?
    • Big expressions?
  • 0.2. get some working code
  • Get the joins right
    • Add one join at a time and test/verify along the way
  • Write any expressions, and verify them
    • For verification, do them in a select clause even if you may use it elsewhere
  • Write any where clause expressions, and verify them
  • Arrange the columns and rows to visualize how the grouping will take place
  • Add the grouping, visually verify
  • Add the having clause
  • Add the order by
  • Tidy up the code

pymysql

  • Can setup a connection and then get the cursor and do:
    • cursor.execute('sql with %(substitution)s', {'substitution': 'sub'})
  • In order to build a string with substitutions
def insert_keywords(cursor, r_id, keywords):
    keyword_list = ''
    sql = "INSERT INTO phrases (p_id, phrase_text) SELECT uuid(), kw.txt FROM phrases p RIGHT OUTER JOIN ("
    sql_dict = {}
    for keyword in keywords.split(','): 
        kw_trimmed = ' '.join(keyword.strip().split())
        if len(kw_trimmed) > 255 or not kw_trimmed:
            continue
        keyword_list += f'%({kw_trimmed})s, '
        sql += f'(SELECT %({kw_trimmed})s AS txt) union '
        sql_dict[f'{kw_trimmed}'] = kw_trimmed
    sql = sql[0:-7]
    sql += ") kw ON lower(p.phrase_text) = lower(kw.txt) WHERE p.p_id is null"
    
    cursor.execute(sql, sql_dict)
    
    cursor.execute(f''' INSERT INTO r_phrases (r_id, p_id) 
                            SELECT %(r_id)s, p_id 
                            FROM phrases WHERE phrase_text IN ({keyword_list[:-2]}) '''
                            , { 'r_id': r_id, **sql_dict})

Common problem types

Gaps and islands

  • Identifying and grouping consecutive rows (islands) that share a common characteristic, while recognizing the breaks (gaps) between these sequences

Top-N per Group

  • Selecting the top N rows for each category or group, which often involves ranking functions or subqueries

Running Totals / Cumulative Sums

  • Calculating an accumulated total or sum over a set of rows, usually in a time series or ordered dataset

Recursive / Hierarchical Queries

  • Querying data that has a recursive structure (such as organizational charts or bill-of-materials) using techniques like recursive common table expressions (CTEs)

Pivot and Unpivot

  • Transforming rows into columns (pivot) or columns into rows (unpivot) to reshape data for reporting or analysis.

Specific scenarios:

  • Find all rows that have the max value of a column
    • SELECT *
      FROM table_name
      WHERE col_name = (SELECT MAX(col_name) FROM table_name;
  • Date clause
    • BETWEEN or comparison brackets:
      SELECT *
      FROM orders
      WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';
      
      -- Other options
      WHERE order_date >= '2024-01-01' AND order_date <= '2024-12-31';
      WHERE order_date > '2024-01-01';
  • Count number of rows corresponds to each unique value in a column:
    • Use COUNT() and GROUP BY clause
      SELECT publisher, COUNT(*) AS book_count
      FROM Books
      GROUP BY publisher;
⚠️ **GitHub.com Fallback** ⚠️