SQL - robbiehume/CS-Notes GitHub Wiki
- https://content.codecademy.com/courses/sql-intensive/reference-guide-queries.pdf?_gl=1*1l6dyh6*_ga*NzA5ODM3ODEzMy4xNjc5NTgzMzE4*_ga_3LRZM6TM9L*MTY3OTU4MzMyNS4xLjEuMTY3OTU4MzYyMy41OC4wLjA.
- https://www.codecademy.com/learn/learn-sql/modules/learn-sql-manipulation/cheatsheet
- https://www.codecademy.com/resources/docs/sql/commands/group-by?page_ref=catalog
- Write SQL like a pro
- Squirrel, toad
- offset, limit
- Indexes
- Stored procedures
- CASE statement
- VIEWs: virtual tables
- Window functions
- Always use single quotes (
'
) for strings to ensure compatibility across different SQL dialects- To include a single quote inside a string, escape it by doubling the quote (
''
):SELECT 'It''s a great day!' AS message;
- To include a single quote inside a string, escape it by doubling the quote (
- Use double quotes (
"
) for identifiers if needed (e.g., column names with spaces or special characters)
- Return the number of desired results + 1 so you know if there's more to get on the next page or not
Joins (link)
-
- The JOIN operator lets you combine related information from multiple tables into a new table
- Inner join vs outer join
-
Inner join:
JOIN
; keeps only the rows from both tables that's related to each other (in the resulting table)- In MySQL, a plain
JOIN
without anON
is treated as aCROSS JOIN
- In MySQL, a plain
-
Outer joins: will also keep rows that are not related to the other table and missing data will be filled with NULL
-
Left (outer) join:
LEFT JOIN
; keeps the unrelated data from the left (the first) table- Keeps any rows from the first table that don't have a match with the second table
-
Right (outer) join:
RIGHT JOIN
; keeps the unrelated data from the right (the second) table- Keeps any rows from the second table that don't have a match with the first table
-
Full (outer) join:
FULL JOIN
; keeps all rows from both tables
-
Left (outer) join:
- The
ON
clause is similar toWHERE
forSELECT
- Syntax:
-
SELECT pets.name AS pet_name, owners.name AS owner FROM pets JOIN owners ON pets.owner_id = owners.id;
-
- Should use table aliases if joining a table with itself
- Can also do joins by having multiple tables in the FROM statement
-
Inner join:
-
Cross Join: produces a Cartesian product, combining each row from the first table with every row from the second table
- Use: you need every combination of rows from two tables
- In MySQL, a plain
JOIN
without anON
is treated as aCROSS JOIN
-
Self Join: joins a table to itself; same as an inner join with itself
- Use: you need to compare rows within the same table
Table alisases:
-
SELECT t.id FROM table_name t // or FROM table_name AS t
- NOTE: only 3 types of keys are actually used in a database (primary, unique, and foreign). The rest are only concepts of RDBMS
-
Super key: a set of one or more than one key that can be used to identify a record uniquely in a table
- Primary key, unique key, and alternate key are a subset of super keys
- A super key can contain multiple attributes that might not be able to identify tuples in a table independently, but when grouped with certain keys, they can identify tuples uniquely
-
Candidate key: a set of one or more fields/columns that can identify a record uniquely in a table
- There can be multiple candidate keys in one table and each candidate key can work as a primary key
-
Primary key: a set of one or more fields/columns of a table that can uniquely identify a record in a table
- There is only one chosen primary key
- Alternate key: a candidate key that currently is not a primary key
- Composite key: a combination of more than one field/column of a table
-
Unique key: a set of one or more fields/columns of a tale that uniquely identify a record in a table
- It's like a primary key, but it can accept only one NULLvalue and it can not have duplicate values, while PK doesn't allow any NULL or duplicate values
- Foreign key: a field/column in the table that is the primary key in another table
-
explain
: In MySQL / MariaDB (different keyword in others?), you can add "explain" to the beginning of the query to see each step it takes- If any of the query steps say "Using full table", then it's potentially making your query slower
-
\G
: in MySQL / MariaDB can add\G
to the end before (or instead of) the semicolon to display a better row format for the results
- Check if a date column is between two dates
-
SELECT * FROM events WHERE event_date BETWEEN '2024-11-01' AND '2024-11-30';
-
SUM()
,AVG()
,COUNT()
,MIN()
,MAX()
-
Scenario Example Requires GROUP BY
?Standard Aggregate SELECT SUM(sales) FROM table
No Grouped Aggregate SELECT region, SUM(sales) GROUP BY region
Yes Window Function SELECT SUM(sales) OVER (...) FROM table
No (uses OVER
)
- It's used with aggregate functions and used in collaboration with the
SELECT
statement to arrange identical data into groups - Can use column numbers (1, 2, etc.) instead of names. The numbers correspond to the order of the columns in the
SELECT
statement- This is especially beneficial when a column in the select is an expression
- Only available in MySQL, PostgreSQL, and some other databases
- Ex:
1
refers toCONCAT(publisher, ' Books')
:SELECT CONCAT(publisher, ' Books'), COUNT(*) FROM Books GROUP BY 1;
- Since this isn't accepted across all databases, column name aliases may be better for consistency:
SELECT CONCAT(publisher, ' Books') AS publisher_books, COUNT(*) FROM Books GROUP BY publisher_books;
- When you use a
GROUP BY
, all columns in the SELECT statement must either:- Be part of the
GROUP BY
clause, or - Be aggregated using an aggregate function (e.g.,
COUNT
,MAX
,MIN
).
- Be part of the
- When doing a
COUNT()
on a query with aGROUP BY
, it will give a count for each of the groups - For filtering a
GROUP BY
with an aggregate function, need to useHAVING
instead ofWHERE
:-
SELECT MAX(number) AS largest_single_number FROM ( SELECT number FROM Numbers GROUP BY number HAVING COUNT(*) = 1 );
-
- You don't need to
SELECT
the column you're applying the ORDER BY to - Can order by multiple columns (will order by the first, and use the second column to order any rows that have the same first column value)
- Can provide an limit of how many results to select:
LIMIT 5
- Can also add an offset:
LIKE offset, limit_num
;- Ex:
LIMIT 10, 5
; gets 5 results starting at result 10 (or 9?)
- Ex:
- On some RDBMS,
LIKE
is case-sensitive; on others it's not -
Wildcards:
-
_
: matches exactly one character;WHERE title LIKE '_arm'
()-
WHERE title LIKE '_arm' -- Matches titles where a four-character word ends with "arm" (e.g., "Farm", "Harm")
-
-
%
: matches zero or more characters-
WHERE title LIKE '% %' -- Ensure the title has more than one word (contains a space between one or more characters on both sides) WHERE LOWER(title) NOT LIKE '%z%' -- Exclude titles containing 'z' (case-insensitive)
-
-
[]
: matches any single character within the brackets;-
WHERE title LIKE '[AB]%' -- Matches titles starting with either "A" or "B"
-
-
-
Common use cases:
-
-- Search for substrings SELECT * FROM users WHERE username LIKE '%john%'; -- Finds usernames containing "john" anywhere in the string -- Starts with SELECT * FROM employees WHERE name LIKE 'A%'; -- Finds names starting with "A -- Ends with SELECT * FROM products WHERE category LIKE '%tools'; -- Finds categories ending with "tools" -- Search for a specific pattern SELECT * FROM files WHERE filename LIKE 'report_2023_%.pdf'; -- Matches filenames starting with "report_2023_" and ending with ".pdf"
-
-
Performance Tips:
- Avoid using
LIKE '%pattern%'
on large datasets as it prevents the use of indexes and can slow down queries. - Consider using full-text search or indexed columns for better performance.
- Avoid using
- It's a conditional expression that allows you to perform logical checks and return different values based on the conditions
- It's similar to an if/else statement
- It's typically used in the
SELECT
,WHERE
,ORDER BY
, orHAVING
clauses to introduce conditional logic into your queries - Structure (
ELSE
is optional)-- Simple CASE Expression CASE expression WHEN value1 THEN result1 WHEN value2 THEN result2 ... ELSE default_result END -- Searched CASE Expression CASE WHEN condition1 THEN result1 WHEN condition2 THEN result2 ... ELSE default_result END
- Ex: check if there are any items
SELECT CASE WHEN item_count > 0 THEN 'True' ELSE 'False END) AS laptop_views, END AS contains_items FROM inventory;
- By default, COUNT will include duplicate values in the count. It essentially counts all rows for which there is a non-null value in the column
- If only wanting to count the unique values in a column you can do
COUNT(DISTINCT <col_name>)
- Two ways to handle NULL values:
-
SELECT count(col_name)
: ignores the count of all the NULL values in thecol_name
column -
SELECT COUNT(*)
: counts rows regardless of the NULL values
-
- When doing a
COUNT()
on a query with aGROUP BY
, it will give a count for each of the groups
- All duplicate rows are removed by default, unless you do
UNION ALL
- It requires all
SELECT
queries to have:- The same number of columns
- Matching column data types in corresponding positions
- Ex:
-
SELECT name FROM Customers UNION SELECT name FROM Suppliers;
-
- If you want to get rows that have certain values you can use IN instead of multiple
=
statements - Ex:
WHERE color = 'red' OR color = 'blue'
-->WHERE color IN ('red', 'blue')
- If the subquery returns rows, then the result from the outer query is added to the result set
- If it returns NULL, then it's not added (skipped)
- Can use the NOT operator to inverse the EXISTS clause
- True if the comparison is true for ANY of the values of the subquery
- True if the comparison is true for ALL of the values of the subquery
- 📊 Window Function: The calculation (e.g.,
SUM()
,ROW_NUMBER()
) - 🪟
OVER()
Clause: The window (which rows to include) - 🧩
PARTITION BY
: How to group windows -
<window_function>() OVER (PARTITION BY column_name ORDER BY column_name)
-
Ranking Functions:
-
ROW_NUMBER()
– Assigns a unique number to each row within a partition -
RANK()
– Assigns a rank with gaps for ties -
DENSE_RANK()
– Assigns a rank without gaps for ties
-
-
Analytic Functions:
-
LAG()
– Finds the value of a previous row -
LEAD()
– Finds the value of the next row -
FIRST_VALUE()
– Finds the first value in the window -
LAST_VALUE()
– Finds the last value in the window
-
-
Aggregate Functions that can be used as Window Functions (but don't have to be):
-
SUM()
– Running totals -
AVG()
– Moving averages -
COUNT()
– Cumulative counts -
MIN()
/MAX()
– Sliding minimum/maximum
-
-
OVER()
Clause:- Defines the window (set of rows) for the window function
- It tells SQL:
- How to partition the data (group it)
- How to order the data (sequence it)
- Required for all window functions (except aggregate functions). It defines the window frame
-
PARTITION BY
Clause-
Similar to
GROUP BY
but without reducing rows - It splits the result set into partitions (subgroups) before performing calculations
- Each partition acts like a mini-table for the window function to operate on
- It's optional
- Without
PARTITION BY
:- The window function acts on all rows together.
- With
PARTITION BY
:- The window function resets for each partition (group)
- Without
-
Similar to
- Indexing
- Optimizing
JOIN
s - Avoiding
SELECT *
(retrieve only necessary data) - Query caching for repeated queries
- Use
LIMIT
to restrict the number of rows returned - Try to avoid subqueries and use
JOIN
s where possible - Stored procedures?
- Stored procedures are recompiled collections of SQL statements that are saved and stored in a database for repeated use
- Best situations for stored procedures:
- Complex business logic that can be handled entirely within the database
- Reducing application-database interaction when multiple queries would otherwise be required
- https://www.codecademy.com/paths/analyze-data-with-sql/tracks/analyze-data-sql-get-started-with-sql/modules/analyze-data-sql-learn-manipulation-c4b/articles/sql-indexes
- Indexes help speed up querying by providing a method to quickly lookup requested data
- Simply put, an index is a pointer to data in a table. It's very similar to a index in the back of a book
- Indexes serve as lookup tables that efficiently store data for quicker retrieval
- A table can have multiple indexes
- Updating a table that has indexes takes more time than updating one without (b/c the indexes also need an update)
- So only create indexes on columns that will be frequently searched against
- Ex: getting records in the past 24 hours
- By indexing a timestamp column you could look at the timestamp and once there's one that is > 24 hours ago, you know you can stop looking
- Without indexing, you would have to look through ALL the records to check each timestamp and thus greatly increasing the time complexity
- https://learnsql.com/blog/sql-subquery-types/
- https://learnsql.com/blog/sql-subquery-examples/
- https://learnsql.com/blog/sql-subqueries/
- A subquery is a query placed within another SQL query
- They can be included in the
WHERE
,FROM
, orSELECT
clauses of the main query
- They can be included in the
- Can use them with the
ANY
orALL
keywords if the subquery can return multiple rows - Can also use them with the
IN
operator - Scalar subqueries: return a single value, or exactly one row and exactly one column
- Ex:
SELECT name, listed_price FROM paintings WHERE listed_price > ( SELECT AVG(listed_price) FROM paintings );
- Ex:
- Multirow subqueries: return either one column with multiple rows (i.e. a list of values) or multiple columns with multiple rows (i.e. tables)
- Correlated subqueries: where the inner query relies on information from the outer query
- They refer to the table from the outer query
- Each subquery is processed one-by-one for each value in the outer query
- You can even use them in the SELECT statement of the outer query
-
SELECT name, (SELECT AVG(age) FROM cats c2 WHERE c2.name = c1.name) FROM cats c1
-
- Ex of subquery in FROM (find most number of cats in one breed)
-
SELECT MAX(number_of_cats) FROM (SELECT breed, COUNT(*) AS number_of_cats FROM cat GROUP BY breed) breed_count // need this subquery table alias
-
- https://learnsql.com/blog/what-is-sql-cte/
- A CTE, also referred to as a WITH clause, is a temporary named result set that you can reference anywhere in your query
- CTEs can help organize and simplify long, complex hierarchical queries and improve readability by breaking them down into smaller blocks
- If possible, CTEs are better to use than subqueries
- Can be used in SELECT, INSERT, UPDATE, DELETE, and MERGE statements
- Ex:
-
--CTE WITH cte_sales AS (SELECT EmployeeID, COUNT(*) AS OrderID FROM Orders GROUP BY EmployeeID) --Query using CTE SELECT AVG(OrderID) AS average_orders_per_employee FROM cte_sales;
- This is a simple example, but can create multiple CTEs which is when you really start to see the advantages
-
- Recursive CTE
- https://learnsql.com/blog/sql-subquery-cte-difference/
- https://learnsql.com/blog/reasons-to-use-ctes/
- https://towardsdatascience.com/sql-for-data-analysis-subquery-vs-cte-699ef629d9eb
- There are many cases where CTEs are better to use than subqueries
- From a performance standpoint, there's not much difference because CTEs run as subqueries
- Main differences / benefits
- CTEs can be recursive
- CTEs are reusable
- CTEs can be more readable
- CTEs must always have a name
- Benefits of subqueries:
- Can be used in the WHERE clause
- Can do correlated subqueries
-
Comparison
Feature Subqueries CTEs Readability Good for simple queries Better for complex or multi-step queries Reusability Limited to the specific clause Can be reused multiple times in the query Recursion No Yes (recursive CTEs) Optimization May be executed multiple times (per row) Typically optimized better, but may materialize Performance Can be slower for nested queries Can perform better, but large CTEs may use more memory Use Cases Simple filters, one-time calculations Complex logic, recursive queries, reusability
- https://learnsql.com/blog/subquery-vs-join/
- https://learnsql.com/blog/converting-subqueries-joins/
- JOINs usually perform faster than subqueries, but subqueries can sometimes be more intuitive
- MySQL: not case-sensitive, can use either "" or ''
- PostgreSQL: case-sensitive, can only use ''
- SQLite: case-sensitive, ca use either "" or ''
- Start with the data model: Which tables store which data and the nature of relations b/w these tables
- First step is to determine which tables will be used, determined by: All tables containing data needed to display the result If any of these tables are not directly related, also include all tables between them and figure out how to join them correctly
- Start with smaller/simpler queries and put them together, checking the results along the way
- 0.1. look at the question, what will I need?
- What tables?
- GROUP BY?
- Subquery?
- Outer joins?
- Big expressions?
- 0.2. get some working code
- Get the joins right
- Add one join at a time and test/verify along the way
- Write any expressions, and verify them
- For verification, do them in a select clause even if you may use it elsewhere
- Write any where clause expressions, and verify them
- Arrange the columns and rows to visualize how the grouping will take place
- Add the grouping, visually verify
- Add the having clause
- Add the order by
- Tidy up the code
- Can setup a connection and then get the cursor and do:
cursor.execute('sql with %(substitution)s', {'substitution': 'sub'})
- In order to build a string with substitutions
def insert_keywords(cursor, r_id, keywords):
keyword_list = ''
sql = "INSERT INTO phrases (p_id, phrase_text) SELECT uuid(), kw.txt FROM phrases p RIGHT OUTER JOIN ("
sql_dict = {}
for keyword in keywords.split(','):
kw_trimmed = ' '.join(keyword.strip().split())
if len(kw_trimmed) > 255 or not kw_trimmed:
continue
keyword_list += f'%({kw_trimmed})s, '
sql += f'(SELECT %({kw_trimmed})s AS txt) union '
sql_dict[f'{kw_trimmed}'] = kw_trimmed
sql = sql[0:-7]
sql += ") kw ON lower(p.phrase_text) = lower(kw.txt) WHERE p.p_id is null"
cursor.execute(sql, sql_dict)
cursor.execute(f''' INSERT INTO r_phrases (r_id, p_id)
SELECT %(r_id)s, p_id
FROM phrases WHERE phrase_text IN ({keyword_list[:-2]}) '''
, { 'r_id': r_id, **sql_dict})
- Identifying and grouping consecutive rows (islands) that share a common characteristic, while recognizing the breaks (gaps) between these sequences
- Selecting the top N rows for each category or group, which often involves ranking functions or subqueries
- Calculating an accumulated total or sum over a set of rows, usually in a time series or ordered dataset
- Querying data that has a recursive structure (such as organizational charts or bill-of-materials) using techniques like recursive common table expressions (CTEs)
- Transforming rows into columns (pivot) or columns into rows (unpivot) to reshape data for reporting or analysis.
- Find all rows that have the max value of a column
-
SELECT * FROM table_name WHERE col_name = (SELECT MAX(col_name) FROM table_name;
-
- Date clause
-
BETWEEN
or comparison brackets:SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'; -- Other options WHERE order_date >= '2024-01-01' AND order_date <= '2024-12-31'; WHERE order_date > '2024-01-01';
-
- Count number of rows corresponds to each unique value in a column:
- Use
COUNT()
andGROUP BY
clauseSELECT publisher, COUNT(*) AS book_count FROM Books GROUP BY publisher;
- Use