5.3.2.Use JOINS to aggregate data in SQL - sj50179/Google-Data-Analytics-Professional-Certificate GitHub Wiki

Understanding JOINS

JOIN

  • A SQL clause that is used to combine rows from two or more tables based on a related column

Common JOINS

  • INNER - A function that returns records with matching values in both tables
  • LEFT- A function that will return all the records from the left table and only the matching records from the right table
  • RIGHT - A function that will return all the records from the right table and only the matching records from the left table
  • OUTER - A function that combines RIGHT and LEFT JOIN to return all matching records in both tables

Secret identities: The importance of aliases

In this reading, you will learn about using aliasing to simplify your SQL queries. Aliases are used in SQL queries to create temporary names for a column or table. Aliases make referencing tables and columns in your SQL queries much simpler when you have table or column names that are too long or complex to make use of in queries. Imagine a table name like special_projects_customer_negotiation_mileages. That would be difficult to retype every time you use that table. With an alias, you can create a meaningful nickname that you can use for your analysis. In this case “special_projects_customer_negotiation_mileages” can be aliased to simply “mileage.” Instead of having to write out the long table name, you can use a meaningful nickname that you decide.

Basic syntax for aliasing

Aliasing is the process of using aliases. In SQL queries, aliases are implemented by making use of the AS command. The basic syntax for the AS command can be seen in the following query for aliasing a table:

SELECT column_name(s)
FROM table_name AS alias_name;

Notice that AS is preceded by the table name and followed by the new nickname. It is a similar approach to aliasing a column:

SELECT column_name AS alias_name
FROM table_name;

In both cases, you now have a new name that you can use to refer to the column or table that was aliased.

Alternate syntax for aliases

If using AS results in an error when running a query because the SQL database you are working with doesn't support it, you can leave it out. In the previous examples, the alternate syntax for aliasing a table or column would be:

  • FROM table_name alias_name
  • SELECT column_name alias_name

The key takeaway is that queries can run with or without using AS for aliasing, but using AS has the benefit of making queries more readable. It helps to make aliases stand out more clearly.

Aliasing in action

Let’s check out an example of a SQL query that uses aliasing. Let’s say that you are working with two tables: one of them has employee data and the other one has department data. The FROM statement to alias those tables could be:

FROM word_day.employees AS employees

FROM work_day.employees AS employees

These aliases still let you know exactly what is in these tables, but now you don’t have to manually input those long table names. Aliases can be really helpful for long, complicated queries. It is easier to read and write your queries when you have aliases that tell you what is included within your tables.

For more information

If you are interested in learning more about aliasing, here are some resources to help you get started:

  • SQL Aliases: This tutorial on aliasing is a really useful resource to have when you start practicing writing queries and aliasing tables on your own. It also demonstrates how aliasing works with real tables.
  • SQL Alias: This detailed introduction to aliasing includes multiple examples. This is another great resource to reference if you need more examples.
  • Using Column Aliasing: This is a guide that focuses on column aliasing specifically. Generally, you will be aliasing entire tables, but if you find yourself needing to alias just a column, this is a great resource to have bookmarked.

Using JOINs effectively

In this reading, you will review how JOINs are used and will be introduced to some resources that you can use to learn more about them. A JOIN combines tables by using a primary or foreign key to align the information coming from both tables in the combination process. JOINs use these keys to identify relationships and corresponding values across tables.

If you need a refresher on primary and foreign keys, refer to the glossary for this course, or go back to Databases in data analytics.

The general JOIN syntax

SELECT
	-- table columns from tables are inserted here
	table_name1.column_name
	table_name2.column_name
FROM
	table_name1
JOIN
	table_name2
ON table_name1.column_name = table_name2.column_name

As you can see from the syntax, the JOIN statement is part of the FROM clause of the query. JOIN in SQL indicates that you are going to combine data from two tables. ON in SQL identifies how the tables are to be matched for the correct information to be combined from both.

Type of JOINs

There are four general ways in which to conduct JOINs in SQL queries: INNER, LEFT, RIGHT, and FULL OUTER.

The circles represent left and right tables, and where they are joined is highlighted in blue

Here is what these different JOIN queries do.

INNER JOIN

INNER is optional in this SQL query because it is the default as well as the most commonly used JOIN operation. You may see this as JOIN only. INNER JOIN returns records if the data lives in both tables. For example, if you use INNER JOIN for the 'customers' and 'orders' tables and match the data using the customer_id key, you would combine the data for each customer_id that exists in both tables. If a customer_id exists in the customers table but not the orders table, data for that customer_id isn’t joined or returned by the query.

SELECT
	customers.customer_name,
	orders.product_id,
	orders.ship_date
FROM
	customers
INNER JOIN
	orders
ON customers.customer_id = orders.customer_id

The results from the query might look like the following, where customer_name is from the customers table and product_id and ship_date are from the orders table:

customer_name product_id ship_date
Martin's Ice Cream 043998 February 23, 2021
Beachside Treats 872012 February 25, 2021
Mona's Natural Flavors 724956 February 28, 2021
... etc. ... etc. ...

The data from both tables was joined together by matching the customer_id common to both tables. Notice that customer_id doesn’t show up in the query results. It is simply used to establish the relationship between the data in the two tables so the data can be joined and returned.

LEFT JOIN

You may see this as LEFT OUTER JOIN, but most users prefer LEFT JOIN. Both are correct syntax. LEFT JOIN returns all the records from the left table and only the matching records from the right table. Use LEFT JOIN whenever you need the data from the entire first table and values from the second table, if they exist. For example, in the query below, LEFT JOIN will return customer_name with the corresponding sales_rep, if it is available. If there is a customer who did not interact with a sales representative, that customer would still show up in the query results but with a NULL value for sales_rep.

SELECT
	customers.customer_name,
	sales.sales_rep
FROM
	customers
LEFT JOIN
	sales
ON cutomers.customer_id = sales.customer_id

The results from the query might look like the following where customer_name is from the customers table and sales_rep is from the sales table. Again, the data from both tables was joined together by matching the customer_id common to both tables even though customer_id wasn't returned in the query results.

customer_name sales_rep
Martin's Ice Cream Luis Reyes
Beachside Treats NULL
Mona's Natural Flavors Geri Hall
...etc. ...etc.

RIGHT JOIN

You may see this as RIGHT OUTER JOIN or RIGHT JOIN. RIGHT JOIN returns all records from the right table and the corresponding records from the left table. Practically speaking, RIGHT JOIN is rarely used. Most people simply switch the tables and stick with LEFT JOIN. But using the previous example for LEFT JOIN, the query using RIGHT JOIN would look like the following:

SELECT	
	sales.sales_rep,
	customers.customer_name
FROM
	sales
RIGHT JOIN
	customers
ON sales.customer_id = customers.customer_id

The query results are the same as the previous LEFT JOIN example.

customer_name sales_rep
Martin's Ice Cream Luis Reyes
Beachside Treats NULL
Mona's Natural Flavors Geri Hall
...etc. ...etc.

FULL OUTER JOIN

You may sometimes see this as FULL JOIN. FULL OUTER JOIN returns all records from the specified tables. You can combine tables this way, but remember that this can potentially be a large data pull as a result. FULL OUTER JOIN returns all records from both tables even if data isn’t populated in one of the tables. For example, in the query below, you will get all customers and their products’ shipping dates. Because you are using a FULL OUTER JOIN, you may get customers returned without corresponding shipping dates or shipping dates without corresponding customers. A NULL value is returned if corresponding data doesn’t exist in either table.

SELECT	
	customers.customer_name,
	orders_ship_date
FROM
	customers
FULL INNER JOIN
	orders
ON customers.customer_id = orders.customer_id

The results from the query might look like the following.

customer_name ship_date
Martin's Ice Cream February 23, 2021
Beachside Treats February 25, 2021
NULL February 25, 2021
The Daily Scoop -
Mountain Ice Cream -
Mona's Natural Flavors February 28, 2021
...etc. ...etc.

For more information

JOINs are going to be useful for working with relational databases and SQL—and you will have plenty of opportunities to practice them on your own. Here are a few other resources that can give you more information about JOINs and how to use them:

  • SQL JOINs: This is a good basic explanation of JOINs with examples. If you need a quick reminder of what the different JOINs do, this is a great resource to bookmark and come back to later.
  • Database JOINs - Introduction to JOIN Types and Concepts: This is a really thorough introduction to JOINs. Not only does this article explain what JOINs are and how to use them, but it also explains the various scenarios in more detail of when and why you would use the different JOINs. This is a great resource if you are interested in learning more about the logic behind JOINing.
  • SQL JOIN Types Explained in Visuals: This resource has a visual representation of the different JOINs. This is a really useful way to think about JOINs if you are a visual learner, and it can be a really useful way to remember the different JOINs.
  • SQL JOINs: Bringing Data Together One Join at a Time: Not only does this resource have a detailed explanation of JOINs with examples, but it also provides example data that you can use to follow along with their step-by-step guide. This is a useful way to practice JOINs with some real data.
  • SQL JOIN: This is another resource that provides a clear explanation of JOINs and uses examples to demonstrate how they work. The examples also combine JOINs with aliasing. This is a great opportunity to see how JOINs can be combined with other SQL concepts that you have been learning about in this course.

COUNT and COUNT DISTINCT

COUNT in spreadsheets

  • Can be used to count the total number of numerical values within a specific range in spreadsheets

COUNT in SQL

  • A query that returns the number of rows in a specified range

COUNT DISTINCT

  • A query that only returns the distinct values in a specified range

As a data analyst, COUNT and COUNT DISTINCT can be used anytime to answer questions about 'how many'.

Aliasing

  • When you temporarily name a table or column in your query to make it easier to read and write

Example:

SELECT 
    warehouse.state AS state,
    COUNT(DISTINCT order_id) as num_orders
FROM 
    warehouse_orders.Orders AS orders 
JOIN 
    warehouse_orders.Warehouse AS warehouse 
		ON orders.warehouse_id = warehouse.warehouse_id
GROUP BY
    warehouse.state

Question

What is the key difference between COUNT and COUNT DISTINCT in a database query?

  • COUNT returns the number of columns in a specified range. COUNT DISTINCT only returns the distinct values within those columns.
  • COUNT returns the number of rows in a specified range. COUNT DISTINCT only returns the distinct values in a specified range.
  • COUNT returns the number of values in a specified range. COUNT DISTINCT returns all values in a specified range.
  • COUNT returns the total values in a specified range. COUNT DISTINCT returns the number of rows in a specified range.

Correct. COUNT returns the number of rows in a specified range. COUNT DISTINCT only returns the distinct values in a specified range.

Test your knowledge on using JOINS to aggregate data

TOTAL POINTS 3

Question 1

A data analyst wants to retrieve only records from a database that have matching values in two different tables. Which JOIN function should they use?

  • LEFT JOIN
  • INNER JOIN
  • OUTER JOIN
  • RIGHT JOIN

Correct. To retrieve only records from a database that have matching values in two different tables, the analyst should use INNER JOIN*.*

Question 2

You are writing a SQL query to instruct a database to count values in a specified range. You only want to count each value once, even if it appears multiple times. Which function should you include in your query?

  • COUNT VALUES
  • COUNT RANGE
  • COUNT DISTINCT
  • COUNT

Correct. To tell a database to return distinct values in a specified range, the analyst should use COUNT DISTINCT in their query.

Question 3

A data analyst wants to temporarily name a column in their query to make it easier to read and write. What technique should they use?

  • Filtering
  • Naming
  • Aliasing
  • Tagging

Correct. To temporarily name a column in a query to make it easier to read and write, the analyst should use aliasing.