5.3.2.Use JOINS to aggregate data in SQL - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki
Optional: Upload the employee dataset to BigQuery
The next video demonstrates how to use JOINS to merge and return data from two tables based on a common attribute used in both tables.
If you would like to follow along with the instructor, you will need to log in to your BigQuery account and upload the employee data provided as two CSV files. If you have hopped around courses, Using BigQuery in the Prepare Data for Exploration course covers how to set up a BigQuery account.
Prepare for the next video
- First, download the CSV files from the attachments below:
Employees Table - Understanding JOINS
Departments Table - Understanding JOINS
- Next, complete the following steps in your BigQuery console to upload the employees and departments tables.
Step 1: Open your BigQuery console and click on the project you want to upload the data to.
Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next to your project name and select Create dataset.
Step 3: Enter employee_data for the Dataset ID.
Step 4: Click CREATE DATASET (blue button) to add the dataset to your project.
Step 5: In the Explorer on the left, click to expand your project, and then click the employee_data dataset you just created.
Step 6: Click the Actions icon (three vertical dots) next to employee_data and select Open.
Step 7: Click the blue + icon at the top right to open the Create table window.
Step 8: Under Source, for the Create table from selection, choose where the data will be coming from.
- Select Upload.
- Click Browse to select the Employees Table CSV file you downloaded.
- Choose CSV from the file format drop-down.
Step 9: For Table name, enter employees if you plan to follow along with the video.
Step 10: For Schema, click the Auto detect check box.
Step 11: Click Create table (blue button). You will now see the employees table under your employee_data dataset in your project.
Step 12: Click the employee_data dataset again.
Step 13: Click the icon to open the Create table window again.
Step 14: Under Source, for the Create table from selection, choose where the data will be coming from.
- Select Upload.
- Click Browse to select the Departments Table CSV file you downloaded.
- Choose CSV from the file format drop-down.
Step 15: For Table name, enter departments if you plan to follow along with the video.
Step 16: For Schema, click the Auto detect check box.
Step 17: Click Create table (blue button). You will now see the departments table under your employee_data dataset in your project.
Step 18: Click the employees table and click the Preview tab to verify that you have the data shown below.
Step 19: Click the departments table and click the Preview tab to verify that you have the data shown below.
If your data previews match, you are ready to follow along with the next video.
Secret identities: The importance of aliases
In this reading, you will learn about using aliasing to simplify your SQL queries. Aliases are used in SQL queries to create temporary names for a column or table. Aliases make referencing tables and columns in your SQL queries much simpler when you have table or column names that are too long or complex to make use of in queries. Imagine a table name like special_projects_customer_negotiation_mileages. That would be difficult to retype every time you use that table. With an alias, you can create a meaningful nickname that you can use for your analysis. In this case “special_projects_customer_negotiation_mileages” can be aliased to simply “mileage.” Instead of having to write out the long table name, you can use a meaningful nickname that you decide.
Basic syntax for aliasing
Aliasing is the process of using aliases. In SQL queries, aliases are implemented by making use of the AS command. The basic syntax for the AS command can be seen in the following query for aliasing a table:
SELECT column_name(s)
FROM table_name AS alias_name;
Notice that AS is preceded by the table name and followed by the new nickname. It is a similar approach to aliasing a column:
SELECT column_name AS alias_name
FROM table_name;
In both cases, you now have a new name that you can use to refer to the column or table that was aliased.
Alternate syntax for aliases If using AS results in an error when running a query because the SQL database you are working with doesn't support it, you can leave it out. In the previous examples, the alternate syntax for aliasing a table or column would be:
- FROM table_name alias_name
- SELECT column_name alias_name
The key takeaway is that queries can run with or without using AS for aliasing, but using AS has the benefit of making queries more readable. It helps to make aliases stand out more clearly.
Aliasing in action
Let’s check out an example of a SQL query that uses aliasing. Let’s say that you are working with two tables: one of them has employee data and the other one has department data. The FROM statement to alias those tables could be:
FROM work_day.employees AS employees
These aliases still let you know exactly what is in these tables, but now you don’t have to manually input those long table names. Aliases can be really helpful for long, complicated queries. It is easier to read and write your queries when you have aliases that tell you what is included within your tables.
For more information
If you are interested in learning more about aliasing, here are some resources to help you get started:
- SQL Aliases : This tutorial on aliasing is a really useful resource to have when you start practicing writing queries and aliasing tables on your own. It also demonstrates how aliasing works with real tables.
- SQL Alias : This detailed introduction to aliasing includes multiple examples. This is another great resource to reference if you need more examples.
- Using Column Aliasing: This is a guide that focuses on column aliasing specifically. Generally, you will be aliasing entire tables, but if you find yourself needing to alias just a column, this is a great resource to have bookmarked.
Using JOINs effectively
In this reading, you will review how JOINs are used and will be introduced to some resources that you can use to learn more about them. A JOIN combines tables by using a primary or foreign key to align the information coming from both tables in the combination process. JOINs use these keys to identify relationships and corresponding values across tables.
If you need a refresher on primary and foreign keys, refer to the glossary for this course, or go back to Databases in data analytics .
The general JOIN syntax
SELECT
-- table columns from tables are inserted here
table_name1.column_name
table_name2.column_name
FROM
table_name1
JOIN
table_name2
ON table_name1.column_name = table_name2.column_name
As you can see from the syntax, the JOIN statement is part of the FROM clause of the query. JOIN in SQL indicates that you are going to combine data from two tables. ON in SQL identifies how the tables are to be matched for the correct information to be combined from both.
Type of JOINs
There are four general ways in which to conduct JOINs in SQL queries: INNER, LEFT, RIGHT, and FULL OUTER.
INNER JOIN
INNER is optional in this SQL query because it is the default as well as the most commonly used JOIN operation. You may see this as JOIN only. INNER JOIN returns records if the data lives in both tables. For example, if you use INNER JOIN for the 'customers' and 'orders' tables and match the data using the customer_id key, you would combine the data for each customer_id that exists in both tables. If a customer_id exists in the customers table but not the orders table, data for that customer_id isn’t joined or returned by the query.
SELECT
customers.customer_name,
orders_product_id,
orders.ship_date
FROM
customers
INNER JOIN
orders
ON customers.customer_id = orders_customer_id
The results from the query might look like the following, where customer_name is from the customers table and product_id and ship_date are from the orders table:
customer_name | product_id | ship_date |
---|---|---|
Martin's Ice Cream | 043998 | 2021-02-23 |
Beachside Treats | 872012 | 2021-02-25 |
Mona's Natural Flavors | 724956 | 2021-02-28 |
... etc. | ... etc. | ... etc. |
The data from both tables was joined together by matching the customer_id common to both tables. Notice that customer_id doesn’t show up in the query results. It is simply used to establish the relationship between the data in the two tables so the data can be joined and returned.
LEFT JOIN
You may see this as LEFT OUTER JOIN, but most users prefer LEFT JOIN. Both are correct syntax. LEFT JOIN returns all the records from the left table and only the matching records from the right table. Use LEFT JOIN whenever you need the data from the entire first table and values from the second table, if they exist. For example, in the query below, LEFT JOIN will return customer_name with the corresponding sales_rep, if it is available. If there is a customer who did not interact with a sales representative, that customer would still show up in the query results but with a NULL value for sales_rep.
SELECT
customers.customer_name,
sales.sales_rep
FROM
customers
LEFT JOIN
sales
ON customers.customer_id = sales_customer_id
The results from the query might look like the following where customer_name is from the customers table and sales_rep is from the sales table. Again, the data from both tables was joined together by matching the customer_id common to both tables even though customer_id wasn't returned in the query results.
customer_name | sales_rep |
---|---|
Martin's Ice Cream | Luis Reyes |
Beachside Treats | NULL |
Mona's Natural Flavors | Geri Hall |
... etc. | ... etc. |
RIGHT JOIN
You may see this as RIGHT OUTER JOIN or RIGHT JOIN. RIGHT JOIN returns all records from the right table and the corresponding records from the left table. Practically speaking, RIGHT JOIN is rarely used. Most people simply switch the tables and stick with LEFT JOIN. But using the previous example for LEFT JOIN, the query using RIGHT JOIN would look like the following:
SELECT
sales.sales_rep,
customers.customer_name
FROM
sales
RIGHT JOIN
customers
ON sales.customer_id = customers.customer_id
The query results are the same as the previous LEFT JOIN example.
customer_name | sales_rep |
---|---|
Martin's Ice Cream | Luis Reyes |
Beachside Treats | NULL |
Mona's Natural Flavors | Geri Hall |
... etc. | ... etc. |
FULL OUTER JOIN
You may sometimes see this as FULL JOIN. FULL OUTER JOIN returns all records from the specified tables. You can combine tables this way, but remember that this can potentially be a large data pull as a result. FULL OUTER JOIN returns all records from both tables even if data isn’t populated in one of the tables. For example, in the query below, you will get all customers and their products’ shipping dates. Because you are using a FULL OUTER JOIN, you may get customers returned without corresponding shipping dates or shipping dates without corresponding customers. A NULL value is returned if corresponding data doesn’t exist in either table.
SELECT
customers.customer_name,
orders.ship_date
FROM
customers
FULL OUTER JOIN
orders
ON customers.customer_id = orders.customer_id
The results from the query might look like the following.
customer_name | ship_date |
---|---|
Martin's Ice Cream | 2021-02-23 |
Beachside Treats | 2021-02-25 |
NULL | 2021-02-25 |
The Daily Scoop | NULL |
Mountain Ice Cream | NULL |
Mona's Natual Flavors | 2021-02-28 |
... etc. | ... etc. |
For more information
JOINs are going to be useful for working with relational databases and SQL—and you will have plenty of opportunities to practice them on your own. Here are a few other resources that can give you more information about JOINs and how to use them:
- SQL JOINs : This is a good basic explanation of JOINs with examples. If you need a quick reminder of what the different JOINs do, this is a great resource to bookmark and come back to later.
- Database JOINs - Introduction to JOIN Types and Concepts: This is a really thorough introduction to JOINs. Not only does this article explain what JOINs are and how to use them, but it also explains the various scenarios in more detail of when and why you would use the different JOINs. This is a great resource if you are interested in learning more about the logic behind JOINing.
- SQL JOIN Types Explained in Visuals: This resource has a visual representation of the different JOINs. This is a really useful way to think about JOINs if you are a visual learner, and it can be a really useful way to remember the different JOINs.
- SQL JOINs: Bringing Data Together One Join at a Time: Not only does this resource have a detailed explanation of JOINs with examples, but it also provides example data that you can use to follow along with their step-by-step guide. This is a useful way to practice JOINs with some real data.
- SQL JOIN: This is another resource that provides a clear explanation of JOINs and uses examples to demonstrate how they work. The examples also combine JOINs with aliasing. This is a great opportunity to see how JOINs can be combined with other SQL concepts that you have been learning about in this course.
Hands-On Activity: Queries for JOINS
Question 1
Activity overview
You’ve come a long way in working with relational databases and SQL. Now, you’ll gain practice writing queries that join multiple tables together.
In this activity, you’ll work with the World Bank’s International Education Dataset. By mastering JOIN statements, you’ll be able to fully harness the power of relational databases by combining data from tables linked by keys.
Load and examine the dataset
- Log in to BigQuery Sandbox. If you have a free trial version of BigQuery, you can use that instead. On the BigQuery page, click the Go to BigQuery button.
- Note: BigQuery Sandbox frequently updates its user interface. The latest changes may not be reflected in the screenshots presented in this activity, but the principles remain the same. Adapting to changes in software updates is an essential skill for data analysts, and it’s helpful for you to practice troubleshooting. You can also reach out to your community of learners on the discussion forum for help.
- If you have never created a BigQuery project before, click CREATE PROJECT on the right side of the screen. If you have created a project before, you can use an existing one or create a new one by clicking the project drop-down in the blue header bar and selecting NEW PROJECT.
- Name your project something that will help you identify it later. You can give it a unique project ID or use an auto-generated one. Don’t worry about selecting an organization if you don’t know what to put.
- Now, you’ll see the Editor interface. In the middle of the screen is a window where you can type code, and to the left is the Explorer menu where you can search for datasets.
Before you begin joining two tables together, you’ll first need to figure out which tables to join together. Recall that two tables must be connected in order to join them. Two tables can be joined if the primary key for one table is included in the other table as a foreign key.
To determine what information the tables contain and identify keys you can join them on, review the schema of the tables. To access the schema:
- Click + ADD at the top right of the Explorer menu, then scroll down and click on the Public Datasets option from the resulting drop-down list.
- In the Search Marketplace search bar, type international education and press return to find the search results.
- Click the first result, The World Bank's International Education dataset.
- Click View Dataset. This will bring you back to the BigQuery Sandbox interface in a new tab.
- Note: You may want to star the bigquery-public-data drop-down to the Explorer menu. You can use this to browse datasets and tables without navigating to the data marketplace in the future.
- In the Explorer menu, search for world_bank_intl_education. Click the drop-down arrow to expand the dataset. You may also need to click on the SHOW MORE button for the additional subsets to appear.
- Click the international_education table. This will bring up the table’s schema. If the schema doesn’t appear, click on the Schema tab in the table viewer.
- Next, select the country_summary table and examine its schema. You’ll find that the country_code column appears in both the table schemas.
- Note: Foreign keys don’t always have the same names across tables. If you’re ever unsure if the columns are the same, you can always double-check. You can click the Details tab from each table and confirm that they contain the same kinds of information.
Review JOINS
Before you join the two tables with queries, take a moment to review the various kinds of JOIN statements.
The two most common kinds of JOIN statements are INNER JOINs and OUTER LEFT JOINs (also known simply as LEFT JOINs). As a review:
- INNER JOIN: Returns only the rows where the target appears in both tables.
- LEFT JOIN: Returns every row from the left table, as well as any rows from the right table with matching keys found in the left table.
Note the difference between the INNER JOIN and LEFT JOIN and the implications for when each should be used. Consider the following queries:
-- Let's say table_1 has 100 rows and table_2 has 10 rows.
-- They share 10 keys in common.
-- Using INNER JOIN --> We get 10 rows in our results.
SELECT
COUNT(*)
FROM
table_1
INNER JOIN
table_2
ON table_1.key = table_2.key;
-- Using LEFT JOIN --> We get 100 rows in our results.
SELECT
COUNT(*)
FROM
table_1
LEFT JOIN
table_2
ON table_1.key = table_2.key;
The takeaway here is that the sort of JOIN you use matters. When writing a query, you can draw out a Venn diagram like the example graphic above to help you decide which sort of JOIN you need.
Queries with JOINS and aliases
Now, it’s time to actually query the dataset. As a starting point, try a query that pulls information from both the international_education and country_summary tables. Copy, paste, and run the following query:
SELECT
`bigquery-public-data.world_bank_intl_education.international_education`.country_name,
`bigquery-public-data.world_bank_intl_education.country_summary`.country_code,
`bigquery-public-data.world_bank_intl_education.international_education`.value
FROM
`bigquery-public-data.world_bank_intl_education.international_education`
INNER JOIN
`bigquery-public-data.world_bank_intl_education.country_summary`
ON `bigquery-public-data.world_bank_intl_education.country_summary`.country_code = `bigquery-public-data.world_bank_intl_education.international_education`.country_code
This basic query joins the tables on the country_code foreign key, and returns the country name, country code, and value column. This is quite a long, unwieldy query for such a basic result! The length of each table name (which must include the full address for each table for BigQuery to know where to pull the data from) makes this hard to read and work with.
However, you can solve this by setting an alias for each table.
Use descriptive aliases
Try using descriptive aliases that tell you what they represent. This next query is the same query as the previous one, but with aliases to improve readability. Copy, paste, and run the following query:
SELECT
edu.country_name,
summary.country_code,
edu.value
FROM
bigquery-public-data.world_bank_intl_education.international_education AS edu
INNER JOIN
bigquery-public-data.world_bank_intl_education.country_summary AS summary
ON edu.country_code = summary.country_code
This query is much easier to read and understand. Recall that you can set aliases for tables by specifying the alias for the table after the table’s name in FROM and/or JOIN statements.
For this example, the international_education table was renamed as edu, and the country_summary table as summary. Using descriptive aliases is a best practice and will help you keep your queries clean, readable, and easy to work with.
Use a JOIN to answer a question
Now that you’ve confirmed that the JOIN statement works, try to answer an actual data question using this dataset. What is the average amount of money spent per region on education? Copy, paste, and run the following query:
SELECT
AVG(edu.value) average_value, summary.region
FROM
`bigquery-public-data.world_bank_intl_education.international_education` AS edu
INNER JOIN
`bigquery-public-data.world_bank_intl_education.country_summary` AS summary
ON edu.country_code = summary.country_code
WHERE summary.region IS NOT null
GROUP BY summary.region
ORDER BY average_value DESC
Your results should appear like this:
Notice how in this query, an alias is also set to give the AVG(edu.value) a more descriptive name for the temporary table the query returns.
Also note that the WHERE statement excludes rows with any null information. This is necessary to present the data succinctly and display only seven rows for the seven regions represented in the data. However, this WHERE statement means that the results will return the same regardless of which JOIN you use. In the next section, you’ll explore a situation where you need to use a specific kind of join in your query...
INNER JOINS versus OUTER JOINS
In the last query, you used an INNER JOIN to find the average amount of money spent per region on education. Because of the WHERE statement in this query, using any kind of JOIN produces the same result.
Now, you will write a LEFT JOIN, a type of OUTER JOIN, for a situation where the type of query you use will change the result you return.
Consider this scenario:
You have been tasked to provide data for a feature sports article on Michael Jordan’s basketball career. The writer wants to include a funny twist and asks you to find out if Michael Jordan played better at schools with animal mascots.
To analyze his early career, you start with the years he played basketball in college. You need to examine National Collegiate Athletic Association (NCAA) college basketball stats from 1984.
You’ll need a list of all NCAA Division I colleges and universities; their mascots, if applicable; and their number of wins and losses. You can find this information by typing ncaa_basketball in the Explorer tab dataset search bar on BigQuery.
Next, you will start a new query tab by clicking on the blue + button. Your query should join the season statistics from one table with the mascot information from another. You need to use a LEFT JOIN instead of an INNER JOIN because not all teams have mascots. If you use an INNER JOIN, you would exclude teams with no mascot.
To demonstrate this, copy, paste, and run the following query:
SELECT
seasons.market AS university,
seasons.name AS team_name,
seasons.wins,
seasons.losses,
seasons.ties,
mascots.mascot AS team_mascot
FROM
`bigquery-public-data.ncaa_basketball.mbb_historical_teams_seasons` AS seasons
LEFT JOIN
`bigquery-public-data.ncaa_basketball.mascots` AS mascots
ON
seasons.team_id = mascots.id
WHERE
seasons.season = 1984
AND seasons.division = 1
ORDER BY
seasons.market
This is an example of when a LEFT JOIN is more helpful than an INNER JOIN. With this query, you can look at college basketball statistics to get a better sense of Michael Jordan’s early career, find out more information about which teams had mascots, and answer your business question.
Confirmation and reflection
In the last query, you use a LEFT JOIN instead of an INNER JOIN to find the correct information. Beneath the query results, you’ll find that the number of rows in your joined table is 281. If you rerun the query with an INNER JOIN instead of a LEFT JOIN, how many rows would it return?
A. 274
B. 281
C. 301
D. 324
The correct answer is A. 274. Explain: The number of rows returned by an INNER JOIN is 274. When you run the query with an INNER JOIN instead of a LEFT JOIN, you exclude universities without mascots and return fewer rows of data. Knowing which JOIN to use is very important for analyzing data. Going forward, you can use your knowledge of JOINs to properly combine data from multiple tables.
Question 2
In this activity, you used JOIN statements to combine data from multiple tables. In the text box below, write 2-3 sentences (40-60 words) in response to each of the following questions.
- Why do you think JOIN statements are important for working with databases?
- How do you distinguish INNER JOINS from OUTER JOINS?
Explain: Congratulations on completing this hands-on activity! A good response would include that JOINs allow you to combine data from linked tables, which helps you make comparisons and answer business questions.
Mastering JOIN statements is one of the most important parts of SQL, as combining data from multiple database tables is a core skill for data analysts. And when you apply for jobs, remember that JOIN statements are a common topic in data analyst interviews! The more JOIN statements you write, the more prepared you will be for a data analyst role.
Optional: Upload the warehouse dataset to BigQuery
The next video demonstrates how to use COUNT and COUNT DISTINCT in SQL to count and return the number of certain values in a dataset.
If you would like to follow along with the instructor, you will need to log in to your BigQuery account and upload the warehouse data provided as two CSV files. If you have hopped around courses, Using BigQuery in the Prepare Data for Exploration course covers how to set up a BigQuery account.
Prepare for the next video
- First, download the two CSV files from the attachments below:
- Next, complete the following steps in your BigQuery console to upload the Warehouse Orders dataset with the two Warehouse and Orders tables.
Step 1: Open your BigQuery console and click on the project you want to upload the data to.
Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next to your project name and select Create dataset.
Step 3: In the upcoming video, the name "warehouse_orders" will be used for the dataset. If you plan to follow along with the video, enter warehouse_orders for the Dataset ID.
Step 4: Click CREATE DATASET (blue button) to add the dataset to your project.
Step 5: In the Explorer on the left, click to expand your project, and then click the warehouse_orders dataset you just created.
Step 6: Click the Actions icon (three vertical dots) next to warehouse_orders and select Open.
Step 7: Click the blue + icon at the top right to open the Create table window.
Step 8: Under Source, for the Create table from selection, choose where the data will be coming from.
- Select Upload.
- Click Browse to select the Warehouse Orders - Warehouse CSV file you downloaded.
- Choose CSV from the file format drop-down.
Step 9: For Table name, enter Warehouse if you plan to follow along with the video.
Step 10: For Schema, click the Auto detect check box.
Step 11: Click Create table (blue button). You will now see the Warehouse table under your warehouse_orders dataset in your project.
Step 12: Click the warehouse_orders dataset again.
Step 13: Click the icon to open the Create table window again.
Step 14: Under Source, for the Create table from selection, choose where the data will be coming from.
- Select Upload.
- Click Browse to select the Warehouse Orders - Orders CSV file you downloaded.
- Choose CSV from the file format drop-down.
Step 15: For Table name, enter Orders if you plan to follow along with the video.
Step 16: For Schema, click the Auto detect check box.
Step 17: Click Create table (blue button). You will now see the Orders table under your warehouse_orders dataset in your project.
Step 18: Click the Warehouse table and click the Preview tab to verify that you have 10 rows of data.
Step 19: Click the Orders table and click the Preview tab to verify that you have the data shown below.
If your data previews match, you are ready to follow along with the next video.
Test your knowledge on using JOINS to aggregate data
Question 1
A data analyst wants to retrieve only records from a database that have matching values in two different tables. Which JOIN function should they use?
A. RIGHT JOIN
B. INNER JOIN
C. LEFT JOIN
D. OUTER JOIN
The correct answer is B. INNER JOIN. Explain: To retrieve only records from a database that have matching values in two different tables, the analyst should use INNER JOIN.
Question 2
You are writing a SQL query to instruct a database to count values in a specified range. You only want to count each value once, even if it appears multiple times. Which function should you include in your query?
A. COUNT VALUES
B. COUNT DISTINCT
C. COUNT RANGE
D. COUNT
The correct answer is B. COUNT DISTINCT. Explain: To tell a database to return distinct values in a specified range, the analyst should use COUNT DISTINCT in their query.
Question 3
A data analyst wants to temporarily name a column in their query to make it easier to read and write. What technique should they use?
A. Naming
B. Tagging
C. Filtering
D. Aliasing
The correct answer is D. Aliasing. Explain: To temporarily name a column in a query to make it easier to read and write, the analyst should use aliasing.