Course 5‐3 - Forestreee/Data-Analytics GitHub Wiki
Google Data Analytics Professional
Analyze Data to Answer Questions
WEEK3 - Aggregating data for analysis
As part of your analysis, you’ll often have to combine data in order to gain insights and complete business objectives. In this part of the course, you’ll explore the functions, procedures, and syntax involved in combining, or aggregating, data. You’ll learn how to do this from multiple cells in spreadsheets and from multiple database tables using SQL queries.
Learning Objectives
- Demonstrate an understanding of functions and procedures that may be used to combine data from multiple cells in a spreadsheets
- Demonstrate an understanding of functions and syntax to create SQL queries for combining data from multiple database tables
- Use VLOOKUP to query data, trim data, convert text data to numeric data, and create a summary table from a queried information
VLOOKUP for data aggregation
Aggregate data for analysis
Welcome back! In the next few videos, we'll explore something called data aggregation.
Aggregation means collecting or gathering many separate pieces into a whole. For example, the Milky Way galaxy is an aggregation of stars, dust, and gases.
So data aggregation is the process of gathering data from multiple sources in order to combine it into a single summarized collection. In data analytics, a summarized collection, or summary, describes identifying the data you need and gathering it all together in one place.
For example, let's say you have a cabinet full of different puzzles. One day, a shelf breaks, and all the boxes topple over, scattering the puzzle pieces everywhere.
To get each puzzle organized again, you need to identify the pieces that correspond to each particular puzzle, gather them together and put them back into their correct boxes. Only then can you work with these pieces and create a complete picture.
So in data, the puzzle pieces represent the data that lives in different, separate datasets. Getting them organized is the aggregation process. Then the piles of pieces that complete a single puzzle become your summary. And finally, putting those pieces back together is like analyzing them to gain important insights.
Data aggregation helps data analyst identify trends, make comparisons and gain insights that wouldn't be possible if each of the data elements were analyzed on its own.
For instance, data on high school graduations for individual students can be aggregated into a single graduation rate for an entire class. Data can also be aggregated over a given time period to provide statistics, such as averages, minimums, maximums, and sums.
For example, that same yearly graduation rate data can be aggregated once again into a summary that shows us graduation rates for districts, states, and countries.
Here's another example. Let's say you had data on real estate sales in a particular neighborhood for each of the past 10 years. If you aggregated all of that data, you'd be able to discover the average price of a home in that area and how values have increased or decreased over time.
Functions are a big help in making data aggregation possible. You'll learn how to use some of the most common ones to create your summaries soon.
In addition, we'll talk about aggregating data using something called a subquery. You've seen SQL in action, and you understand that a query is a request for information from a database.
So a subquery, also called an inner or nested query, is a query within another query.
After the next several videos, you'll know how to aggregate data and understand the tools you'll be using along the way. Let's get started!
Preparing for VLOOKUP
Hi, again. In this video, we'll prep our data for VLOOKUP, a data aggregation tool.
As you learned before, data aggregation is the process of gathering data from multiple sources in order to combine it into a single summarized collection. Data aggregation can give you all kinds of information about the data you are looking at.
For example, in marketing, you can aggregate data from an ad campaign to see how it performed over time and for particular customers. Travel companies use data aggregation to figure out how much their competitors charge for a certain flight, hotel room, or rental car type. Then, they can make sure they price their own products as competitively as possible.
One thing these businesses all have in common is that they can use VLOOKUP to help them achieve these goals. As a reminder, VLOOKUP stands for vertical lookup. Basically, it's a function that searches for a certain value in a column to return a corresponding piece of information.
Earlier, we used VLOOKUP to take the value in one cell and search for a match in another.
We were able to match a product code made up of numbers and letters that lived in one spreadsheet to the actual name of the product that lived in another.
But before any of that can happen, we need to make sure our data is properly prepared. As you've heard many times, clean data is much more likely to give you accurate results.
Let's start with the first common data-cleaning task: different data types. For example, a dataset might have dates formatted as numbers, or numbers represented as text strings instead of numeric values. When data is not in a consistent format or a format that the spreadsheet application recognizes, VLOOKUP won't know what to do with that data, and it will return an error.
Earlier, you learned how to convert numbers to dates using the Format tool.
Now, let's focus on converting text to numeric values.
Would you like to follow along with the instructor using the same spreadsheet? To use the spreadsheet template, click the link below Converting numerical and text values
To do this, you could use the Format menu to select a type of number, but you could also use the VALUE function.
VALUE is a function that converts a text string that represents a number to a numerical value.
Here's an example. In this spreadsheet, the numbers in column A are currently text strings. We can confirm this by running a simple SUM function. The syntax is equals SUM, open parenthesis, and then the items you want to add together. Here, it's A2 to A4. The colon says we're including everything between these two references. Now you can add a closed parenthesis and press Enter, or you can click and drag on the cells you want inside the parentheses to save a little bit of time. The result is zero. That's because the function doesn't work on text strings.
But if we apply the VALUE function, it automatically converts that text to a numeric value. To do that, we'll type equals VALUE, then an open parenthesis. Inside, we reference the cell whose value we want to convert, in this case A2. Now if we close the parentheses and press Enter, you'll notice that the 1, 2, 3 is numeric. If we drag it down the column, the 4, 5, 6 and 7, 8, 9 also become numeric. Now we can test it by running another SUM function. We'll type equals SUM and an open parenthesis, then B2, colon, and B4. B2, B3, and B4 are included in the sum. Close the parentheses and press Enter. Now it shows that the total is 1,368.
The next common error comes from having extra spaces in your spreadsheet. As you've learned, when data is copied from one source to another, sometimes a few leading or trailing spaces tag along. These can cause problems when using VLOOKUP. We want to make sure to use TRIM during the data- cleaning process. TRIM automatically deletes any extra spaces added to the cell.
Another typical mistake in VLOOKUP, which you can easily catch during data cleaning, are duplicates. If there are duplicate rows in the search, it will return only the first match it finds.
As you learned before, Remove duplicates is a tool that automatically searches for and eliminates duplicate entries from a spreadsheet.
Using Remove duplicates, as you saw in a video a little while ago, is a great way to get rid of duplicates and help make sure you find the right record during the lookup.
It's always good to remember that clean data is the foundation that everything else is built on. VLOOKUP can be a very useful data-cleaning tool.
In the next video, we'll keep exploring more ways you can use VLOOKUP. See you there.
VLOOKUP in action
Hi and welcome back. In a previous video, we talked about VLOOKUP for data cleaning. We also discussed the importance of preparing our spreadsheet before putting VLOOKUP to use. Now we're going to experience it in action.
As a quick reminder, VLOOKUP is a spreadsheet function that vertically searches for a certain value in a column to return a corresponding piece of information.
Let's start with VLOOKUP syntax. This example, 103 is a value to search for. A2:B26 is the range that will be searched. As you may remember, VLOOKUP will not recognize column names such as A, B, or C. We use a number to indicate the column. Lastly, FALSE tells VLOOKUP to find an exact match. If this said true, the function will return only a close match, which might not be what we want.
Now let's put VLOOKUP to use. One of the most common things data analysts do with VLOOKUP is populating data in one spreadsheet from another. Here's an example. Let's say we're working with data that exists in two different spreadsheets, but we need information from both in order to answer our business question. VLOOKUP can connect two sheets together on a matching column to populate one single sheet.
Would you like to follow along with the instructor using the same spreadsheet? To use the template for the spreadsheet, click the links below and select "Use Template." The template has two tabs for the worksheets, Employee Hours and Employee Rates. VLOOKUP in Action Example
Check it out. In this spreadsheet, we have employee ID numbers and their rates of pay.
In this spreadsheet we have the same employee ID numbers and how many hours each person worked.
We can use VLOOKUP to search for the rate of pay from the employee rates spreadsheet and add it to the employee hours spreadsheet. The formula is equals VLOOKUP open parentheses, then A2, which is the first employee ID number and the employee hours spreadsheet. Next, we add a comma, the name of the spreadsheet we want to search in, employee rates. Be sure to put single quotation marks around the spreadsheet name and add an exclamation point after it. This is the way to reference the other spreadsheet. Next, we add the range, which is A2 through B5. As you saw in a previous video, we can also choose to add dollar signs to lock the range with absolute cell references. This prevents them from changing when copying the formula to other cells. Add another comma, then a two. The two indicates that we want to search for a match in the second column, column B for rate of pay. Finally, one more comma and we add false to look up an exact match.
Drag the formula down the column and now we can use a simple multiplication formula to calculate each person's paycheck by multiplying hours worked by our newly created pay rate column. Great work.
In an upcoming reading, you'll learn even more about VLOOKUP and access some helpful VLOOKUP reminders and resources. VLOOKUP is one of the more complicated functions, so keep practicing.
Identifying common VLOOKUP errors
When people start out in data analytics, they often think that those of us who've been in the field for a while know everything. But trust me, we're all still figuring things out. And a lot of the time that means troubleshooting. Troubleshooting has to do with asking the right questions, and that's what we'll focus on in this video. We'll learn how you can use troubleshooting to solve all kinds of problems.
To do this, we'll need to talk about some of the limitations of VLOOKUP and then practice fixing some of the most common problems that data analysts face.
Some of the troubleshooting questions I like to ask myself:
How should I prioritize these issues? Trying to solve lots of problems all at once can feel overwhelming. I find it helps when you take things one at a time.
Next I ask, In a single sentence, what's the issue I'm facing? This helps to clarify what's really going on, so I don't get bogged down with extra details. After all, if you don't have a clear objective before looking at the data, you can find just about anything. It's always best to start with your own clear understanding of the situation. Then let the data tell you if you're on the right track or not.
The next question I ask myself is, What resources can help me solve the problem? The internet is one of the best resources out there. If you have a question, chances are thousands of others run into exactly the same thing. So a quick search can be really helpful. And it's good to remember that people are resources, too. Don't be afraid to ask questions. Not only is it a great way to learn, it can also help you build strong relationships with your colleagues.
And a final important question I think about: How can I stop this problem from happening in the future? If a new procedure or guideline can stop the same issue from popping up again, that's a great time-saver.
All right. Let's start by noting that VLOOKUP only returns the first match it finds, even if there are lots of possible matches.
Something else to keep in mind is that VLOOKUP can only return a value from the data to the right. It can't look left. Good news. There's a simple solution. Data analysts usually get around the problem by copying and pasting a column to the left of the data they want to look at. This way, the lookup value is in the leftmost column and the data they want is to the right of it.
Here's another problem I see a lot. Let's say the first few rows of a VLOOKUP have returned the correct result. But when you drive the function down the column, problems start popping up.
This is probably because the table array part of the function hasn't been locked or made absolute. An absolute reference is a reference that is locked so that rows and columns won't change when copied.
You can fix this issue by wrapping the table array in dollar signs. As you learned a while back, the dollar sign controls how the reference will be updated. They make sure that the corresponding part of the reference doesn't change. Something else that can throw off your VLOOKUP results are version control issues. In other words, a function worked perfectly at first, but then something in the spreadsheet it was referencing changed. For example, maybe a user inserted a column.
So now the columns in your function no longer direct VLOOKUP to the right place. When something like this happens, it'll return an incorrect value.
There are a few actions data analysts can take to ensure this doesn't happen. First, lock the spreadsheet. This stops other people from making changes. To do this in Sheets, select Data, then Protected sheets and ranges. In other spreadsheet applications, there are other tools that do the same thing.
Next, choose what you want to protect. In this case, we want to protect the entire sheet. Then you can set permissions to either show a warning or restrict who can edit. Choose only you, then Done.
But keep in mind, there will be times when other people need to work in the spreadsheet, so locking them out might make you pretty unpopular with your coworkers.
When that's the case, you can use MATCH, which is a function used to locate the position of a specific lookup value and can help you with version control. We won't get into that right now, but just know that it's an option in case you ever need it.
The final problem we'll talk about has to do with exact and approximate matching. When using VLOOKUP, you're likely to get different results, depending on whether you enter the word TRUE or FALSE within your function.
TRUE tells VLOOKUP to look for approximate matches, and FALSE tells VLOOKUP to look for exact matches.
So if a function looks like this, it's telling VLOOKUP to find the closest match to the text or number we're looking for.
It's important to know that VLOOKUP starts at the top of a specified range and searches downward vertically in each cell to find the right value. It stops searching when it finds any value that's greater than or equal to the lookup value.
That's why data analysts typically use FALSE, like this. That way VLOOKUP only returns the exact match to what you've entered in the lookup value.
VLOOKUP is one of the most popular lookup and reference functions in spreadsheets. It's also one of the trickiest.
Coming up, you'll learn about more of these common challenges. Everything you learn will help you run into fewer problems when you start using VLOOKUP as a future data analyst.
VLOOKUP core concepts
When do you need to use VLOOKUP? Two common reasons to use VLOOKUP are:
- Populating data in a spreadsheet
- Merging data from one spreadsheet with data in another
VLOOKUP syntax
search_key -The value to search for.
- For example, 42, "Cats", or I24.
range
- The range to consider for the search.
- The first column in the range is searched to locate data matching the value specified by search_key.
index
- The column index of the value to be returned, where the first column in range is numbered 1.
- If index is not between 1 and the number of columns in range, #VALUE! is returned.
is_sorted
- Indicates whether the column to be searched (the first column of the specified range) is sorted. TRUE by default.
- It’s recommended to set is_sorted to FALSE. If set to FALSE, an exact match is returned. If there are multiple matching values, the content of the cell corresponding to the first value found is returned, and #N/A is returned if no such value is found.
- If is_sorted is TRUE or omitted, the nearest match (less than or equal to the search key) is returned. If all values in the search column are greater than the search key, #N/A is returned.
What if you get #N/A?
value
- This is a required value.
- The function checks if the cell value matches the value; such as #N/A.
value_if_na
- This is a required value.
- The function returns this value if the cell value matches the value in the first argument; it returns this value when the cell value is #N/A.
Helpful VLOOKUP reminders
-
TRUE means an approximate match, FALSE means an exact match on the search key. If the data used for the search key is sorted, TRUE can be used.
-
You want the column that matches the search key in a VLOOKUP formula to be on the left side of the data. VLOOKUP only looks at data to the right after a match is found. In other words, the index for VLOOKUP indicates columns to the right only. This may require you to move columns around before you use VLOOKUP.
-
After you have populated data with the VLOOKUP formula, you may copy and paste the data as values only to remove the formulas so you can manipulate the data again.
VLOOKUP resources for Microsoft Excel VLOOKUP may slightly differ in Microsoft Excel, but the overall concepts can still be generally applied. Refer to the following resources if you are working with Excel.
-
How to use VLOOKUP in Excel: This tutorial includes a video to help you get a general understanding of how the VLOOKUP function works in Excel, as well as practical examples to look through.
-
VLOOKUP in Excel tutorial: Follow along in this video lesson and learn how to write a VLOOKUP formula in Excel and master time-saving useful tips and tricks.
-
23 things you should know about VLOOKUP in Excel: Explore this list of 23 VLOOKUP facts as well as challenges you might run into, and start to learn how to master them.
-
How to use Excel's VLOOKUP function: This article shares a specific example around how to apply VLOOKUP in your searches.
-
VLOOKUP in Excel vs Google Sheets: This guide offers a VLOOKUP comparison of Excel and Google Sheets.
Hands-On Activity: Using VLOOKUP
documentation for how to manually create a Pivot Table in Excel.
Practice Quiz: VLOOKUP
Use JOINS to aggregate data in SQL
Optional: Upload the employee dataset to BigQuery
Understanding JOINS
Hey, welcome back. So far we've checked out a few different pools you can use to aggregate data within spreadsheets.
In this video, we'll cover how to use JOIN in SQL to aggregate data and databases. First, I'll tell you a little bit about what a JOIN actually is, and then we'll explore some of the most common JOINs in action. Let's get started.
JOIN is a SQL clause that's used to combine rows from two or more tables based on a related column. Basically, you can think of a JOIN as a SQL version of vlookup, which we just covered.
There are four common JOINs data analysts use, inner, left, right and outer.
Here's a handy visualization of what each JOIN actually does. We'll use these to help us understand these functions. JOINs help you combine matching or related columns from different tables.
When we learned about relational databases we refer to these values as primary and foreign keys. Primary keys reference columns in which each value is unique to that table, but that table can have multiple foreign keys, which are primary keys in other tables. For example, in a table about employees, the employee ID is the primary key and the office ID is a foreign key. JOINs use these keys to identify relationships and corresponding values.
An INNER JOIN is a function that returns records with matching values in both tables.
If we think about our tables as a Circles of the Venn diagram, then an inner join would return the records that exist for the tables are overlapping. For the records to appear and the results table, they'll have to be key values in both tables. The records were only merged if there are matches in both tables. When we input JOIN into SQL, it usually defaults to INNER JOIN, so a lot of analysts will use JOIN as shorthand, instead of typing the whole query.
A LEFT JOIN is a function that will return all the records from the left table and only the matching records from the right table. Here's how you can figure out which table is left or right. In English and SQL, we read from left to right.
The table mentioned first is left, and the table mentioned second is right. You can also think of left as a table name to the left of the JOIN statement, and right as a table name to the right of the JOIN statement.
In this diagram, you'll notice that the entire left tables are colored in, and that's the overlap with the right table which shows us that the left table and the records it shares with the right table are being selected. Each row in the left table appears in the results, even if there are no matches in the right table.
RIGHT JOIN does the opposite it'll return all records from the right table and only the matching records from the left.
You can get the same results if you flip the order of the tables and use a LEFT JOIN.
For example, SELECT * FROM tableA LEFT JOIN tableB is the same as SELECT * FROM tableB RIGHT JOIN table A.
Finally, there's outer joints outer join combines RIGHT and LEFT JOIN to return all matching records in both table.
This means it will return all records in both tables. If there are records in one table without a match, it'll create a record with no values for the other table.
Using JOINs can make working with multiple data sources a lot easier and it can make relationships between tables more clear.
Here's an example. Let's say we're working with employee data across multiple departments. We have an employee's table and a department table which both have some columns like Department ID. We can use different JOIN clauses to help us pull different data from our tables and aggregate it. Maybe we want to get a list of employees with their department name, excluding any employee without a Department ID.
Because the department ID record is used in both tables, we can use an INNER JOIN to return a list with only those employees. As a quick reminder, analyst will sometimes just input JOIN for an INNER JOIN. But for this example, we'll write it out.
To build this query, we'll start with SELECT and AS to tell SQL how we want the column's title. Then we'll use FROM to tell it where we're getting this data.
In this case, the employees table. And then, we'll input INNER JOIN and the other table we're using, which is departments. We can specify which column and each table will contain the matching JOIN key by writing on employees.department_id = departments.department_id.
Now, let's run it. And there. Now we've got a list of employee names and department IDs for the employees that have those IDs. But we could use LEFT or RIGHT JOIN to return a list of all employee names and their departments when available.
Let's try both really quickly. This will start similar to the last word.
We'll put in SELECT * AS and FROM again, but this time we'll say LEFT JOIN and use on like we did with the last query.
When we execute the query, we get back this new list with the employee names and departments. But you'll notice there's no values. These are places for the right table which is departments in this case didn't have corresponding values.
Let's try right join just to test it out. This query will be almost the same, only difference is that we'll use the RIGHT JOIN clause to return all the rows from the right table whether they have matching values in the table to the left of the JOIN statement or not. In this case, the right table is departments.
Now, let's try out one last JOIN, OUTER. OUTER JOIN will fetch all of the employee names and departments. Again, this query will start a lot like the other ones we've done. We'll use SELECT * AS and FROM to choose what data we want and how. We'll grab this from the employees table. And put full OUTER JOIN with the departments table to get all of the records from both. We'll also use on again here.
Now we can run this. And we'll get all the employee names and departments from these tables. There will be nulls in the department_name column and the employee_name column and role columns because we've joined columns that don't have matching values.
And there, now you know how JOINs work. JOINs are super useful when you need to work with data from multiple related tables. They give you a lot of flexibility with how you combine and view that data, and if you ever have trouble remembering what inner, right, left or, outer join do, just think back to our Venn diagram. We'll keep learning about aggregating data and SQL next time. See you soon!
Secret identities: The importance of aliases
In this reading, you will learn about using aliasing to simplify your SQL queries. Aliases are used in SQL queries to create temporary names for a column or table. Aliases make referencing tables and columns in your SQL queries much simpler when you have table or column names that are too long or complex to make use of in queries. Imagine a table name like special_projects_customer_negotiation_mileages. That would be difficult to retype every time you use that table. With an alias, you can create a meaningful nickname that you can use for your analysis. In this case “special_projects_customer_negotiation_mileages” can be aliased to simply “mileage.” Instead of having to write out the long table name, you can use a meaningful nickname that you decide.
Basic syntax for aliasing
Alternate syntax for aliases
Aliasing in action
For more information If you are interested in learning more about aliasing, here are some resources to help you get started:
-
SQL Aliases : This tutorial on aliasing is a really useful resource to have when you start practicing writing queries and aliasing tables on your own. It also demonstrates how aliasing works with real tables.
-
SQL Alias : This detailed introduction to aliasing includes multiple examples. This is another great resource to reference if you need more examples.
-
Using Column Aliasing: This is a guide that focuses on column aliasing specifically. Generally, you will be aliasing entire tables, but if you find yourself needing to alias just a column, this is a great resource to have
Using JOINS effectively
In this reading, you will review how JOINs are used and will be introduced to some resources that you can use to learn more about them. A JOIN combines tables by using a primary or foreign key to align the information coming from both tables in the combination process. JOINs use these keys to identify relationships and corresponding values across tables.
If you need a refresher on primary and foreign keys, refer to the glossary for this course, or go back to Databases in data analytics.
Type of JOINs
INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL OUTER JOIN
For more information JOINs are going to be useful for working with relational databases and SQL—and you will have plenty of opportunities to practice them on your own. Here are a few other resources that can give you more information about JOINs and how to use them:
-
SQL JOINs: This is a good basic explanation of JOINs with examples. If you need a quick reminder of what the different JOINs do, this is a great resource to bookmark and come back to later.
-
Database JOINs - Introduction to JOIN Types and Concepts: This is a really thorough introduction to JOINs. Not only does this article explain what JOINs are and how to use them, but it also explains the various scenarios in more detail of when and why you would use the different JOINs. This is a great resource if you are interested in learning more about the logic behind JOINing.
-
SQL JOIN Types Explained in Visuals: This resource has a visual representation of the different JOINs. This is a really useful way to think about JOINs if you are a visual learner, and it can be a really useful way to remember the different JOINs.
-
SQL JOINs: Bringing Data Together One Join at a Time: Not only does this resource have a detailed explanation of JOINs with examples, but it also provides example data that you can use to follow along with their step-by-step guide. This is a useful way to practice JOINs with some real data.
-
SQL JOIN: This is another resource that provides a clear explanation of JOINs and uses examples to demonstrate how they work. The examples also combine JOINs with aliasing. This is a great opportunity to see how JOINs can be combined with other SQL concepts that you have been learning about in this course.
Hands-On Activity: Queries for JOINS
Optional: Upload the warehouse dataset to BigQuery
COUNT and COUNT DISTINCT
Hi, it's great to have you back. By now we've discovered that spreadsheets and SQL have a lot of tools in common. Earlier in this program, we learned about COUNT in spreadsheets. Now it's time to look at similar tools in SQL: COUNT and COUNT DISTINCT. In this video, we'll talk about when you'd use these queries and check out an example. Let's get started.
COUNT can be used to count the total number of numerical values within a specific range in spreadsheets.
COUNT in SQL does the same thing. COUNT is a query that returns the number of rows in a specified range,
but COUNT DISTINCT is a little different. COUNT DISTINCT is a query that only returns the distinct values in that range. Basically, this means COUNT DISTINCT doesn't count repeating values.
As a data analyst, you'll use COUNT and COUNT DISTINCT anytime you want to answer questions about how many.
Like how many customers did this? Or how many transactions were there this month? Or how many dates are in this dataset? And you'll use them throughout the data analysis process at different stages.
For example, you might need them while you're cleaning data to check how many rows are left in your dataset. Or you might use COUNT and COUNT DISTINCT during the actual analysis to answer a "how many" question.
You'll run into these kinds of questions a lot. So COUNT and COUNT DISTINCT are really useful to know. But let's check out an example to see COUNT and COUNT DISTINCT in action. For this example, we're working with a company that manufactures socks. We have two tables: Warehouse and Orders. Let's take a quick look at these tables before we start querying. First, we'll check out the Warehouse table.
You can see the columns here: warehouse ID, warehouse alias, the maximum capacity, the total number of employees, and the state the warehouse is located in.
We'll pull up the top 100 rows of the Orders table next. We can use LIMIT here to limit the number of rows returned. This is useful if you're working with large datasets, especially if you just want to explore a small sample of that dataset.
From this query, we're actually going to start with a FROM statement so that we can alias our tables. Aliasing is when you temporarily name a table or column in your query to make it easier to read and write. Because these names are temporary, they only last for the given query. We can use our FROM statement to write in what our tables' aliases are going to be to save us some time in other parts of the query.
So we'll start with FROM and use aliasing to name the Warehouse Orders table, just "orders."
FROM warehouse_orders.Orders AS orders
Let's say we need both the warehouse details and the order details because we want to report on the distribution of orders by state. We're going to JOIN these two tables together since we want data from both of them and alias our warehouse table in the process. In this case, we're using JOIN as shorthand for INNER JOIN because we want corresponding data from both tables.
And now that we have the aliases in place, let's build out the SELECT statement that comes before FROM.
Let's run that. And there. Now we have data from both tables joined together, and we know how to create these handy aliases.
Now, we want to count how many states are in our ordered data. To do that, we'll use COUNT and COUNT DISTINCT now. We can try a simple COUNT query first. We'll JOIN the Orders and Warehouse tables in our FROM statement. And in this case we'll start with SELECT and COUNT the number of states. Let's run this query and see what we get. Wait, that's not quite right. This query returned over 9,000 states because we counted every single row that included a state.
But we actually want to count the distinct states. Let's try this again with COUNT DISTINCT. This query is going to look similar to the last one, but we'll use DISTINCT to cut out the repeated instances we got the last time. We'll use the query we just built, but replace COUNT with COUNT DISTINCT in our SELECT statement. Let's try this query. That's more like it. According to these results, we have three distinct states in our Orders data.
Let's check out what happens when we group by the state column in the warehouse table, which we'll call warehouse dot state. We'll use JOIN and GROUP BY in our FROM statement. Let's start there again. Then GROUP BY warehouse state. Now let's build out our SELECT statement on top of that. We're still going to use COUNT DISTINCT. Let's run it. Now we have three rows, one of each state represented in the Orders data. And our COUNT DISTINCT on the number of orders sums up the count we ran earlier: 9,999.
You'll find yourself using COUNT and COUNT DISTINCT during every stage of the data analysis process. Understanding what these queries are and how they are different is key. Great job, and I'll see you again soon!
Test your knowledge on using JOINS to aggregate data
Work with subqueries
Queries within queries
Hey, it's great to have you back. In this video I'll introduce you to another SQL query: subqueries.
A subquery is a SQL query that is nested inside of a larger query. Have you ever seen one of those nesting doll toys? They're also known as matryoshka Russian nesting dolls. Subqueries are a lot like nesting dolls. No, really. Your larger query can have a subquery in it and then that subquery could have a subquery, and then that subquery can have another subquery.
But when you stack them all together, they make one query. With subqueries you can combine different pieces of logic together. Because the logic of your outer query relies on the inner query, you can get more done with a single query. This means all of the logic is in one place, which makes it more efficient and easier to read.
The statement containing the subquery can also be called the outer query or the outer select.
This makes the subquery the inner query or inner select.
The inner query executes first so that the results can be passed on to the outer query to use. Subqueries can get a little confusing because there's so many layers. But if you keep in mind that the innermost query executes first, it'll be easier to order your subqueries when you want them to execute. Subqueries can also be nested inside all sorts of other queries. Usually you'll find subqueries nested in FROM or WHERE clauses. Let's try out some common subqueries.
We'll start with a subquery in a SELECT statement using the bike-sharing data from an earlier example.
If you would like to follow along with the instructor and don't know how to access the bike sharing dataset, refer to the reading Optional: Prepare to use the bike sharing dataset in BigQuery for more information.
For the first statement, let's say we want to compare the number of bikes available at a station to the average number of bikes available. We're going to use this query to pull the average number of bikes available. Then we're going to incorporate it as a subquery.
Now let's build our outer SELECT query. We want to select the station ID and the number of bikes available.
Then we'll put the SELECT query that's pulling the average number of bikes inside that outer query by using parentheses. We'll also build FROM into the subquery before closing it with another parenthesis and completing the outer query. The end of the outer join query has AS to show what we want to call this column and a final FROM statement to indicate which table we're referring to.
Now let's run it. And there! We've got a table with both the number of bikes available and the average number of bikes available at different stations.
It's really common to see subqueries nested in FROM and WHERE statements. So let's try those next. We could use a FROM statement to calculate the number of rides that have started at each station over time.
We'll start with our outer query and input SELECT station_id, name, and number_ of_ rides. We'll use AS to tell it how we want the table labeled, and FROM to tell it where we're pulling data from.
But before we finish that query, we'll add a subquery. We'll put our parenthesis here and then SELECT the start_station_id. Then we can tell it to COUNT the number_of _rides FROM the trip data and group it by the start_station_id.
After that, we'll close the subquery with a parenthesis so that we can continue building the outer query. We'll use AS again and then use INNER JOIN and ON to join it with the station ID data. Finally we'll tell it to put it in descending order.
Let's see what happens when we run that. We now have the number of rides started at each station.
SELECT
id,
name,
number_of_rides AS number_of_rides_starting_station
FROM
(
SELECT
start_station_id,
COUNT(*) number_of_rides
FROM
bigquery-public-data.london_bicycles.cycle_hire
GROUP BY
start_station_id
)
AS station_num_trips
INNER JOIN
bigquery-public-data.london_bicycles.cycle_stations ON id = start_station_id
ORDER BY
number_of_rides DESC
One last example. Let's use a WHERE statement. The bike-sharing company has two kinds of users: subscribers and one-time customers. Let's say we wanted a list of stations subscribers used.
As always, we start with the outer query. SELECT the station_ id and name FROM the public dataset we're using. This time we'll use a WHERE statement. We'll also use IN so that we can specify multiple values and this WHERE statement. Then we'll put our subquery in the parenthesis. We'll add SELECT, FROM, and WHERE again. But this time we'll tell it that we only want data on specific customers. It's good to note that you can use comparison operators in subqueries, even multiple row operators like IN, ANY, or ALL. In this case we'll use equals to indicate that we only want the subscriber user data.
Now let's run the query and we've got the station id and names for stations that fit our criteria.
That's subqueries in action. Subqueries can be challenging. There's a lot of layers to think through and you might find yourself running into errors when you practice. That's totally okay. Having to go through that challenge means you're growing.
If everything was easy, we wouldn't find new ways to grow. For me it's all about how much work and how much time I need to put in to do it. Give yourself time to practice this new concept. Coming up you'll get a chance to use subqueries to aggregate data or you can move on to the weekly challenge.
You'll take everything you've learned, using VLOOKUP, different JOINS, and subqueries and apply it to this upcoming assessment. We've been doing a lot of complex work. If you want to take a moment to review these videos before moving on, feel free. Once you've finished a challenge, I'll see you again for our next big learning adventure. See you soon.
Using subqueries to aggregate data
Hey there! So, we've got some experience nesting subqueries into our SQL statements to perform more complicated queries.
Now, it's time to talk about how to aggregate data with subqueries. Coming up, we'll learn about some new subcore statements and use them to aggregate data. The query we're going to build in this video is pretty advanced; it's going to be a little complicated, but I know you've got what it takes. Let's get started!
We've used functions like WHERE to filter our data before, but the WHERE function can't be used with aggregate functions. For example, you can use WHERE on a statement and follow it with GROUP BY. However, when you want to use GROUP BY first and then use WHERE on that output, you'll need a different function.
This is where HAVING comes in. HAVING basically allows you to add a filter to your query instead of the underlying table when you're working with aggregate functions. That way, it only returns records that meet your specific conditions.
Similarly, CASE returns records with your conditions by allowing you to include If/Then statements in your query. So, let's try to aggregate our data with subqueries and test out these new functions.
Let's say we're working with the company that makes socks that we talked about earlier.
To follow along with the instructor, you will need the Warehouse Orders dataset in BigQuery. If you didn't previously upload the CSV files for this dataset, follow the instructions in the Optional: Upload the warehouse dataset to BigQuery reading before you continue with this video.
We've been asked to calculate what percentage of the orders are fulfilled by each warehouse. Basically, we're interested in knowing which warehouses are delivering the most orders.
We've seen these tables before, but as a quick refresher, here's the Orders table, you can see the columns here, Order ID(order_id), Customer ID(customer_id), Warehouse ID(warehouse_id), Order Date(order_date), and Ship Date(ship_date). If we pull up the Warehouse table, we can check out its columns: we have the Warehouse ID(warehouse_id), Warehouse Alias(warehouse_alias), Maximum Capacity(maximum capacity), Total Number of Employees(employee_total), and the State(state) where the warehouse is located.
Before we start building the rest of our query, we'll want to alias our table names. As a reminder, aliasing is when you temporarily name a table or column in your query to make it easier to read and write. This example query is a little bit more complicated than the ones we've seen before, so aliasing will help save us some time.
We'll start by aliasing the Warehouse table in our FROM statement. The FROM statement in this query is near the end, but we'll build this first so that we can use the alias everywhere else. We'll simplify it to just "Warehouse" for the rest of this query.
We know that we're going to join these tables together, so let's add that while we're working on this part of the query anyway. We're using a LEFT JOIN here because we want all the information from our Warehouse data, even if it doesn't show up in the Orders table. Then, we'll alias the Orders table as part of the statement.
Now both of our tables have temporary names we can use, and we've already finished a JOIN statement. But before we can build the beginning of this query, let's go ahead and add our GROUP BY statement. After this JOIN, we'll GROUP these by the Warehouse ID and Name.
Now we'll go back to the beginning of the query. We'll SELECT the Warehouse ID. Then we'll use CONCAT to combine the strings with the Warehouse's State and alias as the Warehouse Name. Then we'll use COUNT to get the number of orders per Warehouse.
Next, we'll build in a subquery to pull the total number of orders placed across all warehouses. We'll input SELECT again and then write the subquery in parentheses. We'll put an asterisk after COUNT to indicate that we want to include everything from the Orders table.
Finally, we'll close out the subquery and use AS to name this column "Total Orders."
Now that our subquery logic is complete, we can use a CASE statement to create categories for our warehouses based on how many orders they fulfill. We'll represent these as percentages. You should notice COUNT in the statement a few times. We'll start by saying when the number of orders from our Orders table is less than or equal to 0.2, then the table will say "Fulfilled 0 to 20% of orders."
And then, we'll use WHEN again to indicate that when the number of orders is greater than 0.2 and less than or equal to 0.6, it'll say "Fulfilled 21% to 60% of orders." After that, we can use ELSE to have everything that doesn't meet the criteria of our CASE statement say "Fulfilled more than 60% of orders."
Then we'll use END AS to name this column "Fulfillment Summary."
That brings us back to the portion of the query we've already written, but we're going to add a HAVING statement at the very end of this query. Our Warehouse table has warehouses that are currently being built, and we want to filter those out since they aren't fulfilling orders yet. We can use HAVING to only include warehouses that have at least one order.
Now, before we execute this query, let's take a moment to look at the whole thing.
We have an outer SELECT, a COUNT subquery, a CASE statement, a JOIN, and HAVING all wrapped into one query. We've built a really complex query, so let's run it to see the new table.
And there! Now we can easily identify what percentage of our company's total orders are being fulfilled by each Warehouse.
This Warehouse has met our criteria, and we can see here in the Fulfillment Summary column the percentage categories we outlined in our CASE statement. Obviously, since we included a HAVING statement to specify only warehouses with at least one order, there aren't any warehouses currently under construction in this table.
That really complicated query we wrote created the specific table of data we can use to easily compare how these warehouses are performing. There you go! That's a quick taste of what it's like to work with subqueries and data aggregation.
Clauses like HAVING and CASE, paired with subqueries, will help you build more and more complex queries, which lets you do more and more complex things in SQL.
SQL functions and subqueries: A functional friendship
Subqueries
- Subqueries are usually nested in the SELECT, FROM, and/or WHERE clauses. Subqueries can’t be nested in SET queries.
- You can nest subqueries within SELECT, FROM, and/or WHERE clauses.
- Comparison operators such as >, <, or = help you compare data in subqueries. You can also use multiple row operators including IN, ANY, or ALL.
- You can use comparison operators such as greater than, less than, or equal to within subqueries.
- A subquery is called an inner query or inner select because it’s nested in a statement called an outer query or outer select.
- The statement containing a subquery is an outer query or outer select. Subqueries are nested within these statements, called inner queries or inner select.
- The innermost query executes first. Its parent query executes last so it can use the results returned by inner queries.
- The innermost query executes first and the parent query executes last.
- Parentheses are used to mark the beginning and end of a subquery.
- Parentheses are used to mark both the beginning and end of a subquery to make sure it returns the requested information.
- For a subquery to compare multiple columns, those columns must be selected in the main query.
- Only one column in the SELECT clause of a subquery may be selected. To compare multiple columns, select them in the main query.
- A SET command can’t have a nested subquery because it is used with UPDATE to adjust specific columns and values in a table.
- A subquery can’t be nested in a SET command because it is used with UPDATE to adjust specific columns and values in a table.
- Subqueries that return more than one row rely on multiple value operators such as the IN command.
- To return more than one row, subqueries must contain multiple value operators such as the IN command.
Justin: Where data analysis takes you
Hi, I'm Justin. I work here at Google in the Google Cloud space. I lead a small team of data analysts who answer business problems for our executive team. The first thing I would tell you about my journey to analytics was, it was not direct. I came to Google three years ago and I have been doing data analytics and I've really been enjoying that role. It's tied together this through line of excitement about data and answering questions that have an impact.
Your career path is not always straightforward. Maybe data analytics won't be my final destination. But what I would say is just keep changing little by little, figure out what's exciting about your role right now, in my case, it was, I loved avoiding politics and coalition-building and really just bringing better facts and better insights to really motivate decisions.
So figure out what you like about your current role, your current job, and then, figure out what different role could you take that would build on that, but maybe get you more of what you like. Be curious. The number one skill is really just asking why and then going and trying to answer that question. It will lead you down along whether it's Wikipedia to understand this model there is someone usually saying, "Why did they use that model?" You go and look up that model and sort of follow that thread. Or there's so many great resources for different languages, if you want to understand SQL. There's so many great tools, but I guess the number one thing is just sort of follow your curiosity.
When I'm reviewing resumes, the first thing obviously I'm looking for is those core skills, the ability to analyze data, demonstrate it, experience with some of the tooling we use. But I'm also looking for real passion in answering questions. Example is where someone's really dug in and tried to understand the why and they just kept asking, "Why is this happening, why is this happening?" and really dig in.