Course 5‐4 - Forestreee/Data-Analytics GitHub Wiki
Google Data Analytics Professional
Analyze Data to Answer Questions
WEEK4 - Performing data calcuations
Calculations are one of the more common tasks that data analysts complete during analysis. In this part of the course, you’ll explore formulas, functions, and pivot tables in spreadsheets and queries in SQL, all of which will help with your calculations. You’ll also learn about the benefits of using SQL to manage temporary tables.
Learning Objectives
- Describe the use of functions to conduct basic calculations on data in spreadsheets
- Discuss the use of pivot tables to conduct calculations on data in spreadsheets
- Demonstrate an understanding of the use of SQL queries to complete calculations
- Explain the importance of the data-validation process for ensuring accuracy and consistency in analysis
- Discuss the use of SQL queries to manage temporary tables
- Reflect on how conditional statements can be used to create complex queries and functions
- Generate multiple points of summary based on a wide variety of conditions using COUNTIF, SUMIF, MAXIF, and AVERAGEIF
Get started with data calculations
Data calcuations
Hi. Good to have you back. Coming up, we're going to be reviewing some familiar concepts and then using those concepts to explore new ones. As a data analyst, you'll use key tools and processes over and over, but you'll also learn new things as you grow in your job. It could be anything from building a new kind of analysis to a time-saving shortcut.
When I first got to Google, I relied on just a couple of programs and tools to access data and do my analysis. But I soon realized that I wasn't working as efficiently as I wanted to. Once I got comfortable pulling data and analyzing it using SQL, it allowed me to be a lot more efficient than before. And the better I got at SQL and pulling the data from data tables, the faster I completed my analysis. I was hooked.
Over the next few videos, I will show you some ways to be as efficient as possible while completing calculations during your analysis.
We'll start by revisiting spreadsheets where we'll look at formulas for basic calculations.
Then we'll move into conditional formulas that use the IF function to check whether a condition is met through a calculation.
After that, we'll explore the multifunctional SUMPRODUCT function. Try saying that five times quickly! SUMPRODUCT adds and multiplies all in one step, so it's very useful.
Next we'll take another look at pivot tables. If you've skipped around, and it's your first time learning about them, you'll get to know all about them. Pivot tables have tons of uses, including organizing your calculations. We'll then pivot to SQL, pun intended.
We'll show how queries and calculations go hand in hand in SQL.
We'll also look at temporary tables in SQL, which are helpful for temporarily storing your data during analysis.
We'll be covering lots of new concepts in these videos, so feel free to hit the pause button at any time to think through the problem or steps to try it on your own. And you can always review the videos as much as you need to.
So to recap, we'll have a little bit of a review, and then cover some all new concepts, all about calculations. Are you ready? Good. Me, too.
Common calculations formulas
Hey there. You probably do a lot of calculations in your daily life. Maybe it's figuring out how much to tip someone or balancing your budget.
You might do some of these calculations in your head or with paper and pencil or the calculator on your phone. You might even have shortcuts to use to make the calculations easier. You'll perform a lot of calculations as a data analyst too. But they'll involve more numbers in a wider range of calculations. That's where you'll put your data analyst tools to work. We'll show you how you could use formulas in a spreadsheet to complete some of the most basic calculations. Formulas are one of the many shortcuts that data analysts use. But rest assured, even though they're shortcuts, they'll still calculate with complete accuracy. We've covered a lot of these calculations earlier in the program. But if you skip that part and want a refresher, we'll review them here. These calculations will also be more advanced than the ones we've covered so far. But they'll also be closer to what you might use on the job. We'll be using Google Sheets in this video, but you can also use Excel. The steps might look a little different in Excel, but the outcomes will be the same.
Would you like to follow along with the instructor using the same spreadsheet? To use the template for this spreadsheet, click the link below Discount Variety Store-433, Monthly Sales
Let's try out some calculations with sales data from a discount store chain.
We'll look at data for one of the stores in the chain. Our objective: use the existing sales data to find any trends. This is a great way to see a lot of the ways formulas can be useful in your analysis.
We'll start by finding annual sales over the years 2011-2020. The data is already organized in columns by month and in rows by year. But we don't have the total sales for each year yet.
We can use a sum function to help us figure that out. We'll add the sales from 2011 first. We'll add a heading for the annual sales column, then we can type our sum function and a formula. All formulas begin with an equal sign. We'll type that first, followed by sum and then an open parenthesis. After the open parenthesis, we need to tell the formula which cells are being added. In this case, we need data from the whole row which begins in cell B2. B2 is a cell reference we'll use. Instead of typing each cell one by one, we can put them in the formula quickly by selecting cell B2 and dragging the fill handle across the row to the last cell with sales data, M2. Now we'll complete the formula by closing the parentheses and pressing Enter. Just like that, we've calculated the total sales for 2011.
Here's another shortcut we worked with in an earlier video. The fill handle is the tiny box in the corner of each sale. You can use it for lots of things like selecting multiple cells for a formula or continuing a pattern across several cells, the fill handle definitely qualifies as a shortcut. We can use the formula we created to calculate the total sales for the other years in the dataset. All we have to do is drag the fill handle down the other cells in the annual sales column and we'll have total sales data for the rest of the years in the dataset.
Let's say, we also need to find the growth in annual sales from year to year. This would be a good time to think through the problem before we try to solve it. Do we have the data we need to solve this? Not yet. Thinking backwards like this helps us plan out the steps to move forward.
The first step we'll need to do is calculate the total sales per year. Then we'll measure the rate of change between years.
We'll start by labeling a new column. In this case, we won't need to use a function or parentheses, since we're only using data from two cells. We can just use the name of those cells, we'll type an equals sign and then click in "Cell N3", which automatically populates that sale in the formula. Next, we'll add a minus sign to the formula because we're subtracting to find the difference between two consecutive years. Clicking in "Cell N2" gives us the total from 2011, which we can then subtract from the total from 2012. Then we hit Enter and get our sales growth from 2011-2012.
We're definitely getting some useful data here. Let's keep going. We can also use our sales growth to find the growth rate between the two years. We'll show this as a percentage. We'll head our column with the percent sign and growth. To do this, we'll divide the total in cell O3 by the annual sales from 2011 in cell N2. A slash is a symbol that a formula recognizes as division, so we'll place that between the two cell references and presto, there's the growth rate. Growth rates are usually shown as percentages, which can be easier than a decimal to read and understand. Let's change this number to a percentage. Time for another shortcut. All we have to do is click the percent style button and our growth rate will become a percentage. We can select the cells for both the total growth and the growth rate to populate the rest of the two columns. We have some negative numbers, but that just means that there was negative growth from one year to the next.
We've got just a few more things to calculate for our stakeholders. Next step is finding the average sales. We want to compare sales between months to learn if there's a trend. We'll add this in a row instead of a column. This will line up our averages under each month. To find our averages, will calculate the total and then divide that total by the number of values added to get it. We can do this by using the average function. Between our parentheses will select the cells that contain the sales data for January, B2 through B11.
We'll duplicate that formula across the row through December to look for trends.
Right away, we know that summer months and December have the highest average sales.
Since our stakeholders will want to understand our findings quickly and easily, we'll add a little visualization to the data with conditional formatting. You'll learn more about data visualizations like conditional formatting soon. But here's a sneak peek.
Conditional Formatting is a spreadsheet tool that changes how cells appear when values meet specific conditions.
Let's apply conditional formatting to the cells with the average sales by month.
We'll use a color scale to show the range of averages. Well, the lowest monthly average remaining as white and we'll apply shades of green to the rest of the values.
The brighter the green, the higher the average.
Now, when we share our analysis with our stakeholders, they will be able to tell right away which months have the highest average sales. Just a couple more steps to complete our analysis. Now we need to find the minimum and maximum for average monthly sales. With the dataset this small, it might be easy to find the minimum and maximum values without a formula, but it's still good practice to use one.
Not to mention, using a formula helps prevent human error, will again rely on formulas with Functions to do these calculations, we'll start with the lowest monthly average. Our function here is MIN, followed by the cells with the average month B12 through M12. After we press Enter, the lowest monthly average is calculated.
We can repeat the same steps to find the highest monthly average, in this formula will use the same data, but we'll replace MIN with MAX for maximum. For this store location, sales are strongest in December and weakest in January.
We could share these findings with stakeholders if they've met our objectives. If they haven't, we might need to continue with our analysis. Either way, I hope you've learned how spreadsheet formulas can be valuable tools when doing calculations. Coming up, we'll check out more formulas. See you soon.
Functions and conditions
Welcome back! One of the first calculations most kids learn how to do is counting. Soon after, they learn adding, and that doesn't go away. No matter what age we are, we're always counting or adding something, whether it's change at the grocery store or measurements in a recipe. Data analysts do a lot of counting and adding too. And with the amount of data you'll come across as a data analyst, you'll be grateful to have functions that can do the counting and adding for you.
So let's learn how these functions COUNTIF and SUMIF can help you do calculations for your analysis more easily and accurately.
We'll start with the COUNTIF function. You might remember COUNTIF from some of the earlier videos about data cleaning. COUNTIF returns the number of cells that match a specified value.
Earlier, we showed how COUNTIF can be used to find and count errors in a data set.
Here we'll only be counting. Just a reminder though, while we won't be actively searching for errors in this video, you'll still want to watch out for any data that doesn't look right when doing your own analysis. As a data analyst, you'll look for and fix errors every step of the way.
If you would like to follow along with this video, click the link below Kitchen Supplies, Transactions
For this example, we'll look at a sample of data from an online kitchen supplies retailer. Our stakeholders have asked us to answer a few questions about the data to understand more about customer transactions, including the revenue they're bringing in. We've added the questions we need to answer to the spreadsheet.
We'll set up a simple summary table, which is a table used to summarize statistical information about data.
We'll use the questions to create the attributes for our table columns: count, revenue total, and average revenue per transaction.
Each of our questions ask about transactions with one item or transactions with more than one item, so those will be the observations for our rows. We'll make Quantity the heading for our observations. We'll also add borders to make the summary table nice and clear.
The first question asks, How many transactions include exactly one item? To answer this, we'll add a formula using the COUNTIF function in cell G11. We'll begin with an equal sign, COUNTIF, and an open parenthesis. Column B has data about quantity. So we'll select cells B3 through B50, followed by a comma.
Next, we need to tell the formula the value that we're looking for in the cells we've selected. We want to tell the data to count the number of transactions if they equal 1. In this case, between quotation marks, we'll type an equal sign and the number 1 because that's the exact value we need to count. When we add a closed parenthesis and press enter, we get the total count for transactions with only one item, which is 25. We can follow the same steps to count values greater than one. But this time, because we only want values greater than 1, we'll type a greater than sign in our formula inside of an equals sign. Getting this information helps us compare the data about quantity.
Okay, now we need to find out how much total revenue each transaction type brought in. Since the data isn't organized by quantity, we'll use the SUMIF function to help us add the revenue for transactions with one item and with one more item separately.
SUMIF is a function that adds numeric data based on one condition.
Building a formula with SUMIF is a bit different than one with COUNTIF. They both start the same way with an equal sign and the function, but a SUMIF formula contains the range of cells to be evaluated by your criteria, and the criteria(condition). In other words, SUMIF has a list of cells to check based on the criteria you set in the formula. Then the range where we want to add the numbers is placed in the formula if that range is different from the range being evaluated. There's commas between each of these parts. Adding a space after each comma is optional.
So let's try this. In cell H11, we'll type our formula.
The range to be evaluated is in column B, so we'll select those cells.
The condition we want the data to meet is for the values in the column to be equal to one. So we'll type a comma and then inside quotes an equal sign and the number one. Then we'll select the range to be added based on whether the data from our first range is equal to one. This range is in column C, which lists the revenue for each transaction.
So every amount of revenue earned from a transaction with only one item will be added together. And there's our total. Since this is revenue, we'll change the format of the number to currency, so it shows up as dollars and cents. So the transactions with exactly one item earned $1,555.00 in revenue.
Let's see how much the transactions with more than one item earned. Okay, let's check out the results. Just like with our COUNTIF examples, the second SUMIF formula will be the same as the first, except for the condition, which will make it greater than one. When we run the formula, we discover that the revenue total is much higher, $4,735.00. This makes sense, since the revenue is coming from transactions with more than one item. Good news.
To complete our objective, we'll do two more quick calculations. First, we'll find the average revenue per transaction by dividing each total by its count. This will show our stakeholders how much of a difference there is in revenue per transaction between one item and multiple item transactions. This information could be useful for lots of reasons. For example, figuring out whether to add a discount on purchases with more than one item to encourage customers to buy more.
We'll put these calculations in the last column of our summary table. You might remember that we use a slash in a formula as the operator for division calculations. The average revenue for transactions with one item is $62.20. And the average revenue for transactions with more than one item is $205.87.
And that's it for our analysis. Our summary table now gives the stakeholders and team members a snapshot of the analysis that's easy to understand. Our COUNTIF and SUMIF functions played a big role here. Using these functions to complete calculations, especially in large datasets, can help speed up your analysis. They can also make counting and adding a little more interesting. Nothing wrong with that.
And coming up, we'll explore more functions to make your calculations run smoothly. Bye for now.
Functions with multiple conditions
SUMIF to SUMIFS
- The basic syntax of a SUMIF function is: =SUMIF(range, criterion, sum_range)
- The basic syntax is: =SUMIFS(sum_range, criteria_range1, criterion1, [criteria_range2, criterion2, ...])
COUNTIF to COUNTIFS
- The basic syntax for COUNTIF is: =COUNTIF(range, criterion)
- COUNTIFS has the same basic syntax as SUMIFS: =COUNTIFS(criteria_range1, criterion1, [criteria_range2, criterion2, ...])
For more information SUMIFS and COUNTIFS are just two examples of functions with multiple conditions. They help demonstrate how multiple conditions can be built into the basic syntax of a function. But, there are other functions with multiple conditions that you can use in your data analysis. There are a lot of resources available online to help you get started with these other functions:
-
How to use the Excel IFS function: This resource includes an explanation and example of the IFS function in Excel. This is a great reference if you are interested in learning more about IFS. The example is a useful way to understand this function and how it can be used.
-
VLOOKUP in Excel with multiple criteria: Similar to the previous resource, this resource goes into more detail about how to use VLOOKUP with multiple criteria. Being able to apply VLOOKUP with multiple criteria will be a useful skill, so check out this resource for more guidance on how you can start using it on your own spreadsheet data.
-
INDEX and MATCH in Excel with multiple criteria: This resource explains how to use the INDEX and MATCH functions with multiple criteria. It also includes an example which helps demonstrate how these functions work with multiple criteria and actual data.
-
Using IF with AND, OR, and NOT functions in Excel: This resource combines IF with AND, OR, and NOT functions to create more complex functions. By combining these functions, you can perform your tasks more efficiently and cover more criteria at once.
Hands-On Activity: Working with conditions
Composite functions
Hi, again. Data analysts love discovering new ways to work on their analysis, especially when those new ways simplify their work. I know I'm a big fan of learning new tricks to complete tricky tasks. Instead of trying to find a new way to do something every time I do an analysis, I try to learn from other people by asking questions and getting help when I need it. The people I work with like to use the phrase, stealing with pride. All this means is that you should feel no shame for using a process in your analysis that you learned from someone or somewhere else. Fellow team members, message board posts, online searches, I've used all of these resources for ideas. With pride! Of course, I always cite my sources when I do. That's a super important step to remember.
The SUMPRODUCT function is also one of those tricks that analysts come across either on their own or from another source. You can also think of it as a shortcut for doing more complex calculations. We'll show you how SUMPRODUCT works and when you might use it to make your work life simpler.
SUMPRODUCT is a function that multiplies arrays and returns the sum of those products. Here's what the SUMPRODUCT formula looks like; equal sign, the SUMPRODUCT followed by an open parenthesis and arrays being multiplied and then added together. Each array is separated by a comma.
An array is kind of like a range in a spreadsheet. But keep in mind, an array is a collection of values in cells, not the cells themselves.
To follow along using the same spreadsheet as the instructor, click the link below Kitchen Supplies, Profit
When added to a formula, the SUMPRODUCT function multiplies each of the values in two or more arrays together.
For example, each value in the array of cells B3 through B7 can be multiplied by its corresponding value in the array of cells C3 through C7. B3 times C3, B4 times C4, and so on. It will then return the sum of all of those multiplications. Let's check out an example using data from a kitchen supplies company.
You might remember this example from our COUNTIF and SUMIF video. We've been given some data about a product order, including the quantity of each product that was sold in the order and the unit price, which tells how much one of each product cost.
Our job is to use the data in these two columns to find out the total revenue for this order. That's where SUMPRODUCT comes in. To find the total revenue, we need to do both addition and multiplication calculations.
First off, we need to find the revenue that each item brought in separately.
If we did this without SUMPRODUCT, we'd have to multiply each quantity by its unit price: 50 times $1.25, 25 times $5, and so on.
Then we'd have to add all of those revenue amounts together to get the total revenue.
Fortunately, the SUMPRODUCT function does all of that for us. Let's add the label Total Revenue in cell G5 and then click G6 to input our formula. We'll then start our formula with an equal sign and the function followed by an open parenthesis. It's good to remind ourselves that the arrays we add to our formula should always be inside the parentheses. Next, we'll select cells B3 through B7 for the first array followed by a comma. The comma acts as a separator between the two arrays and the formula. Then, we'll select cell C3 through C7 for the second array, followed by a closed parenthesis to complete our formula.
We don't need to include the brackets in our actual formula. We included them in the syntax example to clearly define each array for you. Then we press Enter to get our total revenue. Since we're dealing with revenue, we'll format the number as currency. We've learned the total revenue is $655. But that's not the actual profit from the sales of these kitchen supplies because we haven't included the profit margin in our calculations.
The profit margin is a percentage that indicates how many cents of profit have been generated for each dollar of sale.
In our dataset, product # 789 has a profit margin of 20 percent, meaning each product sold earns a total profit of $0.20 for every dollar. And just like the calculation for revenue, we can save time finding profit margin by using the SUMPRODUCT function. There's only one difference between the formula for profit margin and revenue in this spreadsheet. But it's an important difference.
To start, in cell G7 we type the same first part of the formula. Then we include the two arrays in the same way as well. But instead of ending our formula, we add another comma followed by another array. This time, we'll select the cells with a profit margin, D3 through D7.
We'll finish our formula, and our calculation is complete.
The SUMPRODUCT function saved us from having to multiply each individual revenue amount by each profit margin percentage, then add each profit margin amount together. Using SUMPRODUCT for calculations is a time-saver and helps you avoid making mistakes. Definitely a trick worth remembering, and there's more worth remembering about calculations coming up next.
Practice Quiz: data calculations
Pivot...pivot...pivot...
Start working with pivot tables
Hey, there. By now, we've learned a lot about functions and formulas. They are very helpful tools for your toolbox and great for finding shortcuts to complete calculations. But there's another tool out there that does some of the same things in a spreadsheet: the pivot table.
As a quick reminder, pivot tables let you view data in multiple ways to find insights and trends. We've talked before about how pivot tables help with cleaning and organizing your data including sorting and grouping data. But pivot tables can also help with calculations. For example, they're great for quickly calculating sums and averages.
Let's revisit our movie data set to show you how pivot tables and calculations worked hand in hand. Earlier, we summarized and organized this data in pivot tables. We'll do that here too. But in this case, the organization is a bonus to using pivot tables for calculations. You can do these steps in Excel as well, though some of the steps might look different. If you're using Excel, you can check out the reading after this video for more details about pivot tables and Excel spreadsheets.
Would you like to follow along with the instructor using the same spreadsheet? To use the template for the spreadsheet, click the link below Movie Data Starter Project
In this example, your manager ask you to find some trends to help them think through new movie ideas using revenue calculations. This spreadsheet has data about movies from several years ago. So it probably wouldn't be as useful right now. Still, the steps we take to analyze the data absolutely apply then and now. So let's get into it.
First, we need to find out how much revenue was generated each year.
A pivot table is a good way to organize this. So we'll build a pivot table to show this.
In our pivot table, we can also find the average revenue per movie.
We can then check our findings for some possible trends.
We'll start by finding the revenue generated each year. This gives us the release date for each movie in column B and the box office revenue in column N. Instead of reorganizing the table by year and building a formula to calculate the revenue per year, we'll create a pivot table.
The menu option has slightly changed. To insert a pivot table select Insert and Pivot Table.
Well, at the pivot table in a new sheet keeping the data range of cell A1 to cell N509.
Adding a new sheet is especially helpful when working in a large dataset like this. It helps keep our calculations together in one place and separate from the rest of the data. We'll rename this sheet revenue to call out where our calculations are, both for ourselves and for anyone else on our team who might need our analysis.
Now we can build our pivot table, starting with the rows. We'll sort the rows by release date to find out the revenue for each year. You might notice this creates a row for every date on which one or more movies in this dataset was released.
Since we only need the years, we'll right-click in one of the cells in the release date column to create a pivot date group and group by year. Now we have rows, one for each year in which these movies were released.
Next, we want to work with the values. Well, at the box office revenue data here. This populates the columns next to the release dates with the total box office revenue and each year. These calculations are automatic because the pivot table is already set to summarize the data using the sum function. So no need to change this setting.
There are other functions and the summarized by menu though, such as min for the minimum amount of revenue each year and count for the number of movies that generated revenue in each year.
Okay, let's check out what we've got here. This data shows that 2014 have the highest revenue, while 2016 have the lowest.
This might be useful information, but finding the average revenue per movie would most likely be more useful since there was a different number of movies released each year. So we'll add another column for the average revenue earned by each year's movie. We can do this in the same pivot table. We'll add another value and change the function that we use to summarize from sum to average.
The average function gives us the average revenue per year for the movies in the dataset. We can see that the average revenue in 2015 was much lower than the other years. Since this data stands out so much, let's keep exploring to find out why. Taking your analysis to another level like this is a sign of a great analyst. When you're in your job, you want to answer the questions that your manager and stakeholders ask. But you also want to answer the ones that come up while you're doing your analysis. So let's try to figure this out.
First, we'll know how many movies from each year were included in the dataset, we'll add a new value and use the count function this time. This shows us that there are more movies in the data set from 2015 than from any other year. But 2015 still has the second lowest total box office revenue. This could mean a few things. It's likely that a lot of the movies from 2015 just didn't earn much revenue compared to the other years, which would bring down the overall average revenue. Even if the total revenue remained on par with the other years.
We'll explore just this one possibility here. But you can always go further when you analyze data in your own job. It will depend on your objectives and the questions you need to answer.
For now let's copy and paste our pivot table so we can test our hypothesis.
We'll rename the columns and our copy table to differentiate them from our original table. We'll name them based on the data we'll be looking at, which I'll explain in the next video.
Now our copied pivot table is ready for us to test our hypothesis. Next, we're going to use filters to find out how many movies earned less than $10 million revenue in 2015. Then we'll also create a calculated field to determine what percentage of the total movies from that year they represent. I'll be here when you're ready to learn more about pivot tables.
Pivot tables continued
Welcome back. In the last video, we created a pivot table of movie data and revenue calculations to help our manager think through new movie ideas. We used our pivot table to make some initial observations about annual revenue. We also discovered that the average revenue for 2015 was lower than other years even though more movies were released that year. We hypothesized that this was because more movies that earn less than $10 million in revenue were released in 2015. To test this theory, we created a copy of our original pivot table. Now we are going to apply filters in calculated fields to explore the data more. Let's get started.
You all remember that the filter option lets us view only the values we need. We'll select a cell in our copied pivot table and add a filter to the box office revenue column. The filter will then be applied to the entire table. When we open the status menu, we can choose to filter the data to show specific values.
But in our case, we want to filter by condition so we can figure out how many movies in each year earn less than $10 million. The condition we'll use in our filter is less than and our value will be $10 million which is why we renamed these columns earlier. We'll type our number in a dollar and cents format so the condition matches the data in our pivot table. This might not be necessary, but it prevents potential errors from happening.
Now we know that 20 movies released in 2015 made less than $10 million. This seems like a high number compared to the other years. But keep in mind, there were more movies from our data set released in 2015.
Before we move on, let's use a calculated field to verify our average because it was copied from another pivot table before we filtered it. That way we can check that it's correct.
We'll create a customized column called a calculated field using our values menu.
A calculated field is a new field within a pivot table that carries out certain calculations based on the values of other fields. You can do this in Excel too using field settings and the create formula menu.
For the formula in our calculated field, we'll use the sum function and divide the sum of the box office revenue data from our original table by the count of the same data.
Because we applied our filter to this pivot table earlier, this formula will only return the average revenue of movies under $10 million. That worked.
We were able to check the accuracy of some of our data before analyzing it. Always a good thing. But it's still difficult to tell how much of an impact these lower earning movies had on the average revenue.
Let's run a quick formula to find the percentage of movies for each year that earned less than $10 million. This will make it easier to compare from year to year. Instead of a calculated field, we'll add this as a formula in a new column, that way we can pull data from both of our pivot tables.
We'll put a header for our table in cell G10 and name it percent of total movies. Then we'll add our formula to the next cell in the column. Divide the number of movies in the copy table by the number of movies in the original table. Then we'll use the fill handle in the cell with a formula and drag it to apply the formula to the rest of the years. Finally, we'll format these numbers as percentages.
Now our analysis shows that 16 percent of the movies released in 2015 earned less than $10 million of revenue. The other years are all close to 10 percent. This is one possible explanation for why the average revenue is comparatively low in 2015.
In real life, we'd most likely need to take our analysis even further depending on our goals. But for now, we're all set. You've learned how you can use pivot tables to perform data calculations. It will take practice, but pivot tables are worth it because they do more than calculate. They organize and filter data too.
Together we've covered functions, formulas, and pivot tables. All great tools to use in analysis. With practice and experience, it will feel like you've used them forever. Just take your time getting to know how they work. Keep exploring these videos and the readings. Great work.
Elements of a pivot table
Using pivot tables for analysis
Using pivot tables in analysis
Create your pivot table
Using your pivot table for analysis
You can perform a wide range of analysis tasks with your pivot tables to quickly draw meaningful insights from your data, including performing calculations, sorting, and filtering your data. Below is a list of online resources that will help you learn about performing basic calculations in pivot tables as well as resources for learning about sorting and filtering data in your pivot tables
Perform calculations
Resource | Description |
---|---|
Calculate values in a pivot table (Microsoft Excel) | Microsoft Support’s introduction to calculations in Excel pivot tables. This is a useful starting point if you are learning how to perform calculations with pivot tables specifically in Excel. |
Create and use pivot tables (Google Sheets) | This guide is focused on using pivot tables in Google Sheets and it provides instructions for creating calculated fields. This is a quick how-to guide you can save and reference as a quick reminder on how to add calculated fields. |
Pivot table calculated field example | This resource includes a detailed example of a pivot table being used for calculations. This step-by-step process demonstrates how calculated fields work and provides you with some idea of how they can be used for analysis. |
All about calculated field in pivot tables (Google Sheets) | This is a comprehensive guide to calculated fields for Google Sheets. If you are working with Sheets and are interested in learning more about pivot tables, this is a great resource. |
Pivot table calculated fields: step-by-step tutorial | This tutorial for creating your own calculated fields in pivot tables is a really useful resource to save and bookmark for when you start to apply calculated fields to your own spreadsheets. |
Pivot tables in Google Sheets | This beginner’s guide covers the basics of pivot tables and calculated fields in Google Sheets and uses examples and how-to videos to help demonstrate these concepts. |
Sort your date
Resource | Description |
---|---|
Sort data in a pivot table or PivotChart (Microsoft Excel) | Microsoft Support's how-to guide to sorting data in pivot tables. This is a useful reference if you are working with Excel and are interested in checking out how filtering will appear in Excel specifically. |
Customize a pivot table (Google Sheets) | This guide from Google Support focuses on sorting pivot tables in Google Sheets. This is a useful, quick reference if you are working on sorting data in Sheets and need a step-by-step guide. |
Pivot tables - Sorting data | This tutorial for sorting data in pivot tables includes an example with real data that demonstrates how sorting in Excel pivot tables works. This example is a great way to experience the entire process from start to finish. |
How to sort pivot table columns | This detailed guide uses real data to demonstrate how the sorting process for Google Sheet pivot tables will work. This is a great resource if you need a slightly more detailed guide with screenshots of the actual Sheets environment. |
How to sort a pivot table by value | This source uses an example to explain sorting by value in pivot tables. It includes a video, which is a useful guide if you need a demonstration of the process. |
Pivot table ascending and descending order | This 1-minute beginner’s guide is a great way to brush up on sorting in pivot tables if you are interested in a quick refresher. |
Filter your data
Resource | Description |
---|---|
Filter data in a pivot table (Microsoft Excel) | This resource from the Microsoft Support page provides an explanation of filtering data in pivot tables in Excel. If you are working in Excel spreadsheets, this is a great resource to have bookmarked for quick reference. |
Customize a pivot table (Google Sheets) | This is the Google Support page on filtering pivot table data. This is a useful resource if you are working with pivot tables in Google Sheets and need a quick resource to review the process. |
How to filter Excel pivot table data | This how-to guide for filtering data in pivot tables demonstrates the filtering process in an Excel spreadsheet with data and includes tips and reminders for when you start using these tools on your own. |
Filter multiple values in pivot table | This guide includes details about how to filter for multiple values in Google Sheet pivot tables. This resource expands some of the functionality that you have already learned and sets you up to create more complex filters in Google Sheets. |
Format your data
Resource | Description |
---|---|
Design the layout and format of a PivotTable (Microsoft Excel) | This Microsoft Support article describes how to change the format of the PivotTable by applying a predefined style, banded rows, and conditional formatting. |
Create and edit pivot tables (Google Sheets) | This Help Center article provides information about how to edit a pivot table to change its style, and group data. |
Pivot tables are a powerful tool that you can use to quickly perform calculations and gain meaningful insights into your data directly from the spreadsheet file you are working in! By using pivot table tools to calculate, sort, and filter your data, you can immediately make high-level observations about your data that you can share with stakeholders in reports.
But, like most tools we have covered in this course, the best way to learn is to practice. This was just a small taste of what you can do with pivot tables, but the more you work with pivot tables, the more you will discover.
Hands-On Activity: Explore movie data with pivot tables
Practice Quiz: using pivot tables
Learn more SQL calculations
Queries and calculations
By now, you probably know that there's more than one way to do the daily task of a data analyst. Calculations are no exception. As we've shown in earlier videos, you can complete the same calculations in lots of different ways in spreadsheets. You can also complete them using SQL.
In this video, we'll give you an overview of how SQL calculations compare to spreadsheet calculations. Let's look at the arithmetic operators that are used in both spreadsheets and SQL.
An operator is a symbol that names the type of operation or calculation to be performed in a formula.
As you learned earlier, the four basic arithmetic operators in spreadsheet formulas are the plus sign for addition, the minus or hyphen for subtraction, the asterisk for multiplication, and the forward slash for division. These same operators calculate data in the same way when writing queries in SQL. The operators are embedded in the queries when pulling data from a database. Just like spreadsheet formulas, there's a few different ways to perform calculations using queries.
Let's look at the syntax for one possible query.
The syntax of a query is its structure. It should include all the specific details of the data you want to pull into a new table where those details should be placed.
If you want to add values from two columns of a table, you start with the SELECT command, followed by the name of your first column, then the name of your second column. Then you'd add the names of both columns with a plus sign between them. After that, you'd type AS followed by the name you'd like to give the column with the added totals. Finally, you then complete your query by typing FROM and then the name of the table that you're pulling the data from.
Running this query would get you a table showing the two columns whose values are being added together plus a new column showing the sums of those values. The operator in this query is a plus sign since values are being added.
If you needed to subtract, multiply, or divide, you'd follow the same steps using the appropriate operators. If you need to use more than one arithmetic operator in a calculation, you'd use parentheses to control the order of the calculations.
If we included column C in our query, we could place parentheses around column A plus column B. We then add an asterisk if we're multiplying followed by column C.
This query would return in a new column, the sum of the values in column A and B multiplied by the values in column C.
Now, let's say you only wanted the remainder from a division calculation. Well, you need a different operator for this, the modulo operator.
The modulo operator is represented by the percent symbol. This is an operator that returns the remainder when one number is divided by another.
In a spreadsheet, you could complete the same calculation using the MOD function. This brings us to another similarity between calculations in spreadsheets and SQL. A lot of times, you can use functions instead of operators to complete calculations. For example, the SUM function can complete addition problems in spreadsheets and SQL. The AVERAGE function in a spreadsheet is the same as the AVG function in SQL. They both return the average value of a set of numbers. In SQL, these functions are considered aggregate functions because they perform a calculation on one or more values and return a single value.
You'll learn more about how they're used with the GROUP BY command in a query soon. Those are the basics of SQL calculations. Knowing how to write a query for a calculation is a good first step. Stay with us, and you'll learn more about calculations in SQL. Bye for now.
Optional: Upload the avocado dataset to BigQuery
Embedding simple calculations in SQL
Hi again! Earlier, we showed you how to perform calculations in SQL. There are a few different ways, and embedding them in queries is a very useful one. When you include a calculation in a query with other commands, you can do more work faster. Here's a basic query syntax that we talked about:
SELECT column1, column2, (column1 / column2) AS new_column
FROM table_name;
we start with SELECT, and then the names of the columns we want to use in our calculations. Then we add in calculation details, including an operator like a forward slash for division. Next, we type AS, followed by the new column's name to label the column with the calculated values. Finally, we end our query with the FROM command and the name of the table that we're pulling data from.
Now, let's take it to the next level with some embedded calculations that uses syntax like this one. Even better, we'll do this with some data about avocados. Sorry, those of you aren't avocado lovers like me. Okay, Let's started. Feel free to continue watching to show you the steps using bigquery. If you're joining us, open up your tool of choice for using SQL. Be sure to look through the instructions and the reading right before this video to help you get started. Okay, data is already loaded so we can jump right in.
Our goal is to find out the total number of bags of avocados on each date at each location using this data. There's already a column that shows us the total, but we want to make sure we understand how that total is calculated.
We want to make sure that the total column is just small, large, and extra-large bags added together. So, we'll add the values in those three columns together in our query and then compare them to the total bags column in the dataset.
Let's start with the SELECT command, where we pull certain columns from the table. After SELECT, we'll list the column names: Date, Region, Small_Bags, Large_Bags, XLarge_Bags, Total_Bags. We'll use underscores in column names to avoid potential issues with spaces.
Underscores are the lines used to underline words and connect text characters. Using spaces can confuse certain servers and applications, so using underscores instead helps avoid potential issues while keeping the names readable.
Now, we'll add the calculation to the query using the names of the three columns with plus signs between them: Small_Bags + Large_Bags + XLarge_Bags. Since we want this calculation in a new column, we'll use the AS command to name the column Total_Bags_Calc. We've added the word "calc" so we can compare the columns.
Now, we'll finish our query with FROM and the name of the dataset and subset we're pulling from: avocado_data.avocado_prices.
personal project name.avocado_data.avocado_prices ref: "Optional: uploading the avocado dataset to BigQuery."
Let's run the query. In the "total_bags_calc" column, the data shows the sum of each date for the number of small, large, and extra-large bags of avocados sold at each location. If we compare this to the "total_bags" column, we learn that the values are the same. This lets us know that the data we want to use is the right data.
Now that we have verified the total number of bags, we can use those values in another query. We need to find what percent of the total number of bags were small bags. This information might help stakeholders make decisions on how to package avocados or which size bag to run a sale on. Our job is to get that information to the stakeholders, so we'll set up a new query.
For this new query, we'll select the date, region, total_bags, and small_bags columns. Next we'll set up a new column starting with our calculation. To find the percentage of small bags, we need to first divide the number of small bags by the number of total bags using a slash as the operator. We'll put this part of the calculation in parentheses, and then multiply the result by 100. We'll use the AS command to name this new column "small_bags_percent." We'll put this part of the calculation and parentheses to let the server know that this calculation should be performed first. Then we'll multiply this total by a 100 using an asterisk as our operator. Multiplying by 100 gives us a value that's a percentage instead of a decimal. Percentages usually make it easier for people to understand quickly when you share results with them. We'll use the AS command to name this new column Small_Bags_Percent. Then we'll add from and the name of the set we're pulling from(avocado_data.avocado_prices), and we'll run our query.
Oops, we got an error in our results, it says that we can't divide by zero. Since we're finding percentages dividing by 0 won't work, this means that somewhere in the data set there's a total bags equal to zero. We'll have to fix this in our query.
We can fix this using the WHERE command. WHERE lets us add a condition to our calculation. So after we type where, we'll type Total_Bags followed by a less than sign and then a greater than sign(<>). These symbols tell the server that the values were calculating should not be equal to the value we specify. In this case, that value is zero, so we'll add a zero to our query.
Now, when we run the query you'll notice our new column shows the percent of small bags in the total bags count.
We get the same results if we use an exclamation mark followed by an equal sign in place of the less than and greater than signs. Know that this is one way for doing it. But there are functions such as safe DIVIDE that also allow you to avoid this error.
Those are just a couple of examples to get you started, but with SQL, you can complete just about any calculation you want during your analysis. Embedding the calculations in your queries will help you keep your analysis organized while getting your results. The calculation methods we showed you here are just the beginning so look for more coming up. See you soon!
Calculations with other statements
Hey, good to see you! As a data analyst, you'll find that your calculations come in all shapes and sizes. Earlier, we showed you how to do some of the more basic calculations in SQL. While basic calculations are great, sometimes you'll need to group data before completing calculations.
The GROUP BY and ORDER BY commands help you do this.
These commands are usually paired with aggregate functions like SUM or COUNT. We'll show you how you can use these commands and functions to calculate and summarize data from groups of rows in a table.
Let's explore the GROUP BY command first.
GROUP BY is a command that groups rows that have the same values from a table into summary rows. The GROUP BY command is used with SELECT statements, and a basic SELECT FROM or SELECT FROM WHERE query group comes at the end of the query.
Alright, let's try using GROUP BY. We'll work with a database with data from a bike-sharing system. We want to find out how many rides people took on these bikes per year.
This data has several columns, but for this task, we only need the start time column. Since this dataset is not organized by date and the start time column isn't organized by year, we'll need to include steps in our code to organize it. We also want the total number of rides each year, so we'll need to include a calculation in our query for this. Depending on the questions we've been tasked with answering, this might be the first step of many in our analysis.
We'll start our query with the SELECT command. We'll then add EXTRACT to our query.
The EXTRACT command lets us pull one part of a given date to use. We'll extract the year from the start time column.
To do this, we'll add an open parenthesis followed by YEAR, which lets the server know the part of the date we need. Then, we'll add the FROM command and start time so we can get the year from all of the start times in that column. We'll close the parentheses and then use AS and the word "year" to name the column we're creating.
On the next line of our query, we'll use the aggregate function COUNT, followed by an asterisk in parentheses. This will count the bike rides in the start time column, using the asterisk to make sure that all the start times are counted in the data. Then, we'll name our column "number_of_rides" with underscores between each word. Instead of spaces, we'll add FROM and the database we're pulling from. In this case, that's bigquery-public-data.new_york_city.citibike_trips.
And here's our GROUP BY command. We'll use this to group the data by year. So, type GROUP BY followed by year. We can further organize our results by using the ORDER BY command. Adding this after GROUP BY orders the results. We'll add year to order the data by year. It's good to know that by default, ORDER BY sorts data in ascending order.
Now we can run our query to get the results. The years are ordered starting with 2013 and ending with 2016.
If we want to change this to descending order, we can add the keyword DESC to the end of the query and run it again.
But whichever order you use, the GROUP BY and ORDER BY commands are great for helping us complete and organize a calculation for our analysis. This is one way to include calculations when aggregating data, and it's just one of many ways that SQL helps keep your analysis running smoothly and moving forward. There's still more to come on calculations and SQL.
Coming up, we'll learn more about data validation. See you soon!
Hands-On Activity: Calculations in SQL
Calculations and queries (Discussion Prompt)
Practice Quiz: SQL calculations
The data-validation process
Check and recheck
Hi, again. Earlier, we covered data validation, a spreadsheet function that adds drop-down lists to cells. Using data validation lets you control what can and can't be entered into your worksheet. One of its uses is protecting structured data and formulas in your spreadsheets.
But as useful as it is, the data validation function is just one part of a larger data validation process. This process involves checking and rechecking the quality of your data so that it is complete, accurate, secure, and consistent.
While the data validation process is a form of data cleaning, you should use it throughout your analysis. If this all sounds familiar to you, that's good. Ensuring you have good data is super important. And in my opinion, it's kind of fun because you can pair your knowledge of the business with your technical skills. This will help you understand your data, check that it's clean, and make sure you're aligning with your business objectives. In other words, it's what you do to make sure your data makes sense.
Keep in mind, you'll build your business knowledge with time and experience. And here's a pro tip. Asking as many questions as possible whenever you need to will make this much easier.
Okay, let's say we're analyzing some data for a furniture retailer. We want to check that the values in the purchase price column are always equal to the number of items sold times the product price.
So we'll add a formula in a new column to recalculate the purchase prices using a multiplication formula.
Now, comparing the totals, there's at least one value that doesn't match the value in the purchase price column. We need to find an answer to help us move forward with our analysis.
By doing some research and asking questions, we find that there's a discount of 30% when customers buy five or more of certain items. If we hadn't run this check, we could have missed this completely.
You've learned that as an analyst, calculations are a big part of your job. So it's important that whenever you do calculations, you always check to make sure you've done them in the right way. Sometimes you'll run data validation checks that are common-sense checks.
For example, let's say you're working on an analysis to figure out the effectiveness of in-store promotions for a business that's only open on weekdays.
You check to make sure that there's no sales data for Saturday and Sundays. If your data does show sales on weekends, it might not be a problem with the data itself. It might not even be a problem at all. There might be a good reason. Maybe your business hosts special events on Saturdays and Sundays. Then you would have sales for those weekends. You still might want to leave out the weekend sales in your analysis if your objective is only to look at the weekdays.
But doing this data validation might save you from miscalculations and other errors in your analysis. You should always do data validation no matter what analysis tool you're using.
In an earlier video, we used SQL to analyze some data about avocados. One of the queries was a check to make sure the data showing the total number of bags was the sum of small, large, and extra-large bags.
By running this query, we were able to determine that the total number column was accurate. We compared our two columns briefly in that video.
But to be absolutely sure that there's no issues with the data values in those columns, we could have also run another query.
In this query, we would select all using the asterisk, and FROM the avocado prices data set. In our WHERE clause, we'd also type out where our calculated total does not equal the total bags column. If no values are returned, we can be sure that the values in the Total Bags column are accurate.
Total_Bags != Total_Bags_Calc
And that led us to continue our analysis.
But when we tried to find what percent of the total number of bags was small, we ran into a small problem. We received an error message about dividing by zero. We fixed that error by adjusting our query. If we had linked that query to a presentation that went to our stakeholders, they'd show us the divide by zero error instead of the figures we wanted.
By building in these types of checks as part of your data validation process, you can avoid errors in your analysis and complete your business objectives to make everyone happy. And trust me. It's a great feeling when you do. And another great feeling is knowing that you've made it through another video and learned something new. And we have more where that came from coming soon. See you.
SELECT Date, Region, Total_Bags, Small_Bags, (Small_Bags / Total_Bags) * 100 AS Small_Bags_Percent FROM avocado_data.avocado_prices
Types of data validation
=> Data type / Data range / Data contraints / Data consistencty / Data structure / Code validation
Hands-On Activity: From spreadsheets to BigQuery
Data Validation
- Validating a data type means checking that the data matches the data type defined for the field.
- In this example, Ivan needs to validate that the type of data is correctly defined for the field.
- Validating a data range means checking that the data falls within an acceptable range of values defined for the field.
- In this example, Ivan needs to validate that data falls within an acceptable range of values defined for the field
- Validating data constraints means checking that data meets certain conditions or criteria, such as type of characters.
- In this example, Ivan needs to validate that data meets certain conditions or criteria, such as type of characters.
- Validating data consistency means checking that the data makes sense in the context of other related data.
- In this example, Ivan needs to validate the consistency of data relative to other similar data.
- Validating data structure means checking that the data follows or conforms to a set structure, such as MP3 files or HTML code.
- In this example, Ivan needs to validate data is properly structured to ensure files show up correctly when queried.
- Code validation means checking that the application code systematically performs any of the previously mentioned validations during user data input.
- Ivan needs to validate that code within the system performs previously mentioned validations during user data input.
Learning Log: Finish your data analysis checklist
Complete your checklist
Process Phase Step X: Cleaning data
- Is each variable one column?
- Is each observation one row?
Break down every phase of the data analysis process
- What are the high-level steps you need to take? How can the phases of analysis help you organize the whole process?
- What specific details are necessary to complete these high-level steps?
- How can you simplify each step as you go? Are there any shortcuts that can help you?
- Are there any steps you might be most at risk of forgetting? If so, how can your checklist remind you to complete these steps?
Access your learning log Finish your data analysis checklist
Reflection
- Review your checklist and compare it to the tasks and activities related to this course. How is your checklist similar to or different from the organization of the course?
- How does creating a checklist help you develop your data analysis skills?
- How will your checklist help you analyze your own data?
- What else could you use your checklist for?
Practice Quiz: data validation
Using SQL with temporary tables
Temporary tables
Hello again! Know if you're like me, you always have sticky notes available nearby to write a reminder or figure out a quick math problem. Sticky notes are useful and important, but they're also disposable since you usually only need them for a short time before you recycle them. Data analysts have their own version of sticky notes when they're working in SQL; they're called temporary tables, and we're here to find out what they're all about.
A temporary table is a database table that is created and exists temporarily on a database server. Temp tables, as we call them, store subsets of data from standard data tables for a certain period, and then they're automatically deleted when you end your SQL database session. Since temp tables aren't stored permanently, they're useful when you only need a table for a short time to complete analysis tasks like calculations.
For example, you might have a lot of tables you're performing calculations on at the same time. If you have a query that needs to join, seven or eight of them, you could join the two or three tables having the fewest number of rows and store their output in a temp table. You could then join this temp table to one of the other bigger tables.
Another example is when you have lots of different databases you're running queries on; you can run these initial queries in each separate database and then use a temp table to collect the results of all these queries. The final report query would then run on the temporary table. You might not be able to make use of this reporting structure without temporary tables.
They're also useful if you've got a large number of records in a table, and you need to work with a small subset of those records repeatedly to complete some calculations or other analysis. So, instead of filtering the data over and over to return the subset, you can filter the data once and store it in a temporary table. Then you can run your queries using the temporary table you've created.
Imagine that you've been asked to analyze data about the bike-sharing system we looked at earlier. You only need to analyze the data for bike trips that were over 60 minutes or longer, but you have several questions to answer about the specific data. Using a temporary table will let you run several queries about this data without having to keep filtering it.
There are different ways to create temporary tables in SQL depending on the relational database management system you're using. We'll explore some of these options soon. For this scenario, we'll use BigQuery. We'll apply a WITH clause to our query.
The WITH clause is a type of temporary table that you can query from multiple times. The WITH clause approximates a temporary table. Basically, this means, it creates something that does the same thing as a temporary table, even if it doesn't add a table to the database you're working in for others to see, you can still see your results, and anyone who needs to review your work can see the code that led to your results.
Let's get this query started.
We'll start this query with the WITH command. We'll then name our temp table "trips_over_one_hour."
Then we'll use the SELECT FROM WHERE structure for a subquery. We'll type SELECT followed by an asterisk. The asterisk means you're selecting all the columns in the table.
Now we'll type the FROM command and name the database that we're pulling from:
bigquery-public-data.new_york_citibike_trips
.
Next, we'll add a WHERE clause with the condition that the length of the bike trips we need in our temp table are greater than or equal to 60 minutes. In the query, it goes like this:
trip_duration >= 60
. Finally, we'll add a close parentheses on a new line to end our subquery, and that sets up our temporary table.
Now we can run queries that will only return results for trips that lasted 60 minutes or longer. Let's try one. Since we're working in our version of a temp table, we don't need to open a new query. Instead, we'll label our queries before we add our code to describe what we're doing.
For this query, we'll type two hashtags. This tells the server that this is a description and not part of the code. Next, we'll add the query description: "How many trips are 60 plus minutes long?". Then we'll add our query: SELECT, then on a new line, COUNT with an asterisk in parentheses and AS followed by CNT to name the column with our count. Next, we'll add FROM and the name we're using for our version of a temporary table: "trips_over_one_hour."
When we run our query, the results show the total number of bike trips from the dataset that lasted 60 minutes or longer. We can keep running queries on this temp table over and over as long as we're looking to analyze bike trips that were 60 minutes and over.
If you need to end your session and start a new runtime later, most servers store the code used in temp tables; you'll just need to recreate the table by running the code.
When you use temporary tables, you make your own work more efficient. Naming and using temp tables can help you deal with a lot of data in a more streamlined way. So you don't get lost repeating query after query with the same code that you could just include in a temp table. And here's another bonus to using temp tables: they can help your fellow team members too. With temp tables, your code is usually less complicated and easier to read and understand, which your team will appreciate. Once you start to explore temporary tables on your own, you might not be able to stop. Don't say I didn't warn you.
Coming up, we'll explore even more things you can do with temp tables. See you soon!
Hands-On Activity: Create temporary tables
Multiple tables variations
Hello there. Earlier we introduced you to temporary tables. They're a great resource to use during your analysis because they help you keep your SQL code organized and efficient. You learned how to use a WITH clause to create a type of temporary table.
Now, we'll get into some other ways you can create temp tables along with the pros and cons they present. That's one of the great things about data analytics. There's almost always more than one way to get your analysis done.
The SELECT INTO statement is a good example of how to get a temp table done. This statement copies data from one table into a new table but it doesn't add the new table to the database. It's useful if you want to make a copy of a table with a specific condition, like a query with a WHERE clause.
So far, we've been using BigQuery to show you how SQL works. But BigQuery doesn't currently recognize the SELECT INTO command. Instead, here's an example of how a SELECT INTO statement might look in another RDBMS. In the statement, a new table named Africa Sales is created using the data from the global sales database about the African region.
Using SELECT INTO is a good practice when you want to keep the database uncluttered and you don't need other people using the table. Now, if lots of people will be using the same table, then the CREATE TABLE statement might be the better option. This statement does add the table into the database.
If everyone needs access to the Africa Sales table, your query would start with CREATE TABLE, followed by the same SELECT-FROM-WHERE query as in the SELECT INTO statement.
In most relational database management systems or RDBMSs, you can add metadata to describe the data that's contained in the table you've created. This can help make the table easier to understand for anyone using it.
The CREATE TABLE statement is also useful for tables that are more complex. For example, if the code's difficult to replicate, then making a temp table in this way means it'll be safe for you to access later.
The way you create a temporary table using the WITH clause or a SELECT INTO or CREATE TABLE statement is usually up to you and your needs. The more you work in SQL, the more you might have preferences as well, especially since there's more than one way to create temporary tables.
You may also find that you're working in an RDBMS that uses a different syntax. For example, you might need to use a CREATE TEMP TABLE statement instead of CREATE TABLE. Here's some good news. The syntax that you need for each unique RDBMS is usually pretty easy to find with a quick online search.
But no matter how or where you create temporary tables, there isn't much downside to them. It's good to note though that sometimes building a temp table can interrupt your workflow. Again, that will depend on your objectives and your preferences. You can repeat your code over and over instead of making a temp table but that usually leaves your queries less readable and more vulnerable to typos.
As you continue exploring the world of data analytics, you'll find that temporary tables are just one of the many resources you'll be able to use. The more you use them, the easier it'll be to navigate that world.
Working with temporary tables
Temporary tables are exactly what they sound like—temporary tables in a SQL database that aren’t stored permanently. In this reading, you will learn the methods to create temporary tables using SQL commands. You will also learn a few best practices to follow when working with temporary tables.
A quick refresher on what you have already learned about temporary tables
- They are automatically deleted from the database when you end your SQL session.
- They can be used as a holding area for storing values if you are making a series of calculations. This is sometimes referred to as pre-processing of the data.
- They can collect the results of multiple, separate queries. This is sometimes referred to as data staging. Staging is useful if you need to perform a query on the collected data or merge the collected data.
- They can store a filtered subset of the database. You don’t need to select and filter the data each time you work with it. In addition, using fewer SQL commands helps to keep your data clean.
It is important to point out that each database has its own unique set of commands to create and manage temporary tables. We have been working with BigQuery, so we will focus on the commands that work well in that environment. The rest of this reading will go over the ways to create temporary tables, primarily in BigQuery.
Temporary table creation in BigQuery
Temporary table creation in other databases (not supported in BigQuery)
User-managed temporary table creation
Best practices when working with temporary tables
-
Global vs. local temporary tables: Global temporary tables are made available to all database users and are deleted when all connections that use them have closed. Local temporary tables are made available only to the user whose query or connection established the temporary table. You will most likely be working with local temporary tables. If you have created a local temporary table and are the only person using it, you can drop the temporary table after you are done using it.
-
Dropping temporary tables after use: Dropping a temporary table is a little different from deleting a temporary table. Dropping a temporary table not only removes the information contained in the rows of the table, but removes the table variable definitions (columns) themselves. Deleting a temporary table removes the rows of the table but leaves the table definition and columns ready to be used again. Although local temporary tables are dropped after you end your SQL session, it may not happen immediately. If a lot of processing is happening in the database, dropping your temporary tables after using them is a good practice to keep the database running smoothly.
For more information
-
BigQuery Documentation for Temporary Tables: Documentation has the syntax to create temporary tables in BigQuery
-
How to use temporary tables via WITH in Google BigQuery: Article describes how to use WITH
-
Introduction to Temporary Tables in SQL Server: Article describes how to use SELECT INTO and CREATE TABLE
-
SQL Server Temporary Tables: Article describes temporary table creation and removal
-
Choosing Between Table Variables and Temporary Tables: Article describes the differences between passing variables in SQL statements vs. using temporary tables
Your intermediate guide to SQL
As you get more comfortable with SQL, you'll be able to take on even more advanced queries.
This in-depth guide will give you a more detailed introduction to some of the SQL functions you have already learned, and give you some new tools to work with. Be sure to keep it handy as you progress in your SQL learning journey:
Your-Intermediate-Guide-to-SQL.pdf
Using Connected Sheets with BigQuery
In this reading, you will learn about Connected Sheets, a tool that allows data professionals to use basic spreadsheet functions to analyze large datasets housed in BigQuery. With Connected Sheets users don’t need to know SQL. Instead, anyone, not just data professionals, can generate insights with basic spreadsheet operations such as formulas, charts, and pivot tables.
Why would a data analytics professional use Connected Sheets? As a data analytics professional, Connected Sheets can help with several tasks, such as:
- Collaborating with partners, analysts, or other stakeholders in a familiar spreadsheet interface;
- Ensuring a single source of truth for data analysis without additional .csv exports;
- Defining variables so that all users are working with the same data;
- Sharing insights with your team in a secure environment; and
- Streamlining your reporting and dashboard workflows. Many teams and industries benefit from Connected Sheets such as finance, marketing, and operations teams.
A few example use cases of Connected Sheets include:
- Business planning: A user can build and prepare datasets, and then find insights from the data. For example, a data analyst can analyze sales data to determine which products sell better in different locations.
- Customer service: A user can find out which stores have the most complaints per 10,000 customers.
- Sales: A user can create internal finance and sales reports. After completing, they can share revenue reports with sales reps.
- Logistics, fulfillment, and delivery: A user can run real-time inventory management and intelligent analytics tools.
Connected Sheets benefits
Collaborate with teammates and stakeholders Since Connects Sheets lives in Google Workspace, you can easily collaborate with other teammates and stakeholders in your company. If you’d like to limit access, you also control permissions for who can view, edit, or share the data.
Do more with familiar tools With Connected Sheets, you can access billions of rows of BigQuery data directly in Sheets. This direct access makes it easier for all employees to track, forecast, and analyze their data to get to better decisions faster.
Easily visualize data You can unlock insights from your BigQuery datasets using features you’re already familiar with in Sheets, such as pivot tables, charts, and formulas. These features help visualize large datasets more easily than using a more advanced language such as SQL. However, if you know SQL, you may prefer to use it in certain situations.
Up to date data With Connected Sheets, data professionals can ensure they are making decisions based on a single source of truth by setting up automatic refreshes of BigQuery data in Sheets.
Less data integrity and security risk While users can access big data with Connected Sheets, they won’t be able to accidentally manipulate or jeopardize the integrity of the data. There’s less security risk because data isn’t stored on individual workstations, it’s stored in the cloud.
Connected Sheets shortcomings
Limited free pricing tier A shortcoming of Connected Sheets is that for the free pricing tier, users only receive 1 terabyte (TB) of processed query data each month. To process more data, you will need to move to a paid tier.
Data must be housed in BigQuery Another shortcoming is that you will need access to your data set in BigQuery. Without access to BigQuery, you won’t be able to analyze data in Connected Sheets.
Query will fail with large results A third shortcoming is that the Connected Sheets query will fail if the results are too large. Your query will fail if your pivot table has a significant amount of results, which could be anywhere from 30,000 to 50,000. To reduce your results, you can use filters or limit the number of rows per breakout.
Key takeaways Connected Sheets provides a tremendous opportunity to analyze large data sets without specialized skills like SQL. Use familiar spreadsheet skills such as pivot tables, charts, and formulas to analyze the data. For junior data analysts in particular, Connected Sheets can help them perform key tasks within BigQuery and increase their marketable skills.
Resources for more information
- Get started with BigQuery data in Google Sheets
- Insights at scale with Google Sheets
- Connected Sheets product announcement