5.4.3.Learn more SQL calculations - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki

Optional: Upload the avocado dataset to BigQuery

Using public datasets is a great way to practice working with SQL. Later in the course, you are going to use historical data on avocado prices to perform calculations in BigQuery. This is a step-by-step guide to help you load this data into your own BigQuery console so that you can follow along with the upcoming video.

If you have hopped around courses, Using BigQuery in the Prepare Data for Exploration course covers how to set up a BigQuery account.

Step 1: Download the CSV file from Kaggle

Avocado prices : The publicly available avocado dataset from Kaggle you are going to use (made available by Justin Kiggins under an Open Data Commons license).

You can download this data onto your own device and then upload it to BigQuery, and you will need to create a free account before downloading the zipped data files. Kaggle is a great resource for all types of data analytics resources, and there are also other public datasets on the platform that you can download and use. You can follow these steps to load them into your console and practice on your own!

Sos0Gq07Rxe7BVCoCt80Lw_7bddb81b0f2a4102bb4384f92b9d0af1_kaggle_avocado

You will find some more information about the avocado dataset, including the context, content, and original source on this page. For now, you can simply download the file.

Step 2: Open your BigQuery console and create a new dataset

Open BigQuery . After you have downloaded the dataset from Kaggle, extract the zipped folder, and remember where you place the avocado CSV file for upload into your BigQuery console.

In the Explorer pane on the left side of your console, click the project where you want to add a dataset. Note that your project will have a unique name and won't be the same as the one in the example pictured below ("loyal-glass-371423"). If you already have it starred, don't choose "bigquery-public-data" as your project because that's a public project that you can't change.

EHOr4fvPSKCIUhkzfmLs2A_34a7a0fe96cb480386e4269903d716f1_1 welcomescreen

Click the Actions icon (three vertical dots) next to your project and select Create dataset.

MuiCU-GCRouj1f096NtKMw_710fe2850df84097b0a3c284d45b90f1_5 CreateDataset

Here, you will name the dataset; in this case, enter avocado_data. In the Location type section, select Multi-region (US), and make sure the default Encryption method within the Advanced options is set to the Google_managed encryption key. Then, click Create dataset (blue button) at the bottom of the page to create your new dataset. This will add data in the Explorer on the left of your console.

cqYCAd31SeCGpIAFBApioQ_1805546d7b8444138d6cd6db3bad2ef1_2 createdataset

Step 3: Open the new dataset and create a new table

Navigate to the dataset in your console by clicking to expand your project and selecting the correct dataset listed. In this case, it will be avocado_data.

lnPUFp89TAie-iwCq_toUQ_697b1d2b0a3640cc9d119e470fef14f1_3 datasetinfo

Navigate to the far right side of the screen click the blue + CREATE TABLE button to open the Create table window. Use the visual settings in the next image to complete the steps below.

0lxCNUwgTBKmrkUGi1GX1Q_f6dacefdde18490598c3ea6177b294f1_4 createtablewindow

Next, do the following:

  • Under Source, for the Create table from selection, select Upload.
  • Click Browse to select the unzipped CSV file titled avocado you just downloaded to your computer from Kaggle. The file format should automatically change from Avro to CSV when you select the file.
  • For Table Name, enter avocado_prices for the table.
  • For Schema, click the Auto detect check box. Then, click Create table (blue button).

In the Explorer, the avocado_prices table will appear under the dataset you created.

fuN3MwrOR7m28UsgFSvKYA_b1f1ab6735984b09ab43ebc8b53fcff1_5 explorerpane_avocadotable

After clicking on the table, you'll be able to develop a greater understanding of the table Schema, Details, and the data Preview in the main editor window.

KxGJ_jjVQsu131gAtr6Osg_f559ddcb3512432385aa7daf205464f1_7 schema

Now you are ready to follow along with the video and learn more about performing calculations with queries!

Further reading

Introduction to loading data : This step-by-step guide is a useful resource that you can bookmark and save for later. You can refer to it the next time you need to load data into BigQuery.

Hands-On Activity: Calculations in SQL

Question 1

Activity overview

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

In previous lessons, you were introduced to SQL. In this activity, you will practice it by learning how to write basic calculations in an SQL query.

By the time you complete this activity, you will be able to use operators in SQL queries. This will enable you to make calculations using SQL and combine multiple arithmetic operations in a single query. This will help you to work faster and more efficiently in SQL as you quickly discover significant patterns in your data—a key skill for data analysts.

Operations in SQL

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

In SQL, the symbols for the four basic arithmetic operations are:

  • + addition
  • - subtraction
  • * multiplication
  • / division

For this activity, imagine you’ve been asked to analyze subway ridership data to help improve the quality of the city’s public transportation. You can use basic calculations in SQL to help you make sense of this data.

Access the dataset

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

Log in to BigQuery Sandbox. If you have a free trial version of BigQuery, you can use that instead. On the BigQuery page, click the Go to BigQuery button.

  • Note: BigQuery Sandbox frequently updates its user interface. The latest changes may not be reflected in the screenshots presented in this activity, but the principles remain the same. Adapting to changes in software updates is an essential skill for data analysts, and it’s helpful for you to practice troubleshooting. You can also reach out to your community of learners on the discussion forum for help.

In this activity, you’ll use data that describes the average weekly subway ridership in New York City from 2013-2018. The columns include: subway stations, subway routes, weekly ridership for each year (2013-2018), the change in ridership from 2017 to 2018 (both in raw numbers and percent), and the ridership rank of each station in 2018. Each row corresponds to a different station in the subway system.

To examine the dataset, follow these steps:

  1. Locate the public data by typing the word public in the search bar and then clicking on SEARCH ALL PROJECTS and SHOW MORE to access all the datasets.
  2. Make sure the bigquery-public-data item is pinned to the Explorer menu by clicking on the star located to the right of the item name in the side menu.
  3. Click on the dropdown arrow next to the bigquery-public-data to expand all the available datasets.
  4. Navigate down the list (you may have to click the SHOW MORE button twice) until you find the new_york_subway dataset. Open the dropdown and click subway_ridership_2013_present to open and examine the data table.
  • Important Update: There seems to be a problem with directly searching for the dataset, which yields no results despite the dataset existing in the bigquery-public-data repository. By following the four steps above you are still able to locate the new_york_subway public dataset without typing its name in the search box. The Data Analytics Certificate team is investigating this issue. In the meantime, the query you will write in this activity runs correctly and will allow you to examine the data and proceed properly.

uuh2K-SIT0WizkgRCQy2vg_40cd9fec8de54ca3b1db7f143c06a9f1_1

Use calculation with a single operator

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

The change_2018_raw column describes the change in weekly ridership from 2017 to 2018 in raw numbers. Suppose you want to find data on the change in weekly ridership from 2013 to 2014. You can use SQL to subtract the number of riders in 2013 from the number of riders in 2014.

To do that, follow these steps:

  1. In the Query editor, type SELECT. Add the names of the columns you want to use in your calculations. You’re selecting several columns, so press Enter (Windows) or Return (Mac) after SELECT. Add a comma after each column name. Include the following columns: station_name, ridership_2013, and ridership_2014.
  2. Add the calculation to the query. Enter the names of the two columns with a minus sign between them: ridership_2014 - ridership_2013.
  3. List the result in a new column. To do this, type AS followed by the new column’s name. Name it change_2014_raw since it represents the change in ridership from 2013 to 2014 in raw numbers.
  4. End your query with the FROM command and the name of the dataset and subset you’re pulling data from. After FROM, press Enter or Return and type bigquery-public-data.new_york_subway.subway_ridership_2013_present.

CFDjQ94TR46Q40PeE0eO5Q_738171658a514f51b84407f487d28999_Screen-Shot-2020-12-11-at-8 58 21-PM

  1. Now, click Run and get the results.

The results show the change in ridership from 2013 to 2014. For example, the Atlantic Av - Barclays Ctr station gained an average of 1,774 riders per week. The 4 Av station lost 321 riders.

By including a basic calculation in your query, you can get an idea of the change in ridership for each subway station in any given year. Basic calculations help you quickly gain important knowledge about your data.

irXxNl2WQG-18TZdlgBviA_1d8194b2aae04547812326e9acdcb134_unnamed

Use a calculation with multiple operators

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

The original dataset describes average weekly ridership for each individual year. Suppose you want to find average weekly ridership for a longer period of time, such as the multi-year period from 2013-2016.

To do this, you can combine multiple arithmetic operations in a query. The average of a set of numbers is the sum of the numbers divided by the total number of values in the set. There are four values in your new set (ridership data for 2013, 2014, 2015, 2016). You can use SQL to sum the numbers for each year and divide that sum by 4.

  1. First, click the COMPOSE NEW QUERY button to refresh the query editor.
  2. Type SELECT to select the columns you want to pull from the table. You’re selecting several columns, so press Enter or Return after SELECT and add a comma after each column name. For this query, enter station_name, ridership_2013, ridership_2014, ridership_2015, and ridership_2016.
  3. Add the calculation to the query. If you use more than one arithmetic operator in a calculation, you need to use parentheses to control the order of the operations. In this case, you want to sum the years, and then divide the sum by 4. So, put parentheses around the sum of the four column names. Type (ridership_2013 + ridership_2014 + ridership_2015 + ridership_2016). Then type a division operator / and the number 4.
  4. Next, list the result in a new column by typing AS followed by the new column’s name. Call the new column average since it represents average weekly ridership for the period 2013-2016.
  5. End your query with the FROM command and the name of the dataset and subset that you’re pulling data from. After FROM, press Enter and type bigquery-public-data.new_york_subway.subway_ridership_2013_present.

At this point, your screen should display like this:

f4p3u5H9SNGKd7uR_ZjRQg_182ffd6d31704003bd9c2025c66edcf1_updatedridership

  1. Click the Run button to get the results. The results clearly show the trend in ridership at each station from 2013 to 2016. For example, weekly ridership at the Atlantic Av - Barclays Ctr station increased every year since 2013. Further, for the years 2014, 2015, and 2016, weekly ridership at Atlantic Av - Barclays Ctr exceeded the overall average for the period 2013-2016 (listed in the average column).

WflPE78zTey5TxO_M13s4g_400b2f531d264268bed447978ca836ee_Screen-Shot-2020-12-11-at-9 09 26-PM

This kind of data is useful for managing public transportation. It can help you determine which stations or routes to expand due to increased ridership. Using basic calculations in your query allows you to quickly discover significant patterns in your data.

Confirmation and reflection

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

Using the same average formula as the last query you wrote, write a query to find the average weekly ridership from the years 2016-2018. What is the average weekly ridership for the Atlantic Av - Barclays Ctr station in this timeframe?

A. 49255

B. 13212.67

C. 4903.67

D. 42672.33

The correct answer is D. 42672.33. Explain: The average weekly ridership for the Atlantic Av - Barclays Ctr from the years 2016-2018 is 42672.33. To find the station’s total average weekly ridership, you would have to use a SQL calculation to add the three columns ridership_2016, ridership_2017, and ridership_2018, then divide the sum by 3. Going forward, you can use SQL to perform mathematical calculations with data and analyze patterns in real-world situations.

Question 2

In this activity, you wrote queries with single and multiple calculations to find patterns in subway ridership data. In the text box below, write 2-3 sentences (40-60 words) in response to each of the following questions:

  • What other patterns can you discover in the subway ridership dataset using calculations with multiple operators?
  • How can basic calculations in SQL help you make sense of a large amount of data?

Explain: Congratulations on completing this hands-on activity! A good response would include a pattern you found in the ridership dataset, as well as some of the following insights about SQL calculations:

Making basic calculations is a key part of any data analysis. Using basic calculations in SQL can help you quickly discover significant patterns, relationships, and trends in a large dataset. When you include a calculation in a query with your other commands, you can work faster and more efficiently.

Calculation and queries

You've been learning about analyzing data to answer questions and address business objectives. As a data analyst, you will organize, format, aggregate, and calculate data in order to look for trends and patterns or answer specific questions. You will often be given the datasets you need for analysis.

Kaggle.com/datasets contains plenty of public datasets. Use the link to explore public datasets, choose at least one to analyze, and consider the following questions as you make observations:

  • What reasons do you think are behind an upward or downward trend in the data?
  • What are the potential causes for trends you notice?
  • Why would a certain value be significantly different from the other values in a dataset?

Write two or more paragraphs (150-200 words) describing your findings, other observations, and the name of the dataset. Then, visit the discussion forum to read what other learners have written, engage with two or more posts, and share your feedback.

Test your knowledge on SQL calculations

Question 1

You are working with a database table that contains invoice data. The table includes columns for invoice_line_id (line items for each invoice), invoice_id, unit_price, and quantity (the number of purchases in each line item). Each invoice contains multiple line items. You want to know the total price for each of the first 5 line items in the table. You decide to multiply unit price by quantity to get the total price for each line item, and use the AS command to store the total in a new column called line_total.

Add a statement to your SQL query that calculates the total price for each line item and stores it in a new column as line_total.

NOTE: The three dots (...) indicate where to add the statement.

SELECT 
invoice_line_id,
invoice_id,
unit_price,
quantity,
unit_price * quantity as line_total
FROM
invoice_item 
LIMIT 5

Output:

+-----------------+------------+------------+----------+------------+
| invoice_line_id | invoice_id | unit_price | quantity | line_total |
+-----------------+------------+------------+----------+------------+
|               1 |          1 |       0.99 |        1 |       0.99 |
|               2 |          1 |       0.99 |        1 |       0.99 |
|               3 |          2 |       0.99 |        1 |       0.99 |
|               4 |          2 |       0.99 |        1 |       0.99 |
|               5 |          2 |       0.99 |        1 |       0.99 |
+-----------------+------------+------------+----------+------------+

What total appears in row 1 of your query result? A. 1.98

B. 7.92

C. 3.96

D. 0.99

The total appears in row 1 of your query result is 0.99. Explain: You add the statement unit_price * quantity AS line_total to calculate the total price for each invoice and store it in a new column as line_total. The complete query is SELECT invoice_line_id, invoice_id, unit_price, quantity, unit_price * quantity AS line_total FROM invoice_items LIMIT 5. The AS command gives a temporary name to the new column.

The total 0.99 appears in row 1 of your query result.

Question 2

In a SQL query, which calculation does the modulo (%) operator perform?

A. It applies an exponent to a value

B. It converts a decimal to a percent

C. It returns the remainder of a division calculation

D. It finds the square root of a number

The correct answer is C. It returns the remainder of a division calculation. Explain: The modulo operator returns the remainder of a division calculation when included in a SQL query.

Question 3

You are working with a dataset with the column name “firstquarterexpenses.” How can you rename this column to make it more readable?

A. first+quarter+expenses

B. first_quarter_expenses

C. Firstquarterexpenses

D. first quarter expenses

The correct answer is B. first_quarter_expenses. Explain: You can rename the column first_quarter_expenses. Using underscores between words helps avoid potential issues while keeping the names readable.