5.1.5.Sort data using SQL - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki
Optional Refresher: Using BigQuery
As a quick refresher, BigQuery is a data warehouse on Google Cloud that data analysts can use to query, filter large datasets, aggregate results, and perform complex operations.
In the next activity, you will use BigQuery to sort data using SQL queries. In case you jumped around in a few courses, set up a BigQuery account by following the instructions in Using BigQuery from the Prepare Data for Exploration course. If you already have a BigQuery account, you may mark this reading as complete and begin the next activity.
Hands-On Activity: SQL sorting queries
Question 1
Activity overview
So far, you’ve learned about SQL and used SQL queries to interact with databases. In this activity, you’ll practice sorting data by using SQL queries with ORDER BY and WHERE clauses.
By the time you complete this activity, you will be able to write queries that sort data depending on your needs. This will enable you to organize and use data more efficiently in your career as a data analyst.
Sorting with SQL
To practice sorting data with SQL, you’ll query the CDC Births Data Summary public dataset. The queries you write will help you obtain some answers about which counties in the United States have the most and least births in the years 2016-2018.
Load the dataset
- Log in to BigQuery Sandbox . If you have a free trial version of BigQuery, you can use that instead. On the BigQuery page, click the Go to BigQuery button.
- Note: BigQuery Sandbox frequently updates its user interface. The latest changes may not be reflected in the screenshots presented in this activity, but the principles remain the same. Adapting to changes in software updates is an essential skill for data analysts, and it’s helpful for you to practice troubleshooting. You can also reach out to your community of learners on the discussion forum for help.
- If you have never created a BigQuery project before, click CREATE PROJECT on the right side of the screen. If you have created a project before, you can use an existing one or create a new one by clicking the project dropdown in the blue header bar and selecting NEW PROJECT.
- Name your project something that will help you identify it later. You can give it a unique project ID or use an auto-generated one. Don’t worry about selecting an organization if you don’t know what to put.
- Now, you’ll see the Editor interface. In the middle of the screen is a window where you can type code, and to the left is the Explorer menu where you can search for datasets.
- Click + ADD DATA at the top of the Explorer menu, then Explore public datasets from the resulting dropdown.
- In the Search Marketplace bar, type sdoh_cdc_wonder_natality.
- Click the CDC Births Data Summary.
- Click View Dataset. This will bring you back to the BigQuery Sandbox interface in a new tab.
- Note: This may pin the bigquery-public-data dropdown to the Explorer menu. You can use this to browse datasets and tables.
- Click back to the Editor tab. This is where you’ll use SQL during this activity.
- Copy, paste, and run the following query to display the first 1,000 rows of the county_natality table:
SELECT
*
FROM
`bigquery-public-data.sdoh_cdc_wonder_natality.county_natality`
LIMIT
1000
After the query has run, your results should appear like this:
Use the ORDER BY clause
Examine the dataset you just loaded. Take a moment to familiarize yourself with the columns and get a feel for what each can tell you.
Now, imagine you were asked by your manager to figure out which 10 counties had the lowest birth count for 2016-2018. You could accomplish this by modifying your query to use the ORDER BY clause.
Copy, paste, and run the following query:
SELECT
*
FROM
`bigquery-public-data.sdoh_cdc_wonder_natality.county_natality`
ORDER BY
Births
LIMIT
10
The results of your query should appear like this:
You may have noticed that the query did not specify whether it should be sorted ASC (ascending) or DESC (descending). When this is not specified, SQL defaults to sorting by ascending order. You can run another query to confirm this.
Copy, paste, and run the following query that includes ASC:
SELECT
*
FROM
`bigquery-public-data.sdoh_cdc_wonder_natality.county_natality`
ORDER BY
Births
ASC
LIMIT
10
You’ll find that the results did not change. Notice that Tompkins County, NY, had just 735 births in 2018—the lowest birth count of any county in the US between 2016-2018.
Use DESC to reverse sorting order
Now, modify the query to sort in the other direction, returning the top 10 counties with the highest yearly birth counts between 2016-2018.
Copy, paste, and run the following query:
SELECT
*
FROM
`bigquery-public-data.sdoh_cdc_wonder_natality.county_natality`
ORDER BY
Births
DESC
LIMIT
10
Your results should appear like this:
Now, the query returns the 10 rows with the largest values in the Birth column. Los Angeles County takes up the top three spots.
Combine ORDER BY with WHERE clauses
Next, modify the query so that it returns the top 10 counties with the highest birth counts for 2018 only. To do this, add a WHERE clause to the query that specifies only rows that have a Year value equal to 2018-01-01. Note how the ORDER BY clause comes after the WHERE clause.
Copy, paste, and run the following query:
SELECT
*
FROM
`bigquery-public-data.sdoh_cdc_wonder_natality.county_natality`
WHERE
Year = '2018-01-01'
ORDER BY
Births
DESC
LIMIT
10
Your results should appear something like this:
The query worked! You successfully used both ORDER BY and WHERE clauses in the same query.
Confirmation and reflection
The last query you ran returned the top 10 counties with the highest birth counts for 2018 only. Remove the LIMIT statement and run the query again. What is the county with the 11th highest birth count?
A. Orange County, CA
B. Dallas County, TX
C. Unidentified Counties, KY
D. Miami-Dade County, FL
The correct answer is A. Orange County, CA. Explain: The county with the 11th highest birth count in 2018 is Orange County, CA. To find this answer, you ran a query with an ORDER BY clause and a WHERE clause. Going forward, you can use this knowledge of SQL to better organize and structure your data.
Question 2
In this activity, you practiced sorting data using SQL queries with ORDER BY and WHERE clauses. In the text box below, write 2-3 sentences (40-60 words) in response to each of the following questions:
- How can the ORDER BY clause help you organize and structure your data?
- Why is it helpful to use the ORDER BY and WHERE clauses together when sorting and filtering data?
- Can you think of a business question that you could answer using this method?
Explain: Congratulations on completing this hands-on activity! A good response would include that sorting the data you return in your queries is a crucial tool for analyzing and understanding data.
You can also answer business questions by sorting the dataset according to a given metric. For instance, a store may want to know which products they sell the most or least. Sorting helps you answer business questions that involve phrases such as “how much,” “how many,” “best,” or “worst”—which will be a valuable skill in your career as a data analyst.
Hands-on Activity: Analyze weather data in BigQuery
Question 1
Activity overview
Previously, you learned how to use BigQuery to clean data and prepare it for analysis. Now you will query a dataset and save the results into a new table. This is a useful skill when the original data source changes continuously and you need to preserve a specific dataset for continued analysis. It’s also valuable when you are dealing with a large dataset and know you’ll be doing more than one analysis using the same subset of data.
In this scenario, you’re a data analyst at a local news station. You have been tasked with answering questions for meteorologists about the weather. You will work with public data from the National Oceanic and Atmospheric Administration (NOAA), which has data for the entire United States. This is why you will need to save a subset of the data in a separate table.
By the time you complete this activity, you will be able to use SQL queries to create new tables when dealing with complex datasets. This will greatly simplify your analysis in the future.
Access the public dataset
For this activity you will need the NOAA weather data from BigQuery’s public datasets.
- Click on the + ADD DATA button in the Explorer menu pane and select Explore public datasets. This will open a new menu where you can search public datasets that are already available through Google Cloud. If you have already loaded the BigQuery public datasets into your console, you can just search noaa_gsod in your Explorer menu and skip these steps.
- Type noaa_gsod into the search bar. You’ll find the GSOD of Global Surface Summary of the Day Weather Data.
- Click the GSOD dataset to open it. This will provide you with more detailed information about the dataset if you’re interested. Click VIEW DATASET to open this dataset in your console.
- Search noaa_gsod in your Explorer menu pane to find the dataset. Click the dropdown menu to explore the tables in this dataset. Scroll down to gsod2020 and open the table menu by clicking the three vertical dots.
- Check the table’s schema and preview it to get familiar with the data. Once you’re ready, you can click COMPOSE NEW QUERY to start querying the dataset.
Querying the data
The meteorologists who you’re working with have asked you to get the temperature, wind speed, and precipitation for stations La Guardia and JFK, for every day in 2020, in descending order by date, and ascending order by Station ID. Use the following query to request this information:
SELECT
stn,
date,
-- Use the IF function to replace 9999.9 values, which the dataset description explains is the default value when temperature is missing, with NULLs instead.
IF(
temp=9999.9,
NULL,
temp) AS temperature,
-- Use the IF function to replace 999.9 values, which the dataset description explains is the default value when wind speed is missing, with NULLs instead.
IF(
wdsp="999.9",
NULL,
CAST(wdsp AS Float64)) AS wind_speed,
-- Use the IF function to replace 99.99 values, which the dataset description explains is the default value when precipitation is missing, with NULLs instead.
IF(
prcp=99.99,
0,
prcp) AS precipitation
FROM
`bigquery-public-data.noaa_gsod.gsod2020`
WHERE
stn="725030" -- La Guardia
OR stn="744860" -- JFK
ORDER BY
date DESC,
stn ASC
The meteorologists also asked you a couple questions while they were preparing for the nightly news: They want the average temperature in June 2020 and the average wind_speed in December 2020.
Instead of rewriting similar, but slightly different, queries over and over again, there is an easier approach: Save the results from the original query as a table for future queries.
Save a new table
In order to make this subset of data easier to query from, you can save the table from the weather data into a new dataset.
- From your Explorer pane, click the three vertical dots next to your project and select Create dataset. You can name this dataset demos and leave the rest of the default options. Click CREATE DATASET.
- Open your new dataset and select COMPOSE NEW QUERY. Input the following query to get the average temperature, wind speed, visibility, wind gust, precipitation, and snow depth La Guardia and JFK stations for every day in 2020, in descending order by date, and ascending order by Station ID:
SELECT
stn,
date,
-- Use the IF function to replace 9999.9 values, which the dataset description explains is the default value when temperature is missing, with NULLs instead.
IF(
temp=9999.9,
NULL,
temp) AS temperature,
-- Use the IF function to replace 999.9 values, which the dataset description explains is the default value when wind speed is missing, with NULLs instead.
IF(
wdsp="999.9",
NULL,
CAST(wdsp AS Float64)) AS wind_speed,
-- Use the IF function to replace 99.99 values, which the dataset description explains is the default value when precipitation is missing, with NULLs instead.
IF(
prcp=99.99,
0,
prcp) AS precipitation
FROM
`bigquery-public-data.noaa_gsod.gsod2020`
WHERE
stn="725030" -- La Guardia
OR stn="744860" -- JFK
ORDER BY
date DESC,
stn ASC
- Before you run the query, select the MORE menu from the Query Editor and open the Query Settings menu. In the Query Settings menu, select Set a destination table for query results. Set the dataset option to demos and name the table nyc_weather.
- Run the query from earlier; now it will save as a new table in your demos dataset.
- Return to the Query settings menu by using the MORE dropdown menu. Reset the settings to Save query results in a temporary table. This will prevent you from accidentally adding every query as a table to your new dataset.
Query your new table
Now that you have the subset of this data saved in a new table, you can query it more easily. Use the following query to find the average temperature from the meteorologists first question:
SELECT
AVG(temperature)
FROM
`airy-shuttle-315515.demos.nyc_weather` --remember to change the project name to your project before running this query
WHERE
date BETWEEN '2020-06-01' AND '2020-06-30'
You can also use this syntax to find the average wind_speed or any other information from this subset of data you’re interested in. Try constructing a few more queries to answer the meteorologists’ questions!
The ability to save your results into a new table is a helpful trick when you know you're only interested in a subset of a larger complex dataset that you plan on querying multiple times, such as the weather data for just La Guardia and JFK. This also helps minimize errors during your analysis.
Confirmation and reflection
What was the average temperature at JFK and La Guardia stations between June 1, 2020 and June 30, 2020?
A. 74.909
B. 87.671
C. 72.883
D. 92.099
The correct answer is C. 72.883. Explain: The average was 72.883. To find out the average temperature during this time period, you successfully created a new table using a query and ran another query against that table. Going forward, you will be able to use this skill to create tables with specific subsets of your data to query. This will help you draw insights from multiple data sources in the future.
Question 2
In the text box below, write 2-3 sentences (40-60 words) in response to each of the following questions:
- How can creating tables from queries help you perform data analysis in the future?
- Why is being able to view specific subsets of a dataset important?
Explain: Congratulations on completing this hands-on activity! In this activity, you explored two public datasets and created a new table using a query. A good response might include that creating tables using your queries allows you to work with a subset of data without changing the original.
For instance, now you can query weather data from just the weather stations in New York that you need. This is important for finding trends within a subset of data. In upcoming activities, you will continue analyzing data like this.
Test your knowledge on sorting in SQL
Question 1
A data analyst wants to sort a list of greenhouse shrubs by price from least expensive to most expensive. Which statement should they use?
A. ORDER BY shrub_price
B. WHERE shrub_price
C. ORDER BY shrub_price DESC
D. WHERE shrub_price ASC
The correct answer is A. ORDER BY shrub_price. Explain: To sort a list of greenhouse shrubs by price from least expensive to most expensive, they should use ORDER BY shrub_price.
Question 2
You are working with a database table that contains data about music genres. You want to sort the genres by name in ascending order. The genres are listed in the genre_name column.
You write the SQL query below. Add an ORDER BY clause that will sort the genres by name in ascending order.
SELECT
*
FROM
genre
ORDER BY genre_name ASC;
Results:
+----------+--------------------+
| genre_id | genre_name |
+----------+--------------------+
| 23 | Alternative |
| 4 | Alternative & Punk |
| 6 | Blues |
| 11 | Bossa Nova |
| 24 | Classical |
| 22 | Comedy |
| 21 | Drama |
| 12 | Easy Listening |
| 15 | Electronica/Dance |
| 13 | Heavy Metal |
| 17 | Hip Hop/Rap |
| 2 | Jazz |
| 7 | Latin |
| 3 | Metal |
| 25 | Opera |
| 9 | Pop |
| 14 | R&B/Soul |
| 8 | Reggae |
| 1 | Rock |
| 5 | Rock And Roll |
| 20 | Sci Fi & Fantasy |
| 18 | Science Fiction |
| 10 | Soundtrack |
| 19 | TV Shows |
| 16 | World |
+----------+--------------------+
What genre appears in row 3 of your query result?
A. Alternative
B. Classical
C. Easy Listening
D. Blues
The correct answer is D. Blues. Explain: The clause ORDER BY genre_name
will sort the genres by name in ascending order. The complete query is SELECT * FROM genre ORDER BY genre_name
. The ORDER BY clause tells the database how to organize the data it returns. The ORDER BY clause sorts data in ascending order by default.
The Blues genre appears in row 3 of your query result.
Question 3
You are working with a database table that contains employee data. You want to sort the employees by hire date in descending order. The hire dates are listed in the hire_date column.
You write the SQL query below. Add an ORDER BY clause that will sort the employees by hire date in descending order.
SELECT
*
FROM
employee
ORDER BY hire_date DESC;
Results:
+-------------+-----------+------------+---------------------+------------+---------------------+---------------------+-----------------------------+------------+-------+---------+-------------+-------------------+-------------------+--------------------------+
| employee_id | last_name | first_name | title | reports_to | birth_date | hire_date | address | city | state | country | postal_code | phone | fax | email |
+-------------+-----------+------------+---------------------+------------+---------------------+---------------------+-----------------------------+------------+-------+---------+-------------+-------------------+-------------------+--------------------------+
| 8 | Callahan | Laura | IT Staff | 6 | 1968-01-09 00:00:00 | 2004-03-04 00:00:00 | 923 7 ST NW | Lethbridge | AB | Canada | T1H 1Y8 | +1 (403) 467-3351 | +1 (403) 467-8772 | [email protected] |
| 7 | King | Robert | IT Staff | 6 | 1970-05-29 00:00:00 | 2004-01-02 00:00:00 | 590 Columbia Boulevard West | Lethbridge | AB | Canada | T1K 5N8 | +1 (403) 456-9986 | +1 (403) 456-8485 | [email protected] |
| 5 | Johnson | Steve | Sales Support Agent | 2 | 1965-03-03 00:00:00 | 2003-10-17 00:00:00 | 7727B 41 Ave | Calgary | AB | Canada | T3B 1Y7 | 1 (780) 836-9987 | 1 (780) 836-9543 | [email protected] |
| 6 | Mitchell | Michael | IT Manager | 1 | 1973-07-01 00:00:00 | 2003-10-17 00:00:00 | 5827 Bowness Road NW | Calgary | AB | Canada | T3B 0C5 | +1 (403) 246-9887 | +1 (403) 246-9899 | [email protected] |
| 4 | Park | Margaret | Sales Support Agent | 2 | 1947-09-19 00:00:00 | 2003-05-03 00:00:00 | 683 10 Street SW | Calgary | AB | Canada | T2P 5G3 | +1 (403) 263-4423 | +1 (403) 263-4289 | [email protected] |
| 1 | Adams | Andrew | General Manager | None | 1962-02-18 00:00:00 | 2002-08-14 00:00:00 | 11120 Jasper Ave NW | Edmonton | AB | Canada | T5K 2N1 | +1 (780) 428-9482 | +1 (780) 428-3457 | [email protected] |
| 2 | Edwards | Nancy | Sales Manager | 1 | 1958-12-08 00:00:00 | 2002-05-01 00:00:00 | 825 8 Ave SW | Calgary | AB | Canada | T2P 2T3 | +1 (403) 262-3443 | +1 (403) 262-3322 | [email protected] |
| 3 | Peacock | Jane | Sales Support Agent | 2 | 1973-08-29 00:00:00 | 2002-04-01 00:00:00 | 1111 6 Ave SW | Calgary | AB | Canada | T2P 5M5 | +1 (403) 262-3443 | +1 (403) 262-6712 | [email protected] |
+-------------+-----------+------------+---------------------+------------+---------------------+---------------------+-----------------------------+------------+-------+---------+-------------+-------------------+-------------------+--------------------------+
What employee appears in row 1 of your query result?
A. Margaret Park
B. Laura Callahan
C. Robert King
D. Nancy Edwards
The correct answer is B. Laura Callahan. Explain: The clause ORDER BY hire_date DESC
will sort the employees by hire date in descending order. The complete query is SELECT * FROM employee ORDER BY hire_date DESC
. The ORDER BY clause tells the database how to organize the data it returns. The ORDER BY clause sorts data in ascending order by default. The DESC command is used to sort data in descending order.
The employee Laura Callahan appears in row 1 of your query result.