4.3.2.Learn basic SQL queries - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki

Optional: Upload the customer dataset to BigQuery

In the next video, the instructor uses a specific dataset. The instructions in this reading are provided for you to upload the same dataset in your BigQuery console.

You must have a BigQuery account to follow along. If you have hopped around courses, Using BigQuery in the Prepare Data for Exploration course covers how to set up a BigQuery account.

Prepare for the next video

First, download the CSV file from the attachment below.

Customer Table - Sheet1

Next, complete the following steps in your BigQuery console to upload the Customer Table dataset.

Step 1: Open your BigQuery console and click on the project you want to upload the data to.

Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next to your project name and select Create dataset.

pYj9p7XFS-uI_ae1xYvr6w_05852f3ecf20415fba255e844eff49f1_actions_icon_create_dataset

Step 3: In the upcoming video, the name "customer_data" will be used for the dataset. If you plan to follow along with the video, enter customer_data for the Dataset ID.

rM90PCsNS7qPdDwrDfu6pA_3b05e749468e48eda2bfa98ba908f3f1_create-dataset

Step 4: Click CREATE DATASET (blue button) to add the dataset to your project.

Step 5: In the Explorer on the left, click to expand your project, and then click the customer_data dataset you just created.

Step 6: Click the Actions icon (three vertical dots) next to customer_data and select Open.

Step 7: Click the blue + icon at the middle to open the Create table window.

9t8_PESeT8CfPzxEnh_AAw_d840be8cb250448688ed017b0f6634f1_Screenshot-2022-07-20-11 00 33-AM

Step 8: Under Source, for the Create table from selection, choose where the data will be coming from.

Select Upload.
Click Browse to select the Customer Table CSV file you downloaded.
Choose CSV from the file format drop-down.

Step 9: For Table name, enter customer_address if you plan to follow along with the video.

Step 10: For Schema, click the Auto detect check box.

Step 11: Click Create table (blue button). You will now see the customer_address table under your customer_data dataset in your project.

Step 12: Click customer_address and then select the Preview tab. Confirm that you see the data shown below.

s1bbE_SqTi6W2xP0qm4utQ_160621e5a99c4106b0b8c51ea4c102f1_customer_address

And now you have everything you need to follow along with the next video. This is also a great table to use to practice querying data on your own. Plus, you can use these steps to upload any other data you want to work with.

Hands-On Activity: Clean data using SQL

Question 1

Activity overview

3UWxlyqNRxaFsZcqjZcWgQ_3fd91ae61ce04caaa755c3477a337947_line-y

In previous lessons, you learned about the importance of being able to clean your data where it lives. When it comes to data stored in databases, that means using SQL queries. In this activity, you will create a custom dataset and table, import a CSV file, and use SQL queries to clean automobile data.

In this scenario, you are a data analyst working with a used car dealership startup venture. The investors want you to find out which cars are most popular with customers so they can make sure to stock accordingly.

By the time you complete this activity, you will be able to clean data using SQL. This will enable you to process and analyze data in databases, which is a common task for data analysts.

What you will need

To get started, download the automobile_data CSV file. This is data from an external source that contains historical sales data on car prices and their features.

Click the link to the automobile_data file to download it. Or you may download the CSV file directly from the attachments below.

Link to data: automobile_data

Download download data:

automobile_data CSV

Upload your data

_9-wgkqASFufsIJKgChb1Q_2064b578d20944e1bc638d7b64c7bfe3_line-y

Similarly to a previous BigQuery activity, you will need to create a dataset and a custom table to house your data. Then, you’ll be able to use SQL queries to explore and clean it. Once you’ve downloaded the automobile_data file, you can create your dataset.

Step 1: Create a dataset

Go to the Explorer pane in your workspace and click the three dots next to your pinned project to open the menu. From here, select Create dataset.

SdOA0kTKQUeTgNJEyrFHSQ_fb57f9b6a1dc45ec89ad53e955eb26f1_Screenshot-2021-06-08-11 45 45-AM-1-

From the Create dataset menu, fill out some information about the dataset. Input the Dataset ID as cars; you can leave the Data location as Default. Then click CREATE DATASET.

A_6FQJ1iQtC-hUCdYmLQmg_fe3dbdb96f2c46acabfa450e787291f1_unnamed-7-

The cars dataset should appear under your project in the Explorer pane as shown below. Click on the three dots next to the cars dataset to open it.

HeZEtDfgTE2mRLQ34CxN_g_33a25368776b44ff91406479ec9a55f1_Screenshot-2021-06-09-3 57 02-PM-1-

Step 2: Create table

After you open your newly created dataset, you will be able to add a custom table for your data.

From the cars dataset, click CREATE TABLE.

D8CMXWXPTXmAjF1lz815hw_8877de9f0c384e18a9fffe5ee34b5df1_Screenshot-2021-06-09-4 02 42-PM-1-

Under Source, upload the automobile_data CSV. Under Destination, make sure you are uploading into your cars dataset and name your table car_info. You can set the schema to Auto-detect. Then, click Create table.

sA_6OHXISzSP-jh1yLs0uw_da4cc271973e4993ba486dcb8c3d44f1_unnamed-8-

After creating your table, it will appear in your Explorer pane. You can click on the table to explore the schema and preview your data. Once you have gotten familiar with your data, you can start querying it.

Cleaning your data

_9-wgkqASFufsIJKgChb1Q_2064b578d20944e1bc638d7b64c7bfe3_line-y

Your new dataset contains historical sales data, including details such as car features and prices. You can use this data to find the top 10 most popular cars and trims. But before you can perform your analysis, you’ll need to make sure your data is clean. If you analyze dirty data, you could end up presenting the wrong list of cars to the investors. That may cause them to lose money on their car inventory investment.

Step 1: Inspect the fuel_type column

The first thing you want to do is inspect the data in your table so you can find out if there is any specific cleaning that needs to be done. According to the data’s description, the fuel_type column should only have two unique string values: diesel and gas. To check and make sure that’s true, run the following query:

SELECT DISTINCT fuel_type
FROM cars.car_info;

This returns the follwing results:

2PvkmVqCR6q75Jlaghequw_572b7e9c7f6a435fa142ecbb57c090f1_Screenshot-2021-06-10-11 44 02-AM-1-

This confirms that the fuel_type column doesn’t have any unexpected values.

Step 2: Inspect the length column

Next, you will inspect a column with numerical data. The length column should contain numeric measurements of the cars. So you will check that the minimum and maximum lengths in the dataset align with the data description , which states that the lengths in this column should range from 141.1 to 208.1. Run this query to confirm

SELECT
  MIN(length) AS min_length,
  MAX(length) AS max_length
FROM
  cars.car_info;

Your results should confirm that 141.1 and 208.1 are the minimum and maximum values respectively in this column.

QkcK0ZKESoKHCtGShJqCig_957228fafc6a41bd888315b31193e8f1_Screenshot-2021-06-10-2 40 29-PM-1-

Step 3: Fill in missing data

Missing values can create errors or skew your results during analysis. You’re going to want to check your data for null or missing values. These values might appear as a blank cell or the word null in BigQuery.

You can check to see if the num_of_doors column contains null values using this query:

SELECT * FROM cars.car_info
WHERE num_of_doors IS NULL;

This will select any rows with missing data for the num_of_doors column and return them in your results table. You should get two results, one Mazda and one Dodge:

FPn6Bs_GRwi5-gbPxocIrQ_84308aef483b49fb81d4670aa8cc84f1_unnamed-9-

In order to fill in these missing values, you check with the sales manager, who states that all Dodge gas sedans and all Mazda diesel sedans sold had four doors. If you are using the BigQuery free trial, you can use this query to update your table so that all Dodge gas sedans have four doors:

UPDATE
  cars.car_info
SET
  num_of_doors = "four"
WHERE
  make = "dodge"
  AND fuel_type = "gas"
  AND body_style = "sedan";

You should get a message telling you that three rows were modified in this table. To make sure, you can run the previous query again:

SELECT
  *
FROM
  cars.car_info 
WHERE 
  num_of_doors IS NULL;

Now, you only have one row with a NULL value for num_of_doors. Repeat this process to replace the null value for the Mazda.

If you are using the BigQuery Sandbox, you can skip these UPDATE queries; they will not affect your ability to complete this activity.

Step 4: Identify potential errors

Once you have finished ensuring that there aren’t any missing values in your data, you’ll want to check for other potential errors. You can use SELECT DISTINCT to check what values exist in a column. You can run this query to check the num_of_cylinders column:

SELECT DISTINCT num_of_cylinders
FROM cars.car_info;

After running this, you notice that there are one too many rows. There are two entries for two cylinders: rows 6 and 7. But the two in row 7 is misspelled.

VHPfXt8kRnSz317fJAZ0bQ_9257fefd395046c290c809d29c2e2ef1_Screenshot-2021-06-10-3 44 19-PM-1-

To correct the misspelling for all rows, you can run this query if you have the BigQuery free trial:

UPDATE
  cars.car_info
SET
  num_of_cylinders = "two"
WHERE
  num_of_cylinders = "tow";

You will get a message alerting you that one row was modified after running this statement. To check that it worked, you can run the previous query again:

SELECT
  DISTINCT num_of_cylinders
FROM
  cars.car_info;

Next, you can check the compression_ratio column. According to the data description , the compression_ratio column values should range from 7 to 23. Just like when you checked the length values , you can use MIN and MAX to check if that’s correct:

SELECT
  MIN(compression_ratio) AS min_compression_ratio,
  MAX(compression_ratio) AS max_compression_ratio
FROM
  cars.car_info;

Notice that this returns a maximum of 70. But you know this is an error because the maximum value in this column should be 23, not 70. So the 70 is most likely a 7.0. Run the above query again without the row with 70 to make sure that the rest of the values fall within the expected range of 7 to 23.

SELECT
  MIN(compression_ratio) AS min_compression_ratio,
  MAX(compression_ratio) AS max_compression_ratio
FROM
  cars.car_info
WHERE
  compression_ratio <> 70;

Now the highest value is 23, which aligns with the data description. So you’ll want to correct the 70 value. You check with the sales manager again, who says that this row was made in error and should be removed. Before you delete anything, you should check to see how many rows contain this erroneous value as a precaution so that you don’t end up deleting 50% of your data. If there are too many (for instance, 20% of your rows have the incorrect 70 value), then you would want to check back in with the sales manager to inquire if these should be deleted or if the 70 should be updated to another value. Use the query below to count how many rows you would be deleting:

SELECT
  COUNT(*) AS num_of_rows_to_delete
FROM
  cars.car_info
WHERE
  compression_ratio = 70;

Turns out there is only one row with the erroneous 70 value. So you can delete that row using this query:

DELETE cars.car_info
WHERE compression_ratio = 70;

If you are using the BigQuery sandbox, you can replace DELETE with SELECT to see which row would be deleted.

Step 5: Ensure consistency

Finally, you want to check your data for any inconsistencies that might cause errors. These inconsistencies can be tricky to spot — sometimes even something as simple as an extra space can cause a problem.

Check the drive_wheels column for inconsistencies by running a query with a SELECT DISTINCT statement:

SELECT DISTINCT drive_wheels
FROM cars.car_info;

It appears that 4wd appears twice in results. However, because you used a SELECT DISTINCT statement to return unique values, this probably means there’s an extra space in one of the 4wd entries that makes it different from the other 4wd.

G9XjLN_ITf6V4yzfyB3-sg_1e7dda7421dc4fe089fd53a4a7d09af1_unnamed-10-

To check if this is the case, you can use a LENGTH statement to determine the length of how long each of these string variables:

SELECT
  DISTINCT drive_wheels,
  LENGTH(drive_wheels) AS string_length
FROM
  cars.car_info;

According to these results, some instances of the 4wd string have four characters instead of the expected three (4wd has 3 characters). In that case, you can use the TRIM function to remove all extra spaces in the drive_wheels column if you are using the BigQuery free trial:

UPDATE
  cars.car_info
SET
  drive_wheels = TRIM(drive_wheels)
WHERE TRUE;

Then, you run the SELECT DISTINCT statement again to ensure that there are only three distinct values in the drive_wheels column:

SELECT
  DISTINCT drive_wheels
FROM
  cars.car_info;

And now there should only be three unique values in this column! Which means your data is clean, consistent, and ready for analysis!

Confirmation and reflection

3UWxlyqNRxaFsZcqjZcWgQ_3fd91ae61ce04caaa755c3477a337947_line-y

What is the maximum value in the price column of the car_info table?

A. 45,400

B. 16,430

C. 12,978

D. 5,1180

The correct answer is A. 45,400. Explain: To ensure that the values in the price column fell within the expected range, you used the MIN and MAX functions to determine that the maximum price was 45, 400. Knowing this, you were able to clean this column and prepare for analysis. Going forward, you will continue to check columns with numeric data in BigQuery to make sure your data is clean. This will help you quickly identify issues with your data that might cause errors during analysis.

Question 2

In the text box below, write 2-3 sentences (40-60 words) in response to each of the following questions:

Why is cleaning data before your analysis important?
Which of these cleaning techniques do you think will be most useful for you in the future?

Explain: Congratulations on completing this hands-on activity! In this activity you checked your data for errors and fixed any inconsistencies. A good response would include that cleaning data is an important step of the analysis process that will save you time and help ensure accuracy in the future.

Cleaning data where it lives is incredibly important for analysts. For instance, you were able to use SQL to complete multiple cleaning tasks, which allows you to clean data stored in databases. In upcoming activities, you will use your cleaning skills to prepare for analysis!

Test your knowledge on SQL queries

Question 1

Which of the following SQL functions can data analysts use to clean string variables? Select all that apply.

LENGTH
SUBSTR
COUNTIF
TRIM

Explain: Data analysts can use the SUBSTR and TRIM functions to clean string variables.

Question 2

You are working with a database table that contains data about playlists for different types of digital media. The table includes columns for playlist_id and name. You want to remove duplicate entries for playlist names and sort the results by playlist ID.

You write the SQL query below. Add a DISTINCT clause that will remove duplicate entries from the name column.

NOTE: The three dots (...) indicate where to add the clause.

SELECT playlist_id, name
FROM
playlist
ORDER BY
playlist_id

Outputs:

+-------------+----------------------------+
| playlist_id | name                       |
+-------------+----------------------------+
|           1 | Music                      |
|           2 | Movies                     |
|           3 | TV Shows                   |
|           4 | Audiobooks                 |
|           5 | 90’s Music                 |
|           6 | Audiobooks                 |
|           7 | Movies                     |
|           8 | Music                      |
|           9 | Music Videos               |
|          10 | TV Shows                   |
|          11 | Brazilian Music            |
|          12 | Classical                  |
|          13 | Classical 101 - Deep Cuts  |
|          14 | Classical 101 - Next Steps |
|          15 | Classical 101 - The Basics |
|          16 | Grunge                     |
|          17 | Heavy Metal Classic        |
|          18 | On-The-Go 1                |
+-------------+----------------------------+

What playlist name appears in row 6 of your query result?

A. Movies

B. Music Videos

C. TV Shows

D. Audiobooks

The correct answer is B. Music Videos. Explain: The clause DISTINCT name will remove duplicate entries from the name column. The complete query is SELECT DISTINCT name FROM playlist ORDER BY playlist_id. The DISTINCT clause removes duplicate entries from your query result. The playlist name Music Videos appears in row 6 of your query result.

Question 3

You are working with a database table that contains data about music albums. The table includes columns for album_id, title, and artist_id. You want to check for album titles that are less than 4 characters long.

You write the SQL query below. Add a LENGTH function that will return any album titles that are less than 4 characters long.

SELECT 
*
FROM
album
WHERE LENGTH(title) < 4

Output:

+----------+-------+-----------+
| album_id | title | artist_id |
+----------+-------+-----------+
|      131 | IV    |        22 |
|      181 | Ten   |       118 |
|      182 | Vs.   |       118 |
|      236 | Pop   |       150 |
|      239 | War   |       150 |
+----------+-------+-----------+

What album ID number appears in row 3 of your query result?

A. 182

B. 131

C. 236

D. 239

The correct answer is A. 182. Explain: The function LENGTH(title) < 4 will return any album names that are less than 4 characters long. The complete query is SELECT * FROM album WHERE LENGTH(title) < 4. The LENGTH function counts the number of characters a string contains. The album ID number 182 appears in row 3 of your query result.

Question 4

You are working with a database table that contains customer data. The table includes columns about customer location such as city, state, and country. You want to retrieve the first 3 letters of each country name. You decide to use the SUBSTR function to retrieve the first 3 letters of each country name, and use the AS command to store the result in a new column called new_country.

You write the SQL query below. Add a statement to your SQL query that will retrieve the first 3 letters of each country name and store the result in a new column as new_country.

NOTE: The three dots (...) indicate where to add the statement.

SELECT 
customer_id,
SUBSTR ('country', 1,3) AS new_country
FROM
customer
ORDER BY
country

Output:

+-------------+-------------+
| customer_id | new_country |
+-------------+-------------+
|          56 | cou         |
|          55 | cou         |
|           7 | cou         |
|           8 | cou         |
|           1 | cou         |
|          10 | cou         |
|          11 | cou         |
|          12 | cou         |
|          13 | cou         |
|           3 | cou         |
|          14 | cou         |
|          15 | cou         |
|          29 | cou         |
|          30 | cou         |
|          31 | cou         |
|          32 | cou         |
|          33 | cou         |
|          57 | cou         |
|           5 | cou         |
|           6 | cou         |
|           9 | cou         |
|          44 | cou         |
|          39 | cou         |
|          40 | cou         |
|          41 | cou         |
+-------------+-------------+
(Output limit exceeded, 25 of 59 total rows shown)

What customer ID number appears in row 2 of your query result?

A. 55

B. 47

C. 3

D. 28

The correct answer is A. 55. Explain: The statement SUBSTR(country, 1, 3) AS new_country will retrieve the first 3 letters of each state name and store the result in a new column as new_country. The complete query is SELECT customer_id, SUBSTR(country, 1, 3) AS new_country FROM customer ORDER BY country. The SUBSTR function extracts a substring from a string. This function instructs the database to return 3 characters of each country, starting with the first character. The customer ID number 55 appears in row 2 of your query result.