4.3.2.Learn basic SQL queries - sj50179/Google-Data-Analytics-Professional-Certificate GitHub Wiki

Hands-On Activity: Clean data using SQL

Link to data: automobile_data

Cleaning your data

Your new dataset contains historical sales data, including details such as car features and prices. You can use this data to find the top 10 most popular cars and trims. But before you can perform your analysis, you’ll need to make sure your data is clean. If you analyze dirty data, you could end up presenting the wrong list of cars to the investors. That may cause them to lose money on their car inventory investment.

Some of the queries in this section use DML statements like UPDATE and DELETE, which cannot be executed in the BigQuery sandbox.

Step 1: Inspect the fuel_type column

The first thing you want to do is inspect the data in your table so you can find out if there is any specific cleaning that needs to be done. According to the data’s description, the fuel_type column should only have two unique string values: diesel and gas. To check and make sure that’s true, run the following query:

SELECT
  DISTINCT fuel_type
FROM
  cars.car_info;**

This confirms that the fuel_type column doesn’t have any unexpected values.

Step 2: Inspect the length column

Next, you will inspect a column with numerical data. The length column should contain numeric measurements of the cars. So you will check that the minimum and maximum lengths in the dataset align with the data description, which states that the lengths in this column should range from 141.1 to 208.1. Run this query to confirm

SELECT
  MIN(length) AS min_length,
  MAX(length) AS max_length
FROM
  cars.car_info;

Your results should confirm that 141.1 and 208.1 are the minimum and maximum values respectively in this column.

Step 3: Fill in missing data

Missing values can create errors or skew your results during analysis. You’re going to want to check your data for null or missing values. These values might appear as a blank cell or the word null in BigQuery.

You can check to see if the num_of_doors column contains null values using this query:

SELECT
  *
FROM
  cars.car_info
WHERE
  num_of_doors IS NULL;

This will select any rows with missing data for the num_of_doors column and return them in your results table. You should get two results, one Mazda and one Dodge.

In order to fill in these missing values, you check with the sales manager, who states that all Dodge gas sedans and all Mazda diesel sedans sold had four doors. First, you use this query to update your table so that all Dodge gas sedans have four doors:

UPDATE
  cars.car_info
SET
  num_of_doors = "four"
WHERE
  make = "dodge"
  AND fuel_type = "gas"
  AND body_style = "sedan";

You should get a message telling you that three rows were modified in this table. To make sure, you can run the previous query again:

SELECT
  *
FROM
  cars.car_info 
WHERE
  num_of_doors IS NULL;

Now, you only have one row with a NULL value for num_of_doors. Repeat this process to replace the null value for the Mazda.

Step 4: Identify potential errors

Once you have finished ensuring that there aren’t any missing values in your data, you’ll want to check for other potential errors. You can use SELECT DISTINCT to check what values exist in a column. You can run this query to check the num_of_cylinders column:

SELECT
  DISTINCT num_of_cylinders
FROM
  cars.car_info;

After running this, you notice that there are one too many rows. There are two entries for two cylinders: rows 6 and 7. But the two in row 7 is misspelled, 'tow'.

To correct the misspelling for all rows, you can run this query:

UPDATE
  cars.car_info
SET
  num_of_cylinders = "two"
WHERE
  num_of_cylinders = "tow";

You will get a message alerting you that one row was modified after running this statement. To check that it worked, you can run the previous query again:

SELECT
  DISTINCT num_of_cylinders
FROM
  cars.car_info;

Next, you can check the compression_ratio column. According to the data description, the compression_ratio column values should range from 7 to 23. Just like when you checked the length values , you can use MIN and MAX to check if that’s correct:

SELECT
  MIN(compression_ratio) AS min_compression_ratio,
  MAX(compression_ratio) AS max_compression_ratio
FROM
  cars.car_info;

Notice that this returns a maximum of 70. But you know this is an error because the maximum value in this column should be 23, not 70. So the 70 is most likely a 7.0. Run the above query again without the row with 70 to make sure that the rest of the values fall within the expected range of 7 to 23.

SELECT
  MIN(compression_ratio) AS min_compression_ratio,
  MAX(compression_ratio) AS max_compression_ratio
FROM
  cars.car_info
WHERE
  compression_ratio <> 70;

Now the highest value is 23, which aligns with the data description. So you’ll want to correct the 70 value. You check with the sales manager again, who says that this row was made in error and should be removed. Before you delete anything, you should check to see how many rows contain this erroneous value as a precaution so that you don’t end up deleting 50% of your data. If there are too many (for instance, 20% of your rows have the incorrect 70 value), then you would want to check back in with the sales manager to inquire if these should be deleted or if the 70 should be updated to another value. Use the query below to count how many rows you would be deleting:

SELECT
  COUNT(*) AS num_of_rows_to_delete
FROM
  cars.car_info
WHERE
  compression_ratio = 70;

Turns out there is only one row with the erroneous 70 value. So you can delete that row using this query:

DELETE cars.car_info
WHERE compression_ratio = 70;

Step 5: Ensure consistency

Finally, you want to check your data for any inconsistencies that might cause errors. These inconsistencies can be tricky to spot — sometimes even something as simple as an extra space can cause a problem.

Check the drive_wheels column for inconsistencies by running a query with a SELECT DISTINCT statement:

SELECT
  DISTINCT drive_wheels
FROM
  cars.car_info;

It appears that 4wd appears twice in results. However, because you used a SELECT DISTINCT statement to return unique values, this probably means there’s an extra space in one of the 4wd entries that makes it different from the other 4wd.

To check if this is the case, you can use a LENGTH statement to determine the length of how long each of these string variables:

SELECT
  DISTINCT drive_wheels,
  LENGTH(drive_wheels) AS string_length
FROM
  cars.car_info;

According to these results, some instances of the 4wd string have four characters instead of the expected three (4wd has 3 characters). In that case, you can use the TRIM function to remove all extra spaces in the drive_wheels column:

UPDATE 
  cars.car_info
SET
  drive_wheels = TRIM(drive_wheels);

Then, you run the SELECT DISTINCT statement again to ensure that there are only three distinct values in the drive_wheels column:

SELECT
  DISTINCT drive_wheels
FROM
  cars.car_info;

And now there should only be three unique values in this column! Which means your data is clean,  consistent, and ready for analysis!

In this activity you checked your data for errors and fixed any inconsistencies. Cleaning data is an important step of the analysis process that will save you time and help ensure accuracy in the future.

Cleaning data where it lives is incredibly important for analysts. For instance, you were able to use SQL to complete multiple cleaning tasks, which allows you to clean data stored in databases. In upcoming activities, you will use your cleaning skills to prepare for analysis!

Test your knowledge on SQL queries

TOTAL POINTS 3

Question 1

Which of the following SQL functions can data analysts use to clean string variables? Select all that apply.

  • LENGTH
  • COUNTIF
  • TRIM
  • SUBSTR

Correct. Data analysts can use the SUBSTR and TRIM functions to clean string variables.

Question 2

You are working with a database of information about middle school students. The student_data table contains the name and eight-digit identification (ID) number for each student. The first four digits of each ID number correspond to the student's graduation year. For example, 20267482 indicates the student will graduate in 2026.

The identification number is stored as a string in the id_number column. How do you complete this query to return the name of all students who will graduate in 2026?

SELECT name
FROM student_data
WHERE
  • SUBSTR(id_number, 1, 4) = '2026'
  • SUBSTR = '2026' (id_number, 4, 1)
  • SUBSTR(id_number, 4, 1) = '2026'
  • SUBSTR = '2026' (id_number, 1, 4)

Correct. The SUBSTR() statement is SUBSTR(id_number, 1, 4) = '2026'*. This function instructs the database to return four characters of each student ID, starting with the first character. It will only retrieve data about students who will graduate in 2026.*

Question 3

A data analyst wants to confirm that all of the text strings in a table are the correct length. How would they complete the following query to return any routes greater than 10 characters long?

SELECT
	route
FROM
	US_roads_data
WHERE
  • LENGTH = (route) < 10
  • LENGTH = (route) > 10
  • LENGTH(route) < 10
  • LENGTH(route) > 10

Correct. The LENGTH statement is LENGTH(route) > 10*. This function instructs the database to return any routes that are greater than 10 characters long.*