5.4.4.The data validation process - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki

Types of data validation

This reading describes the purpose, examples, and limitations of six types of data validation. The first five are validation types associated with the data (type, range, constraint, consistency, and structure) and the sixth type focuses on the validation of application code used to accept data from user input.

As a junior data analyst, you might not perform all of these validations. But you could ask if and how the data was validated before you begin working with a dataset. Data validation helps to ensure the integrity of data. It also gives you confidence that the data you are using is clean. The following list outlines six types of data validation and the purpose of each, and includes examples and limitations.

  • Purpose: Check that the data matches the data type defined for a field.
  • Example: Data values for school grades 1-12 must be a numeric data type.
  • Limitations: The data value 13 would pass the data type validation but would be an unacceptable value. For this case, data range validation is also needed.
  • Purpose: Check that the data falls within an acceptable range of values defined for the field.
  • Example: Data values for school grades should be values between 1 and 12.
  • Limitations: The data value 11.5 would be in the data range and would also pass as a numeric data type. But, it would be unacceptable because there aren't half grades. For this case, data constraint validation is also needed.
  • Purpose: Check that the data meets certain conditions or criteria for a field. This includes the type of data entered as well as other attributes of the field, such as number of characters.
  • Example: Content constraint: Data values for school grades 1-12 must be whole numbers.
  • Limitations: The data value 13 is a whole number and would pass the content constraint validation. But, it would be unacceptable since 13 isn’t a recognized school grade. For this case, data range validation is also needed.
  • Purpose: Check that the data makes sense in the context of other related data.
  • Example: Data values for product shipping dates can’t be earlier than product production dates.
  • Limitations: Data might be consistent but still incorrect or inaccurate. A shipping date could be later than a production date and still be wrong.
  • Purpose: Check that the data follows or conforms to a set structure.
  • Example: Web pages must follow a prescribed structure to be displayed properly.
  • Limitations: A data structure might be correct with the data still incorrect or inaccurate. Content on a web page could be displayed properly and still contain the wrong information.
  • Purpose: Check that the application code systematically performs any of the previously mentioned validations during user data input.
  • Example: Common problems discovered during code validation include: more than one data type allowed, data range checking not done, or ending of text strings not well defined.
  • Limitations: Code validation might not validate all possible variations with data input.

Hands-On Activity: From spreadsheets to BigQuery

Question 1

Activity overview

oCbJiUcDSiWmyYlHA1olsA_f1117c8fa8064e1587388186d8712cb0_line-y

By now, you have worked with data using both spreadsheets and SQL. These tools operate very differently: In spreadsheets, you are able to observe and interact with data directly; with SQL, you interact with data through queries to the database. In this activity, you will use spreadsheets to clean your data before importing it into SQL for analysis.

In this scenario, you have been working for a national store chain as a data analyst. Management is interested in the amount of inventory being kept in storage at regional sites. Your supervisor has asked you to perform an analysis on inventory and sales data to make recommendations for changes to inventory management practices. You have been provided with three datasets containing information about inventory, products, and sales.

By the time you complete this activity, you will be able to combine tools to successfully analyze data. Switching between spreadsheets and SQL can be challenging because they’re so different, but once you’re more used to both tools, you’ll be able to use both more easily. This is important for tackling larger and more complex projects in your career as a data analyst.

To get started, first download the three store data CSV files: inventory, products, and sales.

Click the link to each CSV file to create a copy. If you don’t have a Google account, you may download the data directly from the attachments below.

Link to data: inventory, sales and products

OR

Download data:

  1. Inventory
  2. Sales
  3. Products

Cleaning the data

oCbJiUcDSiWmyYlHA1olsA_f1117c8fa8064e1587388186d8712cb0_line-y

Before you upload these files to SQL, you can import them into a spreadsheet in Sheets to get comfortable with the data before you start analyzing it in BigQuery. This might not always be possible with larger datasets you encounter in the future, but you should explore as much as possible within this exercise! You can also use this step to perform some data-cleaning tasks.

Step 1: Import the data

If you’re using Google Sheets, you’ll first need to import the data files into your spreadsheet . Open Sheets and navigate to the File menu, then select Import from the dropdown list.

AAhw-kW2QdaIcPpFtuHWnA_e9e56681bb014a49972a225ca3b369f1_Screenshot-2022-06-28-4 18 13-PM

Select the first file and upload it to the spreadsheet. Choose Replace spreadsheet to insert it into the current sheet.

qhI7sVUrTnuSO7FVK157pw_3cc46f57879f4afdb7aa9fbde68d20f1_Screenshot-2021-06-23-3 44 34-PM-1-

Then return to the Import menu under the File menu and upload the next file. This time, however, select Insert new sheet(s) to create new worksheet tabs with this file.

QDjSQ9e_RRm40kPXvyUZaw_bb4505dbe7d246b3a8cea166900c27f1_Screenshot-2021-06-23-3 46 53-PM-1-

Repeat these steps until you have all three files added to your spreadsheet.

Step 2: Inspect the data

Applying filters in spreadsheets is a good way to identify any data that needs to be cleaned. You’ll inspect the Inventory sheet now.

Navigate to the Inventory sheet and click any cell in the spreadsheet. Open the Data dropdown menu and select Create a filter.

yMixTg0CReGIsU4NArXhrA_62bd08d299cf45b19de2b85fecb086f1_unnamed-13-

Now you can click the filter icons for each column to inspect the values. Start with the StoreID column. As you scroll through, you’ll notice that there do not appear to be any blanks or incorrectly entered values. However, if you inspect the StoreName column, you’ll find a blank.

Deselect all of the values except for the blank.

JrsQjePyQhK7EI3j8hISOA_0c8e4828cda94eedbbbbecd5d3766cf1_unnamed-14-

This should return one row with a missing entry under the StoreName column.

S8r32GLkS1SK99hi5ItURw_aeb85c4af319420b9da923ab10b49ff1_Screenshot-2021-06-16-4 29 20-PM-1-

You might be able to find what the missing value is and input it correctly using the filter. Clear the Storename filter and use the StoreId column filter for other stores with the ID 21791.

KDBl6k0gTPCwZepNIBzw4A_d52cadd811d04c0f882e605350f278f1_Screenshot-2021-06-16-4 42 14-PM-1-

It appears that the other stores with this ID are all Dollar Tree, so it’s probably safe to input that as the StoreName value in the blank cell.

Inspect the other columns in this sheet, then return to the Data menu to turn off the filters. Next, navigate to the Products sheet.

Similarly to the last sheet, you can repeat this process to inspect the Products data. Go to the Data menu and select Create filter.

Check the ProductID column. You'll find that there is a NA value in this column, despite the fact that this column should only have numeric values. In this case, you’ve checked in with the dataset owner, who said you can delete this row because it was input by mistake and does not belong in this dataset. Turn off the filter and move on to the next step.

From spreadsheets to BigQuery

oCbJiUcDSiWmyYlHA1olsA_f1117c8fa8064e1587388186d8712cb0_line-y

Now that you have checked out your data in a tool that lets you observe and interact with your data directly, it’s time to transition to using SQL. With SQL, you can only observe the results of your query, which requires a different mindset than spreadsheets — but SQL is very powerful when you’re working with databases and larger datasets!

Step 1: Create a dataset and custom table

Similar to previous activities, you will need to create a dataset and custom table to house this data before you can inspect it in BigQuery.

  1. From the Explorer pane in your BigQuery console, click the three vertical dots next to your project space and select Create dataset.

6lgwYQzPSlOASd-C_AFMGQ_0fae9f57786d43a28323551e4f390af1_5 CreateDataset

  1. Name the new dataset sales and leave the other settings as their default. Make sure the Location Type is set to Multi-Region (US) and the default Encryption is set to Google-managed encryption key within the Advanced section.

BBSAcQTeThqO5WOcnVFGGw_95e0e294d37d4539861d234d475c1ff1_1 createdataset-1-

  1. Once you have checked all the settings, click the blue button CREATE DATASET. The new dataset should appear in your Explorer pane.
  2. Open the new dataset info window by clicking on the sales item under your project name. On the right hand side of the window, you will see a row of tab commands.

BJ_VlSlyRGymnfnAayzg4g_024a4b6d0b0e4409ae29dcd4140fc0f1_2 saleswindow

  1. Click on the first command tab titled CREATE TABLE. This will open a Create table pop-up window. Select the Create table from > UPLOAD option and import your sales data. Name the table sales_info, select Auto detect under Schema, and leave the rest of the options as default. Once you have checked all the settings, click the blue CREATE TABLE.

r0x6Umh3RFKXdxAIt-xbPw_f428e0d85e2b4055bf47ebc57e4856f1_3 createtablewindow

  1. Click on the new table sales_info in the Explorer pane. A data info window will open, and click on the SCHEMA and PREVIEW tabs to get an overview of the metadata and attributes of your data.

X128eS5aQfyiQhvWfN3TIA_525f676271e6477da4f50fab332134f1_4 sales_info_window

Step 2: Inspect the data Next, you will need to inspect the data to determine how much of it will be useful for your final analysis.

  • NOTE: Within the FROM clause of the syntax below, you will need to begin the Table ID line with your personalized project name, period, the dataset name, period, and end with the table name. It's important to understand that the personal project name will be unique to each learner. You can also locate and copy the full Table ID filename by clicking on the DETAILS option tab in your sales_info Table window. Once copied, paste it after the FROM clause and run the above query.
  1. Ensure that the import was successful by running this query:
SELECT
  *
FROM
  personal project name.sales.sales_info

LIMIT 10;

Your results should appear like this:

_7CsFR2gSOywrBUdoFjsPw_aff6023526cb44efbda7814b144e2af1_Screenshot-2021-06-16-2 44 00-PM-1-

  1. Next, inspect the data to find out how many years of sales data it includes.You can use the MIN and MAX functions to get the oldest and newest dates:
SELECT
  MIN(Date) AS min_date,
  MAX(Date) AS max_date
FROM

personal project name.sales.sales_info;

Now you know what years this data covers. In this case, you’ll want to group the data by month because management wants to see year-over-year changes to inventory by month.

  1. Click COMPOSE NEW QUERY and run the following query, which will return the total quantity sold for each ProductId grouped by the month and year it was sold:
SELECT
  EXTRACT(YEAR FROM date) AS Year,
  EXTRACT(MONTH FROM date) AS Month,
  ProductId,
  ROUND(MAX(UnitPrice),2) AS UnitPrice,
  SUM(Quantity) AS UnitsSold
FROM
  personal project name.sales.sales_info
GROUP BY
  Year,
  Month,
  ProductId
ORDER BY
  Year,
  Month,
  ProductId;

Step 3: Export results to spreadsheet

The subset of data you queried is fewer than 50,000 rows. This means it can be easily exported to a spreadsheet, if your stakeholder requests the data in this form. Or, you can use this exported spreadsheet for visualization. First, however, you’ll need to save your results.

  1. After running the query, click SAVE RESULTS. There will be a pop-up menu with the option to choose the file type for export. Select CSV Google Drive. Once it is downloaded, open the new CSV file in Drive.

uTNae_ZTScizWnv2U6nImA_8436193fc17a46348dafc9aee9091ef1_Screenshot-2021-06-16-3 23 20-PM-1-

There should be about 47,000 rows. Right-click on the sheet tab and rename the sheet Sales.

  1. Next, if you’re using Sheets, you can open these results by selecting the File menu and clicking Import.

This will open a pop-up menu. Click Upload and select the inventory CSV file.

0srHRf8cRSGKx0X_HJUhIA_8bd26791faab47398e8f83fa53e0a5f1_unnamed-15-

Select Insert new sheet(s) to add this data as a worksheet to your spreadsheet and choose Comma for Separator type.

tVxL_HBxR5WcS_xwcUeVEA_afce37f3b2ed4d1592b449937542d9f1_Screenshot-2021-06-16-3 31 33-PM-1-

  1. Repeat these steps for the product CSV file.

Confirmation and reflection

Z65IW3QCSOmuSFt0Aijp8w_914270a3d2e84027b46e404f7a52007f_line-y

What is the earliest year included in this dataset?

A. 2017

B. 2018

C. 2019

D. 2020

The correct answer is A. 2017. Explain: 2017 is the earliest year included in this dataset. To find the date range of this dataset, you used MIN and MAX functions in SQL to determine the earliest and latest years. You were able to pull this observation without actually scrolling through all of the data manually, which is a key skill when working with larger datasets.

Question 2

In the text box below, write 2-3 sentences (40-60 words) in response to each of the following questions:

  • Why is being able to make use of multiple analysis tools useful for some projects?
  • How is working with data in spreadsheets and with SQL different? How are they similar?

Explain: Congratulations on completing this hands-on activity! In this activity, you previewed data in BigQuery to find a useful subset to analyze, imported it to spreadsheets, and analyzed your data! A good response would include that using multiple tools allows you to be more flexible.

Being able to use SQL to create a subset of data to work with in spreadsheets like you did today gives you more options for how you approach your analysis. In upcoming activities, you will have more opportunities to analyze data from beginning to end!

Data validation

Ice cream club members get free ice cream for their birthday. After running a query to see how many members got their free ice cream last year, Ivan noticed some null values.

Review the data problem and identify the best solution by selecting the type of data validation from the list.

  1. The head of marketing explained to Ivan that members can visit as many locations as they want on their birthday. Ivan realized this meant the field should not be defined as a yes/no binary. Instead, it is better represented as a numeric value, up to the actual number of locations visited.

image

  1. To track all sales, Ivan assigned a primary key to every item on the menu from 1 to 60. After running a query to find the most popular menu items, he wondered why he couldn’t find any beverage sales.

After reviewing the table, he realized he forgot to assign values to the five beverages on the menu, so he added those to the menu table, bringing the range to 1 to 65.

image

  1. Ivan ran a query to discover the top neighborhoods where ice cream club members live.

After running the query, he noticed that several zip code results contained letters—but in his country, zip codes can only contain numeric values.

image

  1. Ivan wanted to measure how long ice cream pint inventory sits in company freezers. He ran a query to pull pint production dates and shipping dates. Next, he noticed that some production dates came after their shipping date. This told Ivan there must be an error in the data.

image

  1. After working with an advertising agency, the ice cream shop received five different jingle options to use in their new commercial. Ivan stored these files in his database as MP3 files.

When he ran a query for the jingles, he got null values. He went back into the database to see why the MP3 files wouldn’t show up.

image

  1. After clearly defining that zip code values must have only numeric values, Ivan ran his query again to find out where club members live. But he noticed some results had six or more values.

Zip codes in his country should only have five numeric digits. So he went back to verify that the endings of strings are better defined within the table.

image

Learning Log: Finish your data analysis checklist

Overview

Z65IW3QCSOmuSFt0Aijp8w_914270a3d2e84027b46e404f7a52007f_line-y

In a previous learning log, you started creating a high-level data analysis checklist. Now, you’ll complete that checklist with more detailed steps. By the time you complete this activity, you will have an in-depth data analysis checklist to help keep you organized during analysis projects. As a data analyst, staying organized will ensure that you don’t make any mistakes or miss any steps-- saving you time and effort!

Complete Your Checklist

Z65IW3QCSOmuSFt0Aijp8w_914270a3d2e84027b46e404f7a52007f_line-y

The high-level outline of your checklist was based on the phases of the data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.

Now, you can fill in each step of your high-level checklist with detailed substeps. For example, for the Process phase, you probably have a step for cleaning data. You might break this step down to include detailed substeps such as checking that each variable is one column and that each observation is one row.

You could create a step in your checklist like this:

Process Phase

Step X: Cleaning data

  • Is each variable one column?
  • Is each observation one row?

This example takes a high-level data analysis task like cleaning data and breaks it down into more specific steps. Ultimately, you get to decide how detailed you want your checklist to be and which specific steps to include. This is meant to be a useful tool for you, so you can customize it however you want!

You might not know how to break down every phase of the data analysis process. Here are a few questions that you can consider as you think about it:

  • What are the high-level steps you need to take? How can the phases of analysis help you organize the whole process?
  • What specific details are necessary to complete these high-level steps?
  • How can you simplify each step as you go? Are there any shortcuts that can help you?
  • Are there any steps you might be most at risk of forgetting? If so, how can your checklist remind you to complete these steps?

You can always adapt and add to your checklist as you have new ideas or as your project needs change.

In the learning log template linked below, copy and paste your original checklist and complete it with the specific steps and tasks you want to add.

Access your learning log

To use the template for this course item, click the link below and select “Use Template.”

Link to learning log template: Finish your data analysis checklist

OR

If you don’t have a Google account, you can download the template directly from the attachment below.

Learning Log Template_ Finish your data analysis checklist

Reflection

Z65IW3QCSOmuSFt0Aijp8w_914270a3d2e84027b46e404f7a52007f_line-y When you’re finished, reflect on the process of creating your checklist and its possible uses. Please write 3-5 sentences (60-100 words) in response to each question.

  • Review your checklist and compare it to the tasks and activities related to this course. How is your checklist similar to or different from the organization of the course?
  • How does creating a checklist help you develop your data analysis skills?
  • How will your checklist help you analyze your own data?
  • What else could you use your checklist for?

When you’ve finished your entry in the learning log template, make sure to save the document so your response is somewhere accessible. This will help you continue applying data analysis to your everyday life. You will also be able to track your progress and growth as a data analyst.

Test your knowledge on data validation

Question 1

What are the goals of checking and rechecking the quality of your data during data validation? Select all that apply.

  • Data is sorted and filtered
  • Data is complete and accurate
  • Data is secure
  • Data is consistent

Explain: Checking and rechecking the quality of your data during data validation process helps ensure the data is complete, accurate, secure and consistent.

Question 2

You’re analyzing patient data for a health care company. During the data-validation process, you notice that the first date of service for some of the patients is later than the most recent date of service. Which type of data-validation check are you completing?

A. Data range

B. Data type

C. Data consistency

D. Data structure

The correct answer is C. Data consistency. Explain: This is a check for data consistency. During a data consistency check, you confirm that the data makes sense in the context of other related data.

Question 3

During analysis, you complete a data-validation check for errors in customer identification (ID) numbers. Customer IDs must be eight characters and can contain numbers only. Which of the following customer ID errors will a data-type check help you identify?

A. IDs that are repeated

B. IDs with more than eight characters

C. IDs in the wrong column

D. IDs with text

The correct answer is D. IDs with text. Explain: Completing a data-type check will help you identify customer IDs that contain text. The data type for IDs should be numeric only.