5.4.3.The data validation process - sj50179/Google-Data-Analytics-Professional-Certificate GitHub Wiki

Data validation process

  • Checking and rechecking the quality of your data so that it is complete, accurate, secure and consistent

Question

Fill in the blank: The data validation process is a form of data _____.

  • formatting
  • transformation
  • reporting
  • cleaning

Correct. The data validation process is a form of data cleaning. During this process, data analysts check the quality of their data to make sure it’s complete, accurate, secure, and consistent.

Types of data validation

The following table describes the purpose, examples, and limitations of six types of data validation. The first five are validation types associated with the data (type, range, constraint, consistency, and structure) and the sixth type focuses on the validation of application code used to accept data from user input.

As a junior data analyst, you might not perform all of these validations. But you could ask if and how the data was validated before you begin working with a dataset. Data validation helps to ensure the integrity of data. It also gives you confidence that the data you are using is clean. The following list outlines six types of data validation and the purpose of each, and includes examples and limitations.

  1. Data Type

    • Purpose: Check that the data matches the data type defined for a field.
    • Example: Data values for school grades 1-12 must be a numeric data type.
    • Limitations: The data value 13 would pass the data type validation but would be an unacceptable value. For this case, data range validation is also needed.
  2. Data Range

    • Purpose: Check that the data falls within an acceptable range of values defined for the field.
    • Example: Data values for school grades should be values between 1 and 12.
    • Limitations: The data value 11.5 would be in the data range and would also pass as a numeric data type. But, it would be unacceptable because there aren't half grades. For this case, data constraint validation is also needed.
  3. Data Constraints

    • Purpose: Check that the data meets certain conditions or criteria for a field. This includes the type of data entered as well as other attributes of the field, such as number of characters.
    • Example: Content constraint: Data values for school grades 1-12 must be whole numbers.
    • Limitations: The data value 13 is a whole number and would pass the content constraint validation. But, it would be unacceptable since 13 isn’t a recognized school grade. For this case, data range validation is also needed.
  4. Data Consistency

    • Purpose: Check that the data makes sense in the context of other related data.
    • Example: Data values for product shipping dates can’t be earlier than product production dates.
    • Limitations: Data might be consistent but still incorrect or inaccurate. A shipping date could be later than a production date and still be wrong.
  5. Data Structure

    • Purpose: Check that the data follows or conforms to a set structure.
    • Example: Web pages must follow a prescribed structure to be displayed properly.
    • Limitations: A data structure might be correct with the data still incorrect or inaccurate. Content on a web page could be displayed properly and still contain the wrong information.
  6. Code Validation

    • Purpose: Check that the application code systematically performs any of the previously mentioned validations during user data input.
    • Example: Common problems discovered during code validation include: more than one data type allowed, data range checking not done, or ending of text strings not well defined.
    • Limitations: Code validation might not validate all possible variations with data input.

    Process Phase

    Step X: Cleaning data

    • Is each variable one column?
    • Is each observation one row?

    This example takes a high-level data analysis task like cleaning data and breaks it down into more specific steps. Ultimately, you get to decide how detailed you want your checklist to be and which specific steps to include. This is meant to be a useful tool for you, so you can customize it however you want!

    You might not know how to break down every phase of the data analysis process. Here are a few questions that you can consider as you think about it:

    • What are the high-level steps you need to take? How can the phases of analysis help you organize the whole process?
    • What specific details are necessary to complete these high-level steps?
    • How can you simplify each step as you go? Are there any shortcuts that can help you?
    • Are there any steps you might be most at risk of forgetting? If so, how can your checklist remind you to complete these steps?

    Test your knowledge on data validation

    TOTAL POINTS 3

    Question 1

    What are the goals of checking and rechecking the quality of your data during data validation? Select all that apply.

    • Data is complete and accurate
    • Data is sorted and filtered
    • Data is secure
    • Data is consistent

    Correct. Checking and rechecking the quality of your data during data validation process helps ensure the data is complete, accurate, secure and consistent.

    Question 2

    You’re analyzing patient data for a health care company. During the data-validation process, you notice that the first date of service for some of the patients is later than the most recent date of service. Which type of data-validation check are you completing?

    • Data range
    • Data structure
    • Data type
    • Data consistency

    Correct. This is a check for data consistency. During a data consistency check, you confirm that the data makes sense in the context of other related data.

    Question 3

    During analysis, you complete a data-validation check for errors in customer identification (ID) numbers. Customer IDs must be eight characters and can contain numbers only. Which of the following customer ID errors will a data-type check help you identify?

    • IDs with more than eight characters
    • IDs that are repeated
    • IDs with text
    • IDs in the wrong column

    Correct. Completing a data-type check will help you identify customer IDs that contain text. The data type for IDs should be numeric only.

    Using SQL with temporary tables

    Temporary table

    • A database table that is created and exists temporarily on a database server

    What are temp tables?

    As data calculations become more complicated, there are many components to keep track of. This is similar to keeping track of tasks in daily life. Some people use sticky notes while others use checklists. In data science, a temporary table is just like a sticky note.

    Temporary tables, or temp tables, store subsets of data from standard data tables for a certain period of time. When you end your SQL database session, they are automatically deleted. Temp tables allow you to run calculations in temporary data tables without needing to make modifications to the primary tables in your database.

    Question

    A data analyst has a large number of sales records in a table. They want to perform calculations on a small subset of the table. Rather than filtering the data over and over, what should they do?

    • Use an alternate table
    • Use a copy of the table
    • Use a temporary table
    • Use a backup table

    Correct. They should use a temporary table. A temporary table is created and exists for a short time on a database server.

    The WITH clause is a type of temporary table that you can query from multiple times

    When you use temporary tables, you make your own work more efficient. Naming and using temp tables can help you deal with a lot of data in a more streamlined way, so you don't get lost repeating query after query with the same code that you could just include in a temp table. And here's another bonus to using temp tables: they can help your fellow team members too. With temp tables your code is usually less complicated and easier to read and understand.

    How to create temporary tables:

    • WITH clauses
    • SELECT INTO statements
    • CREATE TABLE statements
    • CREATE TEMP TABLE statements

    Working with temporary tables

    Temporary tables are exactly what they sound like—temporary tables in a SQL database that aren’t stored permanently. In this reading, you will learn the methods to create temporary tables using SQL commands. You will also learn a few best practices to follow when working with temporary tables.

    A quick refresher on temporary tables

    • They are automatically deleted from the database when you end your SQL session.
    • They can be used as a holding area for storing values if you are making a series of calculations. This is sometimes referred to as pre-processing of the data.
    • They can collect the results of multiple, separate queries. This is sometimes referred to as data staging. Staging is useful if you need to perform a query on the collected data or merge the collected data.
    • They can store a filtered subset of the database. You don’t need to select and filter the data each time you work with it. In addition, using fewer SQL commands helps to keep your data clean.

    It is important to point out that each database has its own unique set of commands to create and manage temporary tables. We have been working with BigQuery, so we will focus on the commands that work well in that environment. The rest of this reading will go over the ways to create temporary tables, primarily in BigQuery.

    Temporary table creation in BigQuery

    Temporary tables can be created using different clauses. In BigQuery, the WITH clause can be used to create a temporary table. The general syntax for this method is as follows:

    WITH new_table_data AS 
        (
        SELECT *
        FROM Existing_table
        WHERE Tripduration >= 60
        )
    

    Breaking down this query a bit, notice the following:

    • The statement begins with the WITH clause followed by the name of the new temporary table you want to create
    • The AS clause appears after the name of the new table. This clause instructs the database to put all of the data identified in the next part of the statement into the new table.
    • The opening parenthesis after the AS clause creates the subquery that filters the data from an existing table. The subquery is a regular SELECT statement along with a WHERE clause to specify the data to be filtered.
    • The closing parenthesis ends the subquery created by the AS clause.

    When the database executes this query, it will first complete the subquery and assign the values that result from that subquery to “new_table_data,” which is the temporary table. You can then run multiple queries on this filtered data without having to filter the data every time.

    Temporary table creation in other databases (not supported in BigQuery)

    The following method isn’t supported in BigQuery, but most other versions of SQL databases support it, including SQL Server and mySQL. Using SELECT and INTO, you can create a temporary table based on conditions defined by a WHERE clause to locate the information you need for the temporary table. The general syntax for this method is as follows:

    SELECT *
    INTO AfricaSales
    FROM GlobalSales
    WHERE Region = "Africa"
    

    This SELECT statement uses the standard clauses like FROM and WHERE, but the INTO clause tells the database to store the data that is being requested in a new temporary table named, in this case, “AfricaSales.”

    User-managed temporary table creation

    So far, we have explored ways of creating temporary tables that the database is responsible for managing. But, you can also create temporary tables that you can manage as a user. As an analyst, you might decide to create a temporary table for your analysis that you can manage yourself. You would use the CREATE TABLE statement to create this kind of temporary table. After you have finished working with the table, you would then delete or drop it from the database at the end of your session.

    Note: BigQuery uses CREATE TEMP TABLE instead of CREATE TABLE, but the general syntax is the same.

    CREATE TABLE table_name (
        column1 datatype,
        column2 datatype,
        column3 datatype,
        ....
    )
    

    After you have completed working with your temporary table, you can remove the table from the database using the DROP TABLE clause. The general syntax is as follows:

    DROP TABLE table_name
    

    Best practices when working with temporary tables

    • Global vs. local temporary tables: Global temporary tables are made available to all database users and are deleted when all connections that use them have closed. Local temporary tables are made available only to the user whose query or connection established the temporary table. You will most likely be working with local temporary tables. If you have created a local temporary table and are the only person using it, you can drop the temporary table after you are done using it.
    • Dropping temporary tables after use: Dropping a temporary table is a little different from deleting a temporary table. Dropping a temporary table not only removes the information contained in the rows of the table, but removes the table variable definitions (columns) themselves. Deleting a temporary table removes the rows of the table but leaves the table definition and columns ready to be used again. Although local temporary tables are dropped after you end your SQL session, it may not happen immediately. If a lot of processing is happening in the database, dropping your temporary tables after using them is a good practice to keep the database running smoothly.

    For more information

    Test your knowledge on using SQL with temporary tables

    TOTAL POINTS 3

    Question 1

    When are temporary tables automatically deleted?

    • After running a report from the table
    • After ending the session in a SQL database
    • After running a query in your SQL database
    • After completing all calculations in the table

    Correct. Temporary tables are automatically deleted after ending the session in a SQL database.

    Question 2

    The following SQL query contains information about bike trips:

    WITH 1_hr_trips AS (
        SELECT *
        FROM bigquery-public-data.new_york.citibike_trips
        WHERE tripduration = 60
        )
    

    What data will appear in the temporary table created through this query?

    • Bike trips equal to or more than one hour
    • A random subset of bike trips
    • The total number of bike trips
    • Bike trips that lasted exactly 60 minutes

    Correct. This temporary table will show bike trips that lasted exactly 60 minutes. The name of the table is “1_hr_trips” and the query includes the condition that trips in the table equal one hour.

    Question 3

    What benefit does a CREATE TABLE statement add to a temporary table?

    • Access for anyone to use the table
    • Metadata about the data in the table
    • Automated calculations
    • Specific naming conventions

    Correct. A CREATE TABLE statement provides access for anyone to use the temporary table. The SELECT INTO statement is better suited for one person.