4.2.1.Data cleaning is a must & Begin cleaning data - sj50179/Google-Data-Analytics-Professional-Certificate GitHub Wiki

Data cleaning is a must

What is dirty data?

Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant to the problem you are trying to solve.  This reading summarizes:

  • Types of dirty data you may encounter
  • What may have caused the data to become dirty
  • How dirty data is harmful to businesses

Types of dirty data

Duplicate data

Description Possible causes Potential harm to businesses
Any data record that shows up more than once Manual data entry, batch data imports, or data migration Skewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion during data retrieval

Outdated data

Description Possible causes Potential harm to businesses
Any data that is old which should be replaced with newer and more accurate information People changing roles or companies, or software and systems becoming obsolete Inaccurate insights, decision-making, and analytics

Incomplete data

Description Possible causes Potential harm to businesses
Any data that is missing important fields Improper data collection or incorrect data entry Decreased productivity, inaccurate insights, or inability to complete essential services

Incorrect/inaccurate data

Description Possible causes Potential harm to businesses
Any data that is complete but inaccurate Human error inserted during data input, fake information, or mock data Inaccurate insights or decision-making based on bad information resulting in revenue loss

Inconsistent data

Description Possible causes Potential harm to businesses
Any data that uses different formats to represent the same thing Data stored incorrectly or errors inserted during data transfer Contradictory data points leading to confusion or inability to classify or segment customers

Business impact of dirty data

For further reading on the business impact of dirty data, enter the term “dirty data” into your preferred browser’s search bar to bring up numerous articles on the topic. Here are a few impacts cited for certain industries from a previous search:

  • Banking: Inaccuracies cost companies between 15% and 25% of revenue (source).
  • Digital commerce: Up to 25% of B2B database contacts contain inaccuracies (source).
  • Marketing and sales: 8 out of 10 companies have said that dirty data hinders sales campaigns (source).
  • Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s electronic health records (source).

Test your knowledge on clean versus dirty data

TOTAL POINTS 4

Question 1

Describe the difference between a null and a zero in a dataset.

  • A null represents a value of zero. A zero represents an empty cell.
  • A null indicates that a value does not exist. A zero is a numerical response.
  • A null signifies invalid data. A zero is missing data.
  • A null represents a number with no significance. A zero represents the number zero.

Correct. A null indicates that a value does not exist. A zero is a numerical response.

Question 2

What are the most common processes and procedures handled by data engineers? Select all that apply.

  • Transforming data into a useful format for analysis
  • Verifying results of data analysis
  • Giving data a reliable infrastructure
  • Developing, maintaining, and testing databases and related systems

Correct. Data engineers transform data into a useful format for analysis; give it a reliable infrastructure; and develop, maintain, and test databases and related systems.

Question 3

What are the most common processes and procedures handled by data warehousing specialists? Select all that apply.

  • Ensuring data is backed up to prevent loss
  • Ensuring data is secure
  • Ensuring data is available
  • Ensuring data is properly cleaned

Correct. Data warehousing specialists are responsible for ensuring data is available, secure, and backed up to prevent loss.

Question 4

A data analyst is cleaning a dataset. They want to confirm that users entered five-digit zip codes correctly by checking the data in a certain spreadsheet column. What would be most helpful as the next step?

  • Formatting the cells in the column as number
  • Using the field length tool to specify the number of characters in each cell in the column
  • Using the MAX function to determine the maximum value in the cells in the column
  • Changing the column width to fit only five digits

Correct. Using the field length tool to specify the number of characters in each cell in the column would be the most helpful.

Begin cleaning data

Common data-cleaning pitfalls

In this reading, you will learn the importance of data cleaning and how to identify common mistakes. Some of the errors you might come across while cleaning your data could include:

Common mistakes to avoid

  • Not checking for spelling errors: Misspellings can be as simple as typing or input errors. Most of the time the wrong spelling or common grammatical errors can be detected, but it gets harder with things like names or addresses. For example, if you are working with a spreadsheet table of customer data, you might come across a customer named “John” whose name has been input incorrectly as “Jon” in some places. The spreadsheet’s spellcheck probably won’t flag this, so if you don’t double-check for spelling errors and catch this, your analysis will have mistakes in it.
  • Forgetting to document errors: Documenting your errors can be a big time saver, as it helps you avoid those errors in the future by showing you how you resolved them. For example, you might find an error in a formula in your spreadsheet. You discover that some of the dates in one of your columns haven’t been formatted correctly. If you make a note of this fix, you can reference it the next time your formula is broken, and get a head start on troubleshooting. Documenting your errors also helps you keep track of changes in your work, so that you can backtrack if a fix didn’t work.
  • Not checking for misfielded values: A misfielded value happens when the values are entered into the wrong field. These values might still be formatted correctly, which makes them harder to catch if you aren’t careful. For example, you might have a dataset with columns for cities and countries. These are the same type of data, so they are easy to mix up. But if you were trying to find all of the instances of Spain in the country column, and Spain had mistakenly been entered into the city column, you would miss key data points. Making sure your data has been entered correctly is key to accurate, complete analysis.
  • Overlooking missing values: Missing values in your dataset can create errors and give you inaccurate conclusions. For example, if you were trying to get the total number of sales from the last three months, but a week of transactions were missing, your calculations would be inaccurate. As a best practice, try to keep your data as clean as possible by maintaining completeness and consistency.
  • Only looking at a subset of the data: It is important to think about all of the relevant data when you are cleaning. This helps make sure you understand the whole story the data is telling, and that you are paying attention to all possible errors. For example, if you are working with data about bird migration patterns from different sources, but you only clean one source, you might not realize that some of the data is being repeated. This will cause problems in your analysis later on. If you want to avoid common errors like duplicates, each field of your data requires equal attention.
  • Losing track of business objectives: When you are cleaning data, you might make new and interesting discoveries about your dataset-- but you don’t want those discoveries to distract you from the task at hand. For example, if you were working with weather data to find the average number of rainy days in your city, you might notice some interesting patterns about snowfall, too. That is really interesting, but it isn’t related to the question you are trying to answer right now. Being curious is great! But try not to let it distract you from the task at hand.
  • Not fixing the source of the error: Fixing the error itself is important. But if that error is actually part of a bigger problem, you need to find the source of the issue. Otherwise, you will have to keep fixing that same error over and over again. For example, imagine you have a team spreadsheet that tracks everyone’s progress. The table keeps breaking because different people are entering different values. You can keep fixing all of these problems one by one, or you can set up your table to streamline data entry so everyone is on the same page. Addressing the source of the errors in your data will save you a lot of time in the long run.
  • Not analyzing the system prior to data cleaning: If we want to clean our data and avoid future errors, we need to understand the root cause of your dirty data. Imagine you are an auto mechanic. You would find the cause of the problem before you started fixing the car, right? The same goes for data. First, you figure out where the errors come from. Maybe it is from a data entry error, not setting up a spell check, lack of formats, or from duplicates. Then, once you understand where bad data comes from, you can control it and keep your data clean.
  • Not backing up your data prior to data cleaning: It is always good to be proactive and create your data backup before you start your data clean-up. If your program crashes, or if your changes cause a problem in your dataset, you can always go back to the saved version and restore it. The simple procedure of backing up your data can save you hours of work-- and most importantly, a headache.
  • Not accounting for data cleaning in your deadlines/process: All good things take time, and that includes data cleaning. It is important to keep that in mind when going through your process and looking at your deadlines. When you set aside time for data cleaning, it helps you get a more accurate estimate for ETAs for stakeholders, and can help you know when to request an adjusted ETA.

Additional resources

Refer to these "top ten" lists for data cleaning in Microsoft Excel and Google Sheets to help you avoid the most common mistakes:

Test your knowledge on data-cleaning techniques

TOTAL POINTS 3

Question 1

What is the process of combining two or more datasets into a single dataset?

  • Data transferring
  • Data validation
  • Data merging
  • Data composition

Correct. Data merging is the process of combining two or more datasets into a single dataset.

Question 2

Fill in the blank: In data analytics, _____ describes how well two or more datasets are able to work together.

  • alignment
  • agreement
  • suitability
  • compatibility

Correct. Compatibility describes how well two or more datasets are able to work together.

Question 3

Data analysts use spreadsheet tools such as conditional formatting, remove duplicates, and split text to columns for data analysis. These tools are not applicable to the data-cleaning process.

  • True
  • False

Correct. Spreadsheet tools such as conditional formatting, remove duplicates, and split text to columns are very helpful when cleaning data.