4.2.1.Data cleaning is a must - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki

What is dirty data?

Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant to the problem you are trying to solve. This reading summarizes:

  • Types of dirty data you may encounter
  • What may have caused the data to become dirty
  • How dirty data is harmful to businesses

Types of dirty data

Duplicate data

Description Possible causes Potential harm to business
Any data record that shows up more than once Manual data entry, batch data improts, or data migration Skewed metrics or analyses, inflated or inaccurate counts or predictions, or conclusion during data retrieval

Outdated data

Description Possible causes Potential harm to business
Any data that is old which should be replaced with never and more accurate information People changing roles or companies, or software and systems becoming obsolete Inaccurate insights, decision-making, and analytics

Incomplete data

Description Possible causes Potential harm to business
Any data that is missing important fields Improper data collection or incorrect data entry Decreased productivity, inaccurate insights, or inability to complete essential services

Incorrect/inaccurate data

Description Possible causes Potential harm to business
Any data that is complete but inaccurate Human error inserted during data input, fake information, or mock data Inaccurate insights or decision-making based on bad information resulting in revenue loss

Inconsistent data

Description Possible causes Potential harm to business
Any data that uses different formats to represent the same thing Data stored incorrectly or errors inserted during data transfer Contradictory data points leading to confusion or inability to classify or segment customers

Business impact of dirty data

For further reading on the business impact of dirty data, enter the term “dirty data” into your preferred browser’s search bar to bring up numerous articles on the topic. Here are a few impacts cited for certain industries from a previous search:

  • Banking: Inaccuracies cost companies between 15% and 25% of revenue (source).
  • Digital commerce: Up to 25% of B2B database contacts contain inaccuracies (source).
  • Marketing and sales: 99% of companies are actively tackling data quality in some way (source).
  • Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s electronic health records (source).

Data Integrity/Clean and Dirty

What’s the problem with dirty data?

Match each definition or example with the correct data integrity principle. First, select a definition or example. Then, select a principle to find out if it's a match.

image

What’s the problem with dirty data?

Match each definition or example with the correct data integrity principle. First, select a definition or example. Then, select a principle to find out if it's a match.

image

Test your knowledge on clean versus dirty data

Question 1

Describe the difference between a null and a zero in a dataset.

A. A null represents a number with no significance. A zero represents the number zero.

B. A null indicates that a value does not exist. A zero is a numerical response.

C. A null represents a value of zero. A zero represents an empty cell.

D. A null signifies invalid data. A zero is missing data.

The correct answer is B. A null indicates that a value does not exist. A zero is a numerical response. Explain: A null indicates that a value does not exist. A zero is a numerical response.

Question 2

What are the most common processes and procedures handled by data engineers? Select all that apply.

  • Giving data a reliable infrastructure
  • Verifying results of data analysis
  • Transforming data into a useful format for analysis
  • Developing, maintaining, and testing databases and related systems

Explain: Data engineers transform data into a useful format for analysis; give it a reliable infrastructure; and develop, maintain, and test databases and related systems.

Question 3

What are the most common processes and procedures handled by data warehousing specialists? Select all that apply.

  • Ensuring data is backed up to prevent loss
  • Ensuring data is secure
  • Ensuring data is properly cleaned
  • Ensuring data is available

Expalin: Data warehousing specialists are responsible for ensuring data is available, secure, and backed up to prevent loss.

Question 4

A data analyst is cleaning a dataset. They want to confirm that users entered five-digit zip codes correctly by checking the data in a certain spreadsheet column. What would be most helpful as the next step?

A. Formatting the cells in the column as number

B. Using the field length tool to specify the number of characters in each cell in the column

C. Changing the column width to fit only five digits

D. Using the MAX function to determine the maximum value in the cells in the column

The correct answer is B. Using the field length tool to specify the number of characters in each cell in the column. Explain: Using the field length tool to specify the number of characters in each cell in the column would be the most helpful.