4.4.1.Manually cleaning data - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki
Data-cleaning verification: A checklist
This reading will give you a checklist of common problems you can refer to when doing your data cleaning verification, no matter what tool you are using. When it comes to data cleaning verification, there is no one-size-fits-all approach or a single checklist that can be universally applied to all projects. Each project has its own organization and data requirements that lead to a unique list of things to run through for verification.
Keep in mind, as you receive more data or a better understanding of the project goal(s), you might want to revisit some or all of these steps.
Correct the most common problems
Make sure you identified the most common problems and corrected them, including:
- Sources of errors: Did you use the right tools and functions to find the source of the errors in your dataset?
- Null data: Did you search for NULLs using conditional formatting and filters?
- Misspelled words: Did you locate all misspellings?
- Mistyped numbers: Did you double-check that your numeric data has been entered correctly?
- Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM function?
- Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL?
- Mismatched data types: Did you check that numeric, date, and string data are typecast correctly?
- Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and meaningful?
- Messy (inconsistent) date formats: Did you format the dates consistently throughout your dataset?
- Misleading variable labels (columns): Did you name your columns meaningfully?
- Truncated data: Did you check for truncated or missing data that needs correction?
- Business Logic: Did you check that the data makes sense given your knowledge of the business?
Review the goal of your project
Once you have finished these data cleaning tasks, it is a good idea to review the goal of your project and confirm that your data is still aligned with that goal. This is a continuous process that you will do throughout your project-- but here are three steps you can keep in mind while thinking about this:
- Confirm the business problem
- Confirm the goal of the project
- Verify that data can solve the problem and is aligned to the goal
Test your knowledge on manual data cleaning
Question 1
Making sure data is properly verified is an important part of the data-cleaning process. Which of the following tasks are involved in this verification? Select all that apply.
- Rechecking the data-cleaning effort
- Asking stakeholders to check and confirm the data is clean
- Considering whether the data is credible and appropriate for the project
- Manually fixing any errors found in the data
Explain: The verification process confirms that data cleaning was well executed and the resulting data is accurate and reliable. To verify data, analysts recheck the data-cleaning effort, manually fix errors in the data, and consider whether the data is credible and appropriate for the project.
Question 2
Fill in the blank: To count the total number of spreadsheet values within a specified range, a data analyst uses the _____ function.
A. WHOLE
B. SUM
C. TOTAL
D. COUNTA
To count the total number of spreadsheet values within a specified range, a data analyst uses the COUNTA function.
Question 3
A data analyst is cleaning a dataset with inconsistent formats and repeated cases. They use the TRIM function to remove extra spaces from string variables. What other tools can they use for data cleaning? Select all that apply.
- Protect sheet
- Import data
- Find and replace
- Remove duplicates
Explain: The analyst can use TRIM, remove duplicates, and find and replace for data cleaning.
Question 4
To correct a typo in a database column, where should you insert a CASE statement in a query?
A. As an ORDER BY clause
B. As a GROUP BY clause
C. As a FROM clause
D. As a SELECT clause
The correct answer is D. As a SELECT clause. Explain: You should add a CASE statement as a SELECT clause. A CASE statement goes through one or more conditions and returns a value as soon as a condition is met. The typo would be a condition and the correction would be the returned value for the condition.