4.2.2.Begin cleaning data - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki

Common data-cleaning pitfalls

In this reading, you will learn the importance of data cleaning and how to identify common mistakes. Some of the errors you might come across while cleaning your data could include:

xKinzzFmTkKop88xZs5C2Q_5da922d4cbd349e7aeb60ace3e18393a_Screen-Shot-2021-01-18-at-8 03 48-PM

Common mistakes to avoid

  • Not checking for spelling errors: Misspellings can be as simple as typing or input errors. Most of the time the wrong spelling or common grammatical errors can be detected, but it gets harder with things like names or addresses. For example, if you are working with a spreadsheet table of customer data, you might come across a customer named “John” whose name has been input incorrectly as “Jon” in some places. The spreadsheet’s spellcheck probably won’t flag this, so if you don’t double-check for spelling errors and catch this, your analysis will have mistakes in it.
  • Forgetting to document errors: Documenting your errors can be a big time saver, as it helps you avoid those errors in the future by showing you how you resolved them. For example, you might find an error in a formula in your spreadsheet. You discover that some of the dates in one of your columns haven’t been formatted correctly. If you make a note of this fix, you can reference it the next time your formula is broken, and get a head start on troubleshooting. Documenting your errors also helps you keep track of changes in your work, so that you can backtrack if a fix didn’t work.
  • Not checking for misfielded values: A misfielded value happens when the values are entered into the wrong field. These values might still be formatted correctly, which makes them harder to catch if you aren’t careful. For example, you might have a dataset with columns for cities and countries. These are the same type of data, so they are easy to mix up. But if you were trying to find all of the instances of Spain in the country column, and Spain had mistakenly been entered into the city column, you would miss key data points. Making sure your data has been entered correctly is key to accurate, complete analysis.
  • Overlooking missing values: Missing values in your dataset can create errors and give you inaccurate conclusions. For example, if you were trying to get the total number of sales from the last three months, but a week of transactions were missing, your calculations would be inaccurate. As a best practice, try to keep your data as clean as possible by maintaining completeness and consistency.
  • Only looking at a subset of the data: It is important to think about all of the relevant data when you are cleaning. This helps make sure you understand the whole story the data is telling, and that you are paying attention to all possible errors. For example, if you are working with data about bird migration patterns from different sources, but you only clean one source, you might not realize that some of the data is being repeated. This will cause problems in your analysis later on. If you want to avoid common errors like duplicates, each field of your data requires equal attention.
  • Losing track of business objectives: When you are cleaning data, you might make new and interesting discoveries about your dataset-- but you don’t want those discoveries to distract you from the task at hand. For example, if you were working with weather data to find the average number of rainy days in your city, you might notice some interesting patterns about snowfall, too. That is really interesting, but it isn’t related to the question you are trying to answer right now. Being curious is great! But try not to let it distract you from the task at hand.
  • Not fixing the source of the error: Fixing the error itself is important. But if that error is actually part of a bigger problem, you need to find the source of the issue. Otherwise, you will have to keep fixing that same error over and over again. For example, imagine you have a team spreadsheet that tracks everyone’s progress. The table keeps breaking because different people are entering different values. You can keep fixing all of these problems one by one, or you can set up your table to streamline data entry so everyone is on the same page. Addressing the source of the errors in your data will save you a lot of time in the long run.
  • Not analyzing the system prior to data cleaning: If we want to clean our data and avoid future errors, we need to understand the root cause of your dirty data. Imagine you are an auto mechanic. You would find the cause of the problem before you started fixing the car, right? The same goes for data. First, you figure out where the errors come from. Maybe it is from a data entry error, not setting up a spell check, lack of formats, or from duplicates. Then, once you understand where bad data comes from, you can control it and keep your data clean.
  • Not backing up your data prior to data cleaning: It is always good to be proactive and create your data backup before you start your data clean-up. If your program crashes, or if your changes cause a problem in your dataset, you can always go back to the saved version and restore it. The simple procedure of backing up your data can save you hours of work-- and most importantly, a headache.
  • Not accounting for data cleaning in your deadlines/process: All good things take time, and that includes data cleaning. It is important to keep that in mind when going through your process and looking at your deadlines. When you set aside time for data cleaning, it helps you get a more accurate estimate for ETAs for stakeholders, and can help you know when to request an adjusted ETA.

Additional resources

Refer to these "top ten" lists for data cleaning in Microsoft Excel and Google Sheets to help you avoid the most common mistakes:

Hands-On Activity: Cleaning data with spreadsheets

Question 1

Activity overview

3UWxlyqNRxaFsZcqjZcWgQ_3fd91ae61ce04caaa755c3477a337947_line-y

You’ve learned about cleaning data and its importance in meeting good data science standards. In this activity, you’ll do some data cleaning with spreadsheets, then transpose the data.

By the time you complete this activity, you will be able to perform some basic cleaning methods in spreadsheets. This will enable you to clean and transpose data, which is important for making data more specific and accurate in your career as a data analyst.

What you will need

To get started, first access the data spreadsheet.

To use the spreadsheet for this course item, click the link below and select “Use Template.”

Link to data spreadsheet: Cleaning with spreadsheets

OR

If you don’t have a Google account, you can download the template directly from the attachment below.

Select and remove blank cells

3UWxlyqNRxaFsZcqjZcWgQ_3fd91ae61ce04caaa755c3477a337947_line-y

The first technique we’ll use is to select and eliminate rows containing blank cells by using filters. To eliminate rows with blank cells:

  1. Highlight all cells in the spreadsheet. You can highlight Columns A-H by clicking on the header of Column A, holding Shift, and clicking on the header of Column H.
  2. Click on the Data tab and pick the Create a filter option. In Microsoft Excel, this is called Filter.

O9NNCnlDSnWTTQp5Q1p1XA_25d78d3867c8402da3bafff30662affe_DAC4M2L3HO1-ss1---Hands-on-Activity---Cleaning-with-Spreadsheets

Excel:

rbruHkUVR2e67h5FFVdnIA_46544fd0693e4d54b0b48738dd7adef1_clean1

  1. Every column now shows a green triangle in the first row next to the column title. Click the green triangle in Column B to access a new menu.
  2. On that new menu, click Filter by condition and open the dropdown menu to select Is empty. Click OK.

OnScbk09Sya0nG5NPSsmnw_3bd0ebe89bb54e78b77e287c57d84469_DAC4M2L3HO1-ss2---Hands-on-Activity---Cleaning-with-Spreadsheets

In Excel, click the dropdown, then Filter... then make sure only (Blanks) is checked. Click OK.

Excel:

Uhm2BhPySXGZtgYT8nlxrw_665641702fca449eaa1d0d76780d93f1_clean2

fRE5xga-RnSROcYGvvZ0uw_899befc39c474834a28b47e8173acff1_image-1-

You can then review a list of all the rows with blank cells in that column.

  1. Select all these cells and delete the rows except the row of column headers.
  2. Return to the Filter by condition and return it to None. In Excel, click Clear Filter from ‘Column’.
  • Note: You will now notice that any row that had an empty cell in Column A will be removed (including the extra empty rows after the data).
  1. Repeat this for Columns B-H.
  2. Note: If you simply deleted the data from the row by tapping the backspace button, you will need to go a step further and delete the empty row entirely by left-clicking the row number located on the furthest left side of the screen.

tw7qKgnVSL-qreDMxqqNVg_4c27bec8fb094e09b8ca6fe4852352f1_RowSelect

  1. Next, right click on the highlighted row to call up the drop down window, and select the Delete row option.

iAJUmDmQRDiRHQ69peqoiw_4f399ce0c6f1496c948c70141335bef1_DeleteRow

  1. Continue to do this same operation for the remaining empty rows in the data set.

All the rows that had blank cells are now removed from the spreadsheet.

Transpose the data

3UWxlyqNRxaFsZcqjZcWgQ_3fd91ae61ce04caaa755c3477a337947_line-y

The second technique you will practice will help you convert the data from the current long format (more rows than columns) to the wide format (more columns than rows). This action is called transposing. To transpose your data:

  1. Highlight and copy the data that you want to transpose including the column labels. You can do this by highlighting Columns A-H. In Excel, highlight only the relevant cells (A1-H45) instead of the headers.
  2. Right-click on cell I1. This is where you want the transposed data to start.
  3. Hover over Paste Special from the right-click menu. Select the Transposed option. In Excel, select the Transpose icon under the paste options.

MTwQTUO9Sr28EE1Dvcq9TA_72cc515a56e74c5087f79f9f0a5c76f1_Screenshot-2021-08-01-11 09 05-PM

Excel:

cZZMekh0R--WTHpIdIfvyQ_d00c1398f40f40c793f389815300b4f1_clean3

You should now find the data transformed into the new wide format. At this point, you should remove the original long data from the spreadsheet.

  1. Delete the previous long data. The easiest way to do this is to click on Column A, so the entire column is highlighted. Then, hold down the Shift key and click on Column H. You should find these columns highlighted. Right-click on the highlighted area and select Delete Columns A - H.

Your screen should now appear like this:

Get rid of extra spaces in cells with string data

3UWxlyqNRxaFsZcqjZcWgQ_3fd91ae61ce04caaa755c3477a337947_line-y

Now that you have transposed the data, eliminate the extra spaces in the values of the cells.

  1. Highlight the data in the spreadsheet.
  2. Click on the Data tab, then hover over Data cleanup and select Trim whitespace.

pq0CjzcvSMOtAo83L7jD1A_cc5e405f6d2a41098cce951e3eacc8f1_Screenshot-2021-08-02-12 08 23-AM

In Excel, you can use the TRIM command to get rid of white spaces. In any space beneath your data (such as cell A10), type =TRIM(A1). Then, drag the bottom right corner of the cell to the bottom right to call the data without the white spaces.

Now all the extra spaces in the cells have been removed.

Change Text Lower/Uppercase/Proper Case

3UWxlyqNRxaFsZcqjZcWgQ_3fd91ae61ce04caaa755c3477a337947_line-y

Next, you’ll process string data. The easiest way to clean up string data will depend on the spreadsheet program you are using. If you are using Excel, you’ll use a simple formula. If you are using Google Sheets, you can use an Add-On to do this with a few clicks. Follow the steps in the relevant section below.

Microsoft Excel

If you are using Microsoft Excel, this documentation explains how to use a formula to change the case of a text string. Follow these instructions to clean the string text and then move on to the confirmation and reflection section of this activity.

Google sheets

If you’re completing this exercise using Google Sheets, you’ll need to install an add-in that will give you the functionality needed to easily clean string data and change cases.

Google Sheets Add-on Instructions:

  1. Click on the Add-Ons option at the top of Google Sheets.
  2. Click on Get add-ons.
  3. Search for ChangeCase. It should appear like this:

RsIWudCiRKiCFrnQomSo4Q_a833e87b50d94bacaab8d47e49adef7b_DAC4M2L3SR1---image7

  1. Click on Install to install the add-on. It may ask you to login or verify the installation permissions.

Once you have installed the add-on successfully, you can access it by clicking on the Add-ons menu again.

Now, you can change the case of text data that shows up. To change the text in Column C to all uppercase:

  1. Click on Column C. Be sure to deselect the column header, unless you want to change the case of that as well (which you don't).
  2. Click on the Add-Ons tab and select ChangeCase. Select the option All uppercase. Notice the other options that you could have chosen if needed.

Delete all formating

3UWxlyqNRxaFsZcqjZcWgQ_3fd91ae61ce04caaa755c3477a337947_line-y

If you want to clear the formatting for any or all cells, you can find the command in the Format tab. To clear formatting:

  1. Select the data for which you want to delete the formatting. In this case, highlight all the data in the spreadsheet by clicking and dragging over Rows 1-8.
  2. Click the Format tab and select the Clear Formatting option.

In Excel, go to the Home tab, then hover over Clear and select Clear Formats.

You will notice that all the cells have had their formatting removed.

Confirmation and reflection

3UWxlyqNRxaFsZcqjZcWgQ_3fd91ae61ce04caaa755c3477a337947_line-y

Review the final product of the spreadsheet you cleaned during this activity. Which of the following is the rightmost column?

A. Column Z

B. Column AZ

C. Column AA

D. Column AS

The correct answer is D. Column AS. Explain: In the final product of this activity, the rightmost column is Column AS. You are able to find this information after you properly transpose the data. Going forward, you can apply what you learned about data cleaning and transposing to work with data in the future.

Question 2

In this activity, you practiced cleaning and transposing data. In the text box below, write 2-3 sentences (40-60 words) in response to each of the following questions:

  • What was the most challenging part of cleaning the data?
  • Why is cleaning and transposing data important for data analysis?
  • If you had to clean this data again, what would you do differently? Why?

Explain: Congratulations on completing this hands-on activity! In this activity, you cleaned and transposed data on a spreadsheet. A good response would include that cleaning is a fundamental step in data science as it greatly increases the integrity of the data.

Good data science results rely heavily on the reliability of the data. Data analysts clean data to make it more accurate and reliable. This is important for making sure that the projects you will work on as a data analyst are completed properly.

Test your knowledge on data-cleaning techniques

Question 1

Fill in the blank: Every database has its own formatting, which can cause the data to seem inconsistent. Data analysts use the _____ tool to create a clean and consistent visual appearance for their spreadsheets.

A. clear formats

B. autocorrect

C. spellcheck

D. conditional formatting

Explain: Data analysts use the clear formats tool to create a clean and consistent visual appearance for their spreadsheets.

Question 2

What is the process of combining two or more datasets into a single dataset?

A. Data transferring

B. Data merging

C. Data composition

D. Data validation

The correct answer is B. Data merging. Explain: Data merging is the process of combining two or more datasets into a single dataset.

Question 3

Fill in the blank: In data analytics, _____ describes how well two or more datasets are able to work together.

A. alignment

B. suitability

C. agreement

D. compatibility

The correct answer is D. compatibility. Explain: Compatibility describes how well two or more datasets are able to work together.