4.2.3.Cleaning data in spreadsheets - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki
Workflow automation
In this reading, you will learn about workflow automation and how it can help you work faster and more efficiently. Basically, workflow automation is the process of automating parts of your work. That could mean creating an event trigger that sends a notification when a system is updated. Or it could mean automating parts of the data cleaning process. As you can probably imagine, automating different parts of your work can save you tons of time, increase productivity, and give you more bandwidth to focus on other important aspects of the job.
What can be automated?
Automation sounds amazing, doesn’t it? But as convenient as it is, there are still some parts of the job that can’t be automated. Let's take a look at some things we can automate and some things that we can’t.
Task | Can it be automated? | Why? |
---|---|---|
Communicating with your team and stakeholders | No | Communication is key to understanding the needs of your team and stakeholders as you complete the tasks you are working on. There is no replacement for person-to-person communications. |
Presenting your findings | No | Presenting your data is a big part of your job as a data analyst. Making data accessible and understandable to stakeholders and creating data visualizations can't be automated for the same reasons that communications can't be automated. |
Preparing and cleaning data | Partially | Some tasks in data preparation and cleaning can be automated by setting up specific processes, like using a programming script to automatically detect missing values. |
Data exploration | Partially | Sometimes the best way to understand data is to see it. Luckily, there are plenty of tools available that can help automate the process of visualizing data. These tools can speed up the process of visualizing and understanding the data, but the exploration itself still needs to be done by a data analyst. |
Modeling the data | Yes | Data modeling is a difficult process that involves lots of different factors; luckily there are tools that can completely automate the different stages. |
More about automating data cleaning
One of the most important ways you can streamline your data cleaning is to clean data where it lives. This will benefit your whole team, and it also means you don’t have to repeat the process over and over. For example, you could create a programming script that counted the number of words in each spreadsheet file stored in a specific folder. Using tools that can be used where your data is stored means that you don’t have to repeat your cleaning steps, saving you and your team time and energy.
More resources
There are a lot of tools out there that can help automate your processes, and those tools are improving all the time. Here are a few articles or blogs you can check out if you want to learn more about workflow automation and the different tools out there for you to use:
-
Towards Data Science’s Automating Scientific Data Analysis
-
MIT News’ Automating Big-Data Analysis
-
TechnologyAdvice’s 10 of the Best Options for Workflow Automation Software
As a data analyst, automation can save you a lot of time and energy, and free you up to focus more on other parts of your project. The more analysis you do, the more ways you will find to make your processes simpler and more streamlined.
Hands-On Activity: Clean data with spreadsheet functions
Question 1
Activity overview
Imagine you are a data analyst working for a marketing agency based in San Francisco. The marketing agency wants to contact local boba tea shops to inquire about a potential collaboration for a new marketing campaign. The agency plans to visit the top-rated shops within a 10-mile radius of the center of their target area. To assist with planning, the agency asks your team to review external data related to ratings and locations of boba tea shops in San Francisco. One of your teammates has created a spreadsheet from an online source. However, the data is not in the greatest shape.
Your assignment is to identify the dirty elements in the dataset and clean them up.
By the time you complete this activity, you will be able to identify dirty elements in a dataset, remove duplicate data, and use the COUNTIF and SPLIT functions to help clean data.
What you will need
The dataset includes the following column headers:
Column Header | Description |
---|---|
id | a unique identifier for each boba shop |
name | name of boba shop |
rating | Yelp rating (0 to 5 stars) |
address | street address |
city | city |
lat-long | latitude and longitude |
To get started, access the spreadsheet that contains the data. Click the link and make a copy of the spreadsheet .
Or, if you don’t have a Google account, you may download the dataset directly from the attachment below: San Francisco Boba Tea Shop Location Info
Identify the dirty elements in your data
As a data analyst, your job is to present data that is readable, accurate, and visually appealing. Cleaning your data helps you achieve this goal. The first step is to identify the dirty elements in your data.
- Rename your spreadsheet. Click Untitled Spreadsheet and enter a new name. You can use the name sf_boba_tea_shop_data or a similar name that describes the data your spreadsheet contains.
- If you want to get a better view of your data, you can make the columns wider by dragging the right boundary of the column heading. This may apply to the name (B), address (D), and lat-long (F) columns.
- Now, review your data and consider any problems you may need to address. The following are examples of errors that you can quickly identify and fix. This is not a comprehensive list of every potential problem, but is a great starting point for data cleaning.
- First, there is at least one duplicate line (rows 20 and 21) in your dataset.
- Second, all Yelp ratings should fall between 0 and 5. However, at least one rating (in cell C8) falls outside of that range.
- Finally, the data for latitude and longitude is contained in a single column (F). In order for someone to be able to use this data for analysis, the two values should be in separate columns.
- Now you know what issues to focus your attention on during the cleaning process.
Clean your data
Your goal is to fix these errors and help create a clean dataset for analysis. You can address each issue in turn.
Remove duplicates
The first step is to eliminate any duplicate entries from your dataset. As a best practice, duplicates should be removed even if they are not readily apparent.
- To start, select columns A through F.
- Then, in the menu bar, choose Data, then Data Cleanup, and select Remove duplicates.
- In the pop-up window, click Data has header row. You want to remove duplicate boba shop id's and boba shop names. In the Columns to analyze section, make sure the relevant columns (id, name) are selected.
- Once everything has been selected, click Remove duplicates.
- If done correctly, 3 duplicate rows will be found and removed and 604 rows will remain.
Correct the ratings data
Next, clean up any data that does not make sense. Yelp ratings should be less than 5 and greater than 0. Now, you will determine how many entries are inaccurate and correct them. You can use the COUNTIF function to perform this task.
- The COUNTIF function quickly counts how many items in a range of cells meet a given criterion. In cell I2, type =COUNTIF(C:C,">5"). The first entry (C:C) refers to the range where you are counting the data. In this case, the range is the entire rating column (C), which contains the Yelp ratings. The second entry refers to the criterion (>5), and tells the function to count all the values greater than 5.
- Press Enter. You’ll notice that the function returns a value of 9. This tells you that your dataset contains 9 entries that have a rating greater than 5.
As a data analyst, it's your job to decide what to do with incorrect values or to ask the dataset owner for advice if you’re unsure. In this case, one effective approach would be to search on Yelp for the actual ratings. For this activity, you can just replace the incorrect ratings with the number 5. An efficient way to replace the ratings is to sort the data numerically from largest to smallest rating.
- Select columns A through F.
- Then, from the menu bar, choose Data, then Sort range, and select Advanced range sorting options.
- In the pop-up window, check the box next to Data has header row. Sort by rating from Z→A. This way, the highest ratings will be listed first.
- Click Sort. Check out your spreadsheet. At the start of the rating column, you should now find the 9 rows that have incorrect values (rating > 5).
- Next, select the range of cells C2:C10. Press delete to delete the values that are greater than 5.
- Replace all the values with the number 5. In cell C2, type 5. Then, drag the fill handle down to cell C10 to fill the remaining cells with 5.
- After replacing the incorrect ratings with the number 5, you may notice that the new value in cell I2 is 0. The output of the COUNTIF function now reflects the changes in your dataset. This confirms that the rating column no longer contains any values greater than 5.
- FInally, delete the formula from cell I2 since you don’t need this information anymore.
Clean up the latitude and longitude data
Next, clean up the latitude and longitude data by placing each value in a separate column. You can use the SPLIT function to accomplish this task.
- The SPLIT function divides text around a specified character or string, and puts each fragment of text into a separate cell in the row. The SPLIT function will split the single lat-long column into two separate columns, one for latitude and the other for longitude. In cell G2, type =SPLIT(F2,"-"). The first entry (F2) refers to the cell where the text is located. The second entry (“-”) refers to the fact that you are dividing the text based on the minus sign.
- Press Enter. The result shows each fragment of text in a different cell.
- Select cell G2 again. In cell G2, double-click on the fill handle to split all the remaining lat-long entries.
- Now add column headers to the two new columns (G and H). In cell G1, type lat. In cell H1, type long.
- Next, replace the original lat-long data in column F with the new split entries in columns G and H. Select columns G and H, right-click, and choose Copy.
- Then, select Column F, right-click, and choose Paste special and Paste values only.
- Now the new lat column is column F, and the new long column is column G. Adjust the width of the lat column (F) to fit the data by dragging the right boundary of the column heading.
- Next, select column H, right-click, and choose Delete column.
- Finally, the longitude values should be negative so that they are accurate coordinates for mapping. To make the values in the long column negative, multiply them by -1. In cell H2, type =G2-1*. The asterisk is the operator for multiplication. Press Enter.
- Still in cell H2, double-click on the fill handle to fill in the rest of the values.
- Next, add a column header. In cell H1, type: long.
- Now, replace the longitude data in column G with the new data in column H. Select column H, right-click, and choose Copy.
- Select Column G, right-click, and choose Paste special and Paste values only.
- Then, select column H, right-click, and choose Delete column.
Columns F and G should look like this:
Now your data is cleaner, clearer, and easier to use.
Confirmation and reflection
Which of the following functions divides text around a specified character or string and puts each fragment of text into a separate cell in the row?
A. The CONCATENATE function
B. The SPLIT function
C. The COUNTIF function
D. The TRIM function
The correct answer is B. The SPLIT function. Explain: The SPLIT function divides text around a specified character or string, and puts each fragment of text into a separate cell in the row. Spreadsheet functions are useful tools for data cleaning, and knowing how to use functions effectively is a key part of every data analyst’s skill set.
Question 2
In the text box below, write 2-3 sentences (40-60 words) in response to each of the following questions:
- As a data analyst, why is it important for the data you present to be readable, accurate, and visually appealing?
- How can spreadsheet functions help you clean data more efficiently and effectively?
Explain: Congratulations on completing this hands-on activity! In this activity, you used spreadsheet functions to clean “dirty” data in a spreadsheet.
Cleaning data is an important part of the data analysis process. If data analysis is based on bad or “dirty” data, it may be biased, erroneous, and uninformed. Knowing how to effectively use spreadsheet functions to work with data is an essential skill for every data analyst. In upcoming activities, you will continue to learn more about spreadsheet functions and how they can help you analyze your data.
Learning Log: Develop your approach to cleaning data
Overview
By this point, you have started working with real data. And you may have noticed that data is often messy-- you can expect raw, primary data to be imperfect. In this learning log, you will develop an approach to cleaning data by creating a cleaning checklist, considering your preferred methods for data cleaning, and deciding on a data cleaning motto. By the time you complete this entry, you will have a stronger understanding of how to approach the data cleaning process methodically. This will help you save time cleaning data in the future and ensure that your data is clean and usable.
Fill out the Data Cleaning Approach Table
The problem with data cleaning is that it usually requires a lot of time, energy, and attention from a junior data analyst. One of the best ways to lessen the negative impacts of data cleaning is to have a plan of action or a specific approach to cleaning the data.
In order to help you develop your own approach, you’ll use the instructions from this learning log to fill out a Data Cleaning Approach Table in your learning log template. The table will appear like this in the template:
Once you have completed your Data Cleaning Approach Table, you will spend some time reflecting on the data cleaning process and your own approach.
Access your learning log To use the learning log for this course item, click the link below and select “Use Template.”
Link to learning log template: Develop your approach to data cleaning
OR
If you don’t have a Google account, you can download the template directly from the attachment below.: Learning Log Template: Develop your approach to cleaning data
Step 1: Create your checklist
You can start developing your personal approach to cleaning data by creating a standard checklist to use before your data cleaning process. Think of this checklist as your default "what to search for" list.
With a good checklist, you can efficiently and, hopefully, swiftly identify all the problem spots without getting sidetracked. You can also use the checklist to identify the scale and scope of the dataset itself.
Some things you might include in your checklist:
- Size of the data set
- Number of categories or labels
- Missing data
- Unformatted data
- The different data types
You can use your own experiences so far to help you decide what else you want to include in your checklist!
Step 2: List your preferred cleaning methods
After you have compiled your personal checklist, you can create a list of activities you like to perform when cleaning data. This list is a collection of procedures that you will implement when you encounter specific issues present in the data related to your checklist or every time you clean a new dataset.
For example, suppose that you have a dataset with missing data, how would you handle it? Moreover, if the data set is very large, what would you do to check for missing data? Outlining some of your preferred methods for cleaning data can help save you time and energy.
Step 3: Choose a data cleaning motto
Now that you have a personal checklist and your preferred data cleaning methods, you can create a data cleaning motto to help guide and explain your process. The motto is a short one or two sentence summary of your philosophy towards cleaning data. For example, here are a few data cleaning mottos from other data analysts:
- "Not all data is the same, so don't treat it all the same."
- "Be prepared for things to not go as planned. Have a backup plan.”
- "Avoid applying complicated solutions to simple problems."
The data you encounter as an analyst won’t always conform to your checklist or activities list regardless of how comprehensive they are. Data cleaning can be an involved and complicated process, but surprisingly most data has similar problems. A solid personal motto and explanation can make the more common data cleaning tasks easy to understand and complete.
Reflection
Now that you have completed your Data Cleaning Approach Table, take a moment to reflect on the decisions you made about your data cleaning approach. Write 1-2 sentences (20-40 words) answering each of the following questions:
- What items did you add to your data cleaning checklist? Why did you decide these were important to check for?
- How have your own experiences with data cleaning affected your preferred cleaning methods? Can you think of an example where you needed to perform one of these cleaning tasks?
- How did you decide on your data cleaning motto?
Test your knowledge on cleaning data in spreadsheets
Question 1
Describe the relationship between a text string and a substring.
A. A text string is a column of data within a table. A substring is one cell within that column.
B. A text string is a group of characters within a cell. A substring is a smaller subset of that text string.
C. A text string is the list of attributes at the top of columns within a table. A substring is a single attribute within that list.
D. A text string is a row of data within a table. A substring is one cell within that row.
The correct answer is B. A text string is a group of characters within a cell. A substring is a smaller subset of that text string.
Question 2
A data analyst uses the COUNTIF function to count the number of times a value less than 5 occurs between spreadsheet cells A2 through A100. What is the correct syntax?
A. =COUNTIF(A2:A100,">5")
B. =COUNTIF(A2:A100,<5)
C. =COUNTIF(A2:A100,"<5")
D. =COUNTIF(A2:A100,>5)
The correct answer is C. =COUNTIF(A2:A100,"<5"). Explain: The correct syntax is =COUNTIF(A2:A100,"<5"). COUNTIF will return the number of cells that match a value. A2:A100 is the range. And “<5” is the specified value.
Question 3
Fill in the blank: To remove leading, trailing, and repeated spaces in data, analysts use the ____ function.
A. TRIM
B. LEFT
C. MID
D. RIGHT
The correct answer is A. TRIM. Explain: TRIM is a function that removes leading, trailing, and repeated spaces in data.