Course 4‐4 - Forestreee/Data-Analytics GitHub Wiki
Google Data Analytics Professional
Process Data from Dirty to Clean
WEEK4 - Verify and report on your cleaning results
Cleaning your data is an essential step in the data analysis process. Verifying and reporting your cleaning is a way to show that your data is ready for the next step. In this part of the course, you'll find out the processes involved with verifying and reporting data cleaning as well as their benefits.
Learning Objectives
- Describe the process involved in verifying the results of cleaning data
- Describe what is involved in manually cleaning data
- Discuss the elements and importance of data-cleaning reports
- Describe the benefits of documenting the data-cleaning process
Manually cleaning data
Verifying and reporting results
Hi there, great to have you back. You've been learning a lot about the importance of clean data and explored some tools and strategies to help you throughout the cleaning process. In these videos, we'll be covering the next step in the process: verifying and reporting on the integrity of your clean data.
Verification is a process to confirm that a data cleaning effort was well-executed and the resulting data is accurate and reliable. It involves rechecking your clean dataset, doing some manual clean ups if needed, and taking a moment to sit back and really think about the original purpose of the project. That way, you can be confident that the data you collected is credible and appropriate for your purposes. Making sure your data is properly verified is so important because it allows you to double-check that the work you did to clean up your data was thorough and accurate. For example, you might have referenced an incorrect cellphone number or accidentally keyed in a typo. Verification lets you catch mistakes before you begin analysis. Without it, any insights you gain from analysis can't be trusted for decision-making. You might even risk misrepresenting populations or damaging the outcome of a product that you're actually trying to improve.
I remember working on a project where I thought the data I had was sparkling clean because I'd use all the right tools and processes, but when I went through the steps to verify the data's integrity, I discovered a semicolon that I had forgotten to remove. Sounds like a really tiny error, I know, but if I hadn't caught the semicolon during verification and removed it, it would have led to some big changes in my results. That, of course, could have led to different business decisions. There's an example of why verification is so crucial.
But that's not all. The other big part of the verification process is reporting on your efforts. Open communication is a lifeline for any data analytics project. Reports are a super effective way to show your team that you're being 100 percent transparent about your data cleaning. Reporting is also a great opportunity to show stakeholders that you're accountable, build trust with your team, and make sure you're all on the same page of important project details.
Coming up, you'll learn different strategies for reporting, like creating data-cleaning reports, documenting your cleaning process, and using something called the changelog. A changelog is a file containing a chronologically ordered list of modifications made to a project. It's usually organized by version and includes the date followed by a list of added, improved, and removed features. Changelogs are very useful for keeping track of how a dataset evolved over the course of a project. They're also another great way to communicate and report on data to others.
Along the way, you'll also see some examples of how verification and reporting can help you avoid repeating mistakes and save you and your team time. Ready to get started? Let's go!
Cleaning and your data expectations
Here, we'll discuss how to begin the process of verifying your data-cleaning efforts.
Verification is a critical part of any analysis project. Without it, you have no way of knowing that your insights can be relied on for data-driven decision-making. Think of verification as a stamp of approval.
To refresh your memory, verification is a process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable. It also involves manually cleaning data to compare your expectations with what's actually present. The first step in the verification process is going back to your original unclean data set and comparing it to what you have now.
Review the dirty data and try to identify any common problems. For example, maybe you had a lot of nulls. In that case, you check your clean data to ensure no nulls are present. To do that, you could search through the data manually or use tools like conditional formatting or filters.
Or maybe there was a common misspelling like someone keying in the name of a product incorrectly over and over again. In that case, you'd run a FIND in your clean data to make sure no instances of the misspelled word occur.
Another key part of verification involves taking a big-picture view of your project. This is an opportunity to confirm you're actually focusing on the business problem that you need to solve and the overall project goals and to make sure that your data is actually capable of solving that problem and achieving those goals.
It's important to take the time to reset and focus on the big picture because projects can sometimes evolve or transform over time without us even realizing it. Maybe an e-commerce company decides to survey 1000 customers to get information that would be used to improve a product. But as responses begin coming in, the analysts notice a lot of comments about how unhappy customers are with the e-commerce website platform altogether. So the analysts start to focus on that. While the customer buying experience is of course important for any e-commerce business, it wasn't the original objective of the project. The analysts in this case need to take a moment to pause, refocus, and get back to solving the original problem.
Taking a big-picture view of your project involves doing three things.
First, consider the business problem you're trying to solve with the data. If you've lost sight of the problem, you have no way of knowing what data belongs in your analysis. Taking a problem-first approach to analytics is essential at all stages of any project. You need to be certain that your data will actually make it possible to solve your business problem.
Second, you need to consider the goal of the project. It's not enough just to know that your company wants to analyze customer feedback about a product. What you really need to know is that the goal of getting this feedback is to make improvements to that product. On top of that, you also need to know whether the data you've collected and cleaned will actually help your company achieve that goal.
And third, you need to consider whether your data is capable of solving the problem and meeting the project objectives. That means thinking about where the data came from and testing your data collection and cleaning processes. Sometimes data analysts can be too familiar with their own data, which makes it easier to miss something or make assumptions. Asking a teammate to review your data from a fresh perspective and getting feedback from others is very valuable in this stage.
This is also the time to notice if anything sticks out to you as suspicious or potentially problematic in your data. Again, step back, take a big-picture view, and ask yourself, do the numbers make sense?
Let's go back to our e-commerce company example. Imagine an analyst is reviewing the cleaned up data from the customer satisfaction survey. The survey was originally sent to 1,000 customers, but what if the analyst discovers that there is more than a thousand responses in the data? This could mean that one customer figured out a way to take the survey more than once. Or it could also mean that something went wrong in the data cleaning process, and a field was duplicated. Either way, this is a signal that it's time to go back to the data-cleaning process and correct the problem.
Verifying your data ensures that the insights you gain from analysis can be trusted. It's an essential part of data-cleaning that helps companies avoid big mistakes. This is another place where data analysts can save the day.
Coming up, we'll go through the next steps in the data-cleaning process. See you there.
The final step in data cleaning
Hey there. In this video, we'll continue building on the verification process.
As a quick reminder, the goal is to ensure that our data-cleaning work was done properly and the results can be counted on. You want your data to be verified so you know it's 100 percent ready to go. It's like car companies running tons of tests to make sure a car is safe before it hits the road.
You learned that the first step in verification is returning to your original, unclean dataset and comparing it to what you have now. This is an opportunity to search for common problems. After that, you clean up the problems manually. For example, by eliminating extra spaces or removing an unwanted quotation mark. But there are also some great tools for fixing common errors automatically, such as TRIM and remove duplicates.
Earlier, you learned that TRIM is a function that removes leading, trailing, and repeated spaces and data.
Remove duplicates is a tool that automatically searches for and eliminates duplicate entries from a spreadsheet.
Now sometimes you had an error that shows up repeatedly, and it can't be resolved with a quick manual edit or a tool that fixes the problem automatically. In these cases, it's helpful to create a pivot table. A pivot table is a data summarization tool that is used in data processing. Pivot tables sort, reorganize, group, count, total or average data stored in a database.
We'll practice that now using the spreadsheet from a party supply store.
Would you like to follow along with the instructor using the same spreadsheet? To use the spreadsheet template, click the link below Jeff's Party Planet - Data for Cleaning
Let's say this company was interested in learning which of its four suppliers is most cost-effective.
So an analyst pulled this data on the products the business sells, how many were purchased, which supplier provides them, the cost of the products, and the ultimate revenue. The data has been cleaned.
But during verification, we noticed that one of the suppliers' names was keyed in incorrectly. We could just correct the word as "plus," but this might not solve the problem because we don't know if this was a one-time occurrence or if the problem's repeated throughout the spreadsheet.
There are two ways to answer that question. The first is using Find and replace. Find and replace is a tool that looks for a specified search term in a spreadsheet and allows you to replace it with something else. We'll choose Edit. Then Find and replace. We're trying to find P-L-O-S, the misspelling of "plus" in the supplier's name. In some cases, you might not want to replace the data. You just want to find something. No problem. Just type the search term, leave the rest of the options as default and click "Done." But right now we do want to replace it with P-L-U-S. We'll type that in here. Then click "Replace all" and "Done." There we go. Our misspelling has been corrected. That was of course the goal.
But for now let's undo our Find and replace so we can practice another way to determine if errors are repeated throughout a dataset, like with the pivot table. We'll begin by selecting the data we want to use. Choose column C.
Select "Data." Then "Pivot Table." Choose "New Sheet" and "Create."
The menu option has slightly changed. To insert a pivot table select Insert and Pivot Table.
We know this company has four suppliers. If we count the suppliers and the number doesn't equal four, we know there's a problem. So first, add a row for suppliers. Next, we'll add a value for our suppliers and summarize by COUNTA.
COUNTA counts the total number of values within a specified range. Here we're counting the number of times a supplier's name appears in column C.
Note that there's also function called COUNT, which only counts the numerical values within a specified range.
If we use it here, the result would be zero. Not what we have in mind. But in other special applications, COUNT would give us the information we want for our current example.
As you continue learning more about formulas and functions, you'll discover more interesting options. If you want to keep learning, search online for spreadsheet formulas and functions. There's a lot of great information out there.
Our pivot table has counted the number of misspellings, and it clearly shows that the error occurs just once. Otherwise, our four suppliers are accurately accounted for in our data.
Now we can correct the spelling, and we verify that the rest of the supplier data is clean.
This is also useful practice when querying a database. If you're working in SQL, you can address misspellings using a CASE statement. The CASE statement goes through one or more conditions and returns a value as soon as a condition is met.
Note: This section of the video does not use the Jeff's Party Planet dataset you downloaded earlier. The example that the instructor uses is not from a public dataset and is just intended to demonstrate how a CASE statement works.
Let's discuss how this works in real life using our customer_name table. Check out how our customer, Tony Magnolia, shows up as Tony and Tnoy. Tony's name was misspelled. Let's say we want a list of our customer IDs and the customer's first names so we can write personalized notes thanking each customer for their purchase. We don't want Tony's note to be addressed incorrectly to "Tnoy." Here's where we can use: the CASE statement.
We'll start our query with the basic SQL structure. SELECT, FROM, and WHERE. We know that data comes from the customer_name table in the customer_data dataset, so we can add customer underscore data dot customer underscore name after FROM. Next, we tell SQL what data to pull in the SELECT clause. We want customer_id and first_name. We can go ahead and add customer underscore ID after SELECT. But for our customer's first names, we know that Tony was misspelled, so we'll correct that using CASE. We'll add CASE and then WHEN and type first underscore name equal "Tnoy." Next we'll use the THEN command and type "Tony," followed by the ELSE command. Here we will type first underscore name, followed by End AS and then we'll type cleaned_name. Finally, we're not filtering our data, so we can eliminate the WHERE clause. As I mentioned, a CASE statement can cover multiple cases. If we wanted to search for a few more misspelled names, our statement would look similar to the original, with some additional names like this.
As I mentioned, a CASE statement can cover multiple cases. If we wanted to search for a few more misspelled names, our statement would look similar to the original, with some additional names like this. As I mentioned, a CASE statement can cover multiple cases. If we wanted to search for a few more misspelled names, our statement would look similar to the original, with some additional names like this. There you go.
Now that you've learned how you can use spreadsheets and SQL to fix errors automatically, we'll explore how to keep track of our changes next.
Data-cleaning verification: A checklist
Correct the most common problems
Review the goal of your project
Test your knowledge on manual data cleaning
Documenting results and the cleaning process
Capturing cleaning changes
Hi again. Now that you've learned how to make your data squeaky clean, it's time to address all the dirt you've left behind. When you clean your data, all the incorrect or outdated information is gone, leaving you with the highest-quality content. But all those changes you made to the data are valuable too. In this video, we'll discuss why keeping track of changes is important to every data project and how to document all your cleaning changes to make sure everyone stays informed.
This involves documentation which is the process of tracking changes, additions, deletions, and errors involved in your data cleaning effort.
You can think of it like a crime TV show. Crime evidence is found at the scene and passed on to the forensics team. They analyze every inch of the scene and document every step, so they can tell a story with the evidence. A lot of times, the forensic scientist is called to court to testify about that evidence, and they have a detailed report to refer to. The same thing applies to data cleaning.
Data errors are the crime, data cleaning is gathering evidence, and documentation is detailing exactly what happened for peer review or court.
Having a record of how a data set evolved does three very important things.
First, it lets us recover data-cleaning errors. Instead of scratching our heads, trying to remember what we might have done three months ago, we have a cheat sheet to rely on if we come across the same errors again later. It's also a good idea to create a clean table rather than overriding your existing table. This way, you still have the original data in case you need to redo the cleaning.
Second, documentation gives you a way to inform other users of changes you've made. If you ever go on vacation or get promoted, the analyst who takes over for you will have a reference sheet to check in with.
Third, documentation helps you to determine the quality of the data to be used in analysis. The first two benefits assume the errors aren't fixable. But if they are, a record gives the data engineer more information to refer to. It's also a great warning for ourselves that the data set is full of errors and should be avoided in the future. If the errors were time-consuming to fix, it might be better to check out alternative data sets that we can use instead.
Data analysts usually use a changelog to access this information. As a reminder, a changelog is a file containing a chronologically ordered list of modifications made to a project. You can use and view a changelog in spreadsheets and SQL to achieve similar results.
Let's start with the spreadsheet. We can use Sheet's version history, which provides a real-time tracker of all the changes and who made them from individual cells to the entire worksheet. To find this feature, click the File tab, and then select Version history. In the right panel, choose an earlier version. We can find who edited the file and the changes they made in the column next to their name. To return to the current version, go to the top left and click "Back." If you want to check out changes in a specific cell, we can right-click and select Show Edit History. That's it. Also, if you want others to be able to browse a sheet's version history, you'll need to assign permission.
Now let's switch gears and talk about SQL. The way you create and view a changelog with SQL depends on the software program you're using. Some companies even have their own separate software that keeps track of changelogs and important SQL queries. This gets pretty advanced. But essentially, all you have to do is specify exactly what you did and why when you commit a query to the repository as a new and improved query. This lets(allows) the company to revert back to a previous version if something you've done crashes the system, which has happened to me before.
Another option is to just add comments as you go while you're cleaning data in SQL. This will help you construct your changelog after the fact.
For now, we'll check out query history, which tracks all the queries you've run. You can click on any of them to revert back to a previous version of your query or to bring up an older version to find what you've changed. Here's what we've got. I'm in the Query history tab. Listed on the bottom right are all the queries that run by date and time. You can click on this icon to the right of each individual query to bring it up to the Query editor.
Changelogs like these are a great way to keep yourself on track. It also lets your team get real-time updates when they want them. But there's another way to keep the communication flowing, and that's reporting.
Stick around, and you'll learn some easy ways to share your documentation and maybe impress your stakeholders in the process. See you in the next video.
Embrace changelogs
Automated version control takes you most of the way
Changelogs take you down the last mile
What also happens IRL (in real life)
Self-Reflection: Creating a changelog
# Changelog
This file contains the notable changes to the project
Version 1.0.0 (02-23-2019)
## New
- Added column classifiers (Date, Time, PerUnitCost, TotalCost, etc. ) <br>
- Added Column “AveCost” to track average item cost <br>
## Changes
- Changed date format to MM-DD-YYYY <br>
- Removal of whitespace (cosmetic) <br>
## Fixes
- Fixed misalignment in Column "TotalCost" where some rows did not match with correct dates <br>
- Fixed SUM to run over entire column instead of partial
What to record in a changelog
Reflection
Why documentation is important
Great, you're back. Let's set the stage. The crime is dirty data. We've gathered the evidence. It's been cleaned, verified, and cleaned again.
Now it's time to present our evidence. We'll retrace the steps and present our case to our peers. As we discussed earlier, data cleaning, verifying, and reporting is a lot like crime drama. Now it's our day in court. Just like a forensic scientist testifies on the stand about the evidence, data analysts are counted on to present their findings after a data cleaning effort.
Earlier, we learned how to document and track every step of the data-cleaning process, which means we have solid information to pull from. As a quick refresher, documentation is the process of tracking changes, additions, deletions, and errors involved in a data cleaning effort, changelogs are good example of this. Since it's staged chronologically, it provides a real-time account of every modification.
Documenting will be a huge time saver for you as a future data analyst. It's basically a cheatsheet you can refer to if you're working with the similar data set or need to address similar errors. While your team can view changelogs directly, stakeholders can't and have to rely on your report to know what you did.
Lets check out how we might document our data cleaning process using example we worked with earlier. In that example, we found that this association had two instances of the same membership for $500 in its database. We decided to fix this manually by deleting the duplicate info.
There're plenty of ways we could go about documenting what we did. One common way is to just create a doc listing out the steps we took and the impact they had. For example, first on your list would be that you remove the duplicate instance, which decreased the number of rows from 33 to 32, and lowered the membership total by $500.
If we were working with SQL, we could include a comment in the statement describing the reason for a change without affecting the execution of the statement. That's something a bit more advanced, which we'll talk about later. Regardless of how we capture and share our changelogs, we're setting ourselves up for success by being 100 percent transparent about our data cleaning. This keeps everyone on the same page and shows project stakeholders that we are accountable for effective processes. In other words, this helps build our credibility as witnesses who can be trusted to present all the evidence accurately during testimony. For dirty data, it's an open and shut case.
Feedback and cleaning
Welcome back. By now it's safe to say that verifying, documenting and reporting are valuable steps in the data-cleaning process. You have proof to give stakeholders that your data is accurate and reliable. And the effort to attain it was well-executed and documented. The next step is getting feedback about the evidence and using it for good, which we'll cover in this video.
Clean data is important to the task at hand. But the data-cleaning process itself can reveal insights that are helpful to a business. The feedback we get when we report on our cleaning can transform data collection processes, and ultimately business development.
For example, one of the biggest challenges of working with data is dealing with errors. Some of the most common errors involve human mistakes like mistyping or misspelling, flawed processes like poor design of a survey form, and system issues where older systems integrate data incorrectly. Whatever the reason, data-cleaning can shine a light on the nature and severity of error-generating processes.
With consistent documentation and reporting, we can uncover error patterns in data collection and entry procedures and use the feedback we get to make sure common errors aren't repeated. Maybe we need to reprogram the way the data is collected or change specific questions on the survey form.
In more extreme cases, the feedback we get can even send us back to the drawing board to rethink expectations and possibly update quality control procedures. For example, sometimes it's useful to schedule a meeting with a data engineer or data owner to make sure the data is brought in properly and doesn't require constant cleaning.
Once errors have been identified and addressed, stakeholders have data they can trust for decision-making. And by reducing errors and inefficiencies in data collection, the company just might discover big increases to its bottom line.
Congratulations! You now have the foundation you need to successfully verify a report on your cleaning results. Stay tuned to keep building on your new skills.
Advanced functions for speedy data-cleaning
Keeping data clean and in sync with a source The IMPORTRANGE function in Google Sheets and the Paste Link feature (a Paste Special option in Microsoft Excel) both allow you to insert data from one sheet to another. Using these on a large amount of data is more efficient than manual copying and pasting. They also reduce the chance of errors being introduced by copying and pasting the wrong data. They are also helpful for data cleaning because you can “cherry pick” the data you want to analyze and leave behind the data that isn’t relevant to your project. Basically, it is like canceling noise from your data so you can focus on what is most important to solve your problem. This functionality is also useful for day-to-day data monitoring; with it, you can build a tracking spreadsheet to share the relevant data with others. The data is synced with the data source so when the data is updated in the source file, the tracked data is also refreshed.
In Google Sheets, you can use the IMPORTRANGE function. It enables you to specify a range of cells in the other spreadsheet to duplicate in the spreadsheet you are working in. You must allow access to the spreadsheet containing the data the first time you import the data.
The URL shown below is for syntax purposes only. Don't enter it in your own spreadsheet. Replace it with a URL to a spreadsheet you have created so you can control access to it by clicking the Allow access button.
Refer to the Google support page for IMPORTRANGE for the sample usage and syntax
Example of using IMPORTRANGE An analyst monitoring a fundraiser needs to track and ensure that matching funds are distributed. They use IMPORTRANGE to pull all the matching transactions into a spreadsheet containing all of the individual donations. This enables them to determine which donations eligible for matching funds still need to be processed. Because the total number of matching transactions increases daily, they simply need to change the range used by the function to import the most up-to-date data.
On Tuesday, they use the following to import the donor names and matched amounts: =IMPORTRANGE(“https://docs.google.com/spreadsheets/d/abcd123abcd123", "sheet1!A1:C10”, “Matched Funds!A1:B4001”)
On Wednesday, another 500 transactions were processed. They increase the range used by 500 to easily include the latest transactions when importing the data to the individual donor spreadsheet: =IMPORTRANGE(“https://docs.google.com/spreadsheets/d/abcd123abcd123”, “Matched Funds!A1:B4501”)
Note: The above examples are for illustrative purposes only. Don't copy and paste them into your spreadsheet. To try it out yourself, you will need to substitute your own URL (and sheet name if you have multiple tabs) along with the range of cells in the spreadsheet that you have populated with data.
Pulling data from other data sources The QUERY function is also useful when you want to pull data from another spreadsheet. The QUERY function's SQL-like ability can extract specific data within a spreadsheet. For a large amount of data, using the QUERY function is faster than filtering data manually. This is especially true when repeated filtering is required. For example, you could generate a list of all customers who bought your company’s products in a particular month using manual filtering. But if you also want to figure out customer growth month over month, you have to copy the filtered data to a new spreadsheet, filter the data for sales during the following month, and then copy those results for the analysis. With the QUERY function, you can get all the data for both months without a need to change your original dataset or copy results.
The QUERY function syntax is similar to IMPORTRANGE. You enter the sheet by name and the range of data that you want to query from, and then use the SQL SELECT command to select the specific columns. You can also add specific criteria after the SELECT statement by including a WHERE statement. But remember, all of the SQL code you use has to be placed between the quotes!
Google Sheets run the Google Visualization API Query Language across the data. Excel spreadsheets use a query wizard to guide you through the steps to connect to a data source and select the tables. In either case, you are able to be sure that the data imported is verified and clean based on the criteria in the query.
Examples of using QUERY
Check out the Google support page for the QUERY function with sample usage, syntax, and examples you can download in a Google sheet.
Link to make a copy of the sheet: QUERY examples
The solution Analysts can use SQL to pull a specific dataset into a spreadsheet. They can then use the QUERY function to create multiple tabs (views) of that dataset. For example, one tab could contain all the sales data for a particular month and another tab could contain all the sales data from a specific region. This solution illustrates how SQL and spreadsheets are used well together.
Filtering data to get what you want The FILTER function is fully internal to a spreadsheet and doesn’t require the use of a query language. The FILTER function lets you view only the rows (or columns) in the source data that meet your specified conditions. It makes it possible to pre-filter data before you analyze it.
The FILTER function might run faster than the QUERY function. But keep in mind, the QUERY function can be combined with other functions for more complex calculations. For example, the QUERY function can be used with other functions like SUM and COUNT to summarize data, but the FILTER function can't.
Example of using FILTER Check out the Google support page for the FILTER function with sample usage, syntax, and examples you can download in a Google sheet.
Link to make a copy of the sheet: FILTER examples