Course 4‐2 - Forestreee/Data-Analytics GitHub Wiki

Google Data Analytics Professional

Process Data from Dirty to Clean

WEEK2 - Sparkling-clean data

Every data analyst wants clean data to work with when performing an analysis. In this part of the course, you’ll learn the difference between clean and dirty data. You’ll also explore data-cleaning techniques using spreadsheets and other tools.

Learning Objectives

  • Differentiate between clean and dirty data
  • Explain the characteristics of dirty data
  • Describe data cleaning techniques with reference to identifying errors, redundancy, compatibility and continuous monitoring
  • Identify common pitfalls when cleaning data
  • Demonstrate an understanding of the use of spreadsheets to clean data

Data cleaning is a must

Clean it up!

Can you guess what inaccurate or bad data costs businesses every year? Thousands of dollars? Millions? Billions? Well, according to IBM, the yearly cost of poor-quality data is $3.1 trillion in the US alone. That's a lot of zeros.

Now, can you guess the number one cause of poor-quality data? It's not a new system implementation or a computer technical glitch. The most common factor is actually human error.

Here's a spreadsheet from a law office. It shows customers, the legal services they bought, the service order number, how much they paid, and the payment method.

Dirty data can be the result of someone typing in a piece of data incorrectly;

inconsistent formatting; blank fields; or the same piece of data being entered more than once, which creates duplicates.

Dirty data is data that's incomplete, incorrect, or irrelevant to the problem you're trying to solve.

When you work with dirty data, you can't be sure that your results are correct. In fact, you can pretty much bet they won't be. Earlier, you learned that data integrity is critical to reliable data analytics results, and clean data helps you achieve data integrity.

Clean data is data that's complete, correct, and relevant to the problem you're trying to solve. When you work with clean data, you'll find that your projects go much more smoothly.

I remember the first time I witnessed firsthand how important clean data really is. I had just started using SQL, and I thought it worked like magic. I could have the computer sum up millions of numbers, saving me tons of time and effort. But I quickly discovered that only works when the data is clean. If there was even one accidental letter in a column that should only have numbers, the computer wouldn't know what to do. So, it would throw an error, and suddenly, I'm stuck. And there's no way I could add up millions of numbers by myself. So, I had to clean out that data to make it work. The good news is that there are plenty of effective processes and tools to help you do that.

Coming up, you'll gain the skills and knowledge you need to make sure the data you work with is always clean. Along the way, we'll dig deeper into the difference between clean and dirty data, and why clean data is so important. We'll also talk about different ways to clean your data and common problems to look for during the process.

Ready to start? Let's do it.

Why data cleaning is important

Clean data is incredibly important for effective analysis. If a piece of data is entered into a spreadsheet or database incorrectly, or if it's repeated, or if a field is left blank, or if data formats are inconsistent, the result is dirty data. Small mistakes can lead to big consequences in the long run.

I'll be completely honest with you, data cleaning is like brushing your teeth. It's something you should do and do properly because otherwise, it can cause serious problems. For teeth, that might be cavities or gum disease. For data, that might be costing your company money, or an angry boss. But here's the good news. If you keep brushing twice a day, every day, it becomes a habit. Soon, you don't even have to think about it. It's the same with data. Trust me, it will make you look great when you take the time to clean up that dirty data.

As a quick refresher, dirty data is incomplete, incorrect, or irrelevant to the problem you're trying to solve. It can't be used in a meaningful way, which makes analysis very difficult, if not impossible.

On the other hand, clean data is complete, correct, and relevant to the problem you're trying to solve. This allows you to understand and analyze information and identify important patterns, connect related information, and draw useful conclusions. Then you can apply what you learn to make effective decisions.

In some cases, you won't have to do a lot of work to clean data. For example, when you use internal data that's been verified and cared for by your company's data engineers and data warehouse team, it's more likely to be clean.

Let's talk about some people you'll work with as a data analyst. Data engineers transform data into a useful format for analysis and give it a reliable infrastructure. This means they develop, maintain, and test databases, data processors and related systems.

Data warehousing specialists develop processes and procedures to effectively store and organize data. They make sure that data is available, secure, and backed up to prevent loss.

When you become a data analyst, you can learn a lot by working with the person who maintains your databases to learn about their systems. If data passes through the hands of a data engineer or a data warehousing specialist first, you know you're off to a good start on your project.

There's a lot of great career opportunities as a data engineer or a data warehousing specialist. If this kind of work sounds interesting to you, maybe your career path will involve helping organizations save lots of time, effort, and money by making sure their data is sparkling clean. But even if you go in a different direction with your data analytics career and have the advantage of working with data engineers and warehousing specialists, you're still likely to have to clean your own data. It's important to remember: no dataset is perfect. It's always a good idea to examine and clean data before beginning analysis.

Here's an example. Let's say you're working on a project where you need to figure out how many people use your company's software program. You have a spreadsheet that was created internally and verified by a data engineer and a data warehousing specialist. Check out the column labeled "Username." It might seem logical that you can just scroll down and count the rows to figure out how many users you have. But that won't work because one person sometimes has more than one username. Maybe they registered from different email addresses, or maybe they have a work and personal account. In situations like this, you would need to clean the data by eliminating any rows that are duplicates. Once you've done that, there won't be any more duplicate entries. Then your spreadsheet is ready to be put to work.

So far we've discussed working with internal data. But data cleaning becomes even more important when working with external data, especially if it comes from multiple sources.

Let's say the software company from our example surveyed its customers to learn how satisfied they are with its software product. But when you review the survey data, you find that you have several nulls.

A null is an indication that a value does not exist in a data set. Note that it's not the same as a zero. In the case of a survey, a null would mean the customers skipped that question. A zero would mean they provided zero as their response.

To do your analysis, you would first need to clean this data. Step one would be to decide what to do with those nulls. You could either filter them out and communicate that you now have a smaller sample size, or you can keep them in and learn from the fact that the customers did not provide responses. There are lots of reasons why this could have happened. Maybe your survey questions weren't written as well as they could be. Maybe they were confusing or biased, something we learned about earlier. We've touched on the basics of cleaning internal and external data, but there's lots more to come.

Soon, we'll learn about the common errors to be aware of to ensure your data is complete, correct, and relevant. See you soon!

Angie: Why I love cleaning data

I am Angie. I'm a program manager of engineering at Google. I truly believe that cleaning Data is the heart and soul of data. It's how you get to know your data: its quirks, its flaws, its mysteries. I love a good mystery. I remember one time I found somebody had purchased, I think it was one million dollars worth of chicken sandwiches in one transaction. This mystery drove me nuts. I had all these questions. Could this have really happened? Or maybe it was a really big birthday party? How did they make a million dollars worth of chicken sandwiches? I was cleaning my data and trying to figure out where did it go wrong. We ended up finding out that we'd been squaring and multiplying all of our transactions for a very specific case. It took us about three days to figure this out. I will never forget the moment when it was like, aha! We got to the bottom of it. The result is our data was cleaned, and we had this great dataset that we could use for analysis. But what I loved was just the mystery of it and getting to know all these weird intricacies about my dataset. It felt like a superpower almost, like I was a detective, and I had gone in there and I had really solved something. I love cleaning data!

What is dirty data?

Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant to the problem you are trying to solve. This reading summarizes:

  • Types of dirty data you may encounter
  • What may have caused the data to become dirty
  • How dirty data is harmful to businesses

Duplicate data

Possible Causes

Potential Harm to Businesses

** Outdated data**

Incomplete data

Incorrect/inaccurate data

Inconsistent data

Business impact of dirty data For further reading on the business impact of dirty data, enter the term “dirty data” into your preferred browser’s search bar to bring up numerous articles on the topic. Here are a few impacts cited for certain industries from a previous search:

  • Banking: Inaccuracies cost companies between 15% and 25% of revenue (source).
  • Digital commerce: Up to 25% of B2B database contacts contain inaccuracies (source).
  • Marketing and sales: 99% of companies are actively tackling data quality in some way (source).
  • Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s electronic health records (source).

Recognize and remedy dirty data

Hey, there. Here, we'll focus on common issues associated with dirty data. These includes spelling and other texts errors, inconsistent labels, formats and field lane, missing data and duplicates. This will help you recognize problems quicker and give you the information you need to fix them when you encounter something similar during your own analysis. This is incredibly important in data analytics.

Let's go back to our law office spreadsheet. As a quick refresher, we'll start by checking out the different types of dirty data it shows.

Sometimes, someone might key in a piece of data incorrectly. Other times, they might not keep data formats consistent. It's also common to leave a field blank. That's also called a null, which we learned about earlier. If someone adds the same piece of data more than once, that creates a duplicate.

Let's break that down. Then we'll learn about a few other types of dirty data and strategies for cleaning it. Misspellings, spelling variations, mixed-up letters, inconsistent punctuation, and typos in general, happen when someone types in a piece of data incorrectly. As a data analyst, you'll also deal with different currencies. For example, one dataset could be in US dollars and another in euros, and you don't want to get them mixed up. We want to find these types of errors and fix them like this.

You'll learn more about this soon.

Clean data depends largely on the data integrity rules that an organization follows, such as spelling and punctuation guidelines.

For example, a beverage company might ask everyone working in its database to enter data about volume in fluid ounces instead of cups. It's great when an organization has rules like this in place.

It really helps minimize the amount of data cleaning required, but it can't eliminate it completely. Like we discussed earlier, there's always the possibility of human error.

The next type of dirty data our spreadsheet shows is inconsistent formatting. In this example, something that should be formatted as currency is shown as a percentage. Until this error is fixed, like this, the law office will have no idea how much money this customer paid for its services. We'll learn about different ways to solve this and many other problems soon.

We discussed nulls previously, but as a reminder, nulls are empty fields. This kind of dirty data requires a little more work than just fixing a spelling error or changing a format. In this example, the data analysts would need to research which customer had a consultation on July 4th, 2020. Then when they find the correct information, they'd have to add it to the spreadsheet.

Another common type of dirty data is duplicated. Maybe two different people added this appointment on August 13th, not realizing that someone else had already done it or maybe the person entering the data hit copy and paste by accident. Whatever the reason, it's the data analyst's job to identify this error and correct it by deleting one of the duplicates.

Now, let's continue on to some other types of dirty data. The first has to do with labeling.

To understand labeling, imagine trying to get a computer to correctly identify panda bears among images of all different kinds of animals. You need to show the computer thousands of images of panda bears. They're all labeled as panda bears. Any incorrectly labeled picture, like the one here that's just bear, will cause a problem.

The next type of dirty data is having an inconsistent field length. You learned earlier that a field is a single piece of information from a row or column of a spreadsheet. Field length is a tool for determining how many characters can be keyed into a field. Assigning a certain length to the fields in your spreadsheet is a great way to avoid errors. For instance, if you have a column for someone's birth year, you know the field length is four because all years are four digits long. Some spreadsheet applications have a simple way to specify field lengths and make sure users can only enter a certain number of characters into a field.

This is part of data validation. Data validation is a tool for checking the accuracy and quality of data before adding or importing it. Data validation is a form of data cleansing, which you'll learn more about soon. But first, you'll get familiar with more techniques for cleaning data. This is a very important part of the data analyst job. I look forward to sharing these data-cleaning strategies with you.

Data integrity/Clean and Dirty

Principles of data integrity

Practice Quiz: clean versus dirty data

Begin cleaning data

Data-cleaning tools and techniques

Hi. Now that you're familiar with some of the most common types of dirty data, it's time to clean them up. As you've learned, clean data is essential to data integrity and reliable solutions and decisions. The good news is that spreadsheets have all kinds of tools you can use to get your data ready for analysis. The techniques for data cleaning will be different depending on the specific data set you're working with. So we won't cover everything you might run into, but this will give you a great starting point for fixing the types of dirty data analysts find most often. Think of everything that's coming up as a teaser trailer of data cleaning tools. I'm going to give you a basic overview of some common tools and techniques, and then we'll practice them again later on.

Here, we'll discuss how to remove unwanted data, clean up text to remove extra spaces and blanks, fix typos, and make formatting consistent. However, before removing unwanted data, it's always a good practice to make a copy of the data set. That way, if you remove something that you end up needing in the future, you can easily access it and put it back in the data set. Once that's done, then you can move on to getting rid of the duplicates or data that isn't relevant to the problem you're trying to solve. Typically, duplicates appear when you're combining data sets from more than one source or using data from multiple departments within the same business. You've already learned a bit about duplicates, but let's practice removing them once more now using this spreadsheet, which lists members of a professional logistics association. Duplicates can be a big problem for data analysts. So it's really important that you can find and remove them before any analysis starts.

Here's an example of what I'm talking about. Let's say this association has duplicates of one person's $500 membership in its database. When the data is summarized, the analyst would think there was $1,000 being paid by this member and would make decisions based on that incorrect data. But in reality, this member only paid $500. These problems can be fixed manually, but most spreadsheet applications also offer lots of tools to help you find and remove duplicates.

Now, irrelevant data, which is data that doesn't fit the specific problem that you're trying to solve, also needs to be removed. Going back to our association membership list example, let's say a data analyst was working on a project that focused only on current members. They wouldn't want to include information on people who are no longer members, or who never joined in the first place.

Removing irrelevant data takes a little more time and effort because you have to figure out the difference between the data you need and the data you don't. But believe me, making those decisions will save you a ton of effort down the road.

The next step is removing extra spaces and blanks. Extra spaces can cause unexpected results when you sort, filter, or search through your data. And because these characters are easy to miss, they can lead to unexpected and confusing results.

For example, if there's an extra space in a member ID number, when you sort the column from lowest to highest, this row will be out of place. To remove these unwanted spaces or blank cells, you can delete them yourself. Or again, you can rely on your spreadsheets, which offer lots of great functions for removing spaces or blanks automatically.

The next data cleaning step involves fixing misspellings, inconsistent capitalization, incorrect punctuation, and other typos. These types of errors can lead to some big problems.

Let's say you have a database of emails that you use to keep in touch with your customers. If some emails have misspellings, a period in the wrong place, or any other kind of typo, not only do you run the risk of sending an email to the wrong people, you also run the risk of spamming random people. Think about our association membership example again. Misspelling might cause the data analyst to miscount the number of professional members if they sorted this membership type and then counted the number of rows. Like the other problems you've come across, you can also fix these problems manually. Or you can use spreadsheet tools, such as spellcheck, autocorrect, and conditional formatting to make your life easier. There are also easy ways to convert text to lowercase, uppercase, or proper case, which is one of the things we'll check out again later.

All right, we're getting there. The next step is removing formatting. This is particularly important when you get data from lots of different sources. Every database has its own formatting, which can cause the data to seem inconsistent. Creating a clean and consistent visual appearance for your spreadsheets will help make it a valuable tool for you and your team when making key decisions. Most spreadsheet applications also have a "clear formats" tool, which is a great time saver.

Cleaning data is an essential step in increasing the quality of your data. Now you know lots of different ways to do that. In the next video, you'll take that knowledge even further and learn how to clean up data that's come from more than one source.

Cleaning data from multiple sources

Welcome back. So far you've learned a lot about dirty data and how to clean up the most common errors in a dataset. Now we're going to take that a step further and talk about cleaning up multiple datasets.

Cleaning data that comes from two or more sources is very common for data analysts, but it does come with some interesting challenges.

A good example is a merger, which is an agreement that unites two organizations into a single new one.

In the logistics field, there's been lots of big changes recently, mostly because of the e-commerce boom. With so many people shopping online, it makes sense that the companies responsible for delivering those products to their homes are in the middle of a big shake-up. When big things happen in an industry, it's common for two organizations to team up and become stronger through a merger.

You can review the spreadsheet data along with the instructor as she makes important points about merging data from different organizations. To use the spreadsheet templates, click the links below

Let's talk about how that will affect our logistics association. As a quick reminder, this spreadsheet lists association member ID numbers, first and last names, addresses, how much each member pays in dues, when the membership expires, and the membership types.

Now, let's think about what would happen if the International Logistics Association decided to get together with the Global Logistics Association in order to help their members handle the incredible demands of e-commerce.

First, all the data from each organization would need to be combined using data merging. Data merging is the process of combining two or more datasets into a single dataset. This presents a unique challenge because when two totally different datasets are combined, the information is almost guaranteed to be inconsistent and misaligned. For example, the Global Logistics Association's spreadsheet has a separate column for a person's suite, apartment, or unit number, but the International Logistics Association combines that information with their street address. This needs to be corrected to make the number of address columns consistent.

Next, check out how the Global Logistics Association uses people's email addresses as their member ID, while the International Logistics Association uses numbers.

This is a big problem because people in a certain industry, such as logistics, typically join multiple professional associations. There's a very good chance that these datasets include membership information on the exact same person, just in different ways. It's super important to remove those duplicates. Also, the Global Logistics Association has many more member types than the other organization. On top of that, it uses a term, "Young Professional" instead of "Student Associate." But both describe members who are still in school or just starting their careers. If you were merging these two datasets, you'd need to work with your team to fix the fact that the two associations describe memberships very differently.

Now you understand why the merging of organizations also requires the merging of data, and that can be tricky. But there's lots of other reasons why data analysts merge datasets.

For example, in one of my past jobs, I merged a lot of data from multiple sources to get insights about our customers' purchases. The kinds of insights I gained helped me identify customer buying patterns. When merging datasets, I always begin by asking myself some key questions to help me avoid redundancy and to confirm that the datasets are compatible.

In data analytics, compatibility describes how well two or more datasets are able to work together.

The first question I would ask is, do I have all the data I need? To gather customer purchase insights, I wanted to make sure I had data on customers, their purchases, and where they shopped.

Next I would ask, does the data I need exist within these datasets? As you learned earlier in this program, this involves considering the entire dataset analytically. Looking through the data before I start using it lets me get a feel for what it's all about, what the schema looks like, if it's relevant to my customer purchase insights, and if it's clean data. That brings me to the next question.

Do the datasets need to be cleaned, or are they ready for me to use?

Because I'm working with more than one source, I will also ask myself, are the datasets cleaned to the same standard? For example, what fields are regularly repeated? How are missing values handled? How recently was the data updated? Finding the answers to these questions and understanding if I need to fix any problems at the start of a project is a very important step in data merging.

In both of the examples we explored here, data analysts could use either spreadsheet tools or SQL queries to clean up, merge, and prepare the datasets for analysis. Depending on the tool you decide to use, the cleanup process can be simple or very complex. Soon, you'll learn how to make the best choice for your situation. As a final note, programming languages like R are also very useful for cleaning data. You'll learn more about how to use R and other concepts we covered soon.

Common data-cleaning pitfalls

Common mistakes to avoid

  • Not checking for spelling errors: Misspellings can be as simple as typing or input errors. Most of the time the wrong spelling or common grammatical errors can be detected, but it gets harder with things like names or addresses. For example, if you are working with a spreadsheet table of customer data, you might come across a customer named “John” whose name has been input incorrectly as “Jon” in some places. The spreadsheet’s spellcheck probably won’t flag this, so if you don’t double-check for spelling errors and catch this, your analysis will have mistakes in it.

  • Forgetting to document errors: Documenting your errors can be a big time saver, as it helps you avoid those errors in the future by showing you how you resolved them. For example, you might find an error in a formula in your spreadsheet. You discover that some of the dates in one of your columns haven’t been formatted correctly. If you make a note of this fix, you can reference it the next time your formula is broken, and get a head start on troubleshooting. Documenting your errors also helps you keep track of changes in your work, so that you can backtrack if a fix didn’t work.

  • Not checking for misfielded values: A misfielded value happens when the values are entered into the wrong field. These values might still be formatted correctly, which makes them harder to catch if you aren’t careful. For example, you might have a dataset with columns for cities and countries. These are the same type of data, so they are easy to mix up. But if you were trying to find all of the instances of Spain in the country column, and Spain had mistakenly been entered into the city column, you would miss key data points. Making sure your data has been entered correctly is key to accurate, complete analysis.

  • Overlooking missing values: Missing values in your dataset can create errors and give you inaccurate conclusions. For example, if you were trying to get the total number of sales from the last three months, but a week of transactions were missing, your calculations would be inaccurate. As a best practice, try to keep your data as clean as possible by maintaining completeness and consistency.

  • Only looking at a subset of the data: It is important to think about all of the relevant data when you are cleaning. This helps make sure you understand the whole story the data is telling, and that you are paying attention to all possible errors. For example, if you are working with data about bird migration patterns from different sources, but you only clean one source, you might not realize that some of the data is being repeated. This will cause problems in your analysis later on. If you want to avoid common errors like duplicates, each field of your data requires equal attention.

  • Losing track of business objectives: When you are cleaning data, you might make new and interesting discoveries about your dataset-- but you don’t want those discoveries to distract you from the task at hand. For example, if you were working with weather data to find the average number of rainy days in your city, you might notice some interesting patterns about snowfall, too. That is really interesting, but it isn’t related to the question you are trying to answer right now. Being curious is great! But try not to let it distract you from the task at hand.

  • Not fixing the source of the error: Fixing the error itself is important. But if that error is actually part of a bigger problem, you need to find the source of the issue. Otherwise, you will have to keep fixing that same error over and over again. For example, imagine you have a team spreadsheet that tracks everyone’s progress. The table keeps breaking because different people are entering different values. You can keep fixing all of these problems one by one, or you can set up your table to streamline data entry so everyone is on the same page. Addressing the source of the errors in your data will save you a lot of time in the long run.

  • Not analyzing the system prior to data cleaning: If we want to clean our data and avoid future errors, we need to understand the root cause of your dirty data. Imagine you are an auto mechanic. You would find the cause of the problem before you started fixing the car, right? The same goes for data. First, you figure out where the errors come from. Maybe it is from a data entry error, not setting up a spell check, lack of formats, or from duplicates. Then, once you understand where bad data comes from, you can control it and keep your data clean.

  • Not backing up your data prior to data cleaning: It is always good to be proactive and create your data backup before you start your data clean-up. If your program crashes, or if your changes cause a problem in your dataset, you can always go back to the saved version and restore it. The simple procedure of backing up your data can save you hours of work-- and most importantly, a headache.

  • Not accounting for data cleaning in your deadlines/process: All good things take time, and that includes data cleaning. It is important to keep that in mind when going through your process and looking at your deadlines. When you set aside time for data cleaning, it helps you get a more accurate estimate for ETAs for stakeholders, and can help you know when to request an adjusted ETA.

Key takeaways Data cleaning is essential for accurate analysis and decision-making. Common mistakes to avoid when cleaning data include spelling errors, misfielded values, missing values, only looking at a subset of the data, losing track of business objectives, not fixing the source of the error, not analyzing the system prior to data cleaning, not backing up your data prior to data cleaning, and not accounting for data cleaning in your deadlines/process. By avoiding these mistakes, you can ensure that your data is clean and accurate, leading to better outcomes for your business.

Additional resources Refer to these "top ten" lists for data cleaning in Microsoft Excel and Google Sheets to help you avoid the most common mistakes:

Hands-On Activity: Cleaning data with spreadsheets

Link to data spreadsheet: Cleaning with spreadsheets

my practice template

Practice Quiz: data-cleaning techniques

Cleaning data in spreadsheets

Data-cleaning features in spreadsheets

Hi again. As you learned earlier, there's a lot of different ways to clean up data. I've shown you some examples of how you can clean data manually, such as searching for and fixing misspellings or removing empty spaces and duplicates. We also learned that lots of spreadsheet applications have tools that help simplify and speed up the data cleaning process. There's a lot of great efficiency tools that data analysts use all the time, such as conditional formatting, removing duplicates, formatting dates, fixing text strings and substrings, and splitting text into columns.

We'll explore those in more detail now. The first is something called conditional formatting. Conditional formatting is a spreadsheet tool that changes how cells appear when values meet specific conditions. Likewise, it can let you know when a cell does not meet the conditions you've set. Visual cues like this are very useful for data analysts, especially when we're working on a large spreadsheet with lots of data. Making certain data points stand out makes the information easier to understand and analyze.

For cleaning data, knowing when the data doesn't follow the condition is very helpful.

Would you like to follow along with the instructor? To use the template for the spreadsheet, click the link below International Logistics Association Memberships - Data for Cleaning

Let's return to the logistics association spreadsheet to check out conditional formatting in action. We'll use conditional formatting to highlight blank cells. That way, we know where there's missing information so we can add it to the spreadsheet. To do this, we'll start by selecting the range we want to search. For this example, we're not focused on address 3 and address 5. The fields will include all the columns in our spreadsheets, except for F and H.

Next, we'll go to Format and choose Conditional formatting.

image Great. Our range is automatically indicated in the field. The format rule will be to format cells if the cell is empty. Finally, we'll choose the formatting style. I'm going to pick a shade of bright pink, so my blanks really stand out.

You can select more than one range, A:E, G:G, and I:L by clicking "Add Another Range" to produce the same result in the video.

Then click "Done," and the blank cells are instantly highlighted.

The next spreadsheet tool removes duplicates. As you've learned before, it's always smart to make a copy of the data set before removing anything. Let's do that now.

The instructor will right-click the bottom tab for the sheet and select Duplicate.

Great, now we can continue. You might remember that our example spreadsheet has one association member listed twice. To fix that, go to Data and select "Remove duplicates."

"Remove duplicates" is a tool that automatically searches for and eliminates duplicate entries from a spreadsheet.

Choose "Data has header row" because our spreadsheet has a row at the very top that describes the contents of each column. Next, select "All" because we want to inspect our entire spreadsheet. Finally, "Remove duplicates." You'll notice the duplicate row was found and immediately removed.

Another useful spreadsheet tool enables you to make formats consistent. For example, some of the dates in this spreadsheet are in a standard date format. This could be confusing if you wanted to analyze when association members joined, how often they renewed their memberships, or how long they've been with the association. To make all of our dates consistent, first select column J, then go to "Format," select "Number," then "Date." Now all of our dates have a consistent format.

Before we go over the next tool, I want to explain what a text string is. In data analytics, a text string is a group of characters within a cell, most often composed of letters. An important characteristic of a text string is its length, which is the number of characters in it. You'll learn more about that soon. For now, it's also useful to know that a substring is a smaller subset of a text string.

Now let's talk about Split. Split is a tool that divides a text string around the specified character and puts each fragment into a new and separate cell. Split is helpful when you have more than one piece of data in a cell and you want to separate them out. This might be a person's first and last name listed together, or it could be a cell that contains someone's city, state, country, and zip code, but you actually want each of those in its own column.

Let's say this association wanted to analyze all of the different professional certifications its members have earned. To do this, you want each certification separated out into its own column. Right now, the certifications are separated by a comma. That's the specified text separating each item, also called the delimiter.

Let's get them separated. Highlight the column, then select "Data," and "Split text to columns." This spreadsheet application automatically knew that the comma was a delimiter and separated each certification. But sometimes you might need to specify what the delimiter should be. You can do that here.

Splitting text to columns is also helpful for fixing instances of numbers stored as text. Sometimes values in your spreadsheet will seem like numbers, but they're formatted as text. This can happen when copying and pasting from one place to another or if the formatting is wrong.

To continue following along with the instructor, click the link to the spreadsheet template below Cosmetics Inc. - Data for Cleaning

For this example, let's check out our new spreadsheet from a cosmetics maker. If a data analyst wanted to determine total profits, they could add up everything in column F. But there's a problem; one of the cells has an error. If you check into it, you learn that the "707" in this cell is text and can't be changed into a number. When the spreadsheet tries to multiply the cost of the product by the number of units sold, it's unable to make the calculation. But if we select the orders column and choose "Split text to columns," the error is resolved because now it can be treated as a number.

Coming up, you'll learn about a tool that does just the opposite. CONCATENATE is a function that joins multiple text strings into a single string.

Spreadsheets are a very important part of data analytics. They save data analysts time and effort and help us eliminate errors each and every day. Here, you've learned about some of the most common tools that we use. But there's a lot more to come. Next, we'll learn even more about data cleaning with spreadsheet tools. Bye for now!

Optimize the data-cleaning process

Welcome back. You've learned about some very useful data-cleaning tools that are built right into spreadsheet applications. Now we'll explore how functions can optimize your efforts to ensure data integrity.

As a reminder, a function is a set of instructions that performs a specific calculation using the data in a spreadsheet. The first function we'll discuss is called COUNTIF. COUNTIF is a function that returns the number of cells that match a specified value. Basically, it counts the number of times a value appears in a range of cells.

Let's go back to our professional association spreadsheet.

Would you like to follow along with the instructor using the same spreadsheet? To use the spreadsheet template, click the link below This is the same spreadsheet used in another video. You don't need to copy the template or download the spreadsheet again unless you want a fresh copy of the file. International Logistics Association Memberships - Data for Cleaning

In this example, we want to make sure the association membership prices are listed accurately. We'll use COUNTIF to check for some common problems, like negative numbers or a value that's much less or much greater than expected.

To start, let's find the least expensive membership: $100 for student associates. That'll be the lowest number that exists in this column. If any cell has a value that's less than 100, COUNTIF will alert us.

We'll add a few more rows at the bottom of our spreadsheet, then beneath column H, type "member dues less than $100."

Next, type the function in the cell next to it. Every function has a certain syntax that needs to be followed for it to work. Syntax is a predetermined structure that includes all required information and its proper placement.

The syntax of a COUNTIF function should be like this: Equals COUNTIF, open parenthesis, range, comma, the specified value in quotation marks, and a closed parenthesis. It will show up like this.

Where I2 through I72 is the range, and the value is less than 100. This tells the function to go through column I and return a count of all cells that contain a number less than 100. Turns out there is one! Scrolling through our data, we find that one piece of data was mistakenly keyed in as a negative number. Let's fix that now.

Now we'll use COUNTIF to search for any values that are more than we would expect. The most expensive membership type is $500 for corporate members. Type the function in the cell. This time it will appear like this: I2 through I72 is still the range, but the value is greater than 500. There's one here too. Check it out. This entry has an extra zero. It should be $100.

The next function we'll discuss is called LEN. LEN is a function that tells you the length of the text string by counting the number of characters it contains.

This is useful when cleaning data if you have a certain piece of information in your spreadsheet that you know must contain a certain length. For example, this association uses six-digit member identification codes. If we'd just imported this data and wanted to be sure our codes are all the correct number of digits, we'd use LEN. The syntax of LEN is equals LEN, open parenthesis, the range, and the close parenthesis. We'll insert a new column after Member ID. Then type an equals sign and LEN. Add an open parenthesis. The range is the first Member ID number in A2. Finish the function by closing the parenthesis. It tells us that there are six characters in cell A2. Let's continue the function through the entire column(double-click) and find out if any results are not six.

But instead of manually going through our spreadsheet to search for these instances, we'll use conditional formatting. We talked about conditional formatting earlier. It's a spreadsheet tool that changes how cells appear when values meet specific conditions. Let's practice that now. Select all of column B except for the header. Then go to Format and choose Conditional formatting. The format rule is to format cells if not equal to six. Click "Done." The cell with the seven inside is highlighted.

Now we're going to talk about LEFT and RIGHT. LEFT is a function that gives you a set number of characters from the left side of a text string. RIGHT is a function that gives you a set number of characters from the right side of a text string. As a quick reminder, a text string is a group of characters within a cell, commonly composed of letters, numbers, or both.

To see these functions in action, let's go back to the spreadsheet from the cosmetics maker from earlier. Would you like to follow along with the instructor using the same spreadsheet? To use the spreadsheet template, click the link below Cosmetics Inc. - Data for Optimizing

This spreadsheet contains product codes. Each has a five-digit numeric code and then a four-character text identifier.

But let's say we only want to work with one side or the other. You can use LEFT or RIGHT to give you the specific set of characters or numbers you need. We'll practice cleaning up our data using the LEFT function first. The syntax of LEFT is equals LEFT, open parenthesis, the range, a comma, and the number of characters from the left side of the text string we want. Then, we finish it with a closed parenthesis.

Here, our project requires just the five-digit numeric codes. In a separate column, type equals LEFT, open parenthesis, then the range. Our range is A2. Then, add a comma, and then number 5 for our five-digit product code. Finally, finish the function with a closed parenthesis. Our function should show up like this. Press "Enter." And now, we have a substring, which is the number part of the product code only. Click and drag this function through the entire column to separate out the rest of the product codes by number only.

Now, let's say our project only needs the four-character text identifier. For that, we'll use the RIGHT function, and the next column will begin the function. The syntax is equals RIGHT, open parenthesis, the range, a comma and the number of characters we want. Then, we finish with a closed parenthesis.

Let's key that in now. Equals right, open parenthesis, and the range is still A2. Add a comma. In this time, we'll tell it that we want the first four characters from the right. Close up the parenthesis and press "Enter." Then, drag the function throughout the entire column. Now, we can analyze the product in our spreadsheet based on either substring. The five-digit numeric code or the four-character text identifier. So Hopefully, that makes it clear how you can use LEFT and RIGHT to extract substrings from the left and right sides of a string.

Now, let's learn how you can extract something in between. Here's where we'll use something called MID. MID is a function that gives you a segment from the middle of a text string.

This cosmetics company lists all of its clients using a client code. It's composed of the first three letters of the city where the client is located, its state abbreviation, and then a three-digit identifier. But let's say a data analyst needs to work with just the states in the middle. The syntax for MID is equals MID, open parenthesis, the range, then a comma. When using MID, you always need to supply a reference point. In other words, you need to set where the function should start. After that, place another comma, and how many middle characters you want. In this case, our range is D2.

Let's start the function in a new column. Type equals MID, open parenthesis, D2. Then the first three characters represent a city name, so that means the starting point is the fourth. Add a comma and four. We also need to tell the function how many middle characters we want. Add one more comma, and two, because the state abbreviations are two characters long. Press "Enter" and bam, we just get the state abbreviation. Continue the MID function through the rest of the column.

We've learned about a few functions that help separate out specific text strings. But what if we want to combine them instead? For that, we'll use CONCATENATE, which is a function that joins together two or more text strings. The syntax as equals CONCATENATE, then an open parenthesis inside indicates each text string you want to join, separated by commas. Then finish the function with a closed parenthesis.

Just for practice, let's say we needed to rejoin the left and right text strings back into complete product codes. In a new column, let's begin our function. Type equals CONCATENATE, then an open parenthesis. The first text string we want to join is in H2. Then add a comma. The second part is in I2. Add a closed parenthesis and press "Enter". Drag it down through the entire column, and just like that, all of our product codes are back together.

The last function we'll learn about here is TRIM. TRIM is a function that removes leading, trailing, and repeated spaces in data.

Sometimes when you import data, your cells have extra spaces, which can get in the way of your analysis. For example, if this cosmetics maker wanted to look up a specific client name, it won't show up in the search if it has extra spaces. You can use TRIM to fix that problem. The syntax for TRIM is equals TRIM, open parenthesis, your range, and closed parenthesis. In a separate column, type equals TRIM and an open parenthesis. The range is C2, as you want to check out the client names. Close the parenthesis and press "Enter". Finally, continue the function down the column. TRIM fixed the extra spaces.

Now we know some very useful functions that can make your data cleaning even more successful. This was a lot of information. As always, feel free to go back and review the video and then practice on your own. We'll continue building on these tools soon, and you'll also have a chance to practice. Pretty soon, these data-cleaning steps will become second nature, like brushing your teeth.

Workflow automation

What can be automated?

hashtags: Communicating with your team and stakeholders, Presenting your findings, Preparing and cleaning data, Data exploration, Modeling the data

More about automating data cleaning One of the most important ways you can streamline your data cleaning is to clean data where it lives. This will benefit your whole team, and it also means you don’t have to repeat the process over and over. For example, you could create a programming script that counted the number of words in each spreadsheet file stored in a specific folder. Using tools that can be used where your data is stored means that you don’t have to repeat your cleaning steps, saving you and your team time and energy.

More resources There are a lot of tools out there that can help automate your processes, and those tools are improving all the time. Here are a few articles or blogs you can check out if you want to learn more about workflow automation and the different tools out there for you to use:

Different data perspectives

Hi, let's get into it. Motivational speaker Wayne Dyer once said, "If you change the way you look at things, the things you look at change." This is so true in data analytics. No two analytics projects are ever exactly the same. So it only makes sense that different projects require us to focus on different information differently.

Here, we'll explore different methods that data analysts use to look at data differently and how that leads to more efficient and effective data cleaning.

Some of these methods include sorting and filtering, pivot tables, a function called VLOOKUP, and plotting to find outliers.

Let's start with sorting and filtering. As you learned earlier, sorting and filtering data helps data analysts customize and organize the information the way they need for a particular project. But these tools are also very useful for data cleaning.

You might remember that sorting involves arranging data into a meaningful order to make it easier to understand, analyze, and visualize. For data cleaning, you can use sorting to put things in alphabetical or numerical order, so you can easily find a piece of data. Sorting can also bring duplicate entries closer together for faster identification.

Filters, on the other hand, are very useful in data cleaning when you want to find a particular piece of information. You learned earlier that filtering means showing only the data that meets a specific criteria while hiding the rest. This lets you view only the information you need. When cleaning data, you might use a filter to only find values above a certain number, or just even or odd values. Again, this helps you find what you need quickly and separates out the information you want from the rest. That way you can be more efficient when cleaning your data.

Another way to change the way you view data is by using pivot tables. You've learned that a pivot table is a data summarization tool that is used in data processing. Pivot tables sort, reorganize, group, count, total, or average data stored in the database. In data cleaning, pivot tables are used to give you a quick, clutter-free view of your data. You can choose to look at the specific parts of the data set that you need to get a visual in the form of a pivot table.

Let's create one now using our cosmetic makers spreadsheet again.

To follow along with the instructor using the same spreadsheet, click the link below Cosmetics, Inc. Data for Pivot Table and VLOOKUP

The menu option has slightly changed. To insert a pivot table select Insert and Pivot Table. To start, select the data we want to use. Here, we'll choose the entire spreadsheet. Select "Data" and then "Pivot table." Choose "New sheet" and "Create."

Let's say we're working on a project that requires us to look at only the most profitable products. Items that earn the cosmetics maker at least $10,000 in orders. So the row we'll include is "Total" for total profits. We'll sort in descending order to put the most profitable items at the top. And we'll show totals. Next, we'll add another row for products so that we know what those numbers are about. We can clearly determine that the most profitable products have the product codes 15143 E-X-F-O and 32729 M-A-S-C. We can ignore the rest for this particular project because they fall below $10,000 in orders.

Now, we might be able to use context clues to assume we're talking about exfoliants and mascaras. But we don't know which ones, or if that assumption is even correct. So we need to confirm what the product codes correspond to. And this brings us to the next tool. It's called VLOOKUP. VLOOKUP stands for vertical lookup. It's a function that searches for a certain value in a column to return a corresponding piece of information. When data analysts look up information for a project, it's rare for all of the data they need to be in the same place. Usually, you'll have to search across multiple sheets or even different databases. The syntax of the VLOOKUP is equals VLOOKUP, open parenthesis, then the data you want to look up. Next is a comma and where you want to look for that data. In our example, this will be the name of a spreadsheet followed by an exclamation point. The exclamation point indicates that we're referencing a cell in a different sheet from the one we're currently working in. Again, that's very common in data analytics. Okay, next is the range in the place where you're looking for data, indicated using the first and last cell separated by a colon. After one more comma is the column in the range containing the value to return. Next, another comma and the word "false," which means that an exact match is what we're looking for. Finally, complete your function by closing the parentheses. To put it simply, VLOOKUP searches for the value in the first argument in the leftmost column of the specified location. Then the value of the third argument tells VLOOKUP to return the value in the same row from the specified column. The "false" tells VLOOKUP that we want an exact match. Soon you'll learn the difference between exact and approximate matches. But for now, just know that V lookup takes the value in one cell and searches for a match in another place.

Let's begin. We'll type equals VLOOKUP. Then add the data we are looking for, which is the product data(A2). The dollar sign makes sure that the corresponding part of the reference remains unchanged. You can lock just the column, just the row, or both at the same time. Next, we'll tell it to look at Sheet 2, in both columns We added 2 to represent the second column. The last term, "false," says we wanted an exact match. With this information, we can now analyze the data for only the most profitable products. Going back to the two most profitable products, we can search for 15143 E-X-F-O And 32729 M-A-S-C. Go to Edit and then Find. Type in the product codes and search for them. Now we can learn which products we'll be using for this particular project.

The final tool we'll talk about is something called plotting. When you plot data, you put it in a graph chart, table, or other visual to help you quickly find what it looks like.

Plotting is very useful when trying to identify any skewed data or outliers. For example, if we want to make sure the price of each product is correct, we could create a chart. This would give us a visual aid that helps us quickly figure out if anything looks like an error. So let's select the column with our prices. Then we'll go to Insert and choose Chart. Pick a column chart as the type. One of these prices looks extremely low.

When scrolling through the chart editor, it would be beneficial to hover over the data bar to get a better understanding of the data surrounding the outlier. After doing this, you will receive the data pop-up showing that the smallest data bar has a financial value at $0.73. You may now use this information to help locate the outlier in the Price column, which can be identified as the Greenest Mascara product.

If we look into it, we discover that this item has a decimal point in the wrong place. It should be $7.30, not 73 cents.

That would have a big impact on our total profits. So it's a good thing we caught that during data cleaning. Looking at data in new and creative ways helps data analysts identify all kinds of dirty data.

Coming up, you'll continue practicing these new concepts so you can get more comfortable with them. You'll also learn additional strategies for ensuring your data is clean, and we'll provide you with effective insights. Great work so far.

Even more data-cleaning techniques

Hello. So far you've learned about a lot of different tools and functions that analysts use to clean up data for analysis. Now we'll take a step back and talk about some of the really big-picture aspects of clean data.

Knowing how to fix specific problems, either manually with spreadsheet tools, or with functions, is extremely valuable. But it's also important to think about how your data has moved between systems and how it's evolved along its journey to your data analysis project. To do this, data analysts use something called data mapping. Data mapping is the process of matching fields from one database to another. This is very important to the success of data migration, data integration, and lots of other data management activities. As you learned earlier, different systems store data in different ways. For example, the state field in one spreadsheet might show Maryland spelled out. But another spreadsheet might store it as MD.

Data mapping helps us note these kinds of differences so we know when data is moved and combined it will be compatible. As a quick reminder, compatibility describes how well two or more data sets are able to work together. So the first step to data mapping is identifying what data needs to be moved. This includes the tables and the fields within them. We also need to define the desired format for the data once it reaches its destination.

To figure out how this works let's go back to the merger between our two logistics associations.

Would you like to follow along with the instructor using the same spreadsheets? To use the spreadsheet templates, click the links below NOTE: The Logistics Association Merger sheet is the final merged result in this video. You can examine it to spot-check the merged data. You will not need to merge it yourself.

Starting with the first data field, we'll identify that we need to move both sets of member IDs. To define the desired format, we'll choose whether to use numbers like this spreadsheet, or email addresses like the other spreadsheet.

Next comes mapping the data. Depending on the schema and number of primary and foreign keys in a data source, data mapping can be simple or very complex. As a reminder, a schema is a way of describing how something is organized. A primary key references a column in which each value is unique and a foreign key is a field within a table that is a primary key in another table.

For more challenging projects there are all kinds of data mapping software programs you can use. These data mapping tools will analyze field by field how to move data from one place to another then they automatically clean, match, inspect, and validate the data. They also create consistent naming conventions, ensuring compatibility when the data is transferred from one source to another. When selecting a software program to map your data, you want to be sure that it supports the file types you're working with, such as Excel, SQL, Tableau, and others. Later on, you'll learn more about selecting the right tool for a particular task. For now, let's practice mapping data manually.

First, we need to determine the content of each section to make sure the data ends up in the right place. For example, the data on when memberships expire would be consolidated into a single column. This step makes sure that each piece of information ends up in the most appropriate place in the merged data source. Now, you might remember that some of the data was inconsistent between the two organizations, like the fact that one uses a separate column for suite apartment or unit number but the other doesn't. This brings us to the next step, transforming the data into a consistent format. This is a great time to use concatenate. As you learned before, concatenate is a function that joins together two or more text strings, which is what we did earlier with our cosmetics company example. So we'll insert a new column

Pro Tip: Use CONCATENATE to help you format the data as it is merged. Coming up, if you enter =CONCATENATE(D2, E2) as demonstrated by the instructor, the results will appear like this: 25 Dyas RdSte. 101 You could manually clean the data later to add a space between Rd and Ste., but why not let CONCATENATE do the work for you? Because CONCATENATE merges strings, you can enter an additional string to insert a space between Rd and Ste. Enter =CONCATENATE(D2, " ", E2) instead and you will have an address that is formatted like this: 25 Dyas Rd Ste. 101 Much better!

and then type equals concatenate, then the two text strings we want to make one. Drag that through the entire column. Now we have the consistency in the new merged association lists of member addresses.

Now that everything's compatible, it's time to transfer the data to its destination. There's a lot of different ways to move data from one place to another, including querying, import wizards, and even simple drag and drop. Here's our merged spreadsheet.

It looks good, but we still want to make sure everything was transferred properly. We'll go into the testing phase of data mapping. For this, you inspect a sample piece of data to confirm that it's clean and properly formatted. It's also a smart practice to do spot checks on things such as the number of nulls.

For the test, you can use a lot of the data-cleaning tools we discussed previously, such as data validation, conditional formatting, COUNTIF, sorting, and filtering. Finally, once you've determined that the data is clean and compatible, you can start using it for analysis.

Data mapping is so important because even one mistake when merging data can ripple throughout an organization, causing the same error to appear again and again. This leads to poor results. On the other hand, data mapping can save the day by giving you a clear road map you can follow to make sure your data arrives safely at its destination. That's why you learn how to do it.

Hands-On Activity: Clean data with spreadsheet functions

spreadsheet(sf_boba_tea_shop_location_info)

Learning Log: Develop your approach to cleaning data

template: Develop your approach to data cleaning

Practice Quiz: cleaning data in spreadsheets

Module 2 challenge


Course 4 Module 2 Glossary