Course 4‐3 - Forestreee/Data-Analytics GitHub Wiki

Google Data Analytics Professional

Process Data from Dirty to Clean

WEEK3 - Cleaning data with SQL

Knowing a variety of ways to clean data can make an analyst’s job much easier. In this part of the course, you’ll check out how to clean your data using SQL. You’ll explore queries and functions that you can use in SQL to clean and transform your data to get it ready for analysis.

Learning Objectives

  • Describe how SQL can be used to clean large datasets
  • Compare spreadsheet data-cleaning functions to those associated with SQL in databases
  • Develop basic SQL queries for use with databases
  • Apply basic SQL functions for use in cleaning string variables in a database
  • Apply basic SQL functions for transforming data variables

Using SQL to clean data

Using SQL to clean data

Welcome back and great job on that last weekly challenge. Now that we know the difference between cleaning dirty data and some general data cleaning techniques, let's focus on data cleaning using SQL.

Coming up we'll learn about the different data cleaning functions in spreadsheets and SQL and how SQL can be used to clean large data sets. I'll also show you how to develop some basic search queries for databases and how to apply basic SQL functions for transforming data and cleaning strings. Cleaning your data is the last step in the data analysis process before you can move on to the actual analysis, and SQL has a lot of great tools that can help you do that.

But before we start cleaning databases, we'll take a closer look at SQL and when to use it. I'll see you there.

Sally: For the love of SQL

Advertising agencies get money from their clients to advertise their brand. These agencies use our products, use certain Google platforms, advertising platforms, and I help them with how to best use those platforms, different strategies they can use to be best in class. A lot of the folks at the advertising agencies have reports that they have to send out to their clients. These reports take a lot of time to create and visualize, and so what I do is I help the practitioners and the analytics teams use a particular product that enables them to create those reports much faster and much easier. If you're going to start off as a data analyst, it opens tons of doors because everybody is tracking data, is using data, needs to use data, regardless of industry. Anywhere from health care, to advertising, to e-commerce, to entertainment, anything and everything, everybody uses data, so everybody needs you as a data analyst. SQL makes our lives easier when we're analyzing lots of different data. It's only somewhat recently that the SQL programs that we use now can give us instant results for analyzing millions or billions of data. Years ago, maybe about five years ago or so, even though we could still analyze those millions of rows, we would end up having to wait fifteen minutes, thirty minutes for the queries to run. But now it's instantaneous, and so that's really exciting, and we can do so much more with that power. SQL has helped a lot in my career because it's one of those fundamental things you have to know as a data analyst. Back in the day, not everyone knew SQL, so knowing SQL was definitely a competitive advantage. Nowadays, I would say more people, maybe most people know it. It's a core skill and highly sought after by everybody. So, knowing SQL, becoming a data analyst makes you quite popular from recruiters, so I think that's really fun. I taught myself SQL, so my knowledge about SQL is something I hold near and dear, close to my heart since it's something that almost I've made for myself, and I feel so much satisfaction from it. So that's why I really like SQL. One of the fun things about SQL and another reason why I really enjoy using it is because when you type something in that query, and you just hit Control, Shift, Enter, or once you've run the query, you get the results almost instantly, depending on the platform you use. But it's fascinating to see if you think conceptually how much analysis the computer is doing for you based on that little bit of command code or a little bit of code you wrote, and it's just so powerful if you think about what's happening behind the scenes. So I think that's fun to look at. We live in a world of big data, and it keeps getting bigger. The computing power is also increasing exponentially. With all the data that we can track, the more and more we can track that data, the more and more we need data analysts. Our career prospects are basically skyrocketing. I'm Sally, I'm a measurement and analytical lead at Google.

Understanding SQL capabilities

Hello, again. So before we go over all the ways data analysts use SQL to clean data, I want to formally introduce you to SQL. We've talked about SQL a lot already. You've seen some databases and some basic functions in SQL, and you've even seen how SQL can be used to process data.

But now let's actually define SQL. SQL is a structured query language that analysts use to work with databases. Data analysts usually use SQL to deal with large datasets because it can handle huge amounts of data. And I mean trillions of rows. That's a lot of rows to wrap your head around. So let me give you an idea about how much data that really is.

Imagine a data set that contains the names of all 8 billion people in the world. It would take the average person 101 years to read all 8 billion names. SQL can process this in seconds. Personally, I think that's pretty cool. Other tools like spreadsheets might take a really long time to process that much data, which is one of the main reasons data analysts choose to use SQL, when dealing with big datasets. Let me give you a short history on SQL. Development on SQL actually began in the early 70s.

In 1970, Edgar F.Codd developed the theory about relational databases. You might remember learning about relational databases a while back. This is a database that contains a series of tables that can be connected to form relationships. At the time IBM was using a relational database management system called System R. Well, IBM computer scientists were trying to figure out a way to manipulate and retrieve data from IBM System R. Their first query language was hard to use. So they quickly moved on to the next version, SQL. In 1979, after extensive testing SQL, now just spelled S-Q-L, was released publicly. By 1986, SQL had become the standard language for relational database communication, and it still is. This is another reason why data analysts choose SQL. It's a well-known standard within the community.

The first time I used SQL to pull data from a real database was for my first job as a data analyst. I didn't have any background knowledge about SQL before that. I only found out about it because it was a requirement for that job. The recruiter for that position gave me a week to learn it. So I went online and researched it and ended up teaching myself SQL. They actually gave me a written test as part of the job application process. I had to write SQL queries and functions on a whiteboard. But I've been using SQL ever since. And I really like it. And just like I learned SQL on my own, I wanted to remind you that you can figure things out yourself too. There are tons of great online resources for learning. So don't let one job requirement stand in your way without doing some research first.

Now that we know a little more about why analysts choose to work with SQL when they're handling a lot of data and a little bit about the history of SQL, we'll move on and learn some practical applications for it. Coming up next, we'll check out some of the tools we learned in spreadsheets and figure out if any of those apply to working in SQL. Spoiler alert, they do. See you soon.

Using SQL as a junior data analyst

The business task and context

Spreadsheets functions and formulas or SQL queries?

Spreadsheets and SQL both have their advantages and disadvantages:

Features of Spreadsheets Features of SQL Databases
Smaller data sets Larger datasets
Enter data manually Access tables across a database
Create graphs and visualizations in the same program Prepare data for further analysis in another software
Built-in spell check and other useful functions Fast and powerful functionality
Best when working solo on a project Great for collaborative work and tracking queries run by all users

When it comes down to it, where the data lives will decide which tool you use. If you are working with data that is already in a spreadsheet, that is most likely where you will perform your analysis. And if you are working with data stored in a database, SQL will be the best tool for you to use for your analysis. You will learn more about SQL coming up, so that you will be ready to tackle any business problem with the best tool possible.

Spreadsheets versus SQL

Hey there. So far we've learned about both spreadsheets and SQL. While there's lots of differences between spreadsheets and SQL, you'll find some similarities too. Let's check out what spreadsheets and SQL have in common and how they're different.

Spreadsheets and SQL actually have a lot in common. Specifically, there's tools you can use in both spreadsheets and SQL to achieve similar results. We've already learned about some tools for cleaning data in spreadsheets, which means you already know some tools that you can use in SQL. For example, you can still perform arithmetic, use formulas and join data when you're using SQL, so we'll build on the skills we've learned in spreadsheets and use them to do even more complex work in SQL.

Here's an example of what I mean by more complex work. If we were working with health data for a hospital, we'd need to be able to access and process a lot of data. We might need demographic data, like patients' names, birthdays, and addresses, information about their insurance or past visits, public health data or even user-generated data to add to their patient records. All of this data is being stored in different places, maybe even in different formats, and each location might have millions of rows and hundreds of related tables. This is way too much data to input manually, even for just one hospital. That's where SQL comes in handy. Instead of having to look at each individual data source and record it in our spreadsheet, we can use SQL to pull all this information from different locations in our database.

Now, let's say we want to find something specific in all this data, like how many patients with a certain diagnosis came in today.

In a spreadsheet, we can use the COUNTIF function to find that out, or we can combine the COUNT and WHERE queries in SQL to find out how many rows match our search criteria. This will give us similar results but works with a much larger and more complex set of data.

Next, let's talk about how spreadsheets and SQL are different. First, it's important to understand that spreadsheets and SQL are different things.

Spreadsheets are generated with a program like Excel or Google Sheets. These programs are designed to execute certain built-in functions. / SQL on the other hand is a language that can be used to interact with database programs, like Oracle MySQL or Microsoft SQL Server. The differences between the two are mostly in how they're used. If a data analyst was given data in the form of a spreadsheet they'll probably do their data cleaning and analysis within that spreadsheet, but if they're working with a large data set with more than a million rows or multiple files within a database, it's easier, faster and more repeatable to use SQL.

SQL can access and use a lot more data because it can pull information from different sources in the database automatically, unlike spreadsheets which only have access to the data you input. This also means that data is stored in multiple places.

A data analyst might use spreadsheets stored locally on their hard drive or their personal cloud when they're working alone, but if they're on a larger team with multiple analysts who need to access and use data stored across a database, SQL might be a more useful tool.

Because of these differences, spreadsheets and SQL are used for different things. As you already know, spreadsheets are good for smaller data sets and when you're working independently.

Plus, spreadsheets have built-in functionalities, like spell check that can be really handy. SQL is great for working with larger data sets, even trillions of rows of data. Because SQL has been the standard language for communicating with databases for so long, it can be adapted and used for multiple database programs. SQL also records changes in queries, which makes it easy to track changes across your team if you're working collaboratively.

Next, we'll learn more queries and functions in SQL that will give you some new tools to work with. You might even learn how to use spreadsheet tools in brand new ways. See you next time.

SQL dialects and their uses

More information You may not need to know every SQL dialect, but it is useful to know that these different dialects exist. If you are interested in learning more about SQL dialects and when they are used, you can check out these resources for more information:

Hands-On Activity: Processing time with SQL

Practice Quiz: SQL

Learn basic SQL queries

Optional: Upload the customer dataset to BigQuery

Using BigQuery -> Course3 Module3, Working with large datasets in SQL, Using BigQuery

Widely used SQL queries

Hey, welcome back! So far we've learned that SQL has some of the same tools as spreadsheets, but on a much larger scale. Here, we'll learn some of the most widely used SQL queries that you can start using for your own data cleaning and eventual analysis. Let's get started.

We've talked about queries as requests you put into the database to ask it to do things for you. Queries are a big part of using SQL. It's a structured query language, after all. Queries can help you do a lot of things, but there are some common ones that data analysts use all the time. So let's start there.

First, I'll show you how to use the SELECT query. I've called this one out before but now I'll add some new things for us to try out. Right now the table viewer is blank because we haven't pulled anything from the database yet.

For this example, the store we're working with, is hosting a giveaway for customer's in certain cities. We have a database containing customer information that we can use to narrow down which customers are eligible for the giveaway. So let's do that now.

We can use SELECT to specify exactly what data we want to interact with in a table, and if we combine SELECT with FROM we can pull data from any table in this database as long as they know what the columns and rows are named.

We might want to pull the data about customer names and cities from one of the tables. To do that we can input SELECT name, comma city, FROM customer underscore data, customer underscore address to get this information from the customer underscore address table which lives in the customer underscore data data set.

If you are following along with your own BigQuery dataset, you may have noticed there is some information missing before the customer_data.customer_address query line in the video. You can find the full Table ID in the top line of the DETAILS tab. Anytime an uploaded local dataset name is typed in the FROM section of a BigQuery query, the dataset and data table file path name will always be preceded by the project name. Here is a template of the file path structure: personal project name.customer_data.customer_address Where, your personally customized project name will be inserted before the dataset name. If you don't insert your personal project name, it may cause an error to occur in your query. Keep this technique in mind for future lessons anytime you are typing a local file path name in the FROM section of a query.

So SELECT and FROM help specify what data we want to extract from the database and use.

We can also insert new data into a database or update existing data. For example, maybe we have a new customer that we want to insert into this table. We can use the insert into the query to put that information in.

To set up a free trial account, follow the step-by-step instructions referenced in the Using BigQuery reading.

So let's start with where we're trying to insert this data: the customer underscore address table. We also want to specify which columns were adding this data too, by typing their names in the parentheses. That way the SQL can tell the database exactly where we're inputting new information. Then we'll tell it what values we're putting in.

Run the query and just like that it added it to our table for us.

Now, let's say we just need to change the address of a customer. Well, we can tell the database to update it for us. To do that we need to tell it we're trying to update the customer underscore address table. Then we need to let it know what value we're trying to change. But we also need to tell it where we're making that change specifically so that it doesn't change every address in the table. There, now this one customer's address has been updated.

If we want to create a new table for this database, we can use the create table if not exists statement. Keep in mind just running a SQL query doesn't actually create a table for the data we extract, it just stores it in our local memory. To save it will need to download it as a spreadsheet or say the results into a new table.

As a data analyst, there are a few situations where you might need to do just that. It really depends on what kind of data you're pulling and how often. If you're only using a total number of customers, you probably don't need a CSV file or new table in your database. If you're using the total number of customers per day to do something like track a weekend promotion in a store, you might download that data as a CSV file so you can visualize it in a spreadsheet, but if you're being asked to pull this trend on a regular basis, you can create a table that will automatically refresh with the query you've written. That way, you can directly download the results whenever you need them for a report.

Another good thing to keep in mind, if you're creating lots of tables within a database you'll want to use the drop table if exists statement to clean up after yourself, it's good housekeeping.

You probably won't be deleting existing tables very often. After all, that's the company's data, and you don't want to delete important data from their database. But you can make sure you're cleaning up the tables you've personally made so that there aren't old or unused tables with redundant information cluttering the database. There, now you've seen some of the most widely used SQL queries in action. There are definitely more query keywords for you to learn and unique combinations that will help you work within databases, but this is a great place to start.

Coming up we'll learn even more about queries and Sequel and how to use them to clean our data see you next time.

Evan: Having fun with SQL

Hi, I'm Evan. I'm a learning portfolio manager here at Google. I don't think I'm a computer science or super engineering type, but I really, really like working with numbers, so actually, I went into accounting. And about after two years of accounting I said, "Wow, I really don't want to do all this by hand," so I took my first information systems class, where they taught me the language SQL or S-Q-L, and it completely opened up my mind. Between a working knowledge of spreadsheets where you change one cell and the whole spreadsheet changes because those amazing calculated fields and SQL where I can query billions of rows of data in a matter of second, I was completely sold on my love for data. I've dedicated my life and my career to just communicating that passion and getting folks excited about the things that they can do with their data. Why is SQL such an amazing first language to pick up? Well, there's so many things that you can do with it. I will first caveat and say, I am not a computer science major. I don't know deep down Java and Python, and I was a little bit apprehensive of learning a computer language. It's like a pseudo-programming language, but in reality, you can write your first SQL statement as you're going to find out here in just five minutes or less. SQL, honestly, it's one of those languages that's easy to learn and even more fun to master. I've been learning SQL for 15 years. I've been teaching it for 10. As you're going to see in some of these hands-on labs you'll be working through, it's very easy to return data from within a database or a data set. Just select whatever columns from whichever database that you're pulling from, and immediately you get the data back. Now, the really fun part is actually teasing apart and saying, I wonder if I change my query, add these more columns, filter this data set a different way, share with my colleagues. It's meant to be an interactive querying language, and "query" means "asking a question." If I can challenge you one thing, it's that the syntax for picking up SQL, much like the rules of a chess game, are very easy to pick up. But the hard part is actually not the syntax writing, much like with any programming language, but the actual what question do you want to ask of your data? What I would encourage you to do is be super curious about whatever data set that you're given. Spend a lot of time, even before you touch your keyboard, in thinking about what data set or what insights you can get from your data. And then start having fun. There's many different ways to write the same correct SQL statement, so try one out, share it with your friends and then start returning that data back for insights. Good luck.

Cleaning string variables using SQL

It's so great to have you back. Now that we know some basic SQL queries and spent some time working in a database, let's apply that knowledge to something else we've been talking about: preparing and cleaning data.

You already know that cleaning and completing your data before you analyze it is an important step. Here, I'll show you some ways SQL can help you do just that, including how to remove duplicates, as well as four functions to help you clean string variables.

Earlier, we covered how to remove duplicates in spreadsheets using the Remove Duplicates tool. In SQL, we can do the same thing by including DISTINCT in our SELECT statement.

For example, let's say the company we work for has a special promotion for customers in Ohio. We want to get the customer IDs of customers who live in Ohio. But some customer information has been entered multiple times. We can get these customer IDs by writing SELECT customer_id FROM customer_data.customer_address.

This query will give us duplicates if they exist in the table. If customer ID 9080 shows up three times in our table, our results will have three of that customer ID. But we don't want that. We want a list of all unique customer IDs. To do that, we add DISTINCT to our SELECT statement by writing, SELECT DISTINCT customer_id FROM customer_data.customer_address. Now, the customer ID 9080 will show up only once in our results.

You might remember we've talked before about text strings as a group of characters within a cell, commonly composed of letters, numbers, or both. These text strings need to be cleaned sometimes. Maybe they've been entered differently in different places across your database, and now they don't match.

In those cases, you'll need to clean them before you can analyze them. So here are some functions you can use in SQL to handle string variables. You might recognize some of these functions from when we talked about spreadsheets. Now it's time to see them work in a new way. Pull up the data set we shared right before this video. And you can follow along step-by-step with me during the rest of this video.

The first function I want to show you is LENGTH, which we've encountered before. If we already know the length our string variables are supposed to be, we can use LENGTH to double-check that our string variables are consistent. For some databases, this query is written as LEN, but it does the same thing.

Let's say we're working with the customer_address table from our earlier example. We can make sure that all country codes have the same length by using LENGTH on each of these strings. So to write our SQL query, let's first start with SELECT and FROM. We know our data comes from the customer_address table within the customer_data data set. So we add customer_data.customer_address after the FROM clause. Then under SELECT, we'll write LENGTH, and then the column we want to check, country. To remind ourselves what this is, we can label this column in our results as letters_in_country. So we add AS letters_in_country, after LENGTH(country). The result we get is a list of the number of letters in each country listed for each of our customers. It seems like almost all of them are 2s, which means the country field contains only two letters. But we notice one that has 3. That's not good. We want our data to be consistent.

So let's check out which countries were incorrectly listed in our table. We can do that by putting the LENGTH(country) function that we created into the WHERE clause. Because we're telling SQL to filter the data to show only customers whose country contains more than two letters. So now we'll write SELECT country FROM customer_data.customer_address WHERE LENGTH(country) greater than 2. When we run this query, we now get the two countries where the number of letters is greater than the 2 we expect to find. The incorrectly listed countries show up as USA instead of US. If we created this table, then we could update our table so that this entry shows up as US instead of USA. But in this case, we didn't create this table, so we shouldn't update it.

We still need to fix this problem so we can pull a list of all the customers in the US, including the two that have USA instead of US. The good news is that we can account for this error in our results by using the substring function in our SQL query.

To write our SQL query, let's start by writing the basic structure, SELECT, FROM, WHERE. We know our data is coming from the customer_address table from the customer_data data set. So we type in customer_data.customer_address, after FROM. Next, we tell SQL what data we want it to give us. We want all the customers in the US by their IDs. So we type in customer_id after SELECT. Finally, we want SQL to filter out only American customers. So we use the substring function after the WHERE clause. We're going to use the substring function to pull the first two letters of each country so that all of them are consistent and only contain two letters. To use the substring function, we first need to tell SQL the column where we found this error, country. Then we specify which letter to start with. We want SQL to pull the first two letters, so we're starting with the first letter, so we type in 1. Then we need to tell SQL how many letters, including this first letter, to pull. Since we want the first two letters, we need SQL to pull two total letters, so we type in 2. This will give us the first two letters of each country. We want US only, so we'll set this function to equals US.

When we run this query, we get a list of all customer IDs of customers whose country is the US, including the customers that had USA instead of US. Going through our results, it seems like we have a couple duplicates where the customer ID is shown multiple times.

Remember how we get rid of duplicates? We add DISTINCT before customer_id. So now when we run this query, we have our final list of customer IDs of the customers who live in the US.

Finally, let's check out the TRIM function, which you've come across before. This is really useful if you find entries with extra spaces and need to eliminate those extra spaces for consistency.

For example, let's check out the state column in our customer_address table. Just like we did for the country column, we want to make sure the state column has the consistent number of letters. So let's use the LENGTH function again to learn if we have any state that has more than two letters, which is what we would expect to find in our data table. We start writing our SQL query by typing the basic SQL structure of SELECT, FROM, WHERE. We're working with the customer_address table in the customer_data data set. So we type in customer_data.customer_address after FROM. Next, we tell SQL what we want it to pull. We want it to give us any state that has more than two letters, so we type in state, after SELECT. Finally, we want SQL to filter for states that have more than two letters. This condition is written in the WHERE clause. So we type in LENGTH(state), and that it must be greater than 2 because we want the states that have more than two letters. We want to figure out what the incorrectly listed states look like, if we have any. When we run this query, we get one result. We have one state that has more than two letters. But hold on, how can this state that seems like it has two letters, O and H for Ohio, have more than two letters? We know that there are more than two characters because we used the LENGTH(state) > 2 statement in the WHERE clause when filtering out results. So that means the extra characters that SQL is counting must then be a space. There must be a space after the H.

This is where we would use the TRIM function.

The TRIM function removes any spaces. So let's write a SQL query that accounts for this error. Let's say we want a list of all customer IDs of the customers who live in "OH" for Ohio. We start with the basic SQL structure: SELECT, FROM, WHERE. We know the data comes from the customer_address table in the customer_data data set, so we type in customer_data.customer_address after FROM. Next, we tell SQL what data we want. We want SQL to give us the customer IDs of customers who live in Ohio, so we type in customer_id after SELECT. Since we know we have some duplicate customer entries, we'll go ahead and type in DISTINCT before customer_id to remove any duplicate customer IDs from appearing in our results. Finally, we want SQL to give us the customer IDs of the customers who live in Ohio. We're asking SQL to filter the data, so this belongs in the WHERE clause. Here's where we'll use the TRIM function. To use the TRIM function, we tell SQL the column we want to remove spaces from, which is state in our case. And we want only Ohio customers, so we type in = 'OH'. That's it. We have all customer IDs of the customers who live in Ohio, including that customer with the extra space after the H.

Making sure that your string variables are complete and consistent will save you a lot of time later by avoiding errors or miscalculations. That's why we clean data in the first place. Hopefully functions like length, substring, and trim will give you the tools you need to start working with string variables in your own data sets.

Next up, we'll check out some other ways you can work with strings and more advanced cleaning functions. Then you'll be ready to start working in SQL on your own. See you soon.

Hands-On Activity: Clean data using SQL

link to data: data

data’s description

Practice Quiz: SQL queries

Transforming data

Optional: Upload the store transactions dataset to BigQuery

Advanced data-cleaning functions, part 1

Hi there and welcome back. So far we've gone over some basic SQL queries and functions that can help you clean your data. We've also checked out some ways you can deal with string variables in SQL to make your job easier. Get ready to learn more functions for dealing with strings in SQL. Trust me, these functions will be really helpful in your work as a data analyst.

Here, we'll check out strings again and learn how to use the cast function to correctly format data. When you import data that doesn't already exist in your SQL tables, the data types from the new dataset might not have been imported correctly. This is where the CAST function comes in handy. Basically, CAST can be used to convert anything from one data type to another.

Let's check out an example. Imagine we're working with Lauren's Furniture Store. The owner has been collecting transaction data for the past year, but she just discovered that they can't actually organize their data because it hadn't been formatted correctly. So we'll help her by converting her data to make it useful again. For example, let's say we want to sort all purchases by purchase_price in descending order. That means we want the most expensive purchase to show up first in our results.

To write the SQL query, we start with the basic SQL structure. SELECT, FROM, WHERE, we know the data is stored in the customer_purchase table in the customer_dataset. So we write customer_data.customer_purchase after FROM.

Next, we tell SQL what data to give us in the select clause.

We want to see the purchase_price data, so we type purchase_price after SELECT.

Next is the where clause. We are not filtering out any data since we want all purchase prices shown, so we can take out the where clause.

Finally, to sort the purchase_price in descending order, we type ORDER BY purchase_price DESC at the end of our query. Let's run this query.

We see that 89.85 shows up at the top with 799.99 below it, but we know that 799.99 is a bigger number than 89.85. The database doesn't recognize that these are numbers, so it didn't sort them that way.

If we go back to the customer_purchase table and take a look at its schema, we can see what data type the database thinks purchase_price is. It says here the database thinks purchase_price is a string, when in fact it is a float, which is a number that contains a decimal. That is why 89.85 shows up before 799.99. When we sort letters, we start from the first letter before moving on to the second letter. So if we want to sort the words apple and orange in descending order, we start with the first letters a and o. Since o comes after a, orange will show up first, then apple. The database did the same with 89.85 and 799.99. It started with the first letter, which in this case was 8 and 7 respectively. Since 8 is bigger than 7, the database sorted 89.85 first and then 799.99 because the database treated these as text strings. The database doesn't recognize these strings as floats because they haven't been typecast to match that data type yet. Typecasting means converting data from one type to another, which is what we'll do with the CAST function.

We use the CAST function to replace purchase_price with a new purchase_price that the database recognizes as float instead of string. We start by replacing purchase_price with CAST. Then we tell SQL the field we want to change, which is the purchase_price field. Next is a data type we want to change purchase_price to, which is the FLOAT data type. BigQuery stores numbers in a 64 bit system, so the FLOAT data type is referenced as float 64 in our query. This might be slightly different in other SQL platforms, but basically the 64 and float 64 just indicates that we're casting numbers in the 64 bit system as FLOATs. We also need to sort this new field so we change purchase_price after ORDER BY to CAST purchase_price as FLOAT64. This is how we use the cast function to allow SQL to recognize the purchase_price column as FLOATs instead of text strings. Now we can sort our purchases by purchase_price. And just like that, Lauren's Furniture Store has data that can actually be used for analysis.

As a data analyst, you'll be asked to locate and organize data a lot, which is why you want to make sure you convert between data types early on. Businesses like our Furniture Store are interested in timely sales data, and you need to be able to account for that in your analysis. The CAST function can be used to change strings into other data types too, like date and time. As a data analyst, you might find yourself using data from various sources. Part of your job is making sure the data from those sources is recognizable and usable in your database so that you won't run into any issues with your analysis. And now you know how to do that. The CAST function is one great tool you can use when you're cleaning data.

And coming up, we'll cover some other advanced functions that you can add to your toolbox. See you soon.

Advanced data-cleaning functions, part 2

Hey there. Great to see you again. So far, we've seen some SQL functions in action.

Here, we'll go over more uses for CAST, and then learn about CONCAT and COALESCE. Let's get started. Earlier we talked about the CAST function, which let us typecast text strings into floats. I called out that the CAST function can be used to change into other data types too.

Let's check out another example of how you can use CAST in your own data work. We've got the transaction data we were working with from our Lauren's Furniture Store example. But now, we'll check out the purchase date field.

The furniture store owner has asked us to look at purchases that occurred during their sales promotion period in December. Let's write a SQL query that will pull date and purchase_price for all purchases that occurred between December 1st, 2020, and December 31st, 2020.

We start by writing the basic SQL structure: SELECT, FROM, and WHERE.

We know the data comes from the customer_purchase table in the customer_data dataset, so we write customer_data.customer_purchase after FROM.

Next, we tell SQL what data to pull. Since we want date and purchase_price, we add them into the SELECT statement.

Finally, we want SQL to filter for purchases that occurred in December only. We type date BETWEEN '2020-12-01' AND '2020-12-31' in the WHERE clause.

Let's run the query. Four purchases occurred in December, but the date field looks odd. That's because the database recognizes this date field as datetime, which consists of the date and time. Our SQL query still works correctly, even if the date field is datetime instead of date. But we can tell SQL to convert the date field into the date data type so we see just the day and not the time. To do that, we use the CAST() function again.

We'll use the CAST() function to replace the date field in our SELECT statement with the new date field that will show the date and not the time. We can do that by typing CAST() and adding the date as the field we want to change. Then we tell SQL the data type we want instead, which is the date data type. There. Now we can have cleaner results for purchases that occurred during the December sales period. CAST is a super useful function for cleaning and sorting data, which is why I wanted you to see it in action one more time. Next up, let's check out the CONCAT function. CONCAT lets you add strings together to create new text strings that can be used as unique keys.

Going back to our customer_purchase table, we see that the furniture store sells different colors of the same product. The owner wants to know if customers prefer certain colors, so the owner can manage store inventory accordingly. The problem is, the product_code is the same, regardless of the product color. We need to find another way to separate products by color, so we can tell if customers prefer one color over the others. So we'll use CONCAT to produce a unique key that'll help us tell the products apart by color and count them more easily.

Let's write our SQL query by starting with the basic structure: SELECT, FROM, and WHERE.

We know our data comes from the customer_purchase table and the customer_data dataset. We type "customer_data.customer_purchase" after FROM.

Next, we tell SQL what data to pull. We use the CONCAT() function here to get that unique key of product and color. So we type CONCAT(), the first column we want, product_code, and the other column we want, product_color.

Finally, let's say we want to look at couches, so we filter for couches by typing product = 'couch' in the WHERE clause.

Now we can count how many times each couch was purchased and figure out if customers preferred one color over the others.

With CONCAT, the furniture store can find out which color couches are the most popular and order more.

I've got one last advanced function to show you, COALESCE. COALESCE can be used to return non-null values in a list. Null values are missing values. If you have a field that's optional in your table, it'll have null in that field for rows that don't have appropriate values to put there.

Let's open the customer_purchase table so I can show you what I mean. In the customer_purchase table, we can see a couple rows where product information is missing. That is why we see nulls there. But for the rows where product name is null, we see that there is product_code data that we can use instead. We'd prefer SQL to show us the product name, like bed or couch, because it's easier for us to read. But if the product name doesn't exist, we can tell SQL to give us the product_code instead. That is where the COALESCE function comes into play.

Let's say we wanted a list of all products that were sold. We want to use the product_name column to understand what kind of product was sold.

We write our SQL query with the basic SQL structure: Select, From, AND Where.

We know our data comes from customer_purchase table and the customer_data dataset. We type "customer_data.customer_purchase" after FROM.

Next, we tell SQL the data we want. We want a list of product names, but if names aren't available, then give us the product code. Here is where we type "COALESCE." then we tell SQL which column to check first, product, and which column to check second if the first column is null, product_code. We'll name this new field as product_info.

Finally, we are not filtering out any data, so we can take out the WHERE clause.

This gives us product information for each purchase. Now we have a list of all products that were sold for the owner to review. COALESCE can save you time when you're making calculations too by skipping any null values and keeping your math correct. Those were just some of the advanced functions you can use to clean your data and get it ready for the next step in the analysis process. You'll discover more as you continue working in SQL. But that's the end of this video and this module. Great work. We've covered a lot of ground.

You learned the different data-cleaning functions in spreadsheets and SQL and the benefits of using SQL to deal with large datasets. We also added some SQL formulas and functions to your toolkit, and most importantly, we got to experience some of the ways that SQL can help you get data ready for your analysis. After this, you'll get to spend some time learning how to verify and report your cleaning results so that your data is squeaky clean and your stakeholders know it.

But before that, you've got another weekly challenge to tackle. You've got this. Some of these concepts might seem challenging at first, but they'll become second nature to you as you progress in your career. It just takes time and practice. Speaking of practice, feel free to go back to any of these videos and rewatch or even try some of these commands on your own. Good luck. I'll see you again when you're ready.

Debugging SQL code

You’ve been learning how to use SQL to query databases and clean your data. When using a programming language, it’s common to make coding errors. For example, if you use improper syntax, the database won’t know what you’re trying to communicate. Everyone makes coding errors, from experienced programmers to new learners. So it’s helpful to know some strategies for resolving errors.

Think about some of the errors you have encountered when working with SQL. Then, explain why you chose these errors. Next, write down your top three tips for resolving them. Feel free to find more information online before you start writing. Stack Overflow is a great place to start, or you can search online to find other helpful resources.

Please submit two or more paragraphs (150-200 words) in your written response. Then, go to the discussion forum to read what other learners have written, and choose at least two posts to comment on and discuss.

Participation is optional

Data cleaning with SQL

Self-Reflection: Challenges with SQL

Module 3 challenge


Course 4 Module 3 Glossary