4.3.1.Using SQL to clean data - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki

Using SQL as a junior data analyst

In this reading, you will learn more about how to decide when to use SQL, or Structured Query Language. As a data analyst, you will be tasked with handling a lot of data, and SQL is one of the tools that can help make your work a lot easier. SQL is the primary way data analysts extract data from databases. As a data analyst, you will work with databases all the time, which is why SQL is such a key skill. Let’s follow along as a junior data analyst uses SQL to solve a business task.

The business task and context

The junior data analyst in this example works for a social media company. A new business model was implemented on February 15, 2020 and the company wants to understand how their user-growth compares to the previous year. Specifically, the data analyst was asked to find out how many users have joined since February 15, 2020.

Spreadsheets functions and formulas or SQL queries?

Before they can address this question, this data analyst needs to choose what tool to use. First, they have to think about where the data lives. If it is stored in a database, then SQL is the best tool for the job. But if it is stored in a spreadsheet, then they will have to perform their analysis in that spreadsheet. In that scenario, they could create a pivot table of the data and then apply specific formulas and filters to their data until they were given the number of users that joined after February 15th. It isn’t a really complicated process, but it would involve a lot of steps.

In this case, the data is stored in a database, so they will have to work with SQL. And this data analyst knows they could get the same results with a single SQL query:

SELECT
    COUNT(DISTINCT user_id) AS count_of_unique_users
FROM
    table
WHERE
    join_date >= '2020-02-15'

Spreadsheets and SQL both have their advantages and disadvantages:

Features of Spreadsheets Features of SQL Databases
Smaller data sets Larger datasets
Enter data manually Access tables across a database
Create graphs and visualizations in the same program Prepare data for furtuher analysis in another software
Built-in spell check and other useful functions Fast and powerful functionality
Best when working solo on a project Great for collaborative work and tracking queries run by all users

When it comes down to it, where the data lives will decide which tool you use. If you are working with data that is already in a spreadsheet, that is most likely where you will perform your analysis. And if you are working with data stored in a database, SQL will be the best tool for you to use for your analysis. You will learn more about SQL coming up, so that you will be ready to tackle any business problem with the best tool possible.

SQL dialects and their uses

In this reading, you will learn more about SQL dialects and some of their different uses. As a quick refresher, Structured Query Language, or SQL, is a language used to talk to databases. Learning SQL can be a lot like learning a new language — including the fact that languages usually have different dialects within them. Some database products have their own variant of SQL, and these different varieties of SQL dialects are what help you communicate with each database product.

These dialects will be different from company to company and might change over time if the company moves to another database system. So, a lot of analysts start with Standard SQL and then adjust the dialect they use based on what database they are working with. Standard SQL works with a majority of databases and requires a small number of syntax changes to adapt to other dialects.

As a junior data analyst, it is important to know that there are slight differences between dialects. But by mastering Standard SQL, which is the dialect you will be working with in this program, you will be prepared to use SQL in any database.

More information

You may not need to know every SQL dialect, but it is useful to know that these different dialects exist. If you are interested in learning more about SQL dialects and when they are used, you can check out these resources for more information:

Note that there is an error in this blog article. The comparison table incorrectly states that SQlite uses subqueries instead of window functions. Refer to the SQLite Window Functions documentation for proper clarification.

Hands-On Activity: Processing time with SQL

Question 1

Activity overview

RwryPQ36Q3GK8j0N-oNxVQ_a85206e9a37d4efe82540bc35a71c0ca_line-y

In previous activities, you learned about and practiced SQL. In this activity, you’ll work with SQL queries of different sizes.

By the time you complete this activity, you’ll be familiar with the different sizes used to measure data storage. This will help you understand how data size affects the amount of time queries take to run and how valuable tools like SQL can be to data analysts.

Understand how data is measured

RwryPQ36Q3GK8j0N-oNxVQ_a85206e9a37d4efe82540bc35a71c0ca_line-y

Data is measured by the number of bits it takes to represent it. All information in a computer can be represented as a binary number consisting solely of 0’s and 1’s. Each 0 or 1 in a number is a bit. A bit is the smallest unit of storage in computers. Since computers work in binary (Base 2), this means that all the important numbers that differentiate between different data sizes will be powers of 2.

A byte is a collection of 8 bits. Take a moment to examine the table below to get a feel for the difference between data measurements and their relative sizes to one another.

Unit Equivalent to Abbreviation Real-World Example
Byte 8 bits B 1 character in a string
Kilobyte 1024 bytes KB A page of text (~4 kilobytes)
Megabyte 1024 Kilobytes MB 1 song in MP3 format (~2-3 megabytes)
Gigabyte 1024 Megabytes GB ~300 songs in MP3 format
Terabyte 1024 Gigabytes TB ~500 hours of HD video
Petabyte 1024 Terabytes PB 10 billion Facebook photos
Exabyte 1024 Petabytes EB ~500 million hours of HD video
Zettabyte 1024 Exabytes ZB All the data on the internet in 2019 (~4.5 ZB)

The amount of data in the world is exploding and growing at an incredible pace every year. This growth is largely the result of the over 4.6 billion people around the world connected to the Internet. Now that smartphones and other Internet-connected devices have become common, they generate a staggering amount of new data. Many experts believe that the size of all the data on the Internet will swell to 175 ZB by the end of 2025!

The size of the dataset you’re working with usually determines which tool, spreadsheets or SQL, is best suited for the task. Spreadsheets often start to have performance issues as dataset sizes increase beyond a few megabytes. SQL databases are much better at working with larger datasets that have billions of rows with sizes measured in gigabytes. The dataset’s size still matters here--larger datasets will take longer for queries to complete, depending on the query’s content and the number of rows SQL has to process to complete the query.

Query a large dataset

RwryPQ36Q3GK8j0N-oNxVQ_a85206e9a37d4efe82540bc35a71c0ca_line-y

You’ll now discover for yourself how these runtimes change with dataset size by running some queries on a huge dataset—Wikipedia!

  1. Log in to BigQuery Sandbox . If you have a free trial version of BigQuery, you can use that instead. On the BigQuery page, click the Go to BigQuery button.
  • Note: BigQuery Sandbox frequently updates its user interface. The latest changes may not be reflected in the screenshots presented in this activity, but the principles remain the same. Adapting to changes in software updates is an essential skill for data analysts, and it’s helpful for you to practice troubleshooting. You can also reach out to your community of learners on the discussion forum for help.
  1. If you have never created a BigQuery project before, click CREATE PROJECT on the right side of the screen. If you have created a project before, you can use an existing one or create a new one by clicking the project dropdown in the blue header bar and selecting NEW PROJECT.
  2. Name your project something that will help you identify it later. You can give it a unique project ID or use an auto-generated one. Don’t worry about selecting an organization if you don’t know what to put.
  3. Now, you’ll see the Editor interface. In the middle of the screen is a window where you can type code, and to the left is the Explorer menu where you can search for datasets.
  4. Copy and paste the following query into the editor and run it. The formatting is just cosmetic, so don’t worry if it changes when copied over. The query should take 10-15 seconds to run:
SELECT
  language,
  title,
  SUM(views) AS views
FROM
  `bigquery-samples.wikipedia_benchmark.Wiki10B`
WHERE
  title LIKE '%Google%'
GROUP BY
  language,
  title
ORDER BY
  views DESC;

Note: This query sorts and filters a dataset. You don't need to understand each detail yet. Coming up, you will learn what each part of this query means and how to use its functions in your own work.

After the query finishes, your screen should appear like this:

Qu2Z_5aQSr2tmf-WkBq9PA_b40dcde067da40c6a4cbf0a170d45add_DAC4M3L2HO1-ss1

This query returns a table that displays the total number of times each Wikipedia page with “Google” in the title has been viewed in each language. Note the information that BigQuery provides on the query you just ran. As you can infer from the dataset’s title in the query, this dataset is a sample consisting of 10 billion rows from the Wikipedia public dataset.

You’ll find that the query processes over 415 gigabytes of data when run—pretty impressive for 15 seconds! Note that if you run the query again, the runtime will be almost instant (as long as you haven’t changed the default caching settings). This is because BigQuery caches the query results to avoid extra work if the query needs to be rerun.

Confirmation and reflection

9obKl9wQQRGGypfcEOER4g_d35f8e28c24149ca97c66086a675f748_line-y

In your last query, you processed 415.8 GB of data. How many rows were returned by the query?

A. 305,710

B. 214,710

C. 225,038

D. 198,768

The correct answer is B. 214,710. Explain: The last query you ran returns 214,710 rows of data. At the bottom of the data preview, you can see how many rows you returned. Going forward, you can apply this knowledge of data size measurements to better understand how much data you will work with and what tool is best suited to each data analysis project.

Question 2

In this activity, you compared the amount of time it takes to process different sizes of queries in SQL. In the text box below, write 2-3 sentences (40-60 words) in response to each of the following questions:

  • How did working with SQL help you query a larger dataset?
  • How long do you think it would take a team to query a dataset like this manually?
  • How does the ability to query large datasets in reasonable amounts of time affect data analysts?

Explain: Congratulations on completing this hands-on activity! A good response would include how querying a dataset with billions of items isn’t feasible without tools such as relational databases and SQL.

Performing large queries by hand would take years and years of manual work. The ability to query large datasets is an extremely helpful tool for data analysts. You can gain insights from massive amounts of data to discover trends and opportunities that wouldn’t be possible to find without tools like SQL.

Test your knowledge on SQL

Question 1

Which of the following are benefits of using SQL? Select all that apply.

  • SQL can be adapted and used with multiple database programs.
  • SQL offers powerful tools for cleaning data.
  • SQL can be used to program microprofessors on database servers.
  • SQL can handle huge amounts of data.

Explain: SQL can handle huge amounts of data, can be adapted and used with multiple database programs, and offers powerful tools for cleaning data.

Question 2

Which of the following tasks can data analysts do using both spreadsheets and SQL? Select all that apply.

  • Process huge amounts of data efficiently
  • Use formulas
  • Join datas
  • Performs arithmetic

Explain: Analysts can use SQL and spreadsheets to perform arithmetic, use formulas, and join data.

Question 3

SQL is a language used to communicate with databases. Like most languages, SQL has dialects. What are the advantages of learning and using standard SQL? Select all that apply.

  • Standard SQL requires a small number of syntax changes to adapt to other dialects.
  • Standard SQL works with a majority of databases.
  • Standard SQL is much easier to learn than other dialects.
  • Standard SQL is automatically translated by databases to other dialects.