4.3.1.Using SQL to clean data - sj50179/Google-Data-Analytics-Professional-Certificate GitHub Wiki

Spreadsheets and SQL both have their advantages and disadvantages:

Features of Spreadsheets Features of SQL Databases
Smaller data sets Larger datasets
Enter data manually Access tables across a database
Create graphs and visualizations in the same program Prepare data for further analysis in another software
Built-in spell check and other useful functions Fast and powerful functionality
Best when working solo on a project Great for collaborative work and tracking queries run by all users

When it comes down to it, where the data lives will decide which tool you use. If you are working with data that is already in a spreadsheet, that is most likely where you will perform your analysis. And if you are working with data stored in a database, SQL will be the best tool for you to use for your analysis. You will learn more about SQL coming up, so that you will be ready to tackle any business problem with the best tool possible.

Spreadsheets versus SQL

Spreadsheets SQL
Generated with a program A language used to interact with database programs
Aceess to the data you input Can pull information from different sources in the database
Stored locally Stored across a database
Small datasets Larger datasets
Working independently Tracks changes across team
Built-in functionalities Useful across multiple programs

Question

A team of analysts is working on a data analytics project. How could data in a SQL database be more useful to the team than data in spreadsheets? Select all that apply.

  • They can use SQL to make working with smaller datasets easier.
  • They can use SQL to pull information from the database at the same time.
  • They can track changes to SQL queries across the team.
  • They can use SQL to interact with the database program.

Correct. Data stored in a SQL database is useful to a project with multiple team members because they can access the data at the same time, use SQL to interact with the database program, and track changes to SQL queries across the team.

SQL dialects and their uses

You may not need to know every SQL dialect, but it is useful to know that these different dialects exist. If you are interested in learning more about SQL dialects and when they are used, you can check out these resources for more information:

Hands-On Activity: Processing time with SQL

Activity overview

In previous activities, you learned about and practiced SQL. In this activity, you’ll work with SQL queries of different sizes.

By the time you complete this activity, you’ll be familiar with the different sizes used to measure data storage. This will help you understand how data size affects the amount of time queries take to run and how valuable tools like SQL can be to data analysts.

Understand how data is measured

Data is measured by the number of bits it takes to represent it. All information in a computer can be represented as a binary number consisting solely of 0’s and 1’s. Each 0 or 1 in a number is a bit. A bit is the smallest unit of storage in computers. Since computers work in binary (Base 2), this means that all the important numbers that differentiate between different data sizes will be powers of 2.

A byte is a collection of 8 bits. Take a moment to examine the table below to get a feel for the difference between data measurements and their relative sizes to one another.

Unit Equivalent to Abbreviation Real-World Example
Byte 8 bits B 1 character in a string
Kilobyte 1024 bytes KB A page of text (~4 kilobytes)
Megabyte 1024 Kilobytes MB 1 song in MP3 format (~2-3 megabytes)
Gigabyte 1024 Megabytes GB ~300 songs in MP3 format
Terabyte 1024 Gigabytes TB ~500 hours of HD video
Petabyte 1024 Terabytes PB 10 billion Facebook photos
Exabyte 1024 Petabytes EB ~500 million hours of HD video
Zettabyte 1024 Exabytes ZB All the data on the internet in 2019 (~4.5 ZB)

The amount of data in the world is exploding and growing at an incredible pace every year. This growth is largely the result of the over 4.6 billion people around the world connected to the Internet. Now that smartphones and other Internet-connected devices have become common, they generate a staggering amount of new data. Many experts believe that the size of all the data on the Internet will swell to 175 ZB by the end of 2025!

The size of the dataset you’re working with usually determines which tool, spreadsheets or SQL, is best suited for the task. Spreadsheets often start to have performance issues as dataset sizes increase beyond a few megabytes. SQL databases are much better at working with larger datasets that have billions of rows with sizes measured in gigabytes. The dataset’s size still matters here--larger datasets will take longer for queries to complete, depending on the query’s content and the number of rows SQL has to process to complete the query.

Query a large dataset

You’ll now discover for yourself how these runtimes change with dataset size by running some queries on a huge dataset—Wikipedia!

  1. Log in to BigQuery Sandbox. If you have a free trial version of BigQuery, you can use that instead. On the BigQuery page, click the Go to BigQuery button.
  • Note: BigQuery Sandbox frequently updates its user interface. The latest changes may not be reflected in the screenshots presented in this activity, but the principles remain the same. Adapting to changes in software updates is an essential skill for data analysts, and it’s helpful for you to practice troubleshooting. You can also reach out to your community of learners on the discussion forum for help.
  1. If you have never created a BigQuery project before, click CREATE PROJECT on the right side of the screen. If you have created a project before, you can use an existing one or create a new one by clicking the project dropdown in the blue header bar and selecting NEW PROJECT.

  2. Name your project something that will help you identify it later. You can give it a unique project ID or use an auto-generated one. Don’t worry about selecting an organization if you don’t know what to put.

  3. Now, you’ll see the Editor interface. In the middle of the screen is a window where you can type code, and to the left is the Explorer menu where you can search for datasets.

  4. Copy and paste the following query into the editor and run it. The formatting is just cosmetic, so don’t worry if it changes when copied over. The query should take 10-15 seconds to run:

SELECT
  language,
  title,
  SUM(views) AS views
FROM
  `bigquery-samples.wikipedia_benchmark.Wiki10B`
WHERE
  title LIKE '%Google%'
GROUP BY
  language,
  title
ORDER BY
  views DESC;

After the query finishes, your screen should appear like this:

This query returns a table that displays the total number of times each Wikipedia page with “Google” in the title has been viewed in each language. Note the information that BigQuery provides on the query you just ran. As you can infer from the dataset’s title in the query, this dataset is a sample consisting of 10 billion rows from the Wikipedia public dataset.

You’ll find that the query processes over 415 gigabytes of data when run—pretty impressive for 15 seconds. Note that if you run the query again, the runtime will be almost instant (as long as you haven’t changed the default caching settings). This is because BigQuery caches the query results to avoid extra work if the query needs to be rerun.

Test your knowledge on SQL

TOTAL POINTS 3

Question 1

Which of the following are benefits of using SQL? Select all that apply.

  • SQL can be used to program microprocessors on database servers.
  • QL offers powerful tools for cleaning data.
  • SQL can handle huge amounts of data.
  • SQL can be adapted and used with multiple database programs.

Correct. SQL can handle huge amounts of data, can be adapted and used with multiple database programs, and offers powerful tools for cleaning data.

Question 2

Which of the following tasks can data analysts do using both spreadsheets and SQL? Select all that apply.

  • Process huge amounts of data efficiently
  • Join data
  • Use formulas
  • Perform arithmetic

Correct. Analysts can use SQL and spreadsheets to perform arithmetic, use formulas, and join data.

Question 3

SQL is a language used to communicate with databases. Like most languages, SQL has dialects. What are the advantages of learning and using standard SQL? Select all that apply.

  • Standard SQL is automatically translated by databases to other dialects.
  • Standard SQL is much easier to learn than other dialects.
  • Standard SQL works with a majority of databases.
  • Standard SQL requires a small number of syntax changes to adapt to other dialects.

Correct. Standard SQL works with a majority of databases and requires a small number of syntax changes to adapt to other dialects.