5.4.5.Using SQL with temporary tables - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki

Hands-On Activity: Create temporary tables

Question 1

Activity overview

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

In previous lessons, you learned about temporary tables. In this activity, you will practice making one and using it to run a query.

By the time you complete this activity, you will be able to use temporary tables to work with data without changing the original data. This will help you complete more complicated analytical tasks in your career as a data analyst.

What are temp tables?

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

As data calculations become more complicated, there are many components to keep track of. This is similar to keeping track of tasks in daily life. Some people use sticky notes while others use checklists. In data science, a temporary table is just like a sticky note.

Temporary tables, or temp tables, store subsets of data from standard data tables for a certain period of time. When you end your SQL database session, they are automatically deleted. Temp tables allow you to run calculations in temporary data tables without needing to make modifications to the primary tables in your database.

Now, you will practice creating a temp table.

Importing data

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

To begin, import your data. You will use a dataset on bikesharing in Austin, Texas. Specifically, you will work with a table that gives details about each public bike ride’s duration, starting station, and ending station.

To load your data, follow these steps:

  1. Log in to and open the BigQuery Console. If you have a free trial version of BigQuery, you can use that instead. On the BigQuery page, click the Go to BigQuery button.
  • Note: BigQuery frequently updates its user interface. The latest changes may not be reflected in the screenshots presented in this activity, but the principles remain the same. Adapting to changes in software updates is an essential skill for data analysts, and it’s helpful for you to practice troubleshooting. You can also reach out to your community of learners on the discussion forum for help.
  1. If you have never created a BigQuery project before, click CREATE PROJECT on the right side of the screen. If you have created a project before, you can use an existing one or create a new one by clicking the project dropdown in the blue header bar and selecting NEW PROJECT.
  2. Name your project something that will help you identify it later. You can give it a unique project ID or use an auto-generated one. Don’t worry about selecting an organization if you don’t know what to put.
  3. Now, you’ll see the Editor interface. In the middle of the screen is a window where you can type code, and to the left is the Explorer menu where you can search for datasets.

8yLiSG39QRiRBBBOsR3pkQ_43093cc07ea649628f5f95202d0cccf1_1 ExplorerTab

  • In order to find the austin_bikeshare dataset, you will want to make sure you have the bigquery-public-data pinned in your explorer menu. Follow these steps to pin the dataset:
  1. Navigate to the Explorer menu in BigQuery.
  2. Type the word public in the search box and enter.
  3. Click "Broaden search to all projects"
  4. Find the bigquery-public-data and highlight the star to pin it.

To examine the dataset, follow these steps:

  1. Make sure that bigquery-public-data star is highlighted and pinned to the Explorer menu of your SQL Workspace.
  2. Open the bigquery-public-data list by clicking the drop-down arrow in the Explorer menu and scroll until you find austin_bikeshare.

i2b8JEr9SoG2SYVKeuxGug_58586f36ddf14c118f1b974f4ddb3df1_1 bikeshare_list

  1. Open the drop-down arrow and click bikeshare_trips to open and examine the dataset.
  2. Click on the Preview tab in the viewer on the right, then examine the dataset.

FE2aNBmVQwa-8Q_TUhThIw_c49af7f64deb41808376b0ceab7853f1_2 citibike_trips

You’ll need to use the WITH clause to create a temporary table so you can find at which station the longest (duration) bike ride started.

Create a temporary table

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

Consider the following scenario: A bikeshare company has reached a recent milestone, and their marketing team wants to write a blog post that “congratulates” their most-used bike on being so popular. They want to include the name of the station that the bike is most likely to be found.

They task you with figuring out the station from which the bike begins a trip most frequently.

In order to do this, you will need to create a temp table to find the ID number of the bike that has taken the longest total trips (in minutes). You will take a sum of the minutes of each trip for each bike, then sort by descending order to find the bike that has spent the most minutes being used.

To do that, complete the following steps:

  1. Return to your Editor tab or click Compose new query.
  2. Begin your query with WITH to set up a temp table. Then on a new indented line, create the name of your temp table. Make sure your table name is in the proper snake case (with underscores between each word). Name it longest_used_bike. Then add a space.
  3. Type AS and an open parenthesis (, then press Enter (Windows) or Return (Mac) to create a new indented line.
  4. Type SELECT, then press Enter or Return and Tab to create a new indented line.
  5. Type bikeid and a comma. Then press Enter or Return to create a new line and type SUM(duration_minutes) AS trip_duration. This creates a column in the temp table that contains the sum of the total minutes a bike has been used. Press Enter or Return again, then press Backspace to align the cursor with SELECT.
  6. Type FROM, then press Enter or Return and Tab to create a new indented line.
  7. Specify the dataset you’ll be using. To do this, type bigquery-public-data.austin_bikeshare.bikeshare_trips. Press Enter or Return, then press Backspace to align the cursor with SELECT.

Your text should appear like this:

BDEjfGxPTxixI3xsT88Y_w_670ad4fb7cee482db17cf74be57b98f1_midpt1tt

  1. Type GROUP BY, then press Enter or Return and Tab to create a new indented line.
  2. Type bikeid to group the data by the column bikeid. Press Enter or Return again, then press Backspace to align the cursor with SELECT.
  3. Type ORDER BY, then press Enter or Return and Tab to create a new indented line.
  4. Type trip_duration DESC to sort the data in descending order by the column trip_duration. Press Enter or Return again, then press Backspace to align the cursor with SELECT.
  5. Type LIMIT 1.
  6. Make sure there is a closed parenthesis ) on the next line. If it is not there, add it.

This sets up your temporary table. This section identifies the specific bike (bikeid) with the longest trip duration.

7URWkye-TW2EVpMnvq1tUg_0bd38c1d7958478dbe08dbe1979b09f1_Screenshot-2021-05-08-7 46 53-PM

If you run it now, it’ll return an error because you haven’t written any queries yet. Now, it’s time to write a query that identifies the start station that this bike came out of.

Write your query

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

Now that you have found the ID of the bike that has been used the longest, you will write a query to find the station from which this bike leaves most frequently. To do this, you will join your temp table (containing just the bike’s ID) with the original table and return the station ID with the highest number of trips started.

To find this station ID, follow these steps:

  1. On a new line, type two # signs to begin a comment.
  2. Describe the purpose of your query. This will help you remember the purpose of your query as you’re writing it. It can also help you share your work with others. In this case, type find station at which the longest-used bike leaves most often or something similar. Then press Enter or Return to make a new line.
  3. Begin the query with SELECT, then press Enter or Return and Tab to create a new indented line.
  4. Type trips.start_station_id, and a comma. This line contains the start_station_id column from the trips table, which you will define with an alias later in this query. Then press Enter or Return to create a new line.
  5. Type COUNT(*) AS trip_ct. This line will help you count how many times the bike has left each station. Press Enter or Return again, then press Backspace to align the cursor with SELECT.
  6. Type FROM, then press Enter or Return and Tab to create a new indented line.
  7. Type longest_used_bike AS longest to rename your temp table with an alias. Then press Backspace to align the cursor with SELECT.

Your text should appear like this:

uoZ-0J00SW-GftCdNClvAQ_75dafb01f01249f1a0903bf7220508f1_Screenshot-2021-05-18-4 13 00-PM

Now, it's time to write an INNER JOIN, which you will use to pick out the station ID that corresponds to the bike you identified in the temporary table.

  1. Type INNER JOIN. Press Enter or Return and Tab to create a new indented line.
  2. Type bigquery-public-data.austin_bikeshare.bikeshare_trips AS trips. Press Enter or Return again, then press Backspace to align the cursor with SELECT.
  3. Type ON longest.bikeid = trips.bikeid. This specifies that the JOIN is on the bikeid column in the temp table you created and the original dataset. Then press Enter or Return to create a new line.
  4. Type GROUP BY, then press Enter or Return and Tab to create a new indented line.
  5. Type trips.start_station_id to group by the start_station_id column in the original dataset. Press Enter or Return again, then press Backspace to align the cursor with SELECT.
  6. Type ORDER BY, then press Enter or Return and Tab to create a new indented line. 14, Type trip_ct DESC to sort by the trip_ct column in descending order. Press Enter or Return again, then press Backspace to align the cursor with SELECT.
  7. Type LIMIT 1.
  8. Finally, click Run. The query might take a few seconds before showing you the count. If your query returns 2575 in the start_station_id column and 90 in the trip_ct column, you’ve written it correctly. Your text should appear like this:

Wyz7FfU6RCKs-xX1OjQiWA_c99280ddedc14c3d9543e6084e3057f1_Screenshot-2021-05-08-7 48 12-PM

You’ve now created a temporary table and executed a query with it. This will be helpful when you are performing several calculations at once.

Other types of temp tables

There are also other ways to create a temp table. Instead of using the WITH clause, you can use the SELECT INTO or the CREATE TABLE clauses.

The SELECT INTO clause copies data from one table into a new table, but doesn’t add the new table to the database. It’s useful if you want to make a copy of a table with a specific condition.

The CREATE TABLE clause is a good option when several people need to access the same temp table. This statement adds the table into the database.

Which clause you use depends on your preference and the project’s demands. Different clauses have their own strengths, so understanding how each of them work is helpful for using them effectively.

Confirmation and reflection

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

In a past activity, you learned about the importance of using the right type of join. In this activity, you wrote a query with an INNER JOIN to join your temporary table with the original bikeshare_trips table. Which station ID would your query return if you used a FULL JOIN instead of an INNER JOIN?

A. 3798

B. 2758

C. 2575

D. 3575

The correct answer is A. 3798. Explain: If you used a FULL JOIN instead of an INNER JOIN to join your temporary table to the bikeshare_trips table, you would return the station ID 3798. This is because the temp table you create has to be joined to the original in a specific way in order to return the correct answer. Going forward, you can use temp tables to perform more complex queries like this, which will help you analyze complex data throughout your career.

Question 2

In this activity, you created a temporary table to run calculations without needing to make modifications to the primary tables in your database. In the text box below, write 2-3 sentences (40-60 words) in response to each of the following questions:

  • Why was the JOIN statement necessary to use in this activity?
  • What is the benefit of executing a query in a temporary table rather than a primary table in a database?

Explain: In this activity, you used a temporary table to write a query. A good response would include how temporary tables are extremely helpful for complex calculations and queries.

By using a temporary table, you were able to answer a more complex question and not make any changes to a primary table in the database. You used a JOIN to find something specific from the original dataset. Going forward, you can be creative with your use of temp tables to find complicated answers to real-world questions.

Working with temporary tables

Temporary tables are exactly what they sound like—temporary tables in a SQL database that aren’t stored permanently. In this reading, you will learn the methods to create temporary tables using SQL commands. You will also learn a few best practices to follow when working with temporary tables.

A quick refresher on what you have already learned about temporary tables

  • They are automatically deleted from the database when you end your SQL session.
  • They can be used as a holding area for storing values if you are making a series of calculations. This is sometimes referred to as pre-processing of the data.
  • They can collect the results of multiple, separate queries. This is sometimes referred to as data staging. Staging is useful if you need to perform a query on the collected data or merge the collected data.
  • They can store a filtered subset of the database. You don’t need to select and filter the data each time you work with it. In addition, using fewer SQL commands helps to keep your data clean.

It is important to point out that each database has its own unique set of commands to create and manage temporary tables. We have been working with BigQuery, so we will focus on the commands that work well in that environment. The rest of this reading will go over the ways to create temporary tables, primarily in BigQuery.

Temporary table creation in BigQuery

Temporary tables can be created using different clauses. In BigQuery, the WITH clause can be used to create a temporary table. The general syntax for this method is as follows:

WITH
new_table_data AS (

SELECT *

FROM
Existing_table

WHERE
Tripduration >= 60
)

Breaking down this query a bit, notice the following:

  • The statement begins with the WITH clause followed by the name of the new temporary table you want to create
  • The AS clause appears after the name of the new table. This clause instructs the database to put all of the data identified in the next part of the statement into the new table.
  • The opening parenthesis after the AS clause creates the subquery that filters the data from an existing table. The subquery is a regular SELECT statement along with a WHERE clause to specify the data to be filtered.
  • The closing parenthesis ends the subquery created by the AS clause.

When the database executes this query, it will first complete the subquery and assign the values that result from that subquery to “new_table_data,” which is the temporary table. You can then run multiple queries on this filtered data without having to filter the data every time.

Temporary table creation in other databases (not supported in BigQuery)

The following method isn’t supported in BigQuery, but most other versions of SQL databases support it, including SQL Server and mySQL. Using SELECT and INTO, you can create a temporary table based on conditions defined by a WHERE clause to locate the information you need for the temporary table. The general syntax for this method is as follows:

SELECT 
*
INTO
AfricaSales
FROM
GlobalSales
WHERE
Region = "Africa"

This SELECT statement uses the standard clauses like FROM and WHERE, but the INTO clause tells the database to store the data that is being requested in a new temporary table named, in this case, “AfricaSales.”

User-managed temporary table creation

So far, we have explored ways of creating temporary tables that the database is responsible for managing. But, you can also create temporary tables that you can manage as a user. As an analyst, you might decide to create a temporary table for your analysis that you can manage yourself. You would use the CREATE TABLE statement to create this kind of temporary table. After you have finished working with the table, you would then delete or drop it from the database at the end of your session.

Note: BigQuery uses CREATE TEMP TABLE instead of CREATE TABLE, but the general syntax is the same.

CREATE TABLE table_name (
    column1 datatype,
    column2 datatype,
    column3 datatype,
  ....
)

After you have completed working with your temporary table, you can remove the table from the database using the DROP TABLE clause. The general syntax is as follows:

DROP TABLE table_name

Best practices when working with temporary tables

  • Global vs. local temporary tables: Global temporary tables are made available to all database users and are deleted when all connections that use them have closed. Local temporary tables are made available only to the user whose query or connection established the temporary table. You will most likely be working with local temporary tables. If you have created a local temporary table and are the only person using it, you can drop the temporary table after you are done using it.
  • Dropping temporary tables after use: Dropping a temporary table is a little different from deleting a temporary table. Dropping a temporary table not only removes the information contained in the rows of the table, but removes the table variable definitions (columns) themselves. Deleting a temporary table removes the rows of the table but leaves the table definition and columns ready to be used again. Although local temporary tables are dropped after you end your SQL session, it may not happen immediately. If a lot of processing is happening in the database, dropping your temporary tables after using them is a good practice to keep the database running smoothly.

For more information

Your intermediate guide to SQL

As you get more comfortable with SQL, you'll be able to take on even more advanced queries.

This in-depth guide will give you a more detailed introduction to some of the SQL functions you have already learned, and give you some new tools to work with. Be sure to keep it handy as you progress in your SQL learning journey:

Your intermediate guide to SQL

Using Connected Sheets with BigQuery

In this reading, you will learn about Connected Sheets, a tool that allows data professionals to use basic spreadsheet functions to analyze large datasets housed in BigQuery. With Connected Sheets users don’t need to know SQL. Instead, anyone, not just data professionals, can generate insights with basic spreadsheet operations such as formulas, charts, and pivot tables.

What is Connected Sheets?

oxN7Zm9DQGGjrYd5-3sWkg_e50ca8713b8f408489ceef8a15dc2ff1_oF-N6izP0wLZbug9Jol_fmcrUvE8PXFrNNlAWtMc5RvYAbNcT__u8lo5anTJe6oaUpee-7cQ3CZrc261axCaYVVOMYFzA2kpiXK0s5ltT_QwCbkhCs5O7cy_Lc5XBbWTa__sCvmXqzaG1QH_X0cZ2IM

Recall that BigQuery allows users to analyze petabytes (a million gigabytes) of data using complex queries. A benefit of BigQuery is that it reduces the time needed to develop insights from large datasets.

Google Sheets, on the other hand, is a spreadsheet tool that is easy to use and shareable with a familiar interface. It also allows simple and flexible analysis with tools like pivot tables, charts, and formulas.

Connected Sheets integrates both BigQuery and Google Sheets, allowing the user to analyze billions of rows of data in Sheets without any need for specialized knowledge, such as SQL.

Additionally, Connected Sheets is built to handle big data. Users won’t experience the same limitations or performance issues they’ve had in the past (such as data loss) when working with large data sets in spreadsheets.

Why would a data analytics professional use Connected Sheets? As a data analytics professional, Connected Sheets can help with several tasks, such as:

  • Collaborating with partners, analysts, or other stakeholders in a familiar spreadsheet interface;
  • Ensuring a single source of truth for data analysis without additional .csv exports;
  • Defining variables so that all users are working with the same data;
  • Sharing insights with your team in a secure environment; and
  • Streamlining your reporting and dashboard workflows.

Many teams and industries benefit from Connected Sheets such as finance, marketing, and operations teams.

A few example use cases of Connected Sheets include:

  • Business planning: A user can build and prepare datasets, and then find insights from the data. For example, a data analyst can analyze sales data to determine which products sell better in different locations.
  • Customer service: A user can find out which stores have the most complaints per 10,000 customers.
  • Sales: A user can create internal finance and sales reports. After completing, they can share revenue reports with sales reps.
  • Logistics, fulfillment, and delivery: A user can run real-time inventory management and intelligent analytics tools.

Connected Sheets benefits

Collaborate with teammates and stakeholders

Since Connects Sheets lives in Google Workspace, you can easily collaborate with other teammates and stakeholders in your company. If you’d like to limit access, you also control permissions for who can view, edit, or share the data.

Do more with familiar tools

With Connected Sheets, you can access billions of rows of BigQuery data directly in Sheets. This direct access makes it easier for all employees to track, forecast, and analyze their data to get to better decisions faster.

Easily visualize data

You can unlock insights from your BigQuery datasets using features you’re already familiar with in Sheets, such as pivot tables, charts, and formulas. These features help visualize large datasets more easily than using a more advanced language such as SQL. However, if you know SQL, you may prefer to use it in certain situations.

Up to date data

With Connected Sheets, data professionals can ensure they are making decisions based on a single source of truth by setting up automatic refreshes of BigQuery data in Sheets.

Less data integrity and security risk

While users can access big data with Connected Sheets, they won’t be able to accidentally manipulate or jeopardize the integrity of the data. There’s less security risk because data isn’t stored on individual workstations, it’s stored in the cloud.

Connected Sheets shortcomings

Limited free pricing tier

A shortcoming of Connected Sheets is that for the free pricing tier, users only receive 1 terabyte (TB) of processed query data each month. To process more data, you will need to move to a paid tier.

Data must be housed in BigQuery

Another shortcoming is that you will need access to your data set in BigQuery. Without access to BigQuery, you won’t be able to analyze data in Connected Sheets.

Query will fail with large results

A third shortcoming is that the Connected Sheets query will fail if the results are too large. Your query will fail if your pivot table has a significant amount of results, which could be anywhere from 30,000 to 50,000. To reduce your results, you can use filters or limit the number of rows per breakout.

Key takeaways

Connected Sheets provides a tremendous opportunity to analyze large data sets without specialized skills like SQL. Use familiar spreadsheet skills such as pivot tables, charts, and formulas to analyze the data. For junior data analysts in particular, Connected Sheets can help them perform key tasks within BigQuery and increase their marketable skills.

Resources for more information

Test your knowledge on using SQL with temporary tables

Question 1

When are temporary tables automatically deleted?

A. After running a report from the table

B. After completing all calculations in the table

C. After running a query in your SQL database

_D. After ending the session in a SQL database_

The correct answer is D. After ending the session in a SQL database. Explain: Temporary tables are automatically deleted after ending the session in a SQL database.

Question 2

The following SQL query contains information about bike trips:

WITH 1_hr_trips AS (
  SELECT *
  FROM bigquery-public-data.new_york.citibike_trips
  WHERE tripduration = 60
)

What data will appear in the temporary table created through this query?

A. The total number of bike trips

B. Bike trips equal to or more than one hour

C. A random subset of bike trips

D. Bike trips that lasted exactly 60 minutes

The correct answer is D. Bike trips that lasted exactly 60 minutes. Explain: This temporary table will show bike trips that lasted exactly 60 minutes. The name of the table is “1_hr_trips” and the query includes the condition that trips in the table equal one hour.

Question 3

What benefit does a CREATE TABLE statement add to a temporary table?

A. Automated calculations

B. Metadata about the data in the table

C. Specific naming conventions

D. Access for anyone to use the table

The correct answer is D. Access for anyone to use the table. Explain: A CREATE TABLE statement provides access for anyone to use the temporary table. The SELECT INTO statement is better suited for one person.