3.3.5.Working with large datasets in SQL - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki

Using BigQuery

BigQuery is a data warehouse on Google Cloud that data analysts can use to query, filter large datasets, aggregate results, and perform complex operations.

An upcoming activity is performed in BigQuery. This reading provides instructions to create your own BigQuery account, select public datasets, and upload CSV files. At the end of this reading, you can confirm your access to the BigQuery console before you move on to the activity,

Note: Additional getting started resources for a few other SQL database platforms are also provided at the end of this reading if you choose to work with them instead of BigQuery.

Types of BigQuery accounts

There are two different types of accounts: sandbox and free trial. A sandbox account allows you to practice queries and explore public datasets for free, but has additional restrictions on top of the standard quotas and limits. If you prefer to use BigQuery with the standard limits, you can set up a free trial account instead. More details:

  • A free sandbox account doesn’t ask for a method of payment. It does, however, limit you to 12 projects. It also doesn't allow you to insert new records to a database or update the field values of existing records. These data manipulation language (DML) operations aren't supported in the sandbox.
  • A free trial account requires a method of payment to establish a billable account, but offers full functionality during the trial period.

With either type of account, you can upgrade to a paid account at any time and retain all of your existing projects. If you set up a free trial account but choose not to upgrade to a paid account when your trial period ends, you can still set up a free sandbox account at that time. However, projects from your trial account won't transfer to your sandbox account. It would be like starting from scratch again.

Set up a free sandbox account for use in this program

Set up a free trial account instead (if you prefer)

If you prefer not to have the sandbox limitations in BigQuery, you can set up a free trial account for use in this program.

  • Follow these step-by-step instructions or watch the video, Setting up BigQuery, including sandbox and billing options. The free trial offers $300 in credit over the next 90 days. You won’t get anywhere near that spending limit if you just use the BigQuery console to practice SQL queries. After you spend the $300 credit (or after 90 days) your free trial will expire and you will need to personally select to upgrade to a paid account to keep using Google Cloud Platform services, including BigQuery. Your method of payment will never be automatically charged after your free trial ends. If you select to upgrade your account, you will begin to be billed for charges.
  • After you set up your account, you will see My First Project in the banner and the status of your account above the banner – your credit balance and the number of days remaining in your trial period.

How to get to the BigQuery console

In your browser, go to console.cloud.google.com/bigquery.

Note: Going to console.cloud.google.com in your browser takes you to the main dashboard for the Google Cloud Platform. To navigate to BigQuery from the dashboard, do the following:

  • Click the Navigation menu icon (Hamburger icon) in the banner.
  • Scroll down to the BIG DATA section.
  • Click BigQuery and select SQL workspace.

Watch the How to use BigQuery video for an introduction to each part of the BigQuery SQL workspace.

(Optional) Explore a BigQuery public dataset

You will be exploring a public dataset in an upcoming activity, so you can perform these steps later if you prefer.

(Optional) Upload a CSV file to BigQuery

These steps are provided so you can work with a dataset on your own at this time. You will upload CSV files to BigQuery later in the program.

Getting started with other databases (if not using BigQuery)

It is easier to follow along with the course activities if you use BigQuery, but if you are connecting to and practicing SQL queries on other database platforms instead of BigQuery, here are similar getting started resources:

Hands-On Activity: Introduction to BigQuery

Question 1

Activity overview

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

By now, you have been introduced to BigQuery, a data warehouse on Google Cloud that data analysts can use to query, filter large datasets, aggregate results, and perform complex operations. In this activity, you will explore the BigQuery interface; upload public data to your console; and write some simple SQL queries using SELECT, FROM, and WHERE.

By the time you complete this activity, you will be more familiar with writing queries in the BigQuery interface. This will enable you to practice SQL, which is important for working with databases in your career as a data analyst.

Explore BigQuery

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

For this activity, you will need a BigQuery account. If you haven’t made one already, you can follow the instructions from the Using BigQuery reading. Once you have your account, you can start exploring!

Open your console

  1. Log in to BigQuery.
  2. Then, click the Go to console button on the BigQuery homepage. This will open a new tab with your console.

c_RsX4kvRn20bF-JLzZ9KQ_0490f437d38f4f65805b58763568f9f1_IntrotoBQ1

  1. Take a moment to explore your console. On the left side, you will find the Explorer menu; this includes a search bar you can use to find resources, pinned projects, and the + ADD DATA button. On the right side, you will find the Query Editor. This is where you will input queries and view datasets. You can also find your Job History, Query History, and Saved Queries here.

Access public data in BigQuery

In order to actually start writing queries, you will need some data to work with. Once you’re familiar with the BigQuery interface, you can access a public dataset directly from your console.

  1. Click on the + ADD DATA button in the Explorer menu and select Explore public datasets. This will open a new menu where you can search public datasets that are already available through Google Cloud.

1SKhZOAMQ0eioWTgDLNHRQ_2c10e28f5a4947ccbede0541c152ddf1_IntrotoBQ3

  1. In the dataset menu you just opened, type london bicycle in the search box at the top; this will return the London Bicycle Hires dataset from the Greater London Authority. Click the dataset for more details.

B_J_TkghRsuyf05IIUbL9w_3a752ff00d0e4cfab72def5853b431f1_IntrotoBQ4

  1. From the dataset information page, click the blue VIEW DATASET button. This will open your console in a new tab with this dataset loaded.

YXfGR21uT5K3xkdtbn-Sxw_d61163cde7794e058318438d923108f1_IntrotoBQ5

You'll notice that bigquery-public-data is now pinned in your Explorer menu. You can now explore and query these public datasets.

Follow these steps to find and pin the bigquery-public-data if you do not have it pinned.

  1. Navigate to the Explorer menu in BigQuery.

  2. Type the word public in the search box and enter.

  3. Click "Broaden search to all projects".

  4. Find the bigquery-public-data and pin it.

  5. Click on the arrow next to bigquery-public-data and scroll down the list of public datasets until you find the london_bicycles data. When you click on the dataset, it will list two tables. Click on cycle_hire.

jbX54HmIQiS1-eB5iCIk_g_78e35d1514ec402b89927889d55d01f1_IntrotoBQ6

This will pull up a new tab in your Query Editor with information about the table schema.

  1. After checking out the table schema, you can take a peek into what data the cycle_hire table contains by clicking on the Preview tab. This will give you a better idea of what kind of data you’ll be working with.

gxREBVEkSPKURAVRJLjy-g_c41d9aefd5a44c40a1dfc6ff01ef12f1_IntrotoBQ7

Once you have finished previewing the data, you can write a query!

Query your data

So far, you’ve learned three basic parts of a query: SELECT, FROM, and WHERE. As a refresher, here are what those basic parts represent in the query:

  • SELECT is the section of a query that indicates what data you want SQL to return to you
  • FROM is the section of a query that indicates which table the desired data comes from.
  • WHERE is the section of a query that indicates any filters you’d like to apply to your dataset

Write a basic query

Now, construct a simple command using the basic parts of a query you have already learned! For example, you can select a specific column from the cycle_hire table, such as the end_station_name column.

  1. Start your query with a SELECT clause and indicate which column you want to select from the table; in this case, you’ll input end_station_name.
  2. After you have indicated which column you are selecting, write your FROM clause. You will need to specify the table you are querying from by inputting the following location: bigquery-public-data.london_bicycles.cycle_hire;

The completed query should appear like this:

SELECT 
   end_station_name 
FROM 
   `bigquery-public-data.london_bicycles.cycle_hire`;
  1. Run your completed query by clicking on the blue RUN button.

This query may take a few seconds to execute. Once it has finished, you will find the list of station names you requested under the Query Results console pane.

Write a query to answer a question

After running the first basic query, try answering a specific question about the data. For example, how many bike trips lasted for 20 minutes or longer?

  1. Click COMPOSE NEW QUERY to start a new query. Start with your SELECT statement again. This time, you will want to include all of the columns in the table for this query. You can use an asterisk to indicate that you are selecting all of the data like this: SELECT COUNT(*) AS num_of_trips
  2. Then you will add your FROM statement. You will be using the same dataset as the previous query: FROM bigquery-public-data.london_bicycles.cycle_hire.
  3. Finally, you’ll add a WHERE statement to specify that you want to filter for only bike rides 20 minutes or longer. If you check the preview of this data, you might notice that the duration is recorded in seconds, so you’ll specify 1200 seconds in your query. You can write that as WHERE duration>=1200;

Your completed query should be written like this:

SELECT 
    COUNT(*) AS num_of_trips 
FROM 
    `bigquery-public-data.london_bicycles.cycle_hire` 
WHERE 
    duration >= 1200;
  1. Run your completed query by clicking on the blue RUN button.

LwjQ0b0HShWI0NG9B3oVMA_5cc82cee7b08406197b425451d765cf1_IntrotoBQ8

This query may take a few seconds to execute. Once it has finished, you will find a list of rides from this table that fit your criteria. There are more than 7 million rows with bike trips that are 20 minutes or longer!

Up for a challenge?

If you’re comfortable using queries to answer questions, try creating and running queries to answer any of the questions below:

  1. What are the names of the stations that bike_id 1710 started from?
SELECT
  DISTINCT start_station_name
FROM
  `bigquery-public-data.london_bicycles.cycle_hire`
WHERE
  bike_id = 1710;
  1. How many bike_ids have ended at "Moor Street, Soho"?
SELECT
  COUNT (DISTINCT bike_id) AS num_of_bikes
FROM
  `bigquery-public-data.london_bicycles.cycle_hire`
WHERE
  end_station_name = 'Moor Street, Soho';

There are 24998 bike_ids have ended at "Moor Street, Soho".

  1. What is the station_id for "Canton Street, Poplar"?
SELECT
  DISTINCT start_station_id --can also use end_station_id
FROM
  `bigquery-public-data.london_bicycles.cycle_hire`
WHERE
  start_station_name = 'Canton Street, Poplar'; --can also use end_station_name

Output:

image

  1. What is the name of the station whose ID is 111?
SELECT
  DISTINCT start_station_name --can also use end_station_name
FROM
  `bigquery-public-data.london_bicycles.cycle_hire`
WHERE
  start_station_id = 111; --can also use end_station_id

The name of the station whose ID is 111 is Park Lane , Hyde Park

  1. How many distinct bike_ids had trip durations greater than 2400 seconds (or 40 minutes)?
SELECT
  COUNT(DISTINCT bike_id) AS num_of_bike_trips
FROM
  `bigquery-public-data.london_bicycles.cycle_hire`
WHERE
  duration > 2400 --schema indicates that trip duration is recorded in seconds

There are 31636 distinct bike_ids had trip durations greater than 2400 seconds (or 40 minutes).

You can use the solutions doc to check your work: Intro to BigQuery Solutions

Or download the file directly below

Confirmation and reflection

Run another query on your table:

SELECT
     end_station_name
FROM
    `bigquery-public-data.london_bicycles.cycle_hire` 
WHERE
     rental_id = 57635395;

At what station did the bike trip with rental_id 57635395 end?

A. Tower Gardens, Tower

B. East Village, Queen Elizabeth Olympic Park

C. Notting Hill Gate Station, Notting Hill

D. Southwark Street, Bankside

The correct answer is B. East Village, Queen Elizabeth Olympic Park. Explain: The address listed under the end_station_name column for Row 1 of your results table was East Village, Queen Elizabeth Olympic Park. To find this, you successfully ran a query. Going forward, you will continue using SELECT, FROM, and WHERE statements in your queries to interact with databases using SQL. This will help you build more complicated SQL queries when you are analyzing data in the future.

Question 2

In this activity, you had an opportunity to get more familiar with BigQuery and writing SQL queries. In the text box below, write 2-3 sentences (40-60 words) in response to each of the following questions:

  • How do you think you can use public datasets on BigQuery to help develop your data analysis skills?
  • How do you think understanding basic query syntax will help you write more complicated queries in the future?

Explain: Congratulations on completing this hands-on activity! You explored BigQuery, uploaded public data to your console, and constructed some queries. A good response would include that BigQuery public datasets can help you practice writing SQL.

Being able to construct SQL queries is an important skill for data analysts, because they frequently need to work with databases. In upcoming activities, you will continue working with databases and writing queries with SQL-- an essential tool in every data analyst’s toolkit.

Hands-On Activity: Create a custom table in BigQuery

Question 1

Activity overview

BZoPGCbqS32aDxgm6mt9kg_bfbeb57eeb1743938a5041c2251e2ff0_line-y

Recently, you’ve been thinking about identifying good data sources that would be useful for analysis. You also spent some time in a previous activity exploring a public dataset in BigQuery and writing some basic SQL queries. In addition to using public data on BigQuery, you will need to be able to import data from other sources. In this activity, you will create a custom table and dataset, which you’ll load into a new table and query.

By the time you complete this activity, you will be able to load your own data into BigQuery for analysis. This will enable you to import your own data sources into BigQuery, which is a skill you will need in order to analyze data from different sources.

What you will need

To get started, download the baby names data zip file. This file contains about 7 MB of data about popular baby names from the US Social Security Administration website.

Click the link to the baby names data zip file to download it.

Link to baby names data: names.zip

Create a custom table

BZoPGCbqS32aDxgm6mt9kg_bfbeb57eeb1743938a5041c2251e2ff0_line-y

Once you have the zip file downloaded, you can import it into BigQuery to query and analyze. In order to do that, you will need to create a new dataset and a custom table.

Step 1: Unzip the file

You will need to unzip the file you downloaded onto your computer in order to access it on BigQuery. Once you have unzipped the file, you will find a .pdf file titled NationalReadMe that contains more information about the dataset. This dataset tracks the popularity of baby names for each year; you can find text files labelled by the year they contain. Open yob2014.txt to preview the data. You will notice that it’s a .csv file with three columns. Remember where you saved this folder so you can reference it later.

Step 2: Create a dataset

Before you can upload your txt file and create a table to query, you will need to create a dataset to upload your data into and store your tables.

  1. Go to the Explorer pane in your workspace and click the three dots next to your pinned project to open a menu. From here, select Create dataset.

Le-5XaCaSnCvuV2gmjpweA_9fef0882f1034e0eaab4054d0c0a9df1_Public1

  1. This will open the Create dataset menu on the right side of your console. This is where you will fill out some information about the dataset. You will input the Dataset ID as babynames and set the Data location to United States (US). Once you have finished filling out this information, you can click the blue CREATE DATASET button at the bottom of the menu.

oRmk6k19RaiZpOpNfdWoLw_97d89aed9dfe4262996d3426c8c6e8f1_Public2

Step 3: Create table

Now that you have a custom dataset stored in your project space, this is where you will add your table.

  1. From the babynames dataset, click the CREATE TABLE button. This will open another menu on the right side of your console.

r9Rb_xACSqmUW_8QArqp1g_5a8a38152b6043f3afe10ee378f370f1_Public3

  1. In the Source section, you will select the Upload option under Create table from. Then you will click the Browse button to open your files. Find and open the yob2014.txt file. Set the file format to .csv. In the Destination section, name your table as names_2014. Under Schema, select Edit as text and input the following code: name:string,gender:string,count:integer. This will establish the data types of the three columns in the table. Leave the other parameters as they are, and select Create table.

Oc6ZaMWATMaOmWjFgAzGZA_def63ea152bc41baafb01b0dcde0aef1_Public4

  1. Once you have created your table, it will appear in your Explorer pane under the dataset you created earlier.

TESkHYH2TDaEpB2B9hw2xQ_db3292521ea648bcbdb34a0c874c62f1_Public5

Click on the table to open it in your workspace. Here, you can check the table schema. Then, go to the Preview tab to explore your data. The table should have three columns: name, gender, and count.

4dFbKeWeSYSRWynlnmmEwA_9da146d3a21842b5a2b78f0856c695f1_Public6

Query your custom table

BZoPGCbqS32aDxgm6mt9kg_bfbeb57eeb1743938a5041c2251e2ff0_line-y

Now that your table is set up, you’re ready to start writing queries and answering questions about this data. For example, let’s say you were interested in the top five baby names for boys in the United States in 2014.

Click COMPOSE NEW QUERY to start a new query for this table. Then copy and paste this code:

SELECT name, count FROM `babynames.names_2014`
WHERE gender = 'M' ORDER BY count DESC
LIMIT 5

Output:

image

This query SELECTs the name and count columns from the names_2014 table. Using the WHERE clause, you are filtering for a specific gender for your results. Then, you’re sorting how you want your results to appear with ORDER BY. Because you are ordering by the count in descending order, you will get names and the corresponding count from largest to smallest. And finally, LIMIT tells SQL to only return the top five most popular names and their counts.

Once you have input this in your console, select RUN to get your query results.

Up for a challenge?

If you are comfortable creating your own custom tables, try uploading more files from the baby names dataset into tables you can query. For example, you could upload each of the files from 2015 to 2019 to find the top baby names for those years.

Confirmation and reflection

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

After running the query on your new table, what was the third most popular baby name for boys in 2014?

A. Mason

B. William

C. Noah

D. Jacob

The correct answer is A. Mason. Explain: To find that Mason was the third most popular baby name for boys in 2014, you queried your custom table and checked the results. Going forward, you’ll be able to upload your own data sources into BigQuery for future analysis projects. This will allow you to practice writing SQL queries for more data sources, which will be a key skill as a data analyst.

Question 2

In this activity, you explored public data in BigQuery and used it to create a custom table. In the text box below, write 2-3 sentences (40-60 words) in response to each of the following questions:

  • Why is being able to use data from different sources useful as a data analyst?
  • How can you use BigQuery custom tables and datasets in your future analysis projects?

Explain: Congratulations! In this activity, you created a new dataset within your project, uploaded a .csv file to create a new table, and ran a SQL query. A good response would include that being able to evaluate and use different data sources allows you access more data.

As a data analyst, being able to evaluate data sources and use the appropriate tool to analyze them is important. For instance, you were able to use SQL to analyze a dataset that was previously stored on your computer as a .csv file.

In-depth guide: SQL best practices

You can save this reading for future reference. Feel free to download a PDF version of this reading below

These best practices include guidelines for writing SQL queries, developing documentation, and examples that demonstrate these practices. This is a great resource to have handy when you are using SQL yourself; you can just go straight to the relevant section to review these practices. Think of it like a SQL field guide!

Capitalization and case sensitivity

With SQL, capitalization usually doesn’t matter. You could write SELECT or select or SeLeCT. They all work! But if you use capitalization as part of a consistent style your queries will look more professional.

To write SQL queries like a pro, it is always a good idea to use all caps for clause starters (e.g., SELECT, FROM, WHERE, etc.). Functions should also be in all caps (e.g., SUM()). Column names should be all lowercase (refer to the section on snake_case later in this guide). Table names should be in CamelCase (refer to the section on CamelCase later in this guide). This helps keep your queries consistent and easier to read while not impacting the data that will be pulled when you run them. The only time that capitalization does matter is when it is inside quotes (more on quotes below).

Vendors of SQL databases may use slightly different variations of SQL. These variations are called SQL dialects. Some SQL dialects are case sensitive. BigQuery is one of them. Vertica is another. But most, like MySQL, PostgreSQL, and SQL Server, aren’t case sensitive. This means if you searched for country_code = ‘us’, it will return all entries that have 'us', 'uS', 'Us', and 'US'. This isn’t the case with BigQuery. BigQuery is case sensitive, so that same search would only return entries where the country_code is exactly 'us'. If the country_code is 'US', BigQuery wouldn’t return those entries as part of your result.

Single or double quotes: '' or " "

For the most part, it also doesn’t matter if you use single quotes ' ' or double quotes " " when referring to strings. For example, SELECT is a clause starter. If you put SELECT in quotes like 'SELECT' or "SELECT", then SQL will treat it as a text string. Your query will return an error because your query needs a SELECT clause.

But there are two situations where it does matter what kind of quotes you use:

  1. When you want strings to be identifiable in any SQL dialect
  2. When your string contains an apostrophe or quotation marks

Within each SQL dialect there are rules for what is accepted and what isn’t. But a general rule across almost all SQL dialects is to use single quotes for strings. This helps get rid of a lot of confusion. So if we want to reference the country US in a WHERE clause (e.g., country_code = 'US'), then use single quotes around the string 'US'.

The second situation is when your string has quotes inside it. Suppose you have a column of favorite foods in a table called FavoriteFoods and the other column corresponds to each friend.

Friend Favorite_food
Rachel DeSantos Shepherd's pie
Sujin Lee Tacos
Najil Okoro Spanish paella

You might notice how Rachel's favorite food contains an apostrophe. If you were to use single quotes in a WHERE clause to find the friend who has this favorite food, it would look like this:

SELECT Friend FROM FavoriteFoods WHERE Favorite_food = 'Shepherd's pie'

This won’t work. If you run this query, you will get an error in return. This is because SQL recognizes a text string as something that starts with a quote 'and ends with another quote '. So in the bad query above, SQL thinks that the Favorite_food you are looking for is 'Shepherd'. Just 'Shepherd' because the apostrophe in Shepherd's ends the string.

Generally speaking, this should be the only time you would use double quotes instead of single quotes. So your query would look like this instead:

SELECT Friend FROM FavoriteFoods WHERE Favorite_food = "Shepherd's pie"

SQL understands text strings as either starting with a single quote ' or double quote". Since this string starts with double quotes, SQL will expect another double quote to signal the end of the string. This keeps the apostrophe safe, so it will return "Shepherd's pie" and not 'Shepherd'.

Comments as reminders

As you get more comfortable with SQL, you will be able to read and understand queries at a glance. But it never hurts to have comments in the query to remind yourself of what you are trying to do. And if you share your query, it also helps others understand it.

For example:

--This is an important query used later to join with the accounts table
SELECT
      rowkey, --key used to join with account_id
      Info.date, --date is in string format YYYY-MM-DD HH:MM:SS
      Info.code --e.g., ‘pub-###'

FROM
    Publishers

You can use # in place of the two dashes, --, in the above query but keep in mind that # isn’t recognized in all SQL dialects (MySQL doesn’t recognize #). So it is best to use -- and be consistent with it. When you add a comment to a query using --, the database query engine will ignore everything in the same line after --. It will continue to process the query starting on the next line.

Snake_case names for columns

It is important to always make sure that the output of your query has easy-to-understand names. If you create a new column (say from a calculation or from concatenating new fields), the new column will receive a generic default name (e.g., f0). For example:

SELECT
      SUM(tickets),
      COUNT(tickets),
      SUM(tickets) AS total_tickets
      COUNT(tickets) AS number_of_purchases
FROM
      purchases

Results are:

f0 f1 total_tickets number_of_purchases
8 4 8 4

The first two columns are named f0 and f1 because they weren’t named in the above query. SQL defaults to f0, f1, f2, f3, and so on. We named the last two columns total_tickets and number_of_purchases so these column names show up in the query results. This is why it is always good to give your columns useful names, especially when using functions. After running your query, you want to be able to quickly understand your results, like the last two columns we described in the example.

On top of that, you might notice how the column names have an underscore between the words. Names should never have spaces in them. If 'total_tickets' had a space and looked like 'total tickets' then SQL would rename SUM(tickets) as just 'total'. Because of the space, SQL will use 'total' as the name and won’t understand what you mean by 'tickets'. So, spaces are bad in SQL names. Never use spaces.

The best practice is to use snake_case. This means that 'total tickets', which has a space between the two words, should be written as 'total_tickets' with an underscore instead of a space.

CamelCase names for tables

You can also use CamelCase capitalization when naming your table. CamelCase capitalization means that you capitalize the start of each word, like a two-humped (Bactrian) camel. So the table TicketsByOccasion uses CamelCase capitalization. Please note that the capitalization of the first word in CamelCase is optional; camelCase is also used. Some people differentiate between the two styles by calling CamelCase, PascalCase, and reserving camelCase for when the first word isn't capitalized, like a one-humped (Dromedary) camel; for example, ticketsByOccasion.

At the end of the day, CamelCase is a style choice. There are other ways you can name your tables, including:

  • All lower or upper case, like ticketsbyoccasion or TICKETSBYOCCASION
  • With snake_case, like tickets_by_occasion

Keep in mind, the option with all lowercase or uppercase letters can make it difficult to read your table name, so it isn’t recommended for professional use.

The second option, snake_case, is technically okay. With words separated by underscores, your table name is easy to read, but it can get very long because you are adding the underscores. It also takes more time to write. If you use this table a lot, it can become a chore.

In summary, it is up to you to use snake_case or CamelCase when creating table names. Just make sure your table name is easy to read and consistent. Also be sure to find out if your company has a preferred way of naming their tables. If they do, always go with their naming convention for consistency.

Indentation

As a general rule, you want to keep the length of each line in a query <= 100 characters. This makes your queries easy to read. For example, check out this query with a line with >100 characters:

SELECT
CASE WHEN genre = 'horror' THEN 'Will not watch' WHEN genre = 'documentary'
THEN 'Will watch alone' ELSE 'Watch with others' END AS
Watch_category, COUNT(movie_title) AS number_of_movies
  FROM
       MovieTheater
  GROUP BY
       1

This query is hard to read and just as hard to troubleshoot or edit. Now, here is a query where we stick to the <= 100 character rule:

SELECT
	CASE
		WHEN genre = 'horror' THEN 'Will not watch'
		WHEN genre = 'documentary' THEN 'Will watch alone'
		ELSE 'Watch with others'
		END AS watch_category, COUNT(movie_title) AS number_of_movies
FROM
	MovieTheater
GROUP BY
	1

Now it is much easier to understand what you are trying to do in the SELECT clause. Sure, both queries will run without a problem because indentation doesn’t matter in SQL. But proper indentation is still important to keep lines short. And it will be valued by anyone reading your query, including yourself!

Multi-line comments

If you make comments that take up multiple lines, you can use -- for each line. Or, if you have more than two lines of comments, it might be cleaner and easier is to use /* to start the comment and */ to close the comment. For example, you can use the -- method like below:

-- Date: September 15, 2020
-- Analyst: Jazmin Cisneros
-- Goal: Count the number of rows in the table
SELECT
	COUNT(*) number of rows -- the * stands for all so count all
FROM
	table

Or, you can use the /* */ method like below:

/*
Date: September 15, 2020
Analyst: Jazmin Cisneros
Goal: Count the number of rows in the table
*/
SELECT
	COUNT(*) number of rows -- the * stands for all so count all
FROM
	table

In SQL, it doesn’t matter which method you use. SQL ignores comments regardless of what you use: #, --, or /* and */. So it is up to you and your personal preference. The /* and */ method for multi-line comments usually looks cleaner and helps separate the comments from the query. But there isn’t one right or wrong method.

SQL text editors

When you join a company, you can expect each company to use their own SQL platform and SQL dialect. The SQL platform they use (e.g., BigQuery, MySQL, or SQL Server) is where you will write and run your SQL queries. But keep in mind that not all SQL platforms provide native script editors to write SQL code. SQL text editors give you an interface where you can write your SQL queries in an easier and color-coded way. In fact, all of the code we have been working with so far was written with an SQL text editor!

Example with Sublime Text

If your SQL platform doesn’t have color coding, you might want to think about using a text editor like Sublime Text or Atom. This section shows how SQL is displayed in Sublime Text. Here is a query in Sublime Text:

With Sublime Text, you can also do advanced editing like deleting indents across multiple lines at the same time. For example, suppose your query somehow had indents in the wrong places and looked like this:

This is really hard to read, so you will want to eliminate those indents and start over. In a regular SQL platform, you would have to go into each line and press BACKSPACE to delete each indent per line. But in Sublime, you can get rid of all the indents at the same time by selecting all lines and pressing Command (or CTRL in Windows) + [. This eliminates indents from every line. Then you can select the lines that you want to indent (i.e., lines 2, 4, and 6) by pressing the Command key (or the CTRL key in Windows) and selecting those lines. Then while still holding down the Command key (or the CTRL key in Windows), press ] to indent lines 2, 4, and 6 at the same time. This will clean up your query and make it look like this instead:

Sublime Text also supports regular expressions. Regular expressions (or regex) can be used to search for and replace string patterns in queries. We won’t cover regular expressions here, but you might want to learn more about them on your own because they are a very powerful tool.

You can begin with these resources:

Hands-On Activity: Applying SQL

Question 1

Activity overview

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

In previous lessons, you learned how to apply formulas in spreadsheets. In this activity, we will practice using formulas with SQL queries.

By the time you complete this activity, you will be able to use SQL to write queries for datasets. This will enable you to explore public datasets in BigQuery, which is important for writing queries in your career as a data analyst.

Set up your data

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

  1. Log in to BigQuery Sandbox. If you have a free trial version of BigQuery, you can use that instead. On the BigQuery page, click the Go to BigQuery button.
  • Note: BigQuery Sandbox frequently updates its user interface. The latest changes may not be reflected in the screenshots presented in this activity, but the principles remain the same. Adapting to changes in software updates is an essential skill for data analysts, and it’s helpful for you to practice troubleshooting. You can also reach out to your community of learners on the discussion forum for help.
  1. If you have never created a BigQuery project before, click CREATE PROJECT on the right side of the screen. If you have created a project before, you can use an existing one or create a new one by clicking the project dropdown in the blue header bar and selecting NEW PROJECT.
  2. Name your project something that will help you identify it later. You can give it a unique project ID or use an auto-generated one. Don’t worry about selecting an organization if you don’t know what to put.
  3. Now, you’ll see the Editor interface. In the middle of the screen is a window where you can type code, and to the left is the Explorer menu where you can search for datasets.

Pick a dataset

Follow these steps to find and pick a dataset for this activity:

  1. Locate the Explorer menu on the left side of your screen. Click on + ADD DATA and then Explore public datasets.

aN5I4c9FQ1meSOHPRWNZDw_ec577bdf30ac42dab134309ac7c968f1_4nnVx78aRxPNRqm

  1. In the Search Marketplace bar, search for “New York City Trees.” Click the result, then click View Dataset. This will bring you back to the BigQuery Sandbox interface in a new tab.
  • Note: This may pin the bigquery-public-data dropdown to the Explorer menu. You can use this to browse datasets and tables.

LtnFJZQOTZKZxSWUDq2SyA_0712ee9a8e6440f2a169b87f96c461f1_62Es9WiarRg8jTa

  1. In BigQuery, you’ll find information on the dataset you selected. Review the description of the dataset.

Mua1DL9XT6amtQy_V6-m6w_68c1c221843342cdbc231d38c8bbecf1_7dUi3JYrAZ6S5n6

For instance, you can locate the Dataset ID. You will need this in order to write an SQL query, so that you can tell what database, dataset, and table you’re targeting. In this case, the database connection is “bigquery-public-data” and the Dataset ID is “new_york_trees”. You'll still need to identify what table you want to query, so begin with a close review of the dataset.

Choose a table

  1. Enter the Dataset ID, “new_york_trees,” in the search bar of the Explorer menu.

If this doesn’t pull anything up, you can find it manually by deleting the text from the search bar, clicking on the arrow next to bigquery-public-data, and scrolling until you find the right dataset.

  1. Once you’ve found the “new_york_trees” dataset, click on the arrow next to it to expand the dataset to examine the tables it contains.

6xc6LNa0SHOXOizWtIhzkg_bd3cf8457c1c4d16848a980aec0e22f1_52FeiFfZr5KRdov

These are all tables contained in the dataset. You can check out the data for all trees cataloged in New York City for three specific years.

  1. Click on the tree_census_2005 table. This will bring up information for the table.
  2. On the right side of the information window, click Query and then select “In new tab."

OX0Dy1XPQIG9A8tVzxCB3A_1017f9d952f94a0da981e3c9aeb326f1_4UKmrwnAqjp8nTm

This will populate the query window with a query. Notice that the query doesn’t contain anything in between “SELECT” and “FROM”.

8Bj1FPtES5OY9RT7RMuT7Q_681ffa867562482d917d0ffe73391cf1_4uLyKhbjiNzpVKv

Write a query

Query the data

You’ll still need to complete the query by adding what you want.

  1. Insert an asterisk * after the select, so that your query reads SELECT * FROM followed by your table location.

  2. Run the query. In the example provided, your result should be something like this:

LRkzcRkTQECZM3EZE4BAcg_49403730fbab41e794967f543994caf1_Bx3HkK7zCAs6Bfu

This query returns all columns for the first 1,000 rows from the table.

  1. Write a query to find out the average diameter of all NYC trees in 2005. Instead of an asterisk, type AVG(tree_dbh).

FGi_CehXSqOovwnoV1qjSg_95f9bbc806594a5ab4161b64a78c08f1_55xUDiJEhcYuK3z

This returns your answer, 12.833 (which means the average diameter of NYC trees in 2005 was 12.833 inches).

Write your own queries

Now, come up with some questions and answer them with your own SQL queries. For instance, in the same dataset as the example, try exploring the same data for 1995 and 2015.

You are also free to choose another publicly available dataset in BigQuery and write your own queries for extra practice—there are a lot of interesting choices!

Confirmation and reflection

G-KePoi0R2Sinj6ItBdkMg_2d69ab4b929f40f2b472a78fdd5ed880_line-y

According to the dataset you used in this activity, what was the average diameter of NYC trees in 2015?

A. 12.981

B. 12.334

C. 11.439

D. 11.279

The correct answer is D. 11.279. Explain: The average diameter of NYC trees in 2015 was 11.279 inches. Going forward, you can write other SQL queries to return data from datasets and databases. You do not need to include the equal sign or quotation marks. This will help you find the data you need for future projects.

Question 2

During this activity, you practiced writing SQL queries to return information from datasets. In the text box below, write 2-3 sentences (40-60 words) in response to each of the following questions:

  • What do you think might happen if you wrote each component of a query correctly, but rearranged the order?
  • How can you use SQL queries to grow as a data analyst?

Explain: Congratulations on completing this hands-on activity! A strong response would include how querying public datasets is a great way to practice SQL. Beyond that, consider the following:

Data analysts use SQL to interact with databases and view data they need to analyze. This is important knowledge that will prepare you for future courses and many aspects of your career as a data analyst. In upcoming activities, you will learn and practice writing more advanced queries, which will help you master SQL—an essential tool in every data analyst’s toolkit.

Test your knowledge on using SQL with large datasets

Question 1

In MySQL, what is acceptable syntax for the SELECT keyword? Select all that apply.

  • 'SELECT'
  • "SELECT"
  • SELECT
  • select

Explain: In MySQL, SELECT or select is acceptable syntax.

Question 2

A database table is named blueFlowers. What type of case is this?

A. Lowercase

B. Camel case

C. Snake case

D. Sentence case

The correct answer is B. Camel case. Explain: blueFlowers is in camel case.

Question 3

In BigQuery, what optional syntax can be removed from the following FROM clause without stopping the query from running?

FROM bigquery-public-data.sunroof_solar.solar_potential_by_postal_code

A. Underscores

B. Backticks

C. Dots

D. Dashes

The correct answer is B. Backticks. Explain: The name of the dataset is shown between two backticks to help people read the query more easily. If you remove the backticks, the query will still run.

Question 4

In the following FROM clause, what is the table name in the SQL query?

FROM bigquery-public-data.sunroof_solar.solar_potential_by_postal_code

A. sunroof_solar

B. solar.solar

C. solar_potential_by_postal_code

D. public-data.sunroof

The correct answer is C. solar_potential_by_postal_code. Explain: The table name in the SQL query is solar_potential_by_postal_code. This table is in the sunroof_solar dataset, a public dataset in BigQuery.