1.4.2.Learn about Structured Query Language(SQL) - sj50179/Google-Data-Analytics-Professional-Certificate GitHub Wiki

Q. Fill in the blank: A data analyst uses a SQL query to retrieve information from a database. They add a WHERE statement to _____ the data based on certain conditions.

  • filter

  • copy

  • sort

  • categorize

Correct. They add a WHERE statement to filter the data based on certain conditions.

More about SQL in action

Just as humans use different languages to communicate with others, so do computers. Structured Query Language (or SQL, sometimes pronounced as “sequel”) lets data analysts talk to their databases. SQL is one of the most useful data analyst tools, especially when you are working with large datasets in tables. It can help you investigate huge databases, track down text (referred to as strings) and numbers, and filter for the exact kind of data you need—much faster than a spreadsheet can. In this reading, we will go over the basics of using SQL and explore an example query to demonstrate how SQL works in action.

Basic syntax for SQL

Every programming language, including SQL, has to follow a unique set of guidelines known as syntax. As soon as you enter your search criteria using the correct syntax, the query should start working to pull the information you have requested from the target database.

What is a query?

A query is basically a request for data or information from a database. For example, ''Tell me how many comedy movies were made in 1985” or ''How many people live in Puerto Rico?'' When we query databases, we use SQL to communicate our question or request. Both the user and the database can always exchange information as long as you “speak” the same language.

The foundation of every SQL query is the same:

  • Use SELECT to choose the columns you want to return.
  • Use FROM to choose the tables where the columns you want are located.
  • Use WHERE to filter for certain information.

Basic components of a query (and a few useful tips)

Tip 1: You can write your SQL queries all in lowercase and don’t have to worry about proper spacing between words. With that said, however, using capitalization and spacing can help you read the information more easily. Keep your queries neat, and they will be easier to review or troubleshoot if you need to check them later on.

In the SQL syntax shown above, the SELECT statement identifies the column you want to pull data from by name, field1, and the FROM statement identifies the table in which the column is located by name, table. Finally, the WHERE statement narrows your query so that the database returns only the data with an exact value match or the data that matches a certain condition that you want.

For example, if you are looking for a specific customer with the last name Chavez, the WHERE statement would be: WHERE field1 = 'Chavez';

However, if you are looking for all customers with a last name that begins with the letters “Ch”, the WHERE statement would be: WHERE field1 LIKE 'Ch%';

You can see that the LIKE statement is very powerful because it allows you to tell the database to look for a certain pattern! The percent sign (%) is used as a wildcard to match one or more characters. In our example, both Chavez and Chen would be returned. Note that in some databases the asterisk (*) is used as the wildcard instead of the percent sign (%).

Can you use SELECT * ?

In our example, if you replace SELECT field1 with **SELECT ***  you would be selecting all the columns in the table. From a syntax point of view, it is a correct SQL statement, but you should use it sparingly and with caution because depending on how many columns a table has, you could be selecting a tremendous amount of data.

Finally, you will notice that we have shown the SQL statement with a semicolon at the end. The semicolon is a statement terminator and is part of the American National Standards Institute (ANSI) SQL-92 standard which is a recommended common syntax for adoption by all SQL databases. However, not all SQL databases have adopted or enforce the semicolon, so it’s possible you may come across some SQL statements that aren’t terminated with a semicolon. If a statement works without a semicolon, it’s fine.

Tip 2: Some tables aren’t designed with descriptive enough naming conventions. In our previous example, field1 was the column for a customer’s last name, but you wouldn’t know it by the name. A better name would have been something like last_name. In these cases, you can place comments alongside your SQL statements to help you remember what the name represents. Comments are text placed between certain characters, /* and /, or after two dashes (--*) as shown below.

Tip 3: You can also make it easier on yourself by assigning a new name or alias to the column or table names to make them easier to work with (and avoid the need for comments). This is done with a SQL AS statement. In the example below, you are changing field1 to last_name and table to customers for the duration of the query only. It doesn’t change the names in the actual table used in the database.

Putting SQL to work (what you might do as a data analyst)

Imagine you are a data analyst for a small business and your manager asks you for some employee data. You decide to write a query with SQL to get what you need from the database.

Let’s say you want to pull all the columns: empID, firstName, lastName, jobCode, and salary. Because you know the database isn’t that big, instead of creating a SELECT statement for each column, you use SELECT *.  This will select all the columns from the Employee table in the FROM statement.

Now, let’s get more specific about the data we want from the Employee table. If you want all the data about employees working in the SFI job code, you can use a WHERE statement to filter out the data based on this additional requirement.

Here, you use:

The portion of the resulting data returned from the SQL query might look like this:

Suppose you notice a large salary range for the SFI job code, so you would like to flag all employees in all departments with lower salaries for your manager. Because interns are also included in the table and they have salaries less than $30,000, you want to make sure your results give you only the full time employees with salaries that are $30,000 or less. In other words, you want to exclude interns with the INT job code who earn less than $30,000. A SQL AND statement will enable you to find this information.

You create a SQL query similar to below, where <> means "does not equal":

The resulting data from the SQL query might look like the following (interns with the job code INT aren't returned):

With quick access to this kind of data in SQL, you can provide your manager with tons of different insights about employee data, including whether employee salaries across the business are equitable. Fortunately, the query shows only an additional two employees might need a salary adjustment and you share the results with your manager.

Pulling the data, analyzing it, and implementing a solution might ultimately help improve employee satisfaction and loyalty–making SQL a pretty powerful tool.

Resources to learn more

  • SQL Cheat Sheet This starter guide for standard SQL syntax used in PostgreSQL offers videos, activities, and readings on SQL. By the time you are finished, you will get to know a lot more about SQL and be prepared to use it for business analysis and other tasks.

  • W3Schools SQL Tutorial: If you would like to explore a detailed tutorial of SQL, this is the perfect place to start. This tutorial includes interactive examples you can edit, test, and recreate. Use it as a reference or complete the whole tutorial to practice using SQL. Click the green Start learning SQL now button or the Next button ****to begin the tutorial.

SQL Guide: Getting started

This reading covers some of the best practices for formatting queries so they are easy to read and understand. You will be introduced to conventions that help the purpose and function of a query to stand out. This involves how you work with commands in queries as well as how you work with multiple fields. The reading closes with an introduction to how you add comments to your queries to help explain your thinking and the expected outcome of a query.

Capitalization and indentation

Shown below is a recommended format for writing queries. Capitalize SELECT, FROM, and WHERE. Make sure to add a new line and indent when adding the fields.

Here’s an example of what this could look like in BigQuery.

The query uses three commands to locate customers with the first name Tony.

  1. SELECT the column named ‘first_name’
  2. FROM a table named ‘customer_name’ (in a database named ‘customer_data’)
  3. but only the data WHERE ‘first_name’ is ‘Tony’

Tip: This is like filling in a template. Always start queries by writing the SELECT, FROM, and WHERE statements in this format. Enter your table name after the FROM command, enter the field(s) you want after the SELECT command, and then finally, enter the conditions you want to place on your query after the WHERE command. You’ll find this makes it easier to write SQL queries and it might be more naturally intuitive to proceed from large to small by entering the large details (table) before the small details (fields and conditions).

Multiple fields

Using the indentation previously described also makes it easier for you to group multiple fields together that are affected by the same command.

If you are requesting data from a table that has multiple fields, you need to include these fields in your SELECT command:

Here is an example of what this could look like in BigQuery:

The query uses three commands to locate customers with the first name Tony.

  1. SELECT the columns named ‘customer_id’, ‘first_name’, and ‘last_name’
  2. FROM a table named ‘customer_name’ (in a database named ‘customer_data’)
  3. but only the data WHERE ‘first_name’ is ‘Tony’

This query is no different than the previous query except that more data columns were selected. In general, it’s a better use of resources to select columns that you’ll make use of in your query.

It makes sense to select more columns if you use the additional fields in your WHERE statement. If you have multiple conditions in your WHERE statement they may be written like this:

Notice that unlike the SELECT command that uses a comma to separate fields/variables/parameters, the WHERE command uses the AND statement to connect conditions. This is important to remember because as you become a more advanced writer of queries you will make use of other connectors/operators such as OR and NOT to connect your conditions.

Here is an example of what this could look like in BigQuery:

The query uses three commands to locate customers with a valid (greater than 0) customer ID whose first name is Tony and last name is Magnolia.

  1. SELECT the columns named ‘customer_id’, ‘‘first_name’, and ‘last_name’
  2. FROM a table named ‘customer_name’ (in a database named ‘customer_data’)
  3. but only the data WHERE ‘customer_id’ is greater than zero, ‘first_name’ is ‘Tony’, and ‘last_name’ is ‘Magnolia’.

Note that one of the conditions is a logical condition where we are checking to see if customer_id is greater than zero.

Comments as reminders

The more comfortable you get with SQL, the easier it will be to read and understand queries at a glance. Still, it never hurts to have comments in a query to remind yourself of what you’re trying to do. This also makes it easier for others to understand your query if your query is shared. As your queries become more and more complex, this practice will save you a lot of time and energy to understand complex queries you wrote months or years ago.

Over time, your style of query will change and comments will help you remember what your thinking was at the time. Use the “--” symbols to make comments in your query. It tells SQL to ignore whatever comes after the “--” within the same line. For example:

Notice that comments can be added outside of a statement as well as within a statement. You can use this flexibility to provide an overall description of what you are going to do,  step-by-step notes about how you achieve it, and why you set different parameters/conditions.

Here is an example of how comments could be written in BigQuery:

In the example , we provide a comment next to each of the column names and give a description of the column and its uses. Two dashes (--) are generally supported. So it’s best to use -- and be consistent with it. You can use # in place of -- in the above query but # is not recognized in all SQL versions; for example, MySQL doesn’t recognize #.  You can also use /* before and */ after a comment if the database you’re using supports it.

As you develop your skills professionally, depending on the SQL database you use, you can pick the appropriate comment delimiting symbols you prefer and stick with those as a consistent style.

Test your knowledge on SQL

TOTAL POINTS 5

Question 1

SELECT *
FROM employee
WHERE jobCode = 'FTE'
              AND LastName = 'James'

What does the asterisk (*) after SELECT tell the database to do in this query?

  • Select all data that meets the criteria as stated in the query

  • Select all data that meets the criteria as stated in the query, then multiply it

  • Select all data from the employee table

  • Select all data from the LastName column

Correct. The asterisk tells the database to select all data that meets the criteria as stated in the query.

Question 2

SELECT *
FROM employee
WHERE jobCode = 'FTE'
      AND LastName = 'James'

In this query, the data analyst wants to retrieve data from which table?

  • LastName

  • James

  • jobCode

  • employee

Correct. The data analyst wants to retrieve data from the employee table.

Question 3

SELECT *
FROM employee
WHERE jobCode = 'FTE'
      AND LastName = 'James'

In this query, what will be retrieved from the database?

  • All data from the employee table, where the jobCode is FTE and the last name is James.

  • All data from the employee table, where the jobCode is FTE and the employee has any last name other than James.

  • All data from the jobCode table, where the jobCode is FTE and the employee has any last name other than James.

  • All data from the FTE table, where the employee's LastName is James.

Correct. This query will select all data from the employee table, where the jobCode is FTE and the last name is James.

Question 4

You begin writing the following query:

SELECT *

You want to instruct the database to retrieve data from the warehouse_inventory table. What is the SQL clause that will complete your query?

  • FROM = warehouse_inventory

  • FROM warehouse_inventory

  • WHERE warehouse_inventory

  • WHERE = warehouse_inventory

Correct. The clause that will complete your query is FROM warehouse_inventory. The statement is:

SELECT *
FROM warehouse_inventory

FROM indicates where the selected data comes from. And warehouse_inventory is the name of the table.

Question 5

You begin writing the following query:

SELECT *
FROM cook_book_table

You want to instruct the database to extract records from the recipe_number column, where the number is 59. What is the SQL clause that will complete your query?

  • WHERE recipe_number is 59

  • WHERE recipe_number = 59

  • FROM recipe_number 59

  • FROM recipe_number_59

Correct. The clause that will complete your query is WHERE recipe_number = 59. The statement is:

SELECT *
FROM cook_book_table
WHERE recipe_number = 59

A WHERE clause is used to extract only those records that meet a specified criteria, typically from a column in a table. Because 59 is a number, it does not require quotation marks.