Course 3‐3 - Forestreee/Data-Analytics GitHub Wiki

Google Data Analytics Professional

Prepare Data for Exploration

WEEK3 - Databases: Where data lives

When you’re analyzing data, you’ll access much of the data from a database. It’s where data lives. In this part of the course, you’ll learn all about databases, including how to access them and extract, filter, and sort the data they contain. You’ll also check out metadata to discover the different types and how analysts use them.

Learning Objectives

Describe databases with references to their functions and components.
Explain metadata as it relates to databases.
Discuss the importance of metadata and how it relates to the work of a data analyst.
Recall the issues and steps involved in accessing data from multiple sources.
Explain the use of filters and sorting functionality in spreadsheets.
Demonstrate how to use spreadsheet functionality to import and inspect a given set of data.
Demonstrate how to use SQL functions to extract data from a given database.

Working with databases

All about databases

So far, you've seen how data can be gathered and analyzed to solve all kinds of problems. Next step, we're going to learn all about databases.

As a refresher, a database is a collection of data stored in a computer system, but storage is just the beginning.

You'll discover how databases make it possible to find the exact piece of information you need for your analysis. You'll also learn how to sort data in order to zoom in on what you need to generate insightful reports and much more.

Then we'll go even deeper, and I mean really, really deep. I'm talking about metadata. You've probably heard someone say, wow that's so meta. Usually they're talking about something referencing back to itself or being completely self aware.

For example if a character in a book knows she's in a book, that's meta. If you make a documentary about making documentaries, that's also meta. And here at Google, I constantly analyze how I analyze data. That's definitely meta.

I do that to give my work a quality check to make sure my methods are fair. And to be certain that I'm paying attention to any biases that might affect the outcome. As an analyst, you should do this too. Sometimes we get a little too close to our data. So stepping back and asking ourselves if our processes make sense is key.

But let's back up just a bit and define metadata. Metadata is data about data. Like I said: deep.

Metadata is extremely important when working with databases. Think of it like a reference guide. Without the guide all you have is a bunch of data with no context explaining what it means. Metadata tells you where the data comes from, when and how it was created, and what it's all about.

Up next, you'll learn how to take data from a database or another source and bring it into a spreadsheet. You'll do this either by importing it directly or by using SQL to generate the request. And once you have data in a spreadsheet, the possibilities are endless. Everything we're about to cover is a very important part of the prepare phase of the data analysis process. It's how data analysts figure out which kind of data is going to be most helpful to them.

If you have the right data, you're much more likely to be able to solve your business problems successfully.

Database features

Databases are essential tools for data analysts. I use them constantly. Just about all of the data I access is stored within databases. Databases store and organize data, making it much easier for data analysts to manage and access information. They help us get insights faster, make data-driven decisions, and solve problems.

You've already heard a bit about what databases are and how they're used by data analysts. Now let's learn more about database features and components. Here's a simple database structure.

It contains tables with information from a car manufacturer. The top level includes car dealerships, product details, and repair parts. Then if you drill down to the next level by selecting one of those tables, you'll find more specific details about each item.

This is called a relational database. A relational database is a database that contains a series of related tables that can be connected via their relationships.

For two tables to have a relationship, one or more of the same fields must exist inside both tables. For example, here, branch ID exists in this table and this one. If a field exists within both tables, we can use it to connect the tables together. The branch ID field is the key to connecting these tables.

There are two types of keys. A primary key is an identifier that references a column in which each value is unique. You can think of it as a unique identifier for each row in a table. For our dealership table with information about the different dealership branches, branch ID is the primary key. Similarly, for the product details table about each car, VIN is our primary key. As an analyst you may need to create tables. If you do decide to include a primary key, it should be unique, meaning no two rows can have the same primary key. Also, it cannot be null or blank.

There are also foreign keys. A foreign key is a field within a table that's a primary key in another table. In other words, a foreign key is how one table can be connected to another. Because our repair parts table contains information about each car part, the primary key is part ID. Each row in our repair parts table represents one unique part.

All the other keys in this table, such as the VIN, are the foreign keys that allow the repair parts table to be connected to the other tables.

As you can see, a table can only have one primary key but it can have multiple foreign keys.

Understanding primary and foreign keys can be tricky, so you'll have more opportunities to practice coming up. But as a general summary, a primary key is used to ensure data in a specific column is unique. It uniquely identifies a record in a relational database table. Only one primary key is allowed in a table and they cannot contain null or blank values.

And a foreign key is a column or group of columns in a relational database table that provides a link between the data and two tables. It refers to the field in a table that's the primary key of another table. Lastly, it's important to note that more than one foreign key is allowed to exist in a table.

And coming up, you'll begin practicing how to access and analyze data from actual databases. That will be a great opportunity to improve your understanding of primary and foreign keys, database organization and how you might use databases in your future analytics career.

Databases in data analytics

Databases enable analysts to manipulate, store, and process data. This helps them search through data a lot more efficiently to get the best insights.

Relational databases A relational database is a database that contains a series of tables that can be connected to show relationships. Basically, they allow data analysts to organize and link data based on what the data has in common.

In a non-relational table, you will find all of the possible variables you might be interested in analyzing all grouped together. This can make it really hard to sort through. This is one reason why relational databases are so common in data analysis: they simplify a lot of analysis processes and make data easier to find and use across an entire database.

Database Normalization Normalization is a process of organizing data in a relational database. For example, creating tables and establishing relationships between those tables. It is applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database.

The key to relational databases Tables in a relational database are connected by the fields they have in common. You might remember learning about primary and foreign keys before. As a quick refresher, a primary key is an identifier that references a column in which each value is unique. In other words, it's a column of a table that is used to uniquely identify each record within that table. The value assigned to the primary key in a particular row must be unique within the entire table. For example, if customer_id is the primary key for the customer table, no two customers will ever have the same customer_id.

By contrast, a foreign key is a field within a table that is a primary key in another table. A table can have only one primary key, but it can have multiple foreign keys. These keys are what create the relationships between tables in a relational database, which helps organize and connect data across multiple tables in the database.

Some tables don't require a primary key. For example, a revenue table can have multiple foreign keys and not have a primary key. A primary key may also be constructed using multiple columns of a table. This type of primary key is called a composite key. For example, if customer_id and location_id are two columns of a composite key for a customer table, the values assigned to those fields in any given row must be unique within the entire table.

SQL? You’re speaking my language Databases use a special language to communicate called a query language. Structured Query Language (SQL) is a type of query language that lets data analysts communicate with a database. So, a data analyst will use SQL to create a query to view the specific data that they want from within the larger set. In a relational database, data analysts can write queries to get data from the related tables. SQL is a powerful tool for working with databases — which is why you are going to learn more about it coming up!

Primary and Foreign Keys

A table can only have one.

A primary key is unique and can only exist once within a given table. It cannot contain null or blank values.

A table can have many.

A foreign key is a field that is a primary key in another table. Tables can contain multiple foreign keys.

This cannot have null or blank values.

A primary key is unique and can only exist once within a given table. It cannot contain null or blank values.

This is a field in a table that is a primary key in another table.

A foreign key is a field that is a primary key in another table. Tables can contain multiple foreign keys.

This is used to connect tables in relational databases. Both primary and foreign keys connect tables in relational databases. Tables can only have one primary key, but can have multiple foreign keys.

This is a column with unique values for each row within a table.

A primary key is unique and can only exist once within a given table. It cannot contain null or blank values.

Inspecting a dataset: A guided, hands-on tour

As a data analyst, you'll use data to answer questions and solve problems. When you analyze data and draw conclusions, you are generating insights that can influence business decisions, drive positive change, and help your stakeholders meet their goals.

Before you begin an analysis, it’s important to inspect your data to determine if it contains the specific information you need to answer your stakeholders’ questions. In any given dataset, it may be the case that:

The data is not there (you have sandwich data, but you need pizza data)
The data is insufficient (you have pizza data for June 1-7, but you need data for the entire month of June)
The data is incorrect (your pizza data lists the cost of a slice as $250, which makes you question the validity of the dataset) Inspecting your dataset will help you pinpoint what questions are answerable and what data is still missing. You may be able to recover this data from an external source or at least recommend to your stakeholders that another data source be used.

In this reading, imagine you’re a data analyst inspecting spreadsheet data to determine if it’s possible to answer your stakeholders’ questions.

The scenario You are a data analyst working for an ice cream company. Management is interested in improving the company's ice cream sales.

The company has been collecting data about its sales—but not a lot. The available data is from an internal data source and is based on sales for 2019. You’ve been asked to review the data and provide some insight into the company’s ice cream sales. Ideally, management would like answers to the following questions:

What is the most popular flavor of ice cream?
How does temperature affect sales?
How do weekends and holidays affect sales?
How does profitability differ for new versus returning customers?

Download the data You can download the data to follow along with this reading. To use the template for the sales data, click the link below and select “Use Template.”

Link to template: Ice Cream Sales

Inspect the data

Question 1: What is the most popular flavor of ice cream? To discover the most popular flavor, you first need to define what is meant by "popular." Is the most popular flavor the one that generated the most revenue in 2019? Or is it the flavor that had the largest number of units sold in 2019? Sometimes your measurement choices are limited by what data you have—you can review your spreadsheet to find out if either of these definitions of “popular” make sense based on the available data.

Click the flavors tab on your spreadsheet to view the relevant data. The flavors sheet has three columns and 209 rows of data. The column headers are week, units sold, and flavor. This dataset did not come with a data description, so you have to figure out the significance of the columns on your own. Based on the data, you deduce that these columns provide information about the number of units sold for each ice cream flavor, by week, in 2019

In this case, you can discover what the most popular flavor is by using units sold as your measure. In particular, you can use the units sold column to calculate the total number of units sold during the year for each flavor. Unfortunately, the dataset does not provide the annual sales amount by flavor. In this case, your next step would be to ask your stakeholders if the annual sales per flavor data is available from another source. If not, you can add a statement about the current data’s limitations to your analysis.

Question 2: How does temperature affect sales? To explore your second question, you click the temperatures tab and check out the data. The temperatures sheet has two columns and 366 rows of data. The column headers are temperature and sales. The data may show total 2019 sales per temperature (for instance, the first entry might sum up $39.69 in sales for three separate days that each had a high of 60 degrees). Or, the data may show a snapshot of sales and temperature for each day in 2019 (for instance, the first entry might refer to a single day with a high of 60 degrees and $39.69 in sales).

So, which is it? It’s probably a daily snapshot because there are 365 entries for temperature, and multiple rows with the same temperature and different sales values. This implies that each entry is for a single day and not a summary of multiple days. However, without more information, you can’t be certain. Plus, you don’t know if the current data is listed in consecutive order by date or in a different order. Your next step would be to contact the owner of the dataset for clarification.

If it turns out that temperature does affect sales, you’ll be able to offer your stakeholders an insight such as the following: “When daily highs are above X degrees, average ice cream sales increase by Y amount. So the business should plan on increasing inventory during these times to maximize sales.”

Question 3: How do weekends and holidays affect sales? Next, you click on the sales tab to view the data about dates of sale. The sales sheet has two columns and 366 rows of data. The column headers are date and sales. This data is most likely total daily sales in 2019, as sales are recorded for each date in 2019.

You can use this data to determine whether a specific date falls on a weekend or holiday and add a column to your sheet that reflects this information. Then, you can find out whether sales on the weekends and holidays are greater than sales on other days. This will be useful to know for inventory planning and marketing purposes.

Question 4: How does profitability differ for new customers versus returning customers? Your dataset does not contain sales data related to new customers. Without this data, you won’t be able to answer your final question. However, it may be the case that the company collects customer data and stores it in a different data table.

If so, your next step would be to find out how to access the company’s customer data. You can then join the revenue sales data to the customer data table to categorize each sale as from a new or returning customer and analyze the difference in profitability between the two sets of customers. This information will help your stakeholders develop marketing campaigns for specific types of customers to increase brand loyalty and overall profitability.

Conclusion When working on analytics projects, you won’t always have all the necessary or relevant data at your disposal. In many of these cases, you can turn to other data sources to fill in the gaps.

Despite the limitations of your dataset, it’s still possible to offer your stakeholders some valuable insights. For next steps, your best plan of action will be to take the initiative to ask questions, identify other relevant datasets, or do some research on your own. No matter what data you’re working with, carefully inspecting your data makes a big impact on the overall quality of your analysis.

Practice Quiz: database

Managing data with metadata

Exploring metadata

Now that you understand the different ways to organize data in a database, let's talk about how you can describe that data.

We'll start exploring metadata, which is a very important aspect of database management. Metadata is an abstract concept, though. Let's kick things off with a simple, everyday example. Did you know that every time a photo is taken with a smartphone, data is automatically collected and stored within that photo? Take a look. Here's a cute shot of my friend's dogs, Rudy and Matilda. Choose any photo on your computer. On your photo, right-click on "Get Info" or "Properties." This will give you the photo's metadata, which may tell you the type of file it is; the date and time it was taken; the geolocation, or where it was taken; what kind of device was used to take the photo; and much more.

Pretty amazing, right? Here's another example. Every time you send or receive an email, metadata is sent right along with that message. You can find it by clicking on "Show Original" or "View Message Details." An email message's metadata includes its subject, who it's from, who it's to, and the date and time it was sent. The metadata even knows how quickly it was delivered after the sender pressed, "Send."

Metadata is information that's used to describe the data that's contained in something, like a photo or an email. Keep in mind that metadata is not the data itself. Instead, it's data about the data.

In data analytics, metadata helps data analysts interpret the contents of the data within a database. That's why metadata is so important when working with databases. It tells an analyst what the data is all about. That makes it possible to put the data to work solving problems and making data-driven decisions.

As a data analyst, there are three common types of metadata that you'll come across: descriptive, structural, and administrative.

Descriptive metadata is metadata that describes a piece of data and can be used to identify it at a later point in time. For instance, the descriptive metadata of a book in a library would include the code you see on its spine, known as a unique International Standard Book Number, also called the ISBN. It would also include the book's author and title.

Next is structural metadata, which is metadata that indicates how a piece of data is organized and whether it's part of one or more than one data collection.

Let's head back to the library. An example of structural data would be how the pages of a book are put together to create different chapters.

It's important to note that structural metadata also keeps track of the relationship between two things. For example, it can show us that the digital document of a book manuscript was actually the original version of a now printed book. Finally, we have administrative metadata.

Administrative metadata is metadata that indicates the technical source of a digital asset. When we looked at the metadata inside the photo, that was administrative metadata. It shows you the type of file it was, the date and time it was taken, and much more.

Here's one final thought to help you understand metadata. If you're on your way to the library to pick out a book, you could research a book's title, author, length, and number of chapters. That's all metadata, and it can tell you a lot about the book, but you have to actually read the book to know what it's all about.

Likewise, you can read about data analytics, but you have to take this course to earn the Google Data Analytics certificate. Keep moving forward to gain that new perspective.

Metadata is as important as the data itself

Data analytics, by design, is a field that thrives on collecting and organizing data. In this reading, you are going to learn about how to analyze and thoroughly understand every aspect of your data.

Take a look at any data you find. What is it? Where did it come from? Is it useful? How do you know? This is where metadata comes in to provide a deeper understanding of the data. To put it simply, metadata is data about data. In database management, it provides information about other data and helps data analysts interpret the contents of the data within a database.

Regardless of whether you are working with a large or small quantity of data, metadata is the mark of a knowledgeable analytics team, helping to communicate about data across the business and making it easier to reuse data. In essence, metadata tells the who, what, when, where, which, how, and why of data.

Elements of metadata Before looking at metadata examples, it is important to understand what type of information metadata typically provides.

Title and description What is the name of the file or website you are examining? What type of content does it contain?

Tags and categories What is the general overview of the data that you have? Is the data indexed or described in a specific way?

Who created it and when Where did the data come from, and when was it created? Is it recent, or has it existed for a long time?

Who last modified it and when Were any changes made to the data? If yes, were the modifications recent?

Who can access or update it Is this dataset public? Are special permissions needed to customize or modify the dataset?

Examples of metadata In today’s digital world, metadata is everywhere, and it is becoming a more common practice to provide metadata on a lot of media and information you interact with. Here are some real-world examples of where to find metadata:

Photos Whenever a photo is captured with a camera, metadata such as camera filename, date, time, and geolocation are gathered and saved with it.

Emails When an email is sent or received, there is lots of visible metadata such as subject line, the sender, the recipient and date and time sent. There is also hidden metadata that includes server names, IP addresses, HTML format, and software details.

Spreadsheets and documents Spreadsheets and documents are already filled with a considerable amount of data so it is no surprise that metadata would also accompany them. Titles, author, creation date, number of pages, user comments as well as names of tabs, tables, and columns are all metadata that one can find in spreadsheets and documents.

Websites Every web page has a number of standard metadata fields, such as tags and categories, site creator’s name, web page title and description, time of creation and any iconography.

Digital files Usually, if you right click on any computer file, you will see its metadata. This could consist of file name, file size, date of creation and modification, and type of file.

Books Metadata is not only digital. Every book has a number of standard metadata on the covers and inside that will inform you of its title, author’s name, a table of contents, publisher information, copyright description, index, and a brief description of the book’s contents.

Data as you know it Knowing the content and context of your data, as well as how it is structured, is very valuable in your career as a data analyst. When analyzing data, it is important to always understand the full picture. It is not just about the data you are viewing, but how that data comes together. Metadata ensures that you are able to find, use, preserve, and reuse data in the future. Remember, it will be your responsibility to manage and make use of data in its entirety; metadata is as important as the data itself.

Using metadata as an analyst

Now that you know what metadata is, it's time to explore why data analysts use it. You already know that data needs to be identified and described before it can help you solve a problem or make an effective business decision.

Putting data into context is probably the most valuable thing that metadata does, but there are still many more benefits of using metadata. Here's one. Metadata creates a single source of truth by keeping things consistent and uniform. We data analysts love consistency. We always aim for this kind of uniformity in our data and our databases. After all, data that's uniform can be organized, classified, stored, accessed, and used effectively. Plus, when a database is consistent, it's so much easier to discover relationships between the data inside it and the data elsewhere.

Metadata also makes data more reliable by making sure it's accurate, precise, relevant, and timely. This also makes it easier for data analysts to identify the root causes of any problems that might pop up. The bottom line is, when the data we work with is high quality, it makes things easier and improves our results.

One of the ways data analysts make sure their data is consistent and reliable is by using something called a metadata repository. A metadata repository is a database specifically created to store metadata. Metadata repositories can be stored in a physical location, or they can be virtual, like data that exists in the cloud. These repositories describe where metadata came from, keep it in an accessible form so it can be used quickly and easily, and keep it in a common structure for everyone who may need to use it.

Metadata repositories make it easier and faster to bring together multiple sources for data analysis. They do this by describing the state and location of the metadata, the structure of the tables inside, and how data flows through the repository. They even keep track of who accesses the metadata and when.

Here's a real-world example. As a healthcare analyst at Google, I use second and third-party data. As you learned, second-party data is data that's collected by a group directly from its audience and then sold. Third-party data comes from outside sources, which are not the original collectors of that data. They get it from websites or programs that pull the data from the various platforms where it was originally generated. It's a bit complex, but the main thing to remember is that third party data doesn't come from inside your own business.

If my team needs to work with data that wasn't created at Google, that means we sometimes don't know very much about its quality and credibility, but we need to be certain that our data can be trusted and was collected responsibly. After all, if the data is unreliable, our results can be unreliable too. That's why understanding the metadata of the external database is so important. It lets us confirm that the data is clean, accurate, relevant, and timely. This is particularly important if the data comes from another organization. One other important step when working with external data is confirming that we're allowed to use it. We'll often reach out to the owner to make sure we can access or purchase it.

To sum up, metadata repositories are useful for all these reasons. Plus, they help ensure that my team is pulling the right content for the particular project and using it appropriately. We can confirm this because the metadata clearly describes how and when the data was collected, how it's organized, and much more.

Soon you'll learn even more about using metadata in data analytics, and if you're finding metadata particularly fascinating, you'll discover some really exciting career choices that focus on metadata. Stay tuned.

Metadata management

Metadata and metadata repositories are very powerful tools in the data analyst toolbox. As we discussed previously, data analysts use them to create a single source of truth, keep data consistent and uniform, and ensure that the data we work with is accurate, precise, relevant, and timely. These tools also make it easier to access and use data by standardizing our processes.

We'll explore more components of metadata and learn how metadata analysts work to keep things organized. We know that the amount of data out there continues to grow, but lots of businesses just aren't using their data. Sometimes, they don't know what they have, sometimes they can't find it or sometimes a business just doesn't trust it. Especially in bigger companies, data can span numerous different processes and systems. And pulling together data from so many places can be a big challenge.

For example, let's say a company starts out with a traditional data storage system in its offices.

But then, as the amount of data it owns continues to expand, cloud storage is needed too.

Plus, this company could also be accessing and using second or third party data from a partner organization. Each of these systems has its own rules and requirements, so each organizes the data in a completely different way, adding even more complexity. It's no wonder so many organizations struggle to find the right data at the right moment.

On the other hand, metadata is stored in a single, central location and it gives the company standardized information about all of its data.

This is done in two ways.

First, metadata includes information about where each system is located and where the data sets are located within those systems. Second, the metadata describes how all of the data is connected between the various systems.

Another important aspect of metadata is something called data governance. Another important aspect of metadata is something called data governance. Data governance is a process to ensure the formal management of a company’s data assets. This gives an organization better control of their data and helps a company manage issues related to data security and privacy, integrity, usability, and internal and external data flows. It's important to note that data governance is about more than just standardizing terminology and procedures. It's about the roles and responsibilities of the people who work with the metadata every day.

These are metadata specialists, and they organize and maintain company data, ensuring that it's of the highest possible quality. These people create basic metadata identification and discovery information, describe the way different data sets work together, and explain the many different types of data resources.

Metadata specialists also create very important standards that everyone follows and the models used to organize the data. There's one thing they all have in common. Whether they work at a tech company, a nonprofit association, or a financial institution, metadata analysts are great team players. They're passionate about making data accessible by sharing with colleagues and other stakeholders.

If you're looking for a role that encourages you to explore all the data that the digital world has to offer, following the path to becoming a metadata analyst may be the right choice for you. But either way, businesses of all kinds face market trends and competition, and they need to understand why one process works while another doesn't. Data analytics allows them to answer key questions and keep improving.

Megan: Fun with metadata

"My name is Megan, and I am an agency measurement lead here at Google. Basically, I help to demystify measurement and analytics for advertising agencies. So people that are tasked with executing media plans for advertisers but also people that are interested in measuring the impact that media is having for their clients. So I've been doing this for about 17 years now and have seen a lot of evolution in the space from data availability, from different modeling techniques becoming more advanced but also more accessible, and it's just been a really cool journey to see how it's evolved, how analytics has become more mainstream, and how people are getting more excited about it. Metadata is basically the key to your larger data set. It helps describe what's in the rows and columns of the data that you'll be working with. Metadata is kind of a shorthand or a CliffsNotes version of a much more complex set of information. It can be helpful in just kind of helping you get a handle on what's in a single data set that you may have access to.

It's an important part of the discovery process of any analytics project as you're working with either a client or a vendor to understand the resources that you'll have to address a problem and what might be missing. It just gives you the keys to unlock that data in a really simple and straightforward way and is a great communication tool.

When I was working for an advertiser, one of the things that we were trying to do was build something called a data lake. So essentially, this is bringing together all of the sources of data that you might want to use in an analysis into one place, which can be really, really tricky. One of the benefits of metadata was figuring out where we had sources that may overlap, where we had data sources that had things in common. And what the unique pieces of information were that we were getting from each of those data sets. So as we thought about tackling this really huge and important project, we were able to use metadata to quickly and easily get to the basic constructs that we were trying to tackle. When you're working with people who maybe don't have analytics as their day job, getting that "aha" moment, helping them understand how measurement and analytics are tools that can help them achieve their goals, is really important. And just getting to that idea of you made something that was previously inaccessible a little bit more accessible for that team and something they feel comfortable putting into practice is really important and really kind of a great way to come out of a partnership."

Practice Quiz: metadata

Accessing different data

Working with more data sources

we'll discuss the different places data analysts go to connect with data. There's all kinds of data out there and it's important to know how to access it.

Earlier, you learn that there are two basic types of data used by data analysts: internal and external. Internal data is data that lives within a company's own systems. It's typically also generated from within the company. You may also hear internal data described as primary data.

External data is data that lives and is generated outside an organization. It can come from a variety of places, including other businesses, government sources, the media, professional associations, schools, and more. External data is sometimes called secondary data.

Gathering internal data can be complicated. Depending on your data analytics project, you might need data from lots of different sources and departments, including sales, marketing, customer relationship management, finance, human resources, and even data archives.

But the effort is worth it. Internal data has plenty of advantages for a business. It provides information that's relevant to problems you're trying to solve, and it's free to access because the company already owns it. With internal data, analysts can work on all data projects without ever looking beyond their own walls. But sometimes internal data doesn't give you the full picture. In those cases, data analysts can turn to external data and apply that information to their analysis. For instance, as healthcare analysts, we often partner with other healthcare organizations or nonprofits and use their data to create deeper analyses and add some more industry-level perspective.

Earlier, you learned that openness has created a lot of data for analysts to use, largely through open data initiatives. As a reminder, openness or open data refers to the free access, usage and sharing of data.

For example, the United States government makes hundreds of thousands of data sets available to the public on Data.gov. These data sets contain information on weather patterns, educational progress, crime rates, transportation, and much more. There are lots of reasons for these open data initiatives. One is to make government activities more transparent, like letting the public see where money is spent. It also helps educate citizens about voting and local issues. Open data also improves public service by giving people ways to be a part of public planning or provide feedback to the government. Finally, open data leads to innovation and economic growth by helping people and companies better understand their markets.

Google actually hosts lots of public databases with information on science, transportation, economics, climate, and more. As an example, a bike sharing company could use traffic data from within our public transportation database to see where the roads are busiest, then choose those locations for their bikes in order to reduce cars on the road and give people another transportation option.

Now you're familiar with internal and external data and how you can access both. Coming up, we'll learn how to import all the data you collect from different sources into a spreadsheet.

From external source to a spreadsheet

When you work with spreadsheets, there are a few different ways to import data. This reading covers how you can import data from external sources, specifically:

Other spreadsheets
CSV files
HTML tables (in web pages)

Importing data from other spreadsheets In a lot of cases, you might have an existing spreadsheet open and need to add additional data from another spreadsheet.

Google Sheets In Google Sheets, you can use the IMPORTRANGE function. It enables you to specify a range of cells in the other spreadsheet to duplicate in the spreadsheet you are working in. You must allow access to the spreadsheet containing the data the first time you import the data.

The URL shown below is for syntax purposes only. Don't enter it in your own spreadsheet. Replace it with a URL to a spreadsheet you have created so you can control access to it by clicking the Allow access button.

Refer to the Google Help Center's IMPORTRANGE page for more information about the syntax. There is also an example of its use later in the program in Advanced functions for speedy data cleaning.

Microsoft Excel To import data from another spreadsheet, do the following:

Step 1: Select Data from the main menu. Step 2: Click Get Data, and then select From File within the toolbar. In the drop down, choose From Excel Workbook Step 3: Browse for and select the spreadsheet file and then click Import. Step 4: In the Navigator, select which worksheet to import. Step 5: Click Load to import all the data in the worksheet; or click Transform Data to open the Power Query Editor to adjust the columns and rows of data you want to import. Step 6: If you clicked Transform Data, click Close & Load and then select one of the two options:

Close & Load - import the data to a new worksheet
Close & Load to... - import the data to an existing worksheet If these directions do not work for the version of Excel that you have. Visit this free online training center, Microsoft Excel for Windows Training, you will find everything you need to know, all in one place. If you are using Numbers, search the Numbers User Guide for directions.

Importing data from CSV files

Google Sheets Step 1: Open the File menu in your spreadsheet and select Import to open the Import file window.

Step 2: Select Upload and then select the CSV file you want to import.

Step 3: From here, you will have a few options. For Import location, you can choose to replace the current spreadsheet, create a new spreadsheet, insert the CSV data as a new sheet, add the data to the current spreadsheet, or replace the data in a specific cell. The data will be inserted as plain text only if you uncheck the Convert text to numbers, dates, and formulas checkbox, which is the default setting. Sometimes a CSV file uses a separator like a semi-colon or even a blank space instead of a comma. For Separator type, you can select Tab or Comma, or select Custom to enter another character that is being used as the separator.

Step 4: Select Import data. The data in the CSV file will be loaded into your sheet, and you can begin using it!

Note: You can also use the IMPORTDATA function in a spreadsheet cell to import data using the URL to a CSV file. Refer to Google Help Center's IMPORTDATA page for more information and the syntax.

Microsoft Excel Step 1: Open a new or existing spreadsheet

Step 2: Click Data in the main menu and select the From Text/CSV option.

Step 3: Browse for and select the CSV file and then click Import.

Step 4: From here, you will have a few options. You can change the delimiter from a comma to another character such as a semicolon. You can also turn automatic data type detection on or off. And, finally, you can transform your data by clicking Transform Data to open the Power Query Editor.

Step 5: In most cases, accept the default settings in the previous step and click Load to load the data in the CSV file to the spreadsheet. The data in the CSV file will be loaded into the spreadsheet, and you can begin working with the data.

If these directions do not work for the version of Excel that you have. Visit this free online training center, Microsoft Excel for Windows Training, you will find everything you need to know, all in one place.

If you are using Numbers, search the Numbers User Guide for directions.

Google Sheets In Google Sheets, you can use the IMPORTHTML function. It enables you to import the data from an HTML table (or list) on a web page. Refer to the Google Help Center's IMPORTHTML page for more information about the syntax. If you are importing a list, replace "table" with "list" in the above example. The number 4 is the index that refers to the order of the tables on a web page. It is like a pointer indicating which table on the page you want to import the data from.

You can try this yourself! In blank worksheets, copy and paste each of the following IMPORTHTML functions into cell A1 and watch what happens. You will actually be importing the data from four different HTML tables in a Wikipedia article: Demographics of India. You can compare your imported data with the tables in the article.

=IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","table",1)
=IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","table",2)
=IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","table",3)
=IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","table",4)

Microsoft Excel You can import data from web pages using the From Web option:

Step 1: Open a new or existing spreadsheet.

Step 2: Click Data in the main menu and select the From Web option.

Step 3: Enter the URL and click OK.

Step 4: In the Navigator, select which table to import.

Step 5: Click Load to load the data from the table into your spreadsheet.

If you are using Numbers, search the Numbers User Guide for directions.

Importing data from spreadsheets and databases

At this point, you've learned all about internal and external data, and how to prepare it for use. Now, we'll go through the process of actually importing data from different sources. Sometimes you want to upload a spreadsheet from your files, such as a CSV file.

CSV stands for comma-separated values. A CSV file saves data in a table format. Now, let's bring that file into a fresh spreadsheet. We'll start by selecting a file, then import. Then we'll choose to upload a file. Navigate to it, open it and insert it as a new sheet. CSV files use plain text and they're delineated by characters. So each column or field is clearly distinct from another when importing. As you learned, CSVs are comma-separated, and usually the spreadsheet app will auto-detect those separations.

But sometimes, you might need to indicate that the separator is another character or a space by selecting the different options in this window. Also, if you are planning to work with the data set, you would usually convert to text, numbers or other options here. But plain text is okay for reporting purposes. So we can leave those fields alone. Finally, select Import data.

The WHO data repository may change from time to time. Go to https://www.who.int/data/gho for the latest information and data.

This is a place where anybody can access open-source data.
As you can see, there's tons of data available. You can search by theme, category, indicator and country.
You can also access World Health Organization metadata if you want to learn more about the data in the repository. For our example, we'll look at medical doctors by country and year. This information would be useful for a data analysis project looking into how many doctors are available to treat patients within a certain population compared to other populations. To get this data, we'll start on this webpage, which contains the data set we want. Then we'll download the data as a CSV file. Then open a new spreadsheet and import the file by selecting File, Import. Next, upload your file and select Import Data. After reviewing the data to make sure it looks clean, we can title it and begin our work.

I know this is a lot of information to take in, but you'll get much more comfortable with this the more you practice. Coming up, we'll learn how to sort and filter your data to focus on the information relevant to you.

CSV files use plain text and are delineated by characters, such as a comma. A delineator indicates a boundary or separation between two things.

Exploring public datasets

Open data helps create a lot of public datasets that you can access to make data-driven decisions. Here are some resources you can use to start searching for public datasets on your own:

The Google Cloud Public Datasets allow data analysts access to high-demand public datasets, and make it easy to uncover insights in the cloud.
The Dataset Search can help you find available datasets online with keyword searches.
Kaggle has an Open Data search function that can help you find datasets to practice with.
Finally, BigQuery hosts 150+ public datasets you can access and use.

Public health datasets

Global Health Observatory data: You can search for datasets from this page or explore featured data collections from the World Health Organization.
The Cancer Imaging Archive (TCIA) dataset: Just like the earlier dataset, this data is hosted by the Google Cloud Public Datasets and can be uploaded to BigQuery.
1000 Genomes: This is another dataset from the Google Cloud Public resources that can be uploaded to BigQuery.

Public climate datasets

National Climatic Data Center: The NCDC Quick Links page has a selection of datasets you can explore.
NOAA Public Dataset Gallery: The NOAA Public Dataset Gallery contains a searchable collection of public datasets.

Public social-political datasets

UNICEF State of the World’s Children : This dataset from UNICEF includes a collection of tables that can be downloaded.
CPS Labor Force Statistics : This page contains links to several available datasets that you can explore.
The Stanford Open Policing Project : This dataset can be downloaded as a .CSV file for your own use.

Practice Quiz: accessing data sources

Sorting and filtering

You've learned about both internal and external data. Now I'll show you how to focus on only the data that's relevant to the problem you're trying to solve.

This is useful if you're working with a very large complex spreadsheet, which data analysts encounter all the time. Having lots of data can make it difficult to quickly find and analyze the information you need. No two analytics projects are the same. Often data analysts process, view, and use data very differently, even if it comes from the exact same source.

Here's an example. Check out this spreadsheet that shows a company's sales reps and where they work. template: Sales Rep Cities, States, and Parts

Different data analysts might want different information from the spreadsheet, and that's where sorting and filtering comes in. Sorting and filtering the data in a spreadsheet helps us customize the way data is presented. They can also organize data so analysts can zoom in on the pieces that matter. Think of it like a magnifying glass for our data.

Let's begin with sorting. Sorting involves arranging data into a meaningful order to make it easier to understand, analyze, and visualize. Data can be sorted in ascending or descending order, and alphabetically or numerically. Sorting can be done across all of a spreadsheet or just in a single column or table. You can also sort by multiple variables. For instance, if our data set contains both city and state fields, we can sort first by city and then by state.

Anytime you're sorting data, it's always a good idea to freeze the header row first. To do this, we'll highlight the row. Then from the view menu, choose freeze and one row. This locks the row in place. Now when we scroll down the sheet, the header row stays visible so we know the category of each column.

Looks good to me. Now let's sort the entire spreadsheet. We'll sort by city first. To do this, select the city column, This will sort all the columns from A to Z by row, with the selected column being the primary sort criteria. The cities are now sorted alphabetically, and they're still grouped with the corresponding states, sales reps, and auto parts. The details across each row are automatically kept together when sorting a particular section, as you can see here.

Multiple criteria sorting is another very useful data analysis tool. For instance, let's say we want to see a list of sales reps by the cities and states in which they work. First, we select the entire data set,

The menu option has slightly changed. Select Data and Sort range, and then choose Advanced range sorting options to view the dialog box shown.

then choose data and sort range. In the dialog box, make sure that "Data has header row" is highlighted. That way row A, city, states, sales rep, and auto parts won't be part of the sort. Then in the sort by drop-down menu, select state and the sort order A to Z. Now add another sort column. In the "then by" drop-down, select city and the sort order A to Z. Finally, select Sort.

Now we can search the data to easily find a sales rep who works in a particular state and city. Sorting is useful when you want to look at everything in a spreadsheet in alphabetical or numerical order. <ㅠㄱ> But sometimes data analysts want to isolate a particular piece of information. To do this, they use a filter. Filtering means showing only the data that meets a specific criteria while hiding the rest. A filter simplifies a spreadsheet by only showing us the information we need.

For example, we could add a filter to see only the sales reps who worked with a particular product. To do this, we first select Data and Create a filter. Choose the column with the data we need. In this case, Auto Parts. Filter buttons will appear in the corner of each column header. To filter our spreadsheet by auto part, click the button in the Auto Parts header. In this example, let's say we want to only see sales reps who worked with rims. Remove the check marks from the categories we don't want to see, which is everything except for rims. Then select okay. The filter temporarily hides anything that doesn't meet the condition. But note that, even though they aren't visible, they're still there. When it's time to view the entire area spreadsheet again, simply turn off the filter.

Sorting and filtering are very important tools in the data analyst's toolbox. In the next video, you'll discover even more ways to narrow in on the exact information you need for any data analysis project.

Clean data in spreadsheets with sorting and filtering

What you will need To get started, access the spreadsheet that contains the data. spreadsheet.

Clean your data

Removing incorrect data

Filling in missing data

Converting data

Considering databases and spreadsheets for sorting and filtering

Practice Quiz: sorting and filtering

Working with large datasets in SQL

Setting up BigQuery, including sandbox and billing options

Throughout this course, you've seen how BigQuery can be used to view and analyze data from tons of sources. Now we're going to explore the different account tiers that BigQuery offers, so you know how to choose the right one for your needs, and how you can access them. BigQuery is offered to you at no charge. There are paid options available, but we won't need them for the activities in this course. Instead, we're going to talk about two account types: sandbox, and free trial.

A Sandbox account is available at no charge and anyone with a Google account can log in and use it. There are a couple of limitations to this account type.

For example, you get a maximum of 12 projects at a time. This means that if you want to make a 13th project, you'll have to delete one of your original 12. It also doesn't allow you to insert new records to a database or update the field values of existing records. These Data Manipulation Language or DML operations aren't supported in the sandbox. However, you won't need to do this in course activities. You can read more about the limitations of a sandbox account in the BigQuery documentation. This is the account type we'll use for most of our activities. It's simple to set up. So, later, we'll walk through the steps required to create an account.

Before that though, we should talk about the other way to use BigQuery without charges. The Google Cloud free trial. The free trial gives you access to more of what BigQuery has to offer with fewer overall limitations.

The free trial offers $300 in credit for use in Google Cloud during the first 90 days. You won't get anywhere near that spending limit if you just use the BigQuery console to practice SQL queries. After you spend the $300 credit or after 90 days, your free trial will expire and you will need to personally select to upgrade to a paid account to keep working in Google Cloud. Your method of payment will not be automatically charged after your free trial ends. The free trial does require that you set up a payment option with Google Cloud. But unless you choose to opt-in for an account upgrade, it won't charge you. However, it does require you to enter a payment type. So, we understand if you don't feel comfortable with this option. This is one reason the BigQuery sandbox account exists, so you don't have to enter any payment information. With either type of account, you can upgrade to a paid account at anytime and retain all of your existing projects. If you set up a free trial account but choose not to upgrade to a paid account when your trial period ends, you can set up a free sandbox account at that time. However, projects from your trial won't transfer to your sandbox. It would be like starting from scratch. Just something to keep in mind.

Now we're going to set up your sandbox account, which you can change into a free trial or upgrade to a paid account if you choose. First, we'll go to the BigQuery sandbox documentation page. Then go to the upper right corner and log in to whichever Google account you want to use for the BigQuery sandbox account.

Then we'll select the "Go to BigQuery" button on the documentation page. This gives us a drop-down to select a country and to read the terms of service agreement.

This will bring us to the SQL workspace, which we'll be using an upcoming activities. Choose "Create Project" and name the project and give it an ID. Choose "Create," and then "Done." There we have it. In the next session, we'll explore what each part of the SQL workspace does and how we'll use it in future activities. See you there.

Using BigQuery

BigQuery is a data warehouse on Google Cloud that data analysts can use to query, filter large datasets, aggregate results, and perform complex operations.

An upcoming activity is performed in BigQuery. This reading provides instructions to create your own BigQuery account, select public datasets, and upload CSV files. At the end of this reading, you can confirm your access to the BigQuery console before you move on to the activity,

Note: Additional getting started resources for a few other SQL database platforms are also provided at the end of this reading if you choose to work with them instead of BigQuery.

Types of BigQuery accounts There are two different types of accounts: sandbox and free trial. A sandbox account allows you to practice queries and explore public datasets for free, but has additional restrictions on top of the standard quotas and limits. If you prefer to use BigQuery with the standard limits, you can set up a free trial account instead. More details:

A free sandbox account doesn’t ask for a method of payment. It does, however, limit you to 12 projects. It also doesn't allow you to insert new records to a database or update the field values of existing records. These data manipulation language (DML) operations aren't supported in the sandbox.
A free trial account requires a method of payment to establish a billable account, but offers full functionality during the trial period. With either type of account, you can upgrade to a paid account at any time and retain all of your existing projects. If you set up a free trial account but choose not to upgrade to a paid account when your trial period ends, you can still set up a free sandbox account at that time. However, projects from your trial account won't transfer to your sandbox account. It would be like starting from scratch again.

Set up a free sandbox account for use in this program

Follow these step-by-step instructions or watch the video, Setting up BigQuery, including sandbox and billing options.
For more detailed information about using the sandbox, start with the documentation, Using the BigQuery sandbox.
After you set up your account, you will see the project name you created for the account in the banner and SANDBOX at the top of your BigQuery console.

Set up a free trial account instead (if you prefer) If you prefer not to have the sandbox limitations in BigQuery, you can set up a free trial account for use in this program.

Follow these step-by-step instructions or watch the video, Setting up BigQuery, including sandbox and billing options. The free trial offers $300 in credit over the next 90 days. You won’t get anywhere near that spending limit if you just use the BigQuery console to practice SQL queries. After you spend the $300 credit (or after 90 days) your free trial will expire and you will need to personally select to upgrade to a paid account to keep using Google Cloud Platform services, including BigQuery. Your method of payment will never be automatically charged after your free trial ends. If you select to upgrade your account, you will begin to be billed for charges.
After you set up your account, you will see My First Project in the banner and the status of your account above the banner – your credit balance and the number of days remaining in your trial period.

How to get to the BigQuery console In your browser, go to console.cloud.google.com/bigquery.

Note: Going to console.cloud.google.com in your browser takes you to the main dashboard for the Google Cloud Platform. To navigate to BigQuery from the dashboard, do the following:

Click the Navigation menu icon (Hamburger icon) in the banner.
Scroll down to the BIG DATA section.
Click BigQuery and select SQL workspace. Watch the How to use BigQuery video for an introduction to each part of the BigQuery SQL workspace.

(Optional) Explore a BigQuery public dataset You will be exploring a public dataset in an upcoming activity, so you can perform these steps later if you prefer.

Refer to these step-by-step instructions.

(Optional) Upload a CSV file to BigQuery These steps are provided so you can work with a dataset on your own at this time. You will upload CSV files to BigQuery later in the program.

Refer to these step-by-step instructions.

Getting started with other databases (if not using BigQuery) It is easier to follow along with the course activities if you use BigQuery, but if you are connecting to and practicing SQL queries on other database platforms instead of BigQuery, here are similar getting started resources:

Getting started with MySQL: This is a guide to setting up and using MySQL.
Getting started with Microsoft SQL Server: This is a tutorial to get started using SQL Server.
Getting started with PostgreSQL: This is a tutorial to get started using PostgreSQL.
Getting started with SQLite: This is a quick start guide for using SQLite.

How to use BigQuery

We're going to learn about each part of the BigQuery SQL workspace so you can use it during this course and throughout your career as a data analyst. It's an extremely valuable and widely popular tool, so understanding how it works is super helpful. Feel free to follow along on your screen as we explore BigQuery. You may notice that my screen appears a little different than yours, since BigQuery is constantly updating its interface. Don't worry if this happens as minor differences won't stop you from understanding the basics.

To begin, go to the BigQuery landing page, then login to the account you created earlier. To begin, go to the BigQuery landing page, then login to the account you created earlier.

To navigate to the SQL workspace, select the menu on the left side of the screen and scroll down to the Big Data header. Then hover over the BigQuery label and click ''SQL workspace'' from the drop-down. Now that we're in the SQL workspace, we're going to search for public datasets, select a dataset through the Data Explorer, run a query, and upload our own data for querying.

Follow these steps to find and pin the bigquery-public-data.

1. Navigate to the Explorer menu in BigQuery. 2. Type the word public in the search box and enter. 3 Click "Broaden search to all projects" 4. Find the bigquery-public-data and pin it.

First, we'll search for a public dataset to use. To select a public dataset, navigate to the Explorer menu on the left side of the screen. Click the "Add Data" button in the upper right of the menu. Then in the drop-down menu, select "Explore public datasets." This will open the marketplace and show you available public datasets.

The NOAA Lighting Strike dataset is no longer available to all users. You will not need this specific dataset for future activities or quizzes. However, the same process for finding and using a public dataset still applies.

Double-check that you have pinned public datasets to your Explorer pane, if it isn't listed, search by name for it.

Let's go to the search marketplace bar and search for noaa_lightning, a dataset we'll use in an upcoming activity. Click on the ''Cloud-to-Ground Lightning Strikes'' dataset. This will give us a description and preview of the dataset which captures observations about lightning activity and weather patterns in the United States.

Click "View dataset." This will bring you back to the SQL workspace and create a tab for the dataset.

We can then move back to the Editor tab we have opened, or click "Compose new query" to begin writing with SQL. On the left, notice that the BigQuery public data drop-down list is in the Explorer menu.

We can click the arrow to expand the BigQuery data list and pick out a new dataset. Let's select the first dataset in the drop-down list, austin_311. When we do, it expands to show the table within the dataset. We can open the dataset for a preview. The Schema tab contains the names of each column in the dataset. The Details tab contains additional metadata, such as the creation date of the dataset. The Preview tab contains the first rows from the dataset.

On this page, we can click "Query" to automatically create a new editor window with the template for a query already populated. From here, put an asterisk after Select, where our cursor pops up, then run the query. Congratulations, you ran a SQL query in BigQuery. The query you ran returned rows from the dataset which populate in a window beneath the editor interface. Results from any query you run will also display here. Now let's say you have the results of a survey that you want to upload to BigQuery and analyze using SQL.

To add your own data to BigQuery, choose the ID of the project you want to add to. Select the three vertical dots icon to open options for the project then choose "Create dataset." Name the dataset something that will help you identify it later, such as upload_test_dataset. Then click "Create dataset."

Next, go to the Explorer menu and choose the three vertical dots next to the dataset under the Projects drop-down.

Now we'll select the icon for create table, which opens a pop-up window. Under Source and create table from, select "Upload" or whichever method you prefer to upload your data. Here, we can upload any data file, such as a CSV file. Let's give our table a helpful name such as test_table. Make sure that the schema is set to auto detect and select "Create table."

There's more to come with BigQuery. Feel free to re-watch this video anytime and keep practicing. See you soon.

BigQuery in action

You've learned how sorting and filtering data in spreadsheets helps data analysts customize the information. Customizing data makes it more meaningful and easier to understand, analyze, and visualize.

This is also true for databases. Sometimes a data set is too large to download, or it won't fit in a spreadsheet. So a data analyst will use SQL to create a query to view the specific data that they want from within the larger set.

We've learned that a database is a collection of data stored in a computer system. And that SQL stands for Structured Query Language. Data analysts use query languages to communicate with the database.

In earlier you also learned that a relational database contains a series of tables that can be connected to form relationships. These relationships are represented by primary and foreign keys. Data analysts write queries in order to get data from these tables. Let's see how this works. We'll start with our table viewer.

Here we can see what public data sets are available. We'll scroll through the data before we start using it to get a feel for what it's all about, and to make sure it's clean.

Some table viewers let you preview a few rows before even writing a query. This is helpful if you want to take a quick look to be sure the data set will be right for your project.

The words you see before the dot represent the database name. The words you see before the dot represent the database name.

Let's select and copy the data set name now because we'll need it in a second.

Now we'll click on the plus sign to compose a new query. Most queries begin with the word SELECT. Then we add a space. Because we want to see the entire data set, we'll put an asterisk next. The asterisk says we want to include all columns. This is a great shortcut because without it, we'd have to type in every single field name. Next we'll press return and type FROM. FROM does just what it sounds like. It indicates where the data is coming from. After that, we'll add another space. Now, we paste in the name of the data set that we copied earlier. And finally, run the query. Now, you can carefully inspect the data set before we begin working with it. One important thing to keep in mind: SQL queries can be written in a lot of different ways, but still provide the same results.

For example, we could have written this query as one long line of instructions like this, and we'd still get the same results. The additional lines and spaces don't impact the query's outcome, but they keep your query organized and easier to read for yourself and others.

Now, if the project doesn't require all of these fields, we can use SQL to view a particular piece or pieces of data. To do this, we specify a certain column name in the query. For example, maybe we only want to see data from Pennsylvania. So we'll begin our query the same way we just learned. SELECT, space, add an asterisk. Then FROM our solar potential database. But this time we'll add WHERE. WHERE also does exactly what it sounds like. It tells the database where to look for information. In this case, the state name column. So add a space and state the underscore name, the name of the column. Now because we only want to see data from Pennsylvania, we add an equal sign and the word Pennsylvania with single quotes around it. In SQL single quotes indicate the beginning and ending of a string. Finally, we run the query.

Now we can review the data on solar potential for only Pennsylvania.

Now we've got the data we want and we're ready to start putting it to work, which we'll cover later on. But for now, let's celebrate finishing another module. You've covered a lot of complex and highly technical information. As you keep practicing though, things will start to feel a lot more natural.

For now, take a moment to sit back and think about all you've learned. You discovered metadata and how it keeps data organized by describing what that data is all about.

You've seen how internal and external data are accessed and how data analysts use them to find compelling insights to solve business problems. And you can sort and filter your data to really pinpoint the information you need. Finally, you just learned about queries and you even practiced writing some.

Coming up, you'll have a few readings and then a weekly challenge to test your knowledge. This will help you confirm that you've understood what we've worked on in these videos. And as always, if you're ever unsure about a question, I highly encourage you to review the videos and readings to find the answer. You're the data detective now, so use those skills. Keep up the great work and I'll see you after the weekly challenge.

Hands-on Guided Project: Select a New City Using BigQuery

About Guided Projects In this Guided Project, you will be assigned a cloud desktop that has the required software pre-installed. This will allow you to follow along with the instructor to complete the project's tasks.

You will work side-by-side with an industry expert to help a fictional client select a new city for their offices. To do this, you will apply your knowledge of SQL to query a database and filter a large dataset.

After completing this project, you can reference a summary of the key takeaways from this document.

Please refer to the Coursera Help Center for more information.

Overview and Import Data into BigQuery

Identify Cities That Match the Temperature Requirements

click cities tab, and select schema to see the column names

cities is the table that is in the city_data dataset.

Narrow Down Cities Based on Commute Times

Narrow Down Results Using Happiness Rankings

Analyze Alternative Scenarios

Hands-On Activity: Introduction to BigQuery

Explore BigQuery For this activity, you will need a BigQuery account. If you haven’t made one already, you can follow the instructions from the Using BigQuery reading. Once you have your account, you can start exploring!

Open your console

Log in to BigQuery.
Then, click the Go to console button on the BigQuery homepage. This will open a new tab with your console.

Take a moment to explore your console. On the left side, you will find the Explorer menu; this includes a search bar you can use to find resources, pinned projects, and the + ADD button. On the right side, you will find the Query Editor. This is where you will input queries and view datasets. You can also find your Job History, Query History, and Saved Queries here.

Access public data in BigQuery In order to actually start writing queries, you will need some data to work with. Once you’re familiar with the BigQuery interface, you can access a public dataset directly from your console.

In the upper left corner of the screen locate the Explorer tab.
Click on the + ADD button in the Explorer menu and navigate down the list in the Add window and select Public Datasets. This will open a new menu where you can search public datasets that are already available through Google Cloud.

In the Marketplace menu you just opened, type london bicycle in the search box at the top; this will return the London Bicycle Hires dataset from the Greater London Authority. Click the dataset for more details.

From the dataset information page, click the blue VIEW DATASET button. This will open your console in a new tab with this dataset loaded.

You'll notice that bigquery-public-data is now starred in your Explorer menu. You can now explore and query these public datasets.

Follow these steps to find and star the bigquery-public-data if you do not have it starred.

Navigate to the Explorer menu in BigQuery.

Type the word public in the search box and enter.
Click "Broaden search to all projects".
Find the bigquery-public-data and click on the star to pin it.
Click on the drop-down arrow to the left of bigquery-public-data and scroll down the list of public datasets until you find the london_bicycles data (You may also type london_bicycles in the Explorer search bar to quickly locate the dataset). Once located, click on the drop-down arrow next to the dataset, it will list two tables. Click on cycle_hire. This will pull up a new tab in your Query Editor with information about the table schema.
After checking out the table schema, you can take a peek into what data the cycle_hire table contains by clicking on the Preview tab. This will give you a better idea of what kind of data you’ll be working with. Once you have finished previewing the data, you can write a query!

Query your data So far, you’ve learned three basic parts of a query: SELECT, FROM, and WHERE. As a refresher, here are what those basic parts represent in the query:

SELECT is the section of a query that indicates what data you want SQL to return to you
FROM is the section of a query that indicates which table the desired data comes from.
WHERE is the section of a query that indicates any filters you’d like to apply to your dataset

Write a basic query

Write a query to answer a question

Intro to BigQuery Solutions

Hands-On Activity: Create a custom table in BigQuery

Link to baby names data: names.zip

In-depth guide: SQL best practices

You can save this reading for future reference. Feel free to download a PDF version of this reading below: DAC3 In-depth guide_SQL best practices.pdf These best practices include guidelines for writing SQL queries, developing documentation, and examples that demonstrate these practices. This is a great resource to have handy when you are using SQL yourself; you can just go straight to the relevant section to review these practices. Think of it like a SQL field guide!

Capitalization and case sensitivity

Single or double quotes: '' or " "

SQL understands text strings as either starting with a single quote ' or double quote". Since this string starts with double quotes, SQL will expect another double quote to signal the end of the string. This keeps the apostrophe safe, so it will return "Shepherd's pie" and not 'Shepherd'.

Comments as reminders

Snake_case names for columns

CammelCase names for tables

Indentation

Multi-line comments

SQL text editiors

Examples with Sublime Text

You can begin with these resources:

Search and replace in Sublime Text
Regex tutorial (if you don’t know what regular expressions are)
Regex cheat sheet

Applying SQL

Choose a table

Write a query

Test your knowledge on using SQL with large datasets

Weekly Challenge

Course 3 Module 3 Glossary_DA terms and definitions