3.3.1.Working with databases - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki
Databases in data analytics
Databases enable analysts to manipulate, store, and process data. This helps them search through data a lot more efficiently to get the best insights.
Relational databases
A relational database is a database that contains a series of tables that can be connected to show relationships. Basically, they allow data analysts to organize and link data based on what the data has in common.
In a non-relational table, you will find all of the possible variables you might be interested in analyzing all grouped together. This can make it really hard to sort through. This is one reason why relational databases are so common in data analysis: they simplify a lot of analysis processes and make data easier to find and use across an entire database.
Database Normalization
Normalization is a process of organizing data in a relational database. For example, creating tables and establishing relationships between those tables. It is applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database.
The key to relational databases
Tables in a relational database are connected by the fields they have in common. You might remember learning about primary and foreign keys before. As a quick refresher, a primary key is an identifier that references a column in which each value is unique. In other words, it's a column of a table that is used to uniquely identify each record within that table. The value assigned to the primary key in a particular row must be unique within the entire table. For example, if customer_id is the primary key for the customer table, no two customers will ever have the same customer_id.
By contrast, a foreign key is a field within a table that is a primary key in another table. A table can have only one primary key, but it can have multiple foreign keys. These keys are what create the relationships between tables in a relational database, which helps organize and connect data across multiple tables in the database.
Some tables don't require a primary key. For example, a revenue table can have multiple foreign keys and not have a primary key. A primary key may also be constructed using multiple columns of a table. This type of primary key is called a composite key. For example, if customer_id and location_id are two columns of a composite key for a customer table, the values assigned to those fields in any given row must be unique within the entire table.
SQL? You’re speaking my language
Databases use a special language to communicate called a query language. Structured Query Language (SQL) is a type of query language that lets data analysts communicate with a database. So, a data analyst will use SQL to create a query to view the specific data that they want from within the larger set. In a relational database, data analysts can write queries to get data from the related tables. SQL is a powerful tool for working with databases — which is why you are going to learn more about it coming up!
Inspecting a dataset: A guided, hands-on tour
As a data analyst, you'll use data to answer questions and solve problems. When you analyze data and draw conclusions, you are generating insights that can influence business decisions, drive positive change, and help your stakeholders meet their goals.
Before you begin an analysis, it’s important to inspect your data to determine if it contains the specific information you need to answer your stakeholders’ questions. In any given dataset, it may be the case that:
- The data is not there (you have sandwich data, but you need pizza data)
- The data is insufficient (you have pizza data for June 1-7, but you need data for the entire month of June)
- The data is incorrect (your pizza data lists the cost of a slice as $250, which makes you question the validity of the dataset)
Inspecting your dataset will help you pinpoint what questions are answerable and what data is still missing. You may be able to recover this data from an external source or at least recommend to your stakeholders that another data source be used.
In this reading, imagine you’re a data analyst inspecting spreadsheet data to determine if it’s possible to answer your stakeholders’ questions.
The scenario
You are a data analyst working for an ice cream company. Management is interested in improving the company's ice cream sales.
The company has been collecting data about its sales—but not a lot. The available data is from an internal data source and is based on sales for 2019. You’ve been asked to review the data and provide some insight into the company’s ice cream sales. Ideally, management would like answers to the following questions:
- What is the most popular flavor of ice cream?
- How does temperature affect sales?
- How do weekends and holidays affect sales?
- How does profitability differ for new versus returning customers?
Download the data
You can download the data to follow along with this reading. To use the template for the sales data, click the link below and select “Use Template.”
Link to template: Ice Cream Sales
OR
If you don’t have a Google account, you can download the spreadsheets directly from the attachments below:
Inspect the data
Question 1: What is the most popular flavor of ice cream?
To discover the most popular flavor, you first need to define what is meant by "popular." Is the most popular flavor the one that generated the most revenue in 2019? Or is it the flavor that had the largest number of units sold in 2019? Sometimes your measurement choices are limited by what data you have—you can review your spreadsheet to find out if either of these definitions of “popular” make sense based on the available data.
Click the flavors tab on your spreadsheet to view the relevant data. The flavors sheet has three columns and 209 rows of data. The column headers are week, units sold, and flavor. This dataset did not come with a data description, so you have to figure out the significance of the columns on your own. Based on the data, you deduce that these columns provide information about the number of units sold for each ice cream flavor, by week, in 2019
In this case, you can discover what the most popular flavor is by using units sold as your measure. In particular, you can use the units sold column to calculate the total number of units sold during the year for each flavor. Unfortunately, the dataset does not provide the annual sales amount by flavor. In this case, your next step would be to ask your stakeholders if the annual sales per flavor data is available from another source. If not, you can add a statement about the current data’s limitations to your analysis.
Question 2: How does temperature affect sales?
To explore your second question, you click the temperatures tab and check out the data. The temperatures sheet has two columns and 366 rows of data. The column headers are temperature and sales. The data may show total 2019 sales per temperature (for instance, the first entry might sum up $39.69 in sales for three separate days that each had a high of 60 degrees). Or, the data may show a snapshot of sales and temperature for each day in 2019 (for instance, the first entry might refer to a single day with a high of 60 degrees and $39.69 in sales).
So, which is it? It’s probably a daily snapshot because there are 365 entries for temperature, and multiple rows with the same temperature and different sales values. This implies that each entry is for a single day and not a summary of multiple days. However, without more information, you can’t be certain. Plus, you don’t know if the current data is listed in consecutive order by date or in a different order. Your next step would be to contact the owner of the dataset for clarification.
If it turns out that temperature does affect sales, you’ll be able to offer your stakeholders an insight such as the following: “When daily highs are above X degrees, average ice cream sales increase by Y amount. So the business should plan on increasing inventory during these times to maximize sales.”
Question 3: How do weekends and holidays affect sales?
Next, you click on the sales tab to view the data about dates of sale. The sales sheet has two columns and 366 rows of data. The column headers are date and sales. This data is most likely total daily sales in 2019, as sales are recorded for each date in 2019.
You can use this data to determine whether a specific date falls on a weekend or holiday and add a column to your sheet that reflects this information. Then, you can find out whether sales on the weekends and holidays are greater than sales on other days. This will be useful to know for inventory planning and marketing purposes.
Question 4: How does profitability differ for new customers versus returning customers?
Your dataset does not contain sales data related to new customers. Without this data, you won’t be able to answer your final question. However, it may be the case that the company collects customer data and stores it in a different data table.
If so, your next step would be to find out how to access the company’s customer data. You can then join the revenue sales data to the customer data table to categorize each sale as from a new or returning customer and analyze the difference in profitability between the two sets of customers. This information will help your stakeholders develop marketing campaigns for specific types of customers to increase brand loyalty and overall profitability.
Conclusion
When working on analytics projects, you won’t always have all the necessary or relevant data at your disposal. In many of these cases, you can turn to other data sources to fill in the gaps.
Despite the limitations of your dataset, it’s still possible to offer your stakeholders some valuable insights. For next steps, your best plan of action will be to take the initiative to ask questions, identify other relevant datasets, or do some research on your own. No matter what data you’re working with, carefully inspecting your data makes a big impact on the overall quality of your analysis.
Test your knowledge on working with databases
Question 1
Fill in the blank: A _____ is an identifier that references a database column in which each value is unique.
A. primary key
B. relation
C. field
D. foreign key
The correct answer is A. primary key. Explain: A primary key is an identifier that references a column in which each value is unique. A foreign key is a field within a table that’s a primary key in the original table.
Question 2
Fill in the blank: A relational database contains a series of _____ that can be connected to form relationships.
A. cells
B. tables
C. fields
D. spreadsheets
The correct answer is B. tables. Explain: A relational database contains a series of tables that can be connected to form relationships.
Question 3
A key benefit of working with normalized databases is that they help lower data redundancy. Which of the following is an example of redundancy?
A. Team members in different office locations working with the same data
B. A database containing two foreign keys
C. A database that forms two or more relationships
D. The same piece of data being stored in two different places
The correct answer is D. The same piece of data being stored in two different places. Explain: The same piece of data being stored in two different places is an example of redundancy.