5.1.3.Organize data for analysis - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki
Keeping data organized with sorting and filters
You have learned about four phases of analysis:
- Organize data
- Format and adjust data
- Get input from others
- Transform data
The organization of datasets is really important for data analysts. Most of the datasets you will use will be organized as tables. Tables are helpful because they let you manipulate your data and categorize it. Having distinct categories and classifications lets you focus on, and differentiate between, your data quickly and easily.
Data analysts also need to format and adjust data when performing an analysis. Sorting and filtering are two ways you can keep things organized when you format and adjust data to work with it. For example, a filter can help you find errors or outliers so you can fix or flag them before your analysis. Outliers are data points that are very different from similarly collected data and might not be reliable values. The benefit of filtering the data is that after you fix errors or identify outliers, you can remove the filter and return the data to its original organization.
In this reading, you will learn the difference between sorting and filtering. You will also be introduced to how a particular form of sorting is done in a pivot table.
Sorting versus filtering
Sorting is when you arrange data into a meaningful order to make it easier to understand, analyze, and visualize. It ranks your data based on a specific metric you choose. You can sort data in spreadsheets, SQL databases (when your dataset is too large for spreadsheets), and tables in documents.
For example, if you need to rank things or create chronological lists, you can sort by ascending or descending order. If you are interested in figuring out a group’s favorite movies, you might sort by movie title to figure it out. Sorting will arrange the data in a meaningful way and give you immediate insights. Sorting also helps you to group similar data together by a classification. For movies, you could sort by genre -- like action, drama, sci-fi, or romance.
Filtering is used when you are only interested in seeing data that meets a specific criteria, and hiding the rest. Filtering is really useful when you have lots of data. You can save time by zeroing in on the data that is really important or the data that has bugs or errors. Most spreadsheets and SQL databases allow you to filter your data in a variety of ways. Filtering gives you the ability to find what you are looking for without too much effort.
For example, if you are only interested in finding out who watched movies in October, you could use a filter on the dates so only the records for movies watched in October are displayed. Then, you could check out the names of the people to figure out who watched movies in October.
To recap, the easiest way to remember the difference between sorting and filtering is that you can use sort to quickly order the data, and filter to display only the data that meets the criteria that you have chosen. Use filtering when you need to reduce the amount of data that is displayed.
It is important to point out that, after you filter data, you can sort the filtered data, too. If you revisit the example of finding out who watched movies in October, after you have filtered for the movies seen in October, you can then sort the names of the people who watched those movies in alphabetical order.
Sorting in a pivot table
Items in the row and column areas of a pivot table are sorted in ascending order by any custom list first. For example, if your list contains days of the week, the pivot table allows weekday and month names to sort like this: Monday, Tuesday, Wednesday, etc. rather than alphabetically like this: Friday, Monday, Saturday, etc.
If the items aren’t in a custom list, they will be sorted in ascending order by default. But, if you sort in descending order, you are setting up a rule that controls how the field is sorted even after new data fields are added.
Optional: Upload the movie dataset to BigQuery
The next video demonstrates how to use SQL to filter data in a large dataset in BigQuery.
If you would like to follow along with the instructor, you will need to log in to your BigQuery account and upload the movie dataset provided as a CSV file. If you have hopped around courses, Using BigQuery in the Prepare Data for Exploration course covers how to set up a BigQuery account.
Prepare for the next video
- First, download the CSV file from the attachment below:
- Next, complete the following steps in your BigQuery console to upload the movie dataset.
Step 1: Open your BigQuery console and click on the project you want to upload the data to.
Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next to your project name and select Create dataset.
Step 3: In the upcoming video, the name "movie_data" will be used for the dataset. If you plan to follow along with the video, enter movie_data for the Dataset ID.
Step 4: Click CREATE DATASET (blue button) to add the dataset to your project.
Step 5: In the Explorer on the left, click to expand your project and then click the movie_data dataset you just created.
Step 6: Click the Actions icon (three vertical dots) next to movie_data and select Open.
Step 7: Click the blue + icon at the top right to open the Create table window.
Step 8: Under Source, for the Create table from selection, choose where the data will be coming from.
- Select Upload.
- Click Browse to select the Movie Data CSV file you downloaded.
- Choose CSV from the file format drop-down.
Step 9: Under Destination, for Table name enter movies to match the table in the video.
Step 10: For Schema, click the Auto detect check box.
Step 11: Click Create table (blue button). You will now see the movies table under your movie_data dataset in your project.
Step 12: Click movies and then select the Preview tab. Confirm that you see the data shown below.
Congratulations, you are now ready to follow along with the next video.
Test your knowledge on organizing data
Question 1
Fill in the blank: A data analyst uses _____ to decide which data is relevant to their analysis and which data types and variables are appropriate.
A. database normalization
B. database relationships
C. database references
D. database organization
The correct answer is D. database organization. Explain: Database organization enables analysts to make decisions about which data is relevant to pull for a specific analysis. It also helps them decide which data types and variables are appropriate.
Question 2
A data analyst wants to organize a database to show only the 100 most recent real estate sales in Stamford, Connecticut. How can they do that?
A. The data analyst should add a filter to return only sales in Stamford, Connecticut, then sort the most recent sales at the top of their list.
B. The data analyst should filter out sale amounts in Stamford, Connecticut, then sort the highest sales at the top of their list.
C. The data analyst should add a filter to return only sales in Stamford, Connecticut, then sort the least recent sales at the top of their list.
D. The data analyst should filter out sales in Stamford, Connecticut, then sort the least recent sales at the top of their list.
The correct answer is A. The data analyst should add a filter to return only sales in Stamford, Connecticut, then sort the most recent sales at the top of their list. Explain: The data analyst should add a filter for only sales in Stamford, Connecticut, then sort the most recent sales at the top of their list.
Question 3
You are working with a database table that contains customer data. The country column designates the country where each customer is located. You want to find out which customers are located in Brazil.
You write the SQL query below. Add a WHERE clause that will return only customers located in Brazil.
How many customers are located in Brazil?
A. 7
B. 5
C. 9
D. 3
Explain: The clause WHERE country = “Brazil”
will return only customers located in Brazil. The complete query is SELECT * FROM customer WHERE country = “Brazil”
. The WHERE clause filters results that meet certain conditions. The WHERE clause includes the name of the column, an equals sign, and the value(s) in the column to include.
There are 5 customers located in Brazil.
+-------------+------------+-----------+--------------------------------------------------+---------------------------------+---------------------+-------+---------+-------------+--------------------+--------------------+-------------------------------+----------------+
| customer_id | first_name | last_name | company | address | city | state | country | postal_code | phone | fax | email | support_rep_id |
+-------------+------------+-----------+--------------------------------------------------+---------------------------------+---------------------+-------+---------+-------------+--------------------+--------------------+-------------------------------+----------------+
| 1 | Luís | Gonçalves | Embraer - Empresa Brasileira de Aeronáutica S.A. | Av. Brigadeiro Faria Lima, 2170 | São José dos Campos | SP | Brazil | 12227-000 | +55 (12) 3923-5555 | +55 (12) 3923-5566 | [email protected] | 3 |
| 10 | Eduardo | Martins | Woodstock Discos | Rua Dr. Falcão Filho, 155 | São Paulo | SP | Brazil | 01007-010 | +55 (11) 3033-5446 | +55 (11) 3033-4564 | [email protected] | 4 |
| 11 | Alexandre | Rocha | Banco do Brasil S.A. | Av. Paulista, 2022 | São Paulo | SP | Brazil | 01310-200 | +55 (11) 3055-3278 | +55 (11) 3055-8131 | [email protected] | 5 |
| 12 | Roberto | Almeida | Riotur | Praça Pio X, 119 | Rio de Janeiro | RJ | Brazil | 20040-020 | +55 (21) 2271-7000 | +55 (21) 2271-7070 | [email protected] | 3 |
| 13 | Fernanda | Ramos | None | Qe 7 Bloco G | Brasília | DF | Brazil | 71020-677 | +55 (61) 3363-5547 | +55 (61) 3363-7855 | [email protected] | 4 |
+-------------+------------+-----------+--------------------------------------------------+---------------------------------+---------------------+-------+---------+-------------+--------------------+--------------------+-------------------------------+----------------+