5.1.1.Data analysis basics & Organize data for analysis - sj50179/Google-Data-Analytics-Professional-Certificate GitHub Wiki

The goal of analysis is to identify trends and relationships within the data so that you can accurately answer the question you're asking.

The 4 phases of analysis

  1. Organize data
  2. Format and adjust data: You can adjust the data in a way that makes it easy to digest by filtering and sorting your data.
  3. Get input from others: Gaining input from others is important because it gives you a viewpoint you might not understand or have access to. On top of gaining input from other people, it's also important to seek out others' perspectives early.
  4. Transform data by observing relationships: Transforming data means identifying relationships and patterns between the data, and making calculations based on the data you have.

Test your knowledge on understanding data analysis

TOTAL POINTS 3

Question 1

You ask volunteers at a theater production which tasks they have already completed and add that data to a spreadsheet containing all required tasks. You will use the information provided by the volunteers to figure out which tasks still need to be done. This is an example of which phase of analysis?

  • Format and adjust data
  • Transform data
  • Organize data (into a dataset)
  • Get input from others

Correct. This is an example of getting input from others. Getting input involves soliciting information from other sources to inform your decisions.

Question 2

You are working with three datasets about voter turnout in your county. First, you identify relationships and patterns between the datasets. Then, you use formulas and functions to make calculations based on your data. This is an example of which phase of analysis?

  • Transform data
  • Get input from others
  • Format and adjust data
  • Organize data (into a dataset)

Correct. This is an example of transforming data, which involves identifying relationships and patterns in the data, and making calculations.

Question 3

You are working with a dataset from a local community college. You sort the students alphabetically by last name. This is an example of which phase of analysis?

  • Organize data (into a dataset)
  • Transform data
  • Get input from others
  • Format and adjust data

Correct. Sorting a list of students alphabetically is an example of formatting and adjusting data. This is a step analysts take to rearrange the data to make it easier to work with.


Keeping data organized with sorting and filters

You have learned about four phases of analysis:

  • Organize data
  • Format and adjust data
  • Get input from others
  • Transform data

The organization of datasets is really important for data analysts. Most of the datasets you will use will be organized as tables. Tables are helpful because they let you manipulate your data and categorize it. Having distinct categories and classifications lets you focus on, and differentiate between, your data quickly and easily.

Data analysts also need to format and adjust data when performing an analysis. Sorting and filtering are two ways you can keep things organized when you format and adjust data to work with it. For example, a filter can help you find errors or outliers so you can fix or flag them before your analysis. Outliers are data points that are very different from similarly collected data and might not be reliable values. The benefit of filtering the data is that after you fix errors or identify outliers, you can remove the filter and return the data to its original organization.

Sorting versus filtering

Sorting is when you arrange data into a meaningful order to make it easier to understand, analyze, and visualize. It ranks your data based on a specific metric you choose. You can sort data in spreadsheets, SQL databases (when your dataset is too large for spreadsheets), and tables in documents.

For example, if you need to rank things or create chronological lists, you can sort by ascending or descending order. If you are interested in figuring out a group’s favorite movies, you might sort by movie title to figure it out. Sorting will arrange the data in a meaningful way and give you immediate insights. Sorting also helps you to group similar data together by a classification. For movies, you could sort by genre -- like action, drama, sci-fi, or romance.

Filtering is used when you are only interested in seeing data that meets a specific criteria, and hiding the rest. Filtering is really useful when you have lots of data. You can save time by zeroing in on the data that is really important or the data that has bugs or errors. Most spreadsheets and SQL databases allow you to filter your data in a variety of ways. Filtering gives you the ability to find what you are looking for without too much effort.

For example, if you are only interested in finding out who watched movies in October, you could use a filter on the dates so only the records for movies watched in October are displayed. Then, you could check out the names of the people to figure out who watched movies in October.

To recap, the easiest way to remember the difference between sorting and filtering is that you can use sort to quickly order the data, and filter to display only the data that meets the criteria that you have chosen. Use filtering when you need to reduce the amount of data that is displayed.

It is important to point out that, after you filter data, you can sort the filtered data, too. If you revisit the example of finding out who watched movies in October, after you have filtered for the movies seen in October, you can then sort the names of the people who watched those movies in alphabetical order.

Sorting in a pivot table

Items in the row and column areas of a pivot table are sorted in ascending order by any custom list first. For example, if your list contains days of the week, the pivot table allows weekday and month names to sort like this: Monday, Tuesday, Wednesday, etc. rather than alphabetically like this: Friday, Monday, Saturday, etc.

If the items aren’t in a custom list, they will be sorted in ascending order by default. But, if you sort in descending order, you are setting up a rule that controls how the field is sorted even after new data fields are added.


Test your knowledge on organizing data

TOTAL POINTS 3

Question 1

Fill in the blank: A data analyst uses _____ to decide which data is relevant to their analysis and which data types and variables are appropriate.

  • database references
  • database organization
  • database relationships
  • database normalization

Correct. Database organization enables analysts to make decisions about which data is relevant to pull for a specific analysis. It also helps them decide which data types and variables are appropriate.

Question 2

A data analyst wants to organize a database to show only the 100 most recent real estate sales in Stamford, Connecticut. How can they do that?

  • The data analyst should filter out sales in Stamford, Connecticut, then sort the least recent sales at the top of their list.
  • The data analyst should filter out sales in Stamford, Connecticut, then sort the most recent sales at the top of their list.
  • The data analyst should add a filter to return only sales in Stamford, Connecticut, then sort the least recent sales at the top of their list.
  • The data analyst should add a filter to return only sales in Stamford, Connecticut, then sort the most recent sales at the top of their list.

Correct. The data analyst should add a filter for only sales in Stamford, Connecticut, then sort the most recent sales at the top of their list.

Question 3

You are working with a database table that contains customer data. The country column designates the country where each customer is located. You want to find out which customers are located in Brazil.

You write the SQL query below. Add a WHERE clause that will return only customers located in Brazil.

SELECT
	*
FROM
	customer
WHERE
	country = 'Brazil'
+-------------+------------+-----------+--------------------------------------------------+---------------------------------+---------------------+-------+---------+-------------+--------------------+--------------------+-------------------------------+----------------+
| customer_id | first_name | last_name | company                                          | address                         | city                | state | country | postal_code | phone              | fax                | email                         | support_rep_id |
+-------------+------------+-----------+--------------------------------------------------+---------------------------------+---------------------+-------+---------+-------------+--------------------+--------------------+-------------------------------+----------------+
|           1 | Luís       | Gonçalves | Embraer - Empresa Brasileira de Aeronáutica S.A. | Av. Brigadeiro Faria Lima, 2170 | São José dos Campos | SP    | Brazil  | 12227-000   | +55 (12) 3923-5555 | +55 (12) 3923-5566 | [email protected]          |              3 |
|          10 | Eduardo    | Martins   | Woodstock Discos                                 | Rua Dr. Falcão Filho, 155       | São Paulo           | SP    | Brazil  | 01007-010   | +55 (11) 3033-5446 | +55 (11) 3033-4564 | [email protected]      |              4 |
|          11 | Alexandre  | Rocha     | Banco do Brasil S.A.                             | Av. Paulista, 2022              | São Paulo           | SP    | Brazil  | 01310-200   | +55 (11) 3055-3278 | +55 (11) 3055-8131 | [email protected]              |              5 |
|          12 | Roberto    | Almeida   | Riotur                                           | Praça Pio X, 119                | Rio de Janeiro      | RJ    | Brazil  | 20040-020   | +55 (21) 2271-7000 | +55 (21) 2271-7070 | [email protected] |              3 |
|          13 | Fernanda   | Ramos     | None                                             | Qe 7 Bloco G                    | Brasília            | DF    | Brazil  | 71020-677   | +55 (61) 3363-5547 | +55 (61) 3363-7855 | [email protected]      |              4 |
+-------------+------------+-----------+--------------------------------------------------+---------------------------------+---------------------+-------+---------+-------------+--------------------+--------------------+-------------------------------+----------------+

How many customers are located in Brazil?

  • 9
  • 7
  • 3
  • 5

Correct. The clause WHERE country = “Brazil” will return only customers located in Brazil. The complete query is SELECT * FROM customer WHERE country = “Brazil”*. The WHERE clause filters results that meet certain conditions. The WHERE clause includes the name of the column, an equals sign, and the value(s) in the column to include. There are 5 customers located in Brazil.*