7.3.1.Explore data and R - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki

Hands-On Activity: Create your own data frame

Activity overview

UWFf-U9hTzKhX_lPYX8yBw_8c2e9cd211e3479a89816c7b1816ab07_image4

Earlier, you learned about data frames. In this activity, you will create and use data frames in R.

As a refresher, a data frame is a collection of columns containing data, similar to a spreadsheet or SQL table. Data frames are one of the basic tools you will use to work with data in R. And you can create data frames from different data sources.

By the time you complete this activity, you will be able to create data frames with the data.frame() function and use data frames to complete tasks in R. This will enable you to summarize and organize data in R, which will help give your R analyses more structure as you complete more advanced data analysis tasks.

Working in RStudio Cloud

UWFf-U9hTzKhX_lPYX8yBw_8c2e9cd211e3479a89816c7b1816ab07_image4

To start, log into your RStudio (Posit) Cloud account. Open the project you will work on in the activity with this link, which opens in a new tab. If you haven't gone through this process already, at the top right portion of the screen you will see a "red stamp" indicating this project as a Temporary Copy. Click on the adjacent button, Save a Permanent Copy, and the project will be saved in your main dashboard for use with future lessons. Once that is completed, navigate to the file explorer in the bottom right and click on the following: Course 7 -> Week 3 -> Lesson2_Dataframe.Rmd.

If you have trouble finding the correct activity, check out this step-by-step guide on how to navigate in RStudio Cloud. Make sure to select the correct R markdown (Rmd) file. The other Rmd files will be used in different activities.

If you are using RStudio (Posit) Desktop, you can download the Rmd file here:

Lesson2_Dataframe

You can also find the Rmd file with the solutions for this activity here:

Lesson2_Dataframe_Solutions

Carefully read the instructions in the comments of the Rmd file and complete each step. Some steps may be as simple as running pre-written code, while others may require you to write your own functions. After you finish the steps in the Rmd file, return here to confirm that your work is complete.

Confirmation

TOqxzuNFR2eqsc7jRVdnKg_a3c6611d874f403a923e10406b4f38a9_image4

Which summary functions can you use to preview data frames in R? Select all that apply.

  • str()
  • glimpse()
  • head()
  • mutate()

Explain: The head(), glimpse(), and str() summary functions allow you to preview data frames in R. The head() function returns the columns and the first several rows of data.The mutate() function lets you change the data frame, not preview it. Going forward, you can use summary functions to inspect the data frames you create in your career as a data analyst.

More about tibbles

In this reading, you will learn about tibbles, which are a super useful tool for organizing data in R. You will get a review of what tibbles are, how they differ from standard data frames, and how to create them in R.

Tibbles

Tibbles are a little different from standard data frames. A data frame is a collection of columns, like a spreadsheet or a SQL table. Tibbles are like streamlined data frames that are automatically set to pull up only the first 10 rows of a dataset, and only as many columns as can fit on the screen. This is really useful when you’re working with large sets of data. Unlike data frames, tibbles never change the names of your variables, or the data types of your inputs. Overall, you can make more changes to data frames, but tibbles are easier to use. The tibble package is part of the core tidyverse. So, if you’ve already installed the tidyverse, you have what you need to start working with tibbles.

Creating tibbles

Now, let’s go through an example of how to create a tibble in R. You can use the pre-loaded diamonds dataset that you’re familiar with from earlier videos. As a reminder, the diamonds dataset includes information about different diamond qualities, like carat, cut, color, clarity, and more.

You can load the dataset with the data() function using the following code:

library(tidyverse)

data(diamonds)

Then, let’s add the data frame to our data viewer in RStudio with the View() function.

View(diamonds)

The dataset has 10 columns and thousands of rows. This image displays part of the data frame:

vRf33YfiS_yX992H4qv8Mw_6a91eeac27a544c684de8f57d906764e_Screenshot-2020-11-02-at-9 43 48-AM-1-

Now let’s create a tibble from the same dataset. You can create a tibble from existing data with the as_tibble() function. Indicate what data you’d like to use in the parentheses of the function. In this case, you will put the word “diamonds."

as_tibble(diamonds)

Results

When you run the function, you get a tibble of the diamonds dataset.

While RStudio’s built-in data frame tool returns thousands of rows in the diamonds dataset, the tibble only returns the first 10 rows in a neatly organized table. That makes it easier to view and print.

Additional resources For more information on tibbles, check out the following resources:

  • The entry for Tibble in the tidyverse documentation summarizes what a tibble is and how it works in R code. If you want a quick overview of the essentials, this is the place to go.
  • The Tidy chapter in "A Tidyverse Cookbook" is a great resource if you want to learn more about how to work with tibbles using R code. The chapter explores a variety of R functions that can help you create and transform tibbles to organize and tidy your data.

Data import basics

You can save this reading for future reference. Feel free to download a PDF version of this reading below:

Data import basics.pdf

The data() function

XbxlyImrQgu8ZciJq0ILuA_c3ff7fa1744d497e9267078d16058e2c_Screen-Shot-2021-02-11-at-1 41 31-PM

The default installation of R comes with a number of preloaded datasets that you can practice with. This is a great way to develop your R skills and learn about some important data analysis functions. Plus, many online resources and tutorials use these sample datasets to teach coding concepts in R.

You can use the data() function to load these datasets in R. If you run the data function without an argument, R will display a list of the available datasets.

data()

This includes the list of preloaded datasets from the datasets package.

PhGaBTlcTqORmgU5XP6jSg_ebb5d20662444ab2bca76352ee5a256e_Screen-Shot-2021-01-22-at-11 39 53-AM

If you want to load a specific dataset, just enter its name in the parentheses of the data() function. For example, let’s load the mtcars dataset, which has information about cars that have been featured in past issues of Motor Trend magazine.

data(mtcars)

When you run the function, R will load the dataset. The dataset will also appear in the Environment pane of your RStudio. The Environment pane displays the names of the data objects, such as data frames and variables, that you have in your current workspace. In this image, mtcars appears in the fifth row of the pane. R tells us that it contains 32 observations and 11 variables.

7R_zhy6aSKKf84cumgiiOQ_6fbbeda64ee1472691965bc9bb309cee_Screen-Shot-2021-01-22-at-12 27 39-PM

Now that the dataset is loaded, you can get a preview of it in the R console pane. Just type its name...

mtcars

...and then press ctrl (or cmnd) and enter.

You can also display the dataset by clicking directly on the name of the dataset in the Environment pane. So, if you click on mtcars in the Environment pane, R automatically runs the View() function and displays the dataset in the RStudio data viewer.

32SAqnVYSw-kgKp1WJsPeg_a22883468a134223b74b1f39c5f61655_Screen-Shot-2021-01-22-at-12 10 49-PM

Try experimenting with other datasets in the list if you want some more practice.

The readr package

In addition to using R’s built-in datasets, it is also helpful to import data from other sources to use for practice or analysis. The readr package in R is a great tool for reading rectangular data. Rectangular data is data that fits nicely inside a rectangle of rows and columns, with each column referring to a single variable and each row referring to a single observation.

Here are some examples of file types that store rectangular data:

  • .csv (comma separated values): a .csv file is a plain text file that contains a list of data. They mostly use commas to separate (or delimit) data, but sometimes they use other characters, like semicolons.
  • .tsv (tab separated values): a .tsv file stores a data table in which the columns of data are separated by tabs. For example, a database table or spreadsheet data.
  • .fwf (fixed width files): a .fwf file has a specific format that allows for the saving of textual data in an organized fashion.
  • .log: a .log file is a computer-generated file that records events from operating systems and other software programs.

Base R also has functions for reading files, but the equivalent functions in readr are typically much faster. They also produce tibbles, which are easy to use and read.

The readr package is part of the core tidyverse. So, if you’ve already installed the tidyverse, you have what you need to start working with readr. If not, you can install the tidyverse now.

readr functions

The goal of readr is to provide a fast and friendly way to read rectangular data. readr supports several read_ functions. Each function refers to a specific file format.

  • read_csv(): comma-separated values (.csv) files
  • read_tsv(): tab-separated values files
  • read_delim(): general delimited files
  • read_fwf(): fixed-width files
  • read_table(): tabular files where columns are separated by white-space
  • read_log(): web log files

These functions all have similar syntax, so once you learn how to use one of them, you can apply your knowledge to the others. This reading will focus on the read_csv() function, since .csv files are one of the most common forms of data storage and you will work with them frequently.

In most cases, these functions will work automatically: you supply the path to a file, run the function, and you get a tibble that displays the data in the file. Behind the scenes, readr parses the overall file and specifies how each column should be converted from a character vector to the most appropriate data type.

Reading a .csv file with readr The readr package comes with some sample files from built-in datasets that you can use for example code. To list the sample files, you can run the readr_example() function with no arguments.

readr_example()

[1] "challenge.csv" "epa78.txt" "example.log"

[4] "fwf-sample.txt" "massey-rating.txt" "mtcars.csv"

[7] "mtcars.csv.bz2" "mtcars.csv.zip"

The “mtcars.csv” file refers to the mtcars dataset that was mentioned earlier. Let’s use the read_csv() function to read the “mtcars.csv” file, as an example. In the parentheses, you need to supply the path to the file. In this case, it’s “readr_example(“mtcars.csv”).

read_csv(readr_example("mtcars.csv"))

When you run the function, R prints out a column specification that gives the name and type of each column.

R also prints a tibble.

Optional: the readxl package

To import spreadsheet data into R, you can use the readxl package. The readxl package makes it easy to transfer data from Excel into R. Readxl supports both the legacy .xls file format and the modern xml-based .xlsx file format.

The readxl package is part of the tidyverse but is not a core tidyverse package, so you need to load readxl in R by using the library() function.

library(readxl)

Reading an .xlsx file with readxl

Like the readr package, readxl comes with some sample files from built-in datasets that you can use for practice. You can run the code readxl_example() to see the list.

You can use the read_excel() function to read a spreadsheet file just like you used read_csv() function to read a .csv file. The code for reading the example file “type-me.xlsx” includes the path to the file in the parentheses of the function.

read_excel(readxl_example("type-me.xlsx"))

You can use the excel_sheets() function to list the names of the individual sheets.

excel_sheets(readxl_example("type-me.xlsx"))

[1] "logical_coercion" "numeric_coercion" "date_coercion" "text_coercion"

You can also specify a sheet by name or number. Just type “sheet =” followed by the name or number of the sheet. For example, you can use the sheet named “numeric_coercion” from the list above.

read_excel(readxl_example("type-me.xlsx"), sheet = "numeric_coercion")

When you run the function, R returns a tibble of the sheet.

Additional resources

  • If you want to learn how to use readr functions to work with more complex files, check out the Data Import chapter of the R for Data Science book. It explores some of the common issues you might encounter when reading files, and how to use readr to manage those issues.

  • The readxl entry in the tidyverse documentation gives a good overview of the basic functions in readxl, provides a detailed explanation of how the package operates and the coding concepts behind them, and offers links to other useful resources.

  • The R "datasets" package contains lots of useful preloaded datasets. Check out The R Datasets Package for a list. The list includes links to detailed descriptions of each dataset.

Hands-On Activity: Importing and working with data

Activity overview

UWFf-U9hTzKhX_lPYX8yBw_8c2e9cd211e3479a89816c7b1816ab07_image4

By now, you have some experience manually entering data in R to create a data frame. In this activity you will import data from outside of R using the read_csv() function, then use R functions to manipulate and interact with that data.

Upon completing this activity, you will be able to import data into RStudio so you can analyze it. This will enable you to bring your own .csv files into RStudio and use this environment for personal projects, which will help you hone your data skills. As a data analyst, it will also be common for you to import data from external files into your R console and use it to create a data frame to analyze it.

Work in RStudio Cloud

UWFf-U9hTzKhX_lPYX8yBw_8c2e9cd211e3479a89816c7b1816ab07_image4

To start, log into your RStudio (Posit) Cloud account. Open the project you will work on in the activity with this link, which opens in a new tab. If you haven't gone through this process already, at the top right portion of the screen you will see a "red stamp" indicating this project as a Temporary Copy. Click on the adjacent button, Save a Permanent Copy, and the project will be saved in your main dashboard for use with future lessons. Once that is completed, navigate to the file explorer in the bottom right and click on the following: Course 7 -> Week 3 -> Lesson2_Import.Rmd.

The .csv file you will need, hotel_bookings.csv, is also located in this folder.

If you have trouble finding the correct activity, check out this step-by-step guide on how to navigate in RStudio (Posit) Cloud. Make sure to select the correct R markdown (Rmd) file. The other Rmd files will be used in different activities.

If you are using RStudio Desktop, you can download the Rmd file and the data for this activity directly here:

Lesson2_Import

hotel_bookings

You can also find the Rmd file with the solutions for this activity here:

Lesson2_Import_Solutions

Carefully read the instructions in the comments of the Rmd file and complete each step. Some steps may be as simple as running pre-written code, while others may require you to write your own functions. After you finish the steps in the Rmd file, return here to confirm that your work is complete.

If you have trouble completing the exercise or don't know how to proceed, navigate to Course 7 -> Week 3 -> Solutions -> Lesson2_Import_Solutions.Rmd in the exercise files.

Confirmation

TOqxzuNFR2eqsc7jRVdnKg_a3c6611d874f403a923e10406b4f38a9_image4

Which syntax would you use to import a dataset called quarter_earnings.csv into RStudio?

A. earnings_df <- read_csv(“quarter_earnings”)

B. earnings_df <- read_csv(quarter_earnings.csv)

C. earnings_df <- read_csv(quarter_earnings)

D. earnings_df <- read_csv("quarter_earnings.csv")

The correct answer is D. earnings_df <- read_csv("quarter_earnings.csv"). Explain: The proper syntax to use for importing the “quarter_earnings.csv” dataset is earnings_df <- read_csv("quarter_earnings.csv"). The results of this function display as column specifications of the data frame it creates. Going forward, you can import data into RStudio with read_csv() for projects throughout your career as a data analyst.

Data in R versus SQL

As you’ve been learning, R is a programming language frequently used for statistical analysis, visualization, and other data analysis. R is a little different from the other data analytics tools you have discovered so far.

What are your thoughts about the way R manages datasets compared to SQL or spreadsheets? What are the advantages and disadvantages to each of these tools? Please submit a written response of two or more paragraphs (100-150 words total) responding to this question. Then, visit the discussion forum to review what others have written, and respond to at least two posts with your own thoughts.

Test your knowledge on R data frames

Question 1

Which of the following are best practices for creating data frames? Select all that apply.

  • Rows should be named
  • Each column should contain the same number of data items
  • Columns should be named
  • All data stored should be the same type

Explain: When creating data frames, columns should be named and each column should contain the same number of data items.

Question 2

Why are tibbles a useful variation of data frames?

A. Tibbles make printing easier

B. Tibbles make changing the names of variables easier.

C. Tibble can change the data type of inputs

D. Tibbles can create row names

The correct answer is A. Tibbles make printing easier. Explain: Tibbles can make printing easier. They also help you avoid overloading your console when working with large datasets. Tibbles are automatically set to only return the first ten rows of a dataset and as many columns as it can fit on the screen.

Question 3

Tidy data is a way of standardizing the organization of data within R.

A. True

B. False

It is true statement. Explain: Tidy data refers to the principles that make data structures meaningful and easy to understand. It’s a way of standardizing the organization of data within R.

Question 4

Which R function can be used to make changes to a data frame?

A. colnames()

B. head()

C. str()

D. mutate()

The correct answer is D. mutate(). Explain: The mutate() function can be used to make changes to a data frame.