7.3.1.Explore data and R - sj50179/Google-Data-Analytics-Professional-Certificate GitHub Wiki

In the tidyverse, tibbles are like streamlined data frames.

Tilbbles

  • Never change the data types of the inputs
  • Never change the names of variables
  • Never create rows names
  • Make printing easier

Tidy data (R)

  • A way of standardizing the organization of data within R

Tidy data standards

  • Variables are organized into columns
  • Observations are organized into rows
  • Each value must have its own cell

Question

Which summary functions can you use to preview data frames in R? Select all that apply.

  • glimpse()
  • head()
  • mutate()
  • str()

Correct. The **head()***,* glimpse()*, and* str() summary functions allow you to preview data frames in R. The head() function returns the columns and the first several rows of data. The mutate() function lets you change the data frame, not preview it. Going forward, you can use summary functions to inspect the data frames you create in your career as a data analyst.

More about tibbles

In this reading, you will learn about tibbles, which are a super useful tool for organizing data in R. You will get a review of what tibbles are, how they differ from standard data frames, and how to create them in R.

Tibbles

Tibbles are a little different from standard data frames. A data frame is a collection of columns, like a spreadsheet or a SQL table. Tibbles are like streamlined data frames that are automatically set to pull up only the first 10 rows of a dataset, and only as many columns as can fit on the screen. This is really useful when you’re working with large sets of data. Unlike data frames, tibbles never change the names of your variables, or the data types of your inputs. Overall, you can make more changes to data frames, but tibbles are easier to use. The tibble package is part of the core tidyverse. So, if you’ve already installed the tidyverse, you have what you need to start working with tibbles.

  • When a tibble is called directly, R will display enough information to give a quick sense of the contents of the tibble. This includes:
    1. the dimensions of the tibble
    2. the column names and types
    3. as many cells of the tibble as will fit comfortably in the console window

Creating tibbles

Now, let’s go through an example of how to create a tibble in R. You can use the pre-loaded diamonds dataset that you’re familiar with from earlier videos. As a reminder, the diamonds dataset includes information about different diamond qualities, like carat, cut, color, clarity, and more.

You can load the dataset with the data() function using the the following code:

library(tidyverse)

data(diamonds)

Then, let’s add the data frame to our data viewer in RStudio with the View() function.

View(diamonds)

The dataset has 10 columns and thousands of rows. This image displays part of the data frame:

Now let’s create a tibble from the same dataset. You can create a tibble from existing data with the as_tibble() function. Indicate what data you’d like to use in the parentheses of the function. In this case, you will put the word “diamonds."

as_tibble(diamonds)

Results

When you run the function, you get a tibble of the diamonds dataset.

While RStudio’s built-in data frame tool returns thousands of rows in the diamonds dataset, the tibble only returns the first 10 rows in a neatly organized table. That makes it easier to view and print.

Additional resources

For more information on tibbles, check out the following resources:

  • The entry for Tibble in the tidyverse documentation summarizes what a tibble is and how it works in R code. If you want a quick overview of the essentials, this is the place to go.
  • The Tidy chapter in "A Tidyverse Cookbook" is a great resource if you want to learn more about how to work with tibbles using R code. The chapter explores a variety of R functions that can help you create and transform tibbles to organize and tidy your data.

Data-import basics

The data() function

The default installation of R comes with a number of preloaded datasets that you can practice with. This is a great way to develop your R skills and learn about some important data analysis functions. Plus, many online resources and tutorials use these sample datasets to teach coding concepts in R.

You can use the data() function to load these datasets in R. If you run the data function without an argument, R will display a list of the available datasets.

data()

This includes the list of preloaded datasets from the datasets package.

If you want to load a specific dataset, just enter its name in the parentheses of the data() function. For example, let’s load the mtcars dataset, which has information about cars that have been featured in past issues of Motor Trend magazine.

data(mtcars)

When you run the function, R will load the dataset. The dataset will also appear in the Environment pane of your RStudio. The Environment pane displays the names of the data objects, such as data frames and variables, that you have in your current workspace. In this image, mtcars appears in the fifth row of the pane. R tells us that it contains 32 observations and 11 variables.

Now that the dataset is loaded, you can get a preview of it in the R console pane. Just type its name...

mtcars

...and then press ctrl (or cmnd) and enter.

You can also display the dataset by clicking directly on the name of the dataset in the Environment pane. So, if you click on mtcars in the Environment pane, R automatically runs the View() function and displays the dataset in the RStudio data viewer.

Try experimenting with other datasets in the list if you want some more practice.

The readr package

In addition to using R’s built-in datasets, it is also helpful to import data from other sources to use for practice or analysis. The readr package in R is a great tool for reading rectangular data. Rectangular data is data that fits nicely inside a rectangle of rows and columns, with each column referring to a single variable and each row referring to a single observation.

Here are some examples of file types that store rectangular data:

  • .csv (comma separated values): a .csv file is a plain text file that contains a list of data. They mostly use commas to separate (or delimit) data, but sometimes they use other characters, like semicolons.
  • .tsv (tab separated values): a .tsv file stores a data table in which the columns of data are separated by tabs. For example, a database table or spreadsheet data.
  • .fwf (fixed width files): a .fwf file has a specific format that allows for the saving of textual data in an organized fashion.
  • .log: a .log file is a computer-generated file that records events from operating systems and other software programs.

Base R also has functions for reading files, but the equivalent functions in readr are typically much faster. They also produce tibbles, which are easy to use and read.

The readr package is part of the core tidyverse. So, if you’ve already installed the tidyverse, you have what you need to start working with readr. If not, you can install the tidyverse now.

readr functions

The goal of readr is to provide a fast and friendly way to read rectangular data. readr supports several read_ functions. Each function refers to a specific file format.

  • read_csv(): comma-separated values (.csv) files
  • read_tsv(): tab-separated values files
  • read_delim(): general delimited files
  • read_fwf(): fixed-width files
  • read_table(): tabular files where columns are separated by white-space
  • read_log(): web log files

These functions all have similar syntax, so once you learn how to use one of them, you can apply your knowledge to the others. This reading will focus on the read_csv() function, since .csv files are one of the most common forms of data storage and you will work with them frequently.

In most cases, these functions will work automatically: you supply the path to a file, run the function, and you get a tibble that displays the data in the file. Behind the scenes, readr parses the overall file and specifies how each column should be converted from a character vector to the most appropriate data type.

Reading a .csv file with readr

The readr package comes with some sample files from built-in datasets that you can use for example code. To list the sample files, you can run the readr_example() function with no arguments.

readr_example()

[1] "challenge.csv"     "epa78.txt"         "example.log"
[4] "fwf-sample.txt"    "massey-rating.txt" "mtcars.csv"
[7] "mtcars.csv.bz2"    "mtcars.csv.zip"

The “mtcars.csv” file refers to the mtcars dataset that was mentioned earlier. Let’s use the read_csv() function to read the “mtcars.csv” file, as an example. In the parentheses, you need to supply the path to the file. In this case, it’s “readr_example(“mtcars.csv”).

read_csv(readr_example("mtcars.csv"))

When you run the function, R prints out a column specification that gives the name and type of each column.

R also prints a tibble.


Optional: the readxl package

To import spreadsheet data into R, you can use the readxl package. The readxl package makes it easy to transfer data from Excel into R. Readxl supports both the legacy .xls file format and the modern xml-based .xlsx file format.

The readxl package is part of the tidyverse but is not a core tidyverse package, so you need to load readxl in R by using the library() function.

library(readxl)

Reading a .csv file with readxl

Like the readr package, readxl comes with some sample files from built-in datasets that you can use for practice. You can run the code readxl_example() to see the list.

You can use the read_excel() function to read a spreadsheet file just like you used read_csv() function to read a  .csv file. The code for reading the example file “type-me.xlsx” includes the path to the file in the parentheses of the function.

read_excel(readxl_example("type-me.xlsx"))

You can use the excel_sheets() function to list the names of the individual sheets.

excel_sheets(readxl_example("type-me.xlsx"))

[1] "logical_coercion" "numeric_coercion" "date_coercion" "text_coercion"

You can also specify a sheet by name or number.  Just type “sheet =” followed by the name or number of the sheet. For example, you can use the sheet named “numeric_coercion” from the list above.

read_excel(readxl_example("type-me.xlsx"), sheet = "numeric_coercion")

When you run the function, R returns a tibble of the sheet.

Additional resources

  • If you want to learn how to use readr functions to work with more complex files, check out the Data Import chapter of the R for Data Science book. It explores some of the common issues you might encounter when reading files, and how to use readr to manage those issues.
  • The readxl entry in the tidyverse documentation gives a good overview of the basic functions in readxl, provides a detailed explanation of how the package operates and the coding concepts behind them, and offers links to other useful resources.
  • The R "datasets" package contains lots of useful preloaded datasets. Check out The R Datasets Package for a list. The list includes links to detailed descriptions of each dataset.

Test your knowledge on R data frames

TOTAL POINTS 4

Question 1

Which of the following are best practices for creating data frames? Select all that apply.

  • Rows should be named
  • All data stored should be the same type
  • Columns should be named
  • Each column should contain the same number of data items

Correct. When creating data frames, columns should be named and each column should contain the same number of data items.

Question 2

Why are tibbles a useful variation of data frames?

  • Tibbles make printing easier
  • Tibbles can create row names
  • Tibble can change the data type of inputs
  • Tibbles make changing the names of variables easier.

Correct. Tibbles can make printing easier. They also help you avoid overloading your console when working with large datasets. Tibbles are automatically set to only return the first ten rows of a dataset and as many columns as it can fit on the screen.

Question 3

Tidy data is a way of standardizing the organization of data within R.

  • True
  • False

Correct. Tidy data refers to the principles that make data structures meaningful and easy to understand. It’s a way of standardizing the organization of data within R.

Question 4

Which R function can be used to make changes to a data frame?

  • head()
  • str()
  • colnames()
  • mutate()

Correct. The mutate() function can be used to make changes to a data frame.