Course 7‐3 - Forestreee/Data-Analytics GitHub Wiki

Google Data Analytics Professional

[Data Analysis with R Programming]

WEEK3 - Working with data in R

The R programming language was designed to work with data at all stages of the data analysis process. In this part of the course, you’ll examine how R can help you structure, organize, and clean your data using functions and other processes. You’ll learn about data frames and how to work with them in R. You’ll also revisit the issue of data bias and how R can help.

Learning Objectives

  • Discuss how R functions may be used to address issues of bias and relationship between data variables
  • Describe R functions that may be used to clean and organize data
  • Describe functions used to work with data frames including read_csv(), data(), and datapasta()
  • Discuss the difference between tibbles and tribbles
  • Compare and contrast data cleaning with different tools
  • Create and work with data in R

Explore data and R

Data in R

Hey, it's great to have you back. Now that we've been introduced to R and programmed with it, let's learn about even more ways you can use R during our analysis process. We'll start by learning more about data frames and how to use them, and then explore how to work with our data in different ways using tidyverse packages. After that, we'll cover how to check for bias in R. R's community has really helped me grow as a data analyst, especially when it comes to processes like data cleaning. R helps me clean more efficiently and I can turn to a community of folks to learn how they have cleaned similar data. Sharing knowledge of R and being able to code review has improved my work a ton. I'm so excited to show you new ways to work with R and get more out of your data. Earlier, I mentioned that learning R was going to be fun. Here's where we get to take what we've learned so far and put it in action. When you're ready, you can get started in the next video. See you soon.

R data frames

Hey, welcome back. Before we can start cleaning and organizing our data or even check it for bias, we need to get our data into a usable format. This is where data frames come in. You might remember we talked a little bit about data frames before.

In this video, we'll learn more about what data frames are and how you can use them.

Let's get started.

First, let's talk about what a data frame is. A data frame is a collection of columns. It's a lot like a spreadsheet or a SQL table. Here's an example of a data frame in R. It's a lot like other tables we've worked with throughout this program. There's column names and rows and cells with data. The columns contain one variable, and the rows have a set of values that match each column. We use data frames for a lot of the same reasons as tables too. They help summarize data and put it into a format that's easy to read and use. There's some things to know about data frames before working with them. We'll learn more about data frames throughout this program, but this is a great starting point. First, columns should be named. Using empty column names can create problems with your results later on.

Let's think back to our example.

Each of the columns are named based on the variable they represent. There's carat, cut, color, clarity, depth. All of these columns represent data about the diamonds.

Next, it's important to know that the data stored in your data frame can be many different types, like numeric, factor, or character. Often data frames contain dates, time stamps and logical vectors.

Finally, each column should contain the same number of data items, even if some of those data items are missing. Data frames are foundational.

Now let's talk about tibbles. In the tidyverse, tibbles are like streamlined data frames. They make working with data easier, but they're a little different from standard data frames.

First, tibbles never change the data types of the inputs. They won't change your strings to factors or anything else. You can make more changes to base data frames, but tibbles are easier to use. This saves time because you won't have to do as much cleaning or changing data types in tibbles.

Tibbles also never change the names of your variables,

and they never create row names.

Finally, tibbles make printing in R easier. They won't accidentally overload your console because they're automatically set to pull up only the first 10 rows and as many columns as fit on screen. Super useful when you're working with large sets of data. Data frames and tibbles are the building blocks for analysis in R so having set standards for how they're built and dealt with is pretty important. If we all have the same understanding of what a data frame is, we can communicate more effectively. It's like we're all speaking the same language. It's also just a lot more practical. We need to be able to do things like define columns and review code easily in R. These characteristics make it easier to share your data and reproduce your analysis.

Consistent data structures like data frames make it easier to operate on an entire dataset. Tidy data refers to the principles that make data structures meaningful and easy to understand. It's a way of standardizing the organization of data within R.

These standards are pretty straightforward.

Variables are organized into columns.

Observations are organized into rows

and each value must have its own cell.

Now that you know a little more about data frames, let's start using them.

Coming up, I'll teach you how to create data frames, add data to them and expand them. Bye for now.

Question:

A data frame is a collection of columns. It is similar to a table in spreadsheets or SQL.

Question2:

Working with data frames

Hey there. Earlier, we learned about data frames and their key characteristics. Now we'll actually start working with them. As a data analyst a lot of your work will depend on data frames. If you don't create a data frame, your ability to work with your data will be limited. Think about spreadsheets. That basic structure of columns and rows carries over to R. Data frames are basically the data analyst's default way to interact with data. That's why knowing how to create and work with data frames is so important. So let's check out an example. Here we'll use R's built-in data frames. One of the great things about R and R packages is that there's a lot of interesting, easy-to- access datasets built in. These datasets that you practice some of the tools we've been learning.

Let's open RStudio and get started.

We'll use a preloaded dataset with information about diamonds. This data set is part of the ggplot2 package in the tidyverse. So make sure you first load ggplot2. We'll learn how to load our own datasets later too. But diamonds is a good dataset to practice with.

We can load this data now by using data open and closed parentheses. You might notice that when we start to type diamonds, RStudio gives us the option to select it from its drop-down menu. That's because this dataset already exists in our library. Okay, now let's add this data frame to our data viewer.

There's ten columns and 100 rows in this data frame, but we might not want to see all of it.

Correction: if you look closely at the bottom of the diamonds data set, you will see there are actually 53,940 entries (or observation rows) in total and not 100. In order to get a much shorter and simpler overview of the data observations, we will use the head() function introduced next.

We can use the head function to give us just the first six rows. This is a nice preview of the entire dataset. Accidentally printing the full data frame to the console can be annoying and can take a long time to compute. You can avoid printing the full data frame by using functions like head to get a quick preview.

We can also get the structure of the data frame using functions like str() and colnames(). These are just two functions you can use to check out your data.

We'll explore other functions like glimpse later on.

For example, we could use the structure function to highlight the structure of this data frame. This gives us some high-level info like the column names and the type of data contained in those columns.

But if we just want to know the column names we can use colnames instead. Here we have carat, cut, color, clarity, depth, all of the columns included in this data set.

We can also use the mutate function to make changes to our data frame.

The mutate function is part of the dplyr package which is in the tidyverse. So you'll need to load the tidyverse library before you test out mutate.

Let's add a new column first. All we have to do is input mutate and then tell R we want to add a new column to the diamonds data frame. We'll first call mutate followed by the name of the data frame we want to change. Then we'll add a column and the name of the new column we want to create. Then we want to calculate this new column. In this case, to make it easier to read the carat column we'll multiply it by 100 to create a new column carat_2.

And when we run this, presto, our data frame has a new column.

You won't lose any columns when you create the new one. The rest of the data frame will still be the same.

Data frames are usually the starting point for analyzing data in R. So it's important to understand the characteristics of data frames and how to create them. Great job, and I'll see you next time.

Question:

The head() function provides a preview of the first six rows of a data frame. This is useful if you want to quickly check out the data, but don’t want to print the entire data frame.

Hands-On Activity: Create your own data frame (Practice Quiz)

More about tibbles (Reading)

Data-import basics (Reading)

Hands-On Activity: Importing and working with data (Reading)

Data in R versus SQL (Discussion Prompt)

Test your knowledge on R data frames (Practice Quiz)

Cleaning data

Cleaning up with the basics

Hi again. Now that we've got a little more experience with the data frames, we can start doing some interesting things like clean, standardize, manipulate, and visualize data. We'll go through some common tasks that you'll perform as a data analyst. But we're just scratching the surface of what you might want to do in R. We'll start with the basics and learn how to clean up our columns. There will be a reading with a handy list you can refer to afterwards too.

Let's install the Here, Skimr and Janitor packages now. We'll go ahead and open our console.

First, we'll add the Here package. This package makes referencing files easier. To install it, we'll just write install.packages. Then in the parentheses, we'll put Here and RStudio will install it. After we install it, we'll also need to load it using library.

Next, we'll install Skimr and Janitor. As a quick reminder, these packages simplify data cleaning tasks. They're both really useful and do slightly different things. The Skimr package makes summarizing data really easy and let's you skim through it more quickly. We'll install it now.

The Janitor package has functions for cleaning data. After it's done installing, we'll still need to load it.

Finally, we want to make sure the dplyr package is loaded since we are going to be using some of its features. There, now we've got all the packages we need for basic data cleaning.

Now, let's load some data in.

Later, when you're practicing with your own data, you can use read to grab a file. For example, if you had a CSV you wanted to load, you could write, read underscore CSV, and input the file name in the parentheses. This is where the Here package comes in handy. Be sure to install and load the Here package before trying to save CSV files.

For now, we'll load a really fun package to practice with, the palmer penguin package. This is a dataset we've used before, but just as a quick reminder, the palmer penguin data has lots of information about three penguin species in the Palmer Archipelago, including size measurements, clutch sizes, and blood isotope ratios. Who doesn't love penguins?

First, we'll install the package. We'll type install.packages and input palmerpenguins. Then remember to load it by using the library function.

Now that we've got this data loaded into our library, we can try some cleaning functions on our columns. There's a few different functions that we can use to get summaries of our data frame. Skim without charts, glimpse, head, and select.

The skim without charts function gives us a pretty comprehensive summary of a dataset. Let's try it out. When we run this, we get a lot of info back. First, it gives us a summary with the name of the dataset and the number of rows and columns. It also gives us the column types and a summary of the different data types contained in the data frame.

Or we could use Glimpse to get a really quick idea of what's in this dataset. When we run this command, it'll show us a summary of the data. There's 344 rows and eight columns. We have species, island, measurements for bills, which are basically beaks and flippers, the penguins' body mass in grams, the sex, and finally, the year the data was recorded.

We can also use Head to get a preview of the column names and the first few rows of this data set. Having the column names summarized like this will make it easier to clean them up. We can use select to specify certain columns or to exclude columns we don't need right now.

Let's say we only need to check the species column. We can input penguins, then a pipe to indicate we'll add another command, and our select. We'll jump back into an R script because it will be easier to see. Now we have just the species column, or maybe we want everything except the species column. We'll put minus species instead of species, and now we have every column but species.

The select statement is useful for pulling just a subset of variables from a large dataset. This lets you focus on specific groups of variables. There's a lot of other select functions that build on this that we'll cover later on.

Now that we know our column names, we've got a better idea of what we might want to change.

The rename function makes it easy to change column names. Starting with the penguin data, we'll type rename and change the name of our island column to island underscore new. Now, looking at our column names, we can see the column name has changed.

Or let's say we want to change our columns so that they're spelled and formatted correctly. In spreadsheet programs, as long as our column names are meaningful, they're fine. But since we have to type the column names over and over in R, we need them to be consistent. Similar to the rename function, the rename_with() function can change column names to be more consistent. For example, maybe we want all of our column names to be in uppercase. We can use the rename_with() function to do that. This will automatically make our column names uppercase. But since variable names are usually lowercase, we'll use the "To lower" option to change it back.

The clean names function in the Janitor package will automatically make sure that the column names are unique and consistent. Let's try the clean names function on our penguins data. This ensures that there's only characters, numbers, and underscores in the names.

Now you know some functions for cleaning columns in your datasets.

Try practicing them on your own with the palmer penguins data. Once you're comfortable with these functions, we'll learn even more about data cleaning in R. See you soon.

Question:

The skim_without_charts() and glimpse() functions both return a summary of the data frame, including the number of columns and rows. glimpse()

Question2:

File-naming conventions (Reading)

More on R operators (Reading)

Organize your data

Hey, great to have you back. We've learned how to create data frames and perform some basic cleaning functions. Now it's time to start getting organized in R. Coming up I'll teach you some functions that will help you organize and filter your data. These functions look a little different in R than in the other tools we've used so far. But the reason we use them is still the same. If we don't organize our data we can't turn information into knowledge. Organizing our data and comparing different metrics from that data helps us find new insights. In other words it makes our data useful.

To help us do this, we'll use the arrange, group by and filter functions. Let's start by sorting our data.

We'll keep working with the palmer penguins data from earlier. In case you don't remember, refer to the link below.

If you haven't already installed the palmerpenguins package in RStudio, refer to the palmerpenguins package installation instructions.

We'll also need to load the right packages. All the packages we'll need are part of the core tidyverse. So let's load the core tidyverse now. We can use the arrange function to choose which variable we want to sort by, for example let's say we want to sort our penguin data by bill length. We'll type in a range and our column name. And when we execute this command it will return a tibble with data sorted by bill lengths. It's currently in ascending order. If we want to sort it in descending order we just add a minus sign before the column name. And now, the longest penguin bill is first. Now it's important to remember this data is just in our console, to save this as a data frame will start by naming it. Then we'll input the function we used to arrange the previous version of the penguins data. When we execute this it'll save a new data frame and we can use view penguins2 to add it to our data. This lets you save cleaned data without losing information from the original dataset.

You can also sort by data using the group by function. Group by is usually combined with other functions.

For example, we might want to group by a certain column and then perform an operation on those groups. With our penguin data, we can group by island and then use the summarize function to get the mean bill length. We checked out the summarize function when we introduced piping. Basically the summarize function lets us get high-level information about our penguin data. So let's build our group by statement first.

We're not interested in NA values so we can leave those out using the drop underscore NA argument. This addresses any missing values in our dataset. It's important to be careful when using drop_na. It's useful doing a group-level summary statistic like this but it will remove rows from the data.

Now let's use summarize. We'll title the summary column mean bill length millimeters. And then we'll build the mean statement.

And when we run this we get a data frame with the three islands and the mean bill length of the penguins living there.

We can get other summaries too, for example, if we want to know the maximum bill length, we can write a similar function and replace mean with max. And now we know that the penguin with the longest bill lived on Biscoe island.

Both group by and summarize can perform multiple tasks. For example, we could group by island and species and then summarize to calculate both the mean and max. To do that, we can write a similar command. We'll put species and island in our group by and drop any missing values. And then we can add a summarize statement with a max and mean calculation. And when we run this, we have both groupings and the max and mean. Thanks to piping we can combine all of these cleaning and transforming tasks into one code chunk.

data %>%
  group_by(category_column) %>%
  summarize(mean_value = mean(numeric_column), max_value = max(numeric_column))

Finally we can filter results using the filter function. Let's say we only want data on Adelie penguins. We'll start with the dataset we're using and then add the filter. You might notice that we're using two equal signs here; that's on purpose. The double equal sign means exactly equal to in R. And now we have a data frame that only contains data on Adelie penguins. This lets us narrow down our analysis if we need to.

Being able to clean and organized data is a key step in the data analysis process and knowing the right tool for the job is an important skill for a data analyst. R makes wrangling data easier and gives you a lot of functionality across the different stages of the data analysis process. Now that we've cleaned our data, we can get ready to transform it.

Coming up, we'll learn how to use the separate, unite and mutate functions and how to use them to transform our data in R. See you next time.

Hands-On Activity: Cleaning data in R (Practice Quiz)

Optional: Manually create a data frame (Reading)

Transforming data

Welcome back. So far, we've started cleaning and now working with data in R. Now, let's talk about how to transform data. Sometimes you need to be able to break up a variable across multiple columns or combine existing columns, or even add new values to your data frame.

Coming up, we'll use the separate, unite and mutate functions to transform our data in R. Luckily, the packages already downloaded into our library have some tools we can use to do just that. Let's open RStudio Cloud and check them out.

To start, we'll create a data frame from scratch. For this example, we'll create a standard data frame, so that we can test out other functions. But you could also make a tribble here since we're manually inputting the data. You'll learn more about tribbles in a reading.

For our dataset, we are going to copy and paste some data to create our own data frame. If you want to use the same data to follow along, check out the earlier reading. Our data contains employee information, including names and job titles. You can just copy it in.

We can then name the data frame employee, indicate the column names as id, name and job title, and print the whole data frame. Right now, the first and last names are combined into one column.

We can use the separate function to split these into separate columns. We'll start with separate, and then the data frame we want to work with and the column we'd like to separate. Then we'll add what we'd like to split the name column into. We'll just name these new columns, first name and last name. And finally, we'll tell R to separate the name column at the first blank space. When we run this, it will build us new columns for the first and last names.

The separate function has a partner, unite. The unite function allows us to merge columns together. Basically, it does the opposite of separate. Let's say we're working with the version of this data frame with two name columns, and we want to combine them. We'll copy in this data as well.

Note:

first_name <- c("John", "Rob", "Rachel", "Christy", "Johnson", "Candace", "Carlson", "Pansy", "Darius", "Claudia")

last_name <- c("Mendes", "Stewart", "Abrahamson", "Hickman", "Harper", "Miller", "Landy", "Jordan", "Berry", "Garcia")

job_title <- c("Professional", "Programmer", "Management", "Clerical", "Developer", "Programmer", "Management", "Clerical", "Developer", "Programmer")

employee <- data.frame(id, first_name, last_name, job_title)

print(employee)

Our unite statement's a lot like our separate. We'll start with unite and indicate the data frame we're referring to. Then, we'll name the column we're combining first name and last name in. And then we'll say which columns we're combining. No quotation marks needed here. And finally, we can include a space that separates them. And when we run that, those two columns are combined.

In addition to separating and merging columns, we can also create new variables in our data frame using the mutate function. We worked with mutate a little bit before to clean and organize our data. But mutate can also be used to add columns with calculations.

Let's go back to our penguin dataset. Right now, the body mass column is measured in grams. Maybe we want to add a column with kilograms. To do that, we'll use mutate to perform the conversion and add a new column. And it will return a tibble with our new column.

You can make calculations on multiple new variables by adding a comma. Let's add a column converting the flipper length too.

Now, we've learned how to transform existing data in our tables and how to create new variables. Separate, unite and mutate are some basic functions that we'll keep building on, and you might discover new ways to use them while you're practicing too.

Coming up, we'll talk more about summarizing data frames and how to address bias. I'll see you soon

Wide to long with tidyr (Reading)

Clean, organize, and transform data with R (Ungraded Plugin)

Test your knowledge on cleaning data (Practice Quiz)

Take a closer look at the data

Same data, different outcome

Great to have you back. Earlier, we talked about summarizing data in R. We even used the summarize function to calculate the mean for one of our penguin data variables.

Now, we'll work with a very famous data example: Anscombe's quartet. Anscombe's quartet has four datasets that have nearly identical summary statistics. But those summary statistics might be misleading. Data visualizations, especially for datasets like these, are so important. They help us discover things in our data that would otherwise remain hidden. Plus, you'll discover some of the ways R can create awesome visualizations.

Let's install the packages. This may take a few minutes to load.

Let's load the Anscombe's quartet data now. When we view this data, we notice that there's four sets of x and y axis in the data frame. That's the quartet.

Data can be summarized by different statistical measures. We'll get a summary of each set with the mean, standard deviation, and correlation for each of these datasets. We'll start by indicating that we want to group our data by set. Then we'll input our summarize function. When we run this, we'll get a summary of these statistical measures. In our summary table, we can check the mean. The mean for x in each dataset is nine, and the mean for y is 7.5. The standard deviation can help us understand the spread of values in a dataset and show us how far each value is from the mean. The standard deviation for x and y in every set in the quartet is the same, 3.32 and 2.03. Finally, we've got our correlation, which shows us how strong the relationship between two variables is. Here, it looks like the correlation between x and y in all four sets is around 0.816. Based on the summaries we created with our statistical measures, these datasets are identical, but sometimes just looking at the summarized data can be misleading.

Let's put together some simple graphs to help us visualize this data and check if the datasets are actually identical. You'll learn more about plotting data in R later. But for now, we'll just get a quick idea of how this data appears.

ggplot(quartet, aes(x, y)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  facet_wrap(~set)

Check it out. These four datasets appear quite different when we visualize them. If we just gone with a statistical summaries, we never would have known that this data is actually really different.

aes is internal function belong to ggplot2 package, used to data and its visualizing attributes to mapping when visualizing the graph.

library(ggplot2)

# 데이터 생성
data <- data.frame(
  x = rnorm(100),
  y = rnorm(100),
  category = rep(c("A", "B"), each = 50)
)

# ggplot 그래프 생성
ggplot(data, aes(x = x, y = y, color = category)) +
  geom_point()

when there is already data brought by external files, like xlsx or csv, the following usage will be such in below

# readxl 패키지 설치
# install.packages("readxl")

library(readxl)

# xlsx 파일 불러오기
data <- read_excel("경로/파일명.xlsx")

# 불러온 데이터 확인
head(data)
# csv 파일 불러오기
data <- read.csv("경로/파일명.csv")

# 불러온 데이터 확인
head(data)

I want to show you one more really cool thing. The datasauRus package. The datasauRus creates plots with the Anscombe data in different shapes. But let's run it to see that for ourselves. First, you'll start off with installing and loading the package. Then we'll create a chart. It's okay at these commands seem complicated. You'll be creating your own plot soon. This is just a sneak peek at how R can help you create data visualizations.

Note:

install.packages('datasauRus') **run the code**

library('datasauRus') **run the code**

ggplot(datasaurus_dozen,aes(x=x,y=y,colour=dataset))+geom_point()+theme_void()+theme(legend.position = "none")+facet_wrap(~dataset,ncol=3)  **run the code**

When we run this, it shows us several different plots. There's the famous dinosaur, a bull's eye, a star, R is a pretty powerful visualization tool. You could use the relationships between data points to create many other shapes. As you can see, you can do a lot of things with R. Data visualizations like the ones we just explored help you discover so much more about the data you're working with. It's important to explore your data in multiple ways to learn a little more about its story.

Next, we'll learn how to use R functions to check for bias.

The bias function

Hey, welcome back. By now, you've already learned the importance of fair unbiased data in data analysis.

In R, we can actually quantify bias by comparing the actual outcome of our data with the predicted outcome. There's a pretty complicated statistical explanation behind this. But with the bias function in R, we don't have to perform this calculation by hand. Basically the bias function finds the average amount that the actual outcome is greater than the predicted outcome. It's included in the sim design package. So it's helpful to install that and practice on your own. If the model is unbiased, the outcome should be pretty close to zero. A high result means that your data might be biased. A good thing to know before you analyze it.

Let's say we're working with a local weather channel to determine if their weather predictions are biased.

First we need to install and load a package called Sim design.

We'll use the bias function to compare forecasted temperatures with actual temperatures. For this example we'll just take a small sample of our weather data and input them here. We'll label this the actual temp. Then, we'll put in the predictions. And then the bias function. When we run this we find out that the result Is 0.71.

That's pretty close to zero but the prediction seemed biased towards lower temperatures which, means they aren't as accurate as they could be. And now that the local weather channel knows about this, they can find the problem in their system that's causing biased predictions. This doesn't mean that their predictions will be perfect all the time, but they'll be more accurate overall.

Let's try another example, in this scenario we're working for a game store. The store has been keeping a record of how many copies of new games they sell on release day. They want to compare those numbers to their actual sales so that they could find out if they are ordering new stock according to their actual needs.

Just like the previous example, we will start by inputting our sales data, we'll label that actual underscore sales and add the data points. Next will input the amount of stock they ordered as predicted underscore sales and then input those data points. And now we have our data ready to go.

As you learned in the first example the bias function compares the actual outcome and the predicted outcome of the data: to determine the average amount the actual outcome is greater than the predicted outcome. An unbiased model should be close to zero.

Let's run the bias function on our sales data now.

Like before we'll just type bias to start the function and then actual underscore sales and predicted underscore sales in the parentheses. When we press enter... Wow, the result is negative 35. That's pretty far from zero. The predicted outcome is larger than the actual outcome which means they may be ordering too much stock for release days. Now that they've used the bias function to compare these data points, they can reevaluate their stocking practices to avoid buying more stock than they need at once.

And that's it for now. We've covered a lot together. We learned how to create data frames. We tried out some basic data cleaning functions. We got a little preview of how data viz in R can help us better understand our data. And finally we learned how to use the bias function.

I've still got a lot more I want to tell you about R and if the data visualizations we created in this module, we're exciting for you, I've got great news.

Coming up, we'll learn all about data viz in R, but first you've got a weekly challenge to tackle. I know you're going to do great. And if you want to review any of the material we've covered in these videos, feel free. This might be the first time you've encountered R so it's a great opportunity to practice something new. Your code might throw up some errors at first. That's just part of writing code. Learning from our mistakes is how we grow. I'll see you afterwards for our next adventure in R.

Working with biased data (Reading)

Hands-On Activity: Changing your data (Practice Quiz)

Compare data cleaning on different platforms (Discussion Prompt)

Test your knowledge on R functions (Practice Quiz)

Module 3 challenge


Course 7 Module 3 Glossary