7.4.1.Create data visualization in R - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki
Hands-On Activity: Visualizing data with ggplot2
Activity overview
Earlier in this course, you encountered ggplot2, an R package for data visualization. In this activity, you’ll learn about the basic logic of data visualization in ggplot2 and how to create a plot using R code.
By the time you complete this activity, you’ll be able to write R functions that create data visualizations. This will enable you to create basic visualizations to demonstrate and share findings with your data and code.
The basics of ggplot2
The ggplot2 package lets you make high quality, customizable plots of your data. As a refresher, ggplot2 is based on the grammar of graphics, which is a system for describing and building data visualizations. The essential idea behind the grammar of graphics is that you can build any plot from the same basic components, like building blocks.
These building blocks include:
- A dataset
- A set of geoms: A geom refers to the geometric object used to represent your data. For example, you can use points to create a scatterplot, bars to create a bar chart, lines to create a line diagram, etc.
- A set of aesthetic attributes: An aesthetic is a visual property of an object in your plot. You can think of an aesthetic as a connection, or mapping, between a visual feature in your plot and a variable in your data. For example, in a scatterplot, aesthetics include things like the size, shape, color, or location (x-axis, y-axis) of your data points.
To create a plot with ggplot2, you first choose a dataset. Then, you determine how to visually organize your data on a coordinate system by choosing a geom to represent your data points and aesthetics to map your variables.
Prepare your data
The ggplot2 package lets you use R code to specify the dataset, geom, and aesthetics of your plot.
To do this, first choose a dataset to work with. For this activity, you will use the Palmer Penguins data that you’re already familiar with from earlier videos. However, you can also use another dataset instead.
Once you decide on your dataset, open RStudio and follow these steps:
- If you have not done so before, use the install.packages() function to install both ggplot2 and the Palmer Penguins data set. Type install.packages(“ggplot2”) and install.packages(“palmerpenguins”), then click Run.
- Load ggplot2 and the dataset using the library() function. Type library(ggplot2) and library(palmerpenguins).
- Now, examine the data frame for the penguins data. To do this, use the data() and View() functions. Use a capital “V” for the View() function since functions in R are case sensitive. Type data(penguins) and View(penguins), then click Run.
The first 10 rows of the data frame should appear like this:
The penguins dataset contains size measurements for three penguin species (Adelie, Chinstrap, and Gentoo) that live on the Palmer Archipelago in Antarctica. The columns include information such as body mass, flipper length, and bill length.
Create a plot in ggplot2
Suppose you want to plot the relationship between body mass and flipper length in the three penguin species. You can choose a specific geom that fits the type of data you have. Points show the relationship between two quantitative variables. A scatterplot of points would be an effective way to display the relationship between the two variables. You can put flipper length on the x-axis and body mass on the y-axis.
Type the following code to create the plot. But before you run it, review the code piece by piece:
ggplot(data = penguins) + geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g))
ggplot(data = penguins): In ggplot2, you begin a plot with the ggplot() function. The ggplot() function creates a coordinate system that you can add layers to. The first argument of the ggplot() function is the dataset to use in the plot. In this case, it’s “penguins.”
+: Then, you add a “+” symbol to add a new layer to your plot. You complete your plot by adding one or more layers to ggplot().
geom_point(): Next, you choose a geom by adding a geom function. The geom_point() function uses points to create scatterplots, the geom_bar function uses bars to create bar charts, and so on. In this case, choose the geom_point function to create a scatter plot of points. The ggplot2 package comes with many different geom functions. You’ll learn more about geoms later in this course.
(mapping = aes(x = flipper_length_mm, y = body_mass_g)): Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with the aes() function. The x and y arguments of the aes() function specify which variables to map to the x-axis and the y-axis of the coordinate system. In this case, you want to map the variable “flipper_length_mm” to the x-axis, and the variable “body_mass_g” to the y-axis.
Now go ahead and run the code. When you do, you get the following plot:
The plot shows a positive relationship between the two variables. In other words, the larger the penguin, the longer the flipper.
Create your own plot
To create your own plot using code, follow these three steps:
- Start with the ggplot() function and choose a dataset to work with.
- Add a geom_ function to display your data.
- Map the variables you want to plot in the arguments of the aes() function.
Try plotting with different datasets using different geoms and mapping arguments. Coming up in this course, you’ll learn even more about the process of creating a plot. You’ll also get a chance to work with the Penguins dataset to create lots of different plots in ggplot2.
Pro-Tip: You can write the same section of code above using a different syntax with the mapping argument inside the ggplot() call: ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) + geom_point()
The ggplot2 cheat sheet
This is just the beginning of what you can do with ggplot2. If you want to find out more about ggplot2, RStudio has a useful reference guide called the “Data Visualization with ggplot2 Cheat Sheet.” You can use the Cheat Sheet as a quick reference while you work to learn about the main functions and features of ggplot2.
Click the link to check it out: Cheat Sheet
Confirmation
In this activity, you created a scatterplot to show the relationship between flipper length and body mass in three penguin species. Which part of your code refers to the geometric object used to represent your data?
A. geom_point()
B. +
C. (mapping = aes(x = flipper_length_mm, y = body_mass_g))
D. ggplot(data = penguins)
The correct answer is A. geom_point(). Explain: A geom is the geometric object used to represent your data. In this case, the function geom_point() tells R to represent your data with points.
Common problems when visualizing in R
You can save this reading for future reference. Feel free to download a PDF version of this reading below:
Common problems encountered when visualizing in R.pdf
Coding errors are an inevitable part of writing code—especially when you are first beginning to learn a new programming language. In this reading, you will learn how to recognize common coding errors when creating visualizations using ggplot2. You will also find links to some resources that you can use to help address any coding problems you might encounter moving forward.
Common coding errors in ggplot2
When working with R code in ggplot2, a lot of the most common coding errors involve issues with syntax, like misplaced characters. That is why paying attention to details is such an important part of writing code. When there is an error in your code that R is able to detect, it will generate an error message. Error messages can help point you in the right direction, but they won’t always help you figure out the precise problem.
Let’s explore a few of the most common coding errors you might encounter in ggplot2.
Case sensitivity
R code is case sensitive. If you accidentally capitalize the first letter in a certain function, it might affect your code. Here is an example:
Glimpse(penguins)
The error message lets you know that R cannot find a function named “Glimpse”:
Error in Glimpse(penguins) : could not find function "Glimpse"
But you know that the function glimpse (lowercase “g”) does exist. Notice that the error message doesn’t explain exactly what is wrong but does point you in a general direction.
Based on that, you can figure out that this is the correct code:
glimpse(penguins)
Balance parenthesis and quotation marks
Another common R coding error involves parentheses and quotation marks. In R, you need to make sure that every opening parenthesis in your function has a closing parenthesis, and every opening quotation mark has a closing quotation mark. For example, if you run the following code, nothing happens. R does not create the plot. That is because the second line of code is missing two closing parentheses:
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g
RStudio does alert you to the problem. To the left of the line of code in your RStudio source editor, you might notice a red circle with a white “X” in the center. If you hover over the circle with your cursor, this message appears:
RStudio lets you know that you have an unmatched opening bracket. So, to correct the code, you know that you need to add a closing bracket to match each opening bracket.
Here is the correct code:
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g))
Using the plus sign to add layers
In ggplot2, you need to add a plus sign (“+”) to your code when you add a new layer to your plot. Putting the plus sign in the wrong place is a common mistake. The plus sign should always be placed at the end of a line of code, and not at the beginning of a line.
Here’s an example of code that includes incorrect placement of the plus sign:
ggplot(data = penguins)
+ geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g))
In this case, R’s error message identifies the problem, and prompts you to correct it:
Error: Cannot use
+.gg()with a single argument. Did you accidentally put + on a new line?
Here is the correct code:
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g))
You also might accidentally use a pipe instead of a plus sign to add a new layer to your plot, like this:
ggplot(data = penguins)%>%
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g))
You then get the following error message:
Error:
datamust be a data frame, or other object coercible by
fortify(), not an S3 object with class gg/ggplot
Here is the correct code:
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g))
Keeping these issues in mind and paying attention to details when you write code will help you reduce errors and save time, so you can stay focused on your analysis.
Help resources
Everyone makes mistakes when writing code–it is just part of the learning process. Fortunately, there are lots of helpful resources available in RStudio and online.
R documentation
R has built-in documentation for all functions and packages. To learn more about any R function, just run the code ?function_name. For example, if you want to learn more about the geom_bar function, type:
?geom_bar
When you run the code, an entry on “geom_bar” appears in the Help viewer in the lower-right pane of your RStudio workspace. The entry begins with a “Description” section that discusses bar charts:
The RDocumentation website contains much of the same content in a slightly different format, with additional examples and links.
ggplot2 documentation
The ggplot2 page, which is part of the official tidyverse documentation, is a great resource for all things related to ggplot2. It includes entries on key topics, useful examples of code, and links to other helpful resources.
Online search
Doing an online search for the error message you are encountering (and including “R” and the function or package name in your search terms) is another option. There is a good chance someone else has already encountered the same error and posted about it online.
The R community
If the other resources don’t help, you can try reaching out to the R community online. There are lots of useful online forums and websites where people ask for and receive help, including:
Hands-On Activity: Using ggplot
Activity overview
In the last activity, you got an introduction to visualizing data in ggplot2. In this activity, you’ll dive deeper with ggplot2 to quickly create data visualizations that allow you to explore your data and gain new insights.
By the time you complete this activity, you will have strengthened your understanding of ggplot2 and visualizing data in R. You will be able to use basic ggplot2 syntax and troubleshoot some common errors you might encounter. This will enable you to easily demonstrate and share your insights throughout your career as a data analyst.
Working in RStudio Cloud
To start, log into your RStudio (Posit) Cloud account. Open the project you will work on in the activity with this link, which opens in a new tab. If you haven't gone through this process already, at the top right portion of the screen you will see a "red stamp" indicating this project as a Temporary Copy. Click on the adjacent button, Save a Permanent Copy, and the project will be saved in your main dashboard for use with future lessons. Once that is completed, navigate to the file explorer in the bottom right and click on the following: Course 7 -> Week 4 -> Lesson2_GGPlot.Rmd.
The .csv file you will need, hotel_bookings.csv, is also located in this folder.
If you have trouble finding the correct activity, check out this step-by-step guide on how to navigate in RStudio (Posit) Cloud. Make sure to select the correct R markdown (Rmd) file. The other Rmd files will be used in different activities.
If you are using RStudio Desktop, you can download the Rmd file and the data for this activity directly here:
You can also find the Rmd file with the solutions for this activity here:
Carefully read the instructions in the comments of the Rmd file and complete each step. Some steps may be as simple as running pre-written code, while others may require you to write your own functions. After you finish the steps in the Rmd file, return here to confirm that your work is complete.
Confirmation
In Step 5 of this activity, you mapped columns to the x and y axes of a scatter plot. What syntax did you use to do this?
A. aes(x = stays_in_weekend_nights, y = children)
B. aes(x = ‘stays_in_weekend_nights’, y = ‘children’)
C. aes(x = children, y = stays_in_weekend_nights)
D. aes(x = ‘children’, y = ‘stays_in_weekend_nights’)
The correct answer is A. aes(x = stays_in_weekend_nights, y = children). Explain: The correct syntax for mapping columns to axes in this activity is aes(x = stays_in_weekend_nights, y = children). Going forward, you can use the knowledge of mapping and the ggplot2 package to create many kinds of visualizations in RStudio.
Visualizations in Tableau versus R
If you’ve taken the previous course on sharing data through storytelling, you may know how to use Tableau to create effective data visualizations. In this course, you’ll discover how to use R code in ggplot2 to create a variety of plots to visualize your data.
Please write one to two paragraphs (150-200 words total) describing your initial thoughts on the difference between Tableau and ggplot2 when it comes to data visualization. Reflect on the following questions:
-
What are the strengths and limitations of Tableau when it comes to data visualization? What are your favorite features of Tableau?
-
If you’re new to ggplot2, what features do you think will be the most useful for visualizing data?
-
How do the visualization tools in Tableau differ from the tools in ggplot2?
Then, visit the discussion forum to read what other learners have written, and engage with two or more posts to share your feedback.
Test your knowledge on data visualizations in R
Question 1
In ggplot2, you can use the _____ function to specify the data frame to use for your plot.
A. ggplot()
B. aes()
C. labs()
D. geom_point()
The correct answer is A. ggplot(). Explain: In ggplot2, you can use the ggplot() function to specify the data frame to use for your plot.
Question 2
In ggplot2, you use the plus sign (+) to add a layer to your plot.
A. True
B. False
It is true statement. Explain: In ggplot2, you use the plus sign (+) to add a layer to your plot.
Question 3
In ggplot2, what function do you use to map variables in your data to visual features of your plot?
A. The ggplot() function
B. The geom_bar() function
C. The geom_point() function
D. The aes() function
The correct answer is D. The aes() function. Explain: In ggplot2, you use the aes() function to map variables in your data to visual features of your plot. These features are known as aesthetics.
Question 4
What type of plot will the following code create?
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g))
A. Bar chart
B. Line diagram
C. Scatterplot
D. Boxplot
The correct answer is C. Scatterplot. Explain: The code will create a scatterplot. The function geom_point() uses points to create a scatterplot.