R V: Visualization - BDC-training/VT25 GitHub Wiki
Course: VT25 R programming (SC00035)
- In this exercise we will work mainly with the R package ggplot2. Start by loading the
tidyverse
collection of packages which containsggplot2
library(tidyverse)
- During this exercise we will work with the data set
PimaIndiansDiabetes2
. This data can be downloaded from our github page with the code below. Spend a few minutes to get familiar with the data set using thehead()
function and thestr()
function
PimaIndiansDiabetes2 <- read_tsv("https://raw.githubusercontent.com/bcfgothenburg/VT24/main/PimaIndiansDiabetes2.txt")
head(PimaIndiansDiabetes2)
- Type the below line:
PimaIndiansDiabetes2 %>% ggplot
what is created by this line of code?
- Now add aesthetics for
glucose
forx
andpressure
fory
, what changed?
Solution
PimaIndiansDiabetes2 %>%
ggplot(aes(x = glucose, fill = diabetes))
# a plotting area appears without any data points
- The
PimaIndians2
data set has nine columns, Use theggplot()
function with only the columnglucose
as input and make a histogram. Whichgeom
layer do you need to make a histogram?
Solution
PimaIndiansDiabetes2 %>%
ggplot(aes(x = glucose)) +
geom_histogram()
# geom_histogram() is the geom layer used for displaying histograms
- make another histogram where you color the bars by diabetes status (using the aes
fill
). Within thegeom
for histogram, specifyalpha=0.3
, what happens when you do this?
Solution
PimaIndiansDiabetes2 %>%
ggplot(aes(x = glucose, fill = diabetes)) +
geom_histogram(aes(alpha = 0.3))
- Next, instead of a histogram, make a density plot using
geom_density
, use the same setting foralpha
.
Solution
PimaIndiansDiabetes2 %>%
ggplot(aes(x = glucose, fill = diabetes)) +
geom_density(alpha = 0.3)
- Now make a scatterplot with
ggplot
, give theaes()
function two vectors as argument instead:glucose
on the x-axis andpressure
on the y-axis.
Solution
PimaIndiansDiabetes2 %>%
ggplot(aes(x = glucose, y = pressure)) +
geom_point()
- The plot function uses the column names from the data frame as labels for the axis. Usually you want to write something that is more informative here. Add the layers
xlab
andylab
to your plot code, to set the labels to "Plasma Glucose Concentration" and "Diastolic blood pressure(mm Hg)" instead.
Solution
PimaIndiansDiabetes2 %>%
ggplot(aes(x = glucose, y = pressure)) +
geom_point() +
xlab("Plasma Glucose Concentration") +
ylab("Diastolic blood pressure(mm Hg)")
- You can also add a title to the plot using the
ggtitle
layer. Add the layerggtitle()
to your plot code, to give your plot the header: "Scatter plot of Pima Indians data".
Solution
PimaIndiansDiabetes2 %>%
ggplot(aes(x = glucose, y = pressure)) +
geom_point() +
xlab("Plasma Glucose Concentration") +
ylab("Diastolic blood pressure(mm Hg)") +
ggtitle("Scatter plot of Pima Indians data")
- Next, color the points by if the subjects have diabetes or not.
Solution
PimaIndiansDiabetes2 %>%
ggplot(aes(x = glucose, y = pressure, color = diabetes)) +
geom_point() +
xlab("Plasma Glucose Concentration") +
ylab("Diastolic blood pressure(mm Hg)") +
ggtitle("Scatter plot of Pima Indians data")
- Now instead of color by diabetes status, make the plot with facets
Solution
PimaIndiansDiabetes2 %>%
ggplot(aes(x = glucose, y = pressure, color = diabetes)) +
geom_point() +
xlab("Plasma Glucose Concentration") +
ylab("Diastolic blood pressure(mm Hg)") +
ggtitle("Scatter plot of Pima Indians data") +
facet_grid(~ diabetes)
A barplot is a staples diagram that shows the number of elements in different groups. We will create a barplot that shows how many individuals that have a blood pressure larger than 90, and how many are below.
- Use the package
tidyverse
and the functionmutate()
to calculate which of the subjects have high vs low blood pressure as mentioned above, create a new variable namedHighBP
. Then create a barplot of this variable with ggplot.
Solution
PimaIndiansDiabetes2 %>%
mutate(HighBP = pressure > 90) %>%
filter(!is.na(HighBP)) %>%
ggplot(aes(x = HighBP)) +
geom_bar()
- Make a boxplot of glucose, grouped by diabetes status, color the boxes by diabetes status.
Solution
PimaIndiansDiabetes2 %>%
ggplot(aes(x = diabetes, y = glucose, fill = diabetes)) +
geom_boxplot()
- Install and load the R package
ggsignif
. Next add the geomgeom_signif()
to add a p-value for comparing the glucose level of the diabetics to the glucose level of the non-diabetics, by using a Mann-Whitney or Wilcoxon test. Check the help ofgeom_signif()
for details. Is there a significant difference between the groups?
Solution
install.packages("ggsignif")
library(ggsignif)
PimaIndiansDiabetes2 %>%
ggplot(aes(x = diabetes, y = glucose, fill = diabetes)) +
geom_boxplot() +
geom_signif(comparisons = list(c("neg", "pos")),
map_signif_level = FALSE,
test="wilcox.test")
- In this exercise we will use another dataset containing normalized read counts from a metagenomic investigation of microbial diversity. The data is available at https://raw.githubusercontent.com/bcfgothenburg/VT24/main/mine-data.csv. Read the data into R using a suitable function. Check the properties of the datasets using e.g.
str()
.
Solution
mine_data <- read_csv("https://raw.githubusercontent.com/bcfgothenburg/VT24/main/mine-data.csv")
str(mine_data)
- Next, we will use the function
pivot_longer()
, included in thetidyr
package (and hence also thetidyverse
package), to change the format of the dataset from a wide to a long format. Instead of one column for each bacteria, we want one column with bacteria name and one column with the values of each measurement. Check the R help or online resources for information on how this function is used.
Solution
mine_long <- mine_data %>%
pivot_longer(cols = Actinobacteria:Gammaproteobacteria,
names_to = "Class",
values_to = "Abundance")
- Now we will make a heatmap using
ggplot
. For this we use the geomgeom_tile
, put sample names on the x-axis and bacteria class on the y-axis. Fill the tiles based on the abundance values.
Solution
mine_long %>%
ggplot(aes(x = Sample.name, y = Class, fill = Abundance)) +
geom_tile() +
xlab(label = "Sample")
- The variable
Depth
has 3 different values. Now make the heatmap into a faceted usingfacet_grid
Solution
mine_long %>%
ggplot(aes(x = Sample.name, y = Class, fill = Abundance)) +
geom_tile() +
xlab(label = "Sample") +
facet_grid(~ Depth)
- In the faceted heatmap you see that you have some empty columns, because those don't have values in those groups. One way of only showing the columns that contain values is to add the parameters
scales = "free", space="free"
to thefacet_grid()
layer.
Solution
mine_long %>%
ggplot(aes(x = Sample.name, y = Class, fill = Abundance)) +
geom_tile() +
xlab(label = "Sample") +
facet_grid(~ Depth, scales = "free", space="free")
- Make a transformation of the Abundance values by taking the square root, using the function
sqrt()
. Repeat the heatmap above by using the transformed values instead.
Solution
mine_long %>%
mutate(Sqrt.abundance = sqrt(Abundance)) %>%
ggplot(aes(x = Sample.name, y = Class, fill = Sqrt.abundance)) +
geom_tile() +
xlab(label = "Sample") +
facet_grid(~ Depth, scales = "free", space="free")
- Now also add the color scale in the heatmap using
scale_fill_gradient()
, define the parameterslow = "#FFFFFF", high = "#012345"
so the low values will be white and the high values will be dark blue.
Solution
mine_long %>%
mutate(Sqrt.abundance = sqrt(Abundance)) %>%
ggplot(aes(x = Sample.name, y = Class, fill = Sqrt.abundance)) +
geom_tile() +
xlab(label = "Sample") +
facet_grid(~ Depth, scales = "free", space="free") +
scale_fill_gradient(name = "Sqrt(Abundance)",
low = "#FFFFFF",
high = "#012345")
- As often in R instead of doing a lot of manual steps, there is a package available to make life easier. Now instead of using
ggplot
we will use the packagepheatmap
as was previously shown in the demo. First install and load thepheatmap
package. Extract the columns with bacteria abundance from the original dataset, transpose the new matrix and transform the values with asqrt()
.
Solution
install.packages("pheatmap")
library(pheatmap)
tmp <- mine_data %>% select(Actinobacteria:Gammaproteobacteria) %>% t %>% sqrt
- The new matrix has samples as columns and bacterias as rows. Now use the column
Sample.name
in the original data.frame to set the column names in the new matrix.
Solution
colnames(tmp) <- mine_data$Sample.name
- Create a heatmap of this new matrix using the function
pheatmap()
.
Solution
pheatmap(tmp)
- We might want to add some annotation to the plot. Create a data.frame with one column named
Depth
containing values from the columnDepth
in the original data.frame. Also set the rownames of your annotation data.frame to the values in the columnSample.name
Solution
annocol <- data.frame(Depth = factor(mine_data$Depth))
rownames(annocol) <- mine_data$Sample.name
- Now make a new heatmap adding the column annotation using the
annotation_col
argument inpheatmap()
.
Solution
pheatmap(tmp, annotation_col=annocol)
Now we will do a PCA plot of the same data. To do that we need to have bacterias as columns and samples as rows, and still have the sqrt
-transformation so you can create the same tmp object as in the pheatmap exercise above but without the transpose step.
- To start we need to calculate the principal components. This is done in a function called
prcomp()
. This function takes a numerical data frame as input. Calculate the prinicipal components for the numerical columns in the data using theprcomp()
function. Store them in a variable calledpc
.
Solution
tmp <- mine_data[,-(1:3)] %>% sqrt
pc <- prcomp(tmp)
- Plot your
pc
variable with the plot function. Settype="b"
. This plots the variances of the principal components. You can see that the variance is large for the first PC and decreases for every PC.
Solution
plot(pc, type="b")
-
Your
pc
variable contains a lot of information. It is neither a vector, a data frame, a matrix or any of the other classes we have seen so far.-
Write
class(pc)
in your console to find out which classpc
belongs to. -
Write
names(pc)
in your console to find out whatpc
contains.
-
-
The prinicpal components are found in the matrix
pc$x
. We can plot them in a score plot.- Take a look at the matrix
pc$x
usinghead()
to find out how the principal components are stored. - Next, convert the matrix
pc$x
into adata.frame
. - Then use this
data.frame
to create a scatter plot, withPC1
on the x-axis andPC2
on the y-axis, usingggplot()
- To be able to distinguish the different depths, add the col
aestetic with the
mine.data$Depth
vector as input, you might need to convert the vector into a factor.
- Take a look at the matrix
Solution
pc$x %>%
as.data.frame() %>%
ggplot(aes(x=PC1, y=PC2, color=as.factor(mine_data$Depth)))+
geom_point()
- Change the theme of the PCA plot to
theme_classic
by adding the layertheme_classic()
Solution
pc$x %>%
as.data.frame() %>%
ggplot(aes(x=PC1, y=PC2, color=as.factor(mine_data$Depth)))+
geom_point() +
theme_classic()
Home: R programming
Developed by Maria Nethander, 2019. Modified by Malin Östensson, 2020. Modified by Björn Andersson and Jari Martikainen, 2024