3B R Basics - NU-CPGME/sl_workshop_2024 GitHub Wiki
May 2024
Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo-Redondo, Ph.D. ([email protected])
We will test data representation with the mpg data frame found in ggplot2 (aka ggplot2::mpg). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).
mpg contains observations collected by the US Environmental Protection Agency on 38 models of car.
Warning
Make sure you have installed all the R packages first. See the Software Installation page in the pre-workshop material and follow the instructions under Step 4 to use the terminal to install the required packages in R.

library(tidyverse)
library(ggsci)
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
colnames(mpg)
## [1] "manufacturer" "model" "displ" "year" "cyl"
## [6] "trans" "drv" "cty" "hwy" "fl"
## [11] "class"
As datasets normally do not come in the format we need for analysis, R allows us to modify, transform, add new data among many other options. Here we will explore a few
Using dplyr we can generate pipelines with %>%
that allow us to perform various consecutive tasks with one command
mpg_new <- mpg %>%
filter(year>2006)
dim(mpg_new)
## [1] 117 11
summary(mpg_new$year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2008 2008 2008 2008 2008 2008
mpg_1 <- mpg %>%
dplyr::select(model,cty) %>%
summarise(Mean_cty = dplyr::n(), .by = c(model)) %>%
pivot_wider(names_from = model, values_from = Mean_cty)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
We can represent class by color:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
We can improve colors by using libraries such as ggpubr:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
scale_color_aaas()
We can also represent class by size:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
Or by shade:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
Or by shape:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
We can also set the aesthetic properties of our geom manually:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
We can split using facets:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
A geom is the geometrical object that a plot uses to represent data.
People often describe plots by the type of geom that the plot uses.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
Add statistical transformations:
ggplot(data = mpg) +
geom_bar(mapping = aes(x = class))
ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, y = after_stat(prop), group = 1))
We can also change position:
ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, fill = trans), position = "dodge")
We will analyze the spatio-temporal distribution of our sequences.
library(tidyverse)
library(treeio)
library(ggtree)
library(ape)
library(ggtreeExtra)
library(RColorBrewer)
library(lubridate)
library(ape)
library(TreeTools)
library(RColorBrewer)
library(ggpubr)
Read the Metadata:
Metadata <- read.csv("~/sl_workshop_2024/demo_data/phylo_data/SARSCoV2_BA4.metadata.tsv", sep = "\t")
Transform collection date into a date format with lubridate
Metadata$Date <- ymd(Metadata$date)
Group dates in weeks
Metadata$Week<-cut(Metadata$Date,breaks ="week")
Summarize country data by week
MetadataCountries <- Metadata %>%
drop_na(Week)
MetadataCountries <- MetadataCountries %>%
dplyr::count(country, Week) %>%
group_by(Week) %>%
as.data.frame()
MetadataCountries <- MetadataCountries %>%
tidyr::spread(key = country, value = n, fill = 0) %>%
tidyr::gather(key = country, value = n, - Week) %>%
arrange(Week, country) %>%
group_by(Week) %>%
mutate(freq = n / sum(n)) %>%
as.data.frame()
MetadataCountries$Week <- ymd(MetadataCountries$Week)
Generate colors for all countries
colourCount <- length(unique(MetadataCountries$country))
getPalette <- colorRampPalette(brewer.pal(12, "Paired"))
Plot weekly sequence frequency by country (y = freq
)
MetadataCountries %>%
ggplot(aes(fill = country, y = freq, x = Week, order = dplyr::desc(n))) +
geom_bar(position = "stack", stat = "identity") +
theme_pubr(legend = "right") +
scale_fill_manual(values = getPalette(colourCount)) +
scale_x_date(date_labels = "%b-%Y") +
ylab("Frequency") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ggtitle("Countries per time")
Plot weekly number of sequences per country (y = n
)
MetadataCountries %>%
ggplot(aes(fill = country, y = n, x = Week, order = dplyr::desc(n))) +
geom_bar(position = "stack", stat = "identity") +
theme_pubr(legend = "right") +
scale_fill_manual(values = getPalette(colourCount)) +
scale_x_date(date_labels = "%b-%Y") +
ylab("Number of Sequences") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ggtitle("Countries per time")
Now, we will visualize our temporal tree.
Read the tree
time_tree <- read.nexus("~/sl_workshop_2024/demo_data/phylo_data/timetree_out/timetree.nexus")
TreeData <- as.data.frame(as_tibble(time_tree))
names(TreeData)[4] <- "strain"
Add the Metadata
Metadata_Tree <- inner_join(TreeData, Metadata, by = "strain")
rownames(Metadata_Tree) <- Metadata_Tree$name
colourCount = length(unique(Metadata_Tree$country))
getPalette = colorRampPalette(brewer.pal(12, "Paired"))
Plot a rectangular tree showing country and region
p <- ggtree(time_tree, options(ignore.negative.edge=TRUE)) %<+% Metadata_Tree +
geom_tippoint(aes(color=country), show.legend = T) +
scale_color_manual(values = getPalette(colourCount))
p + geom_fruit(
geom = geom_tile,
mapping = aes(fill = region),
width = 0.01,
offset = 0.1) +
scale_fill_lancet()
Plot a circular tree showing country and region
p1 <- ggtree(time_tree,
layout = "circular",
options(ignore.negative.edge = TRUE)
) %<+% Metadata_Tree +
geom_tippoint(aes(color = country),
show.legend = T) +
scale_color_manual(values = getPalette(colourCount))
p1 + geom_fruit(
geom = geom_tile,
mapping = aes(fill = region),
width = 0.02,
offset = 0.1) +
scale_fill_lancet()
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.