3B R Basics - NU-CPGME/sl_workshop_2024 GitHub Wiki

May 2024

Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo-Redondo, Ph.D. ([email protected])

R Basics

We will test data representation with the mpg data frame found in ggplot2 (aka ggplot2::mpg). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).

mpg contains observations collected by the US Environmental Protection Agency on 38 models of car.

Warning

Make sure you have installed all the R packages first. See the Software Installation page in the pre-workshop material and follow the instructions under Step 4 to use the terminal to install the required packages in R.

Start RStudio

Load the libraries

library(tidyverse)
library(ggsci)

View mpg data frame

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows
summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
colnames(mpg)
##  [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
##  [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
## [11] "class"

Basic Data transformation

As datasets normally do not come in the format we need for analysis, R allows us to modify, transform, add new data among many other options. Here we will explore a few

Using dplyr we can generate pipelines with %>% that allow us to perform various consecutive tasks with one command

Filtering

mpg_new <- mpg %>% 
	filter(year>2006)

dim(mpg_new)
## [1] 117  11
summary(mpg_new$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2008    2008    2008    2008    2008    2008

Select columns, summarize and reshape

mpg_1 <- mpg %>% 
  dplyr::select(model,cty) %>%
  summarise(Mean_cty = dplyr::n(), .by = c(model)) %>%
  pivot_wider(names_from = model, values_from = Mean_cty)

Basic Data representation

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

We can represent class by color:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

We can improve colors by using libraries such as ggpubr:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  scale_color_aaas()

We can also represent class by size:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class))

Or by shade:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

Or by shape:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))

We can also set the aesthetic properties of our geom manually:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

We can split using facets:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

Geometric objects

A geom is the geometrical object that a plot uses to represent data.

People often describe plots by the type of geom that the plot uses.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()

Add statistical transformations:

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class))

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class, y = after_stat(prop), group = 1))

We can also change position:

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class, fill = trans), position = "dodge")

Now that we have been introduced to R, let’s work with our SARS-CoV-2 metadata

We will analyze the spatio-temporal distribution of our sequences.

library(tidyverse)
library(treeio)
library(ggtree)
library(ape)
library(ggtreeExtra)
library(RColorBrewer)
library(lubridate)
library(ape)
library(TreeTools)
library(RColorBrewer)
library(ggpubr)

Read the Metadata:

Metadata <- read.csv("~/sl_workshop_2024/demo_data/phylo_data/SARSCoV2_BA4.metadata.tsv", sep = "\t")

Transform collection date into a date format with lubridate

Metadata$Date <- ymd(Metadata$date)

Group dates in weeks

Metadata$Week<-cut(Metadata$Date,breaks ="week")

Summarize country data by week

MetadataCountries <- Metadata %>% 
    drop_na(Week)
    
MetadataCountries <- MetadataCountries %>% 
    dplyr::count(country, Week) %>%
    group_by(Week) %>%
    as.data.frame()
    
MetadataCountries <- MetadataCountries %>%
  tidyr::spread(key = country, value = n, fill = 0) %>%
  tidyr::gather(key = country, value = n, - Week) %>%
  arrange(Week, country) %>% 
  group_by(Week) %>%
  mutate(freq = n / sum(n)) %>% 
  as.data.frame()
  
MetadataCountries$Week <- ymd(MetadataCountries$Week)

Generate colors for all countries

colourCount <- length(unique(MetadataCountries$country))

getPalette <- colorRampPalette(brewer.pal(12, "Paired"))

Plot weekly sequence frequency by country (y = freq)

MetadataCountries %>%
  ggplot(aes(fill = country, y = freq, x = Week, order = dplyr::desc(n))) + 
  geom_bar(position = "stack", stat = "identity") +
  theme_pubr(legend = "right") +
  scale_fill_manual(values = getPalette(colourCount)) +
  scale_x_date(date_labels = "%b-%Y") +
  ylab("Frequency") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + 
  ggtitle("Countries per time")

Plot weekly number of sequences per country (y = n)

MetadataCountries %>%
  ggplot(aes(fill = country, y = n, x = Week, order = dplyr::desc(n))) + 
  geom_bar(position = "stack", stat = "identity") +
  theme_pubr(legend = "right") +
  scale_fill_manual(values = getPalette(colourCount)) +
  scale_x_date(date_labels = "%b-%Y") +
  ylab("Number of Sequences") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + 
  ggtitle("Countries per time")

Tree Visualization

Now, we will visualize our temporal tree.

Read the tree

time_tree <- read.nexus("~/sl_workshop_2024/demo_data/phylo_data/timetree_out/timetree.nexus")

TreeData <- as.data.frame(as_tibble(time_tree))

names(TreeData)[4] <- "strain"

Add the Metadata

Metadata_Tree <- inner_join(TreeData, Metadata, by = "strain")

rownames(Metadata_Tree) <- Metadata_Tree$name

colourCount = length(unique(Metadata_Tree$country))

getPalette = colorRampPalette(brewer.pal(12, "Paired"))

Plot a rectangular tree showing country and region

p <- ggtree(time_tree, options(ignore.negative.edge=TRUE)) %<+% Metadata_Tree +
  geom_tippoint(aes(color=country), show.legend = T) +
  scale_color_manual(values = getPalette(colourCount))

p + geom_fruit(
    geom = geom_tile,
    mapping = aes(fill = region),
    width = 0.01, 
    offset = 0.1) +
  scale_fill_lancet()

Plot a circular tree showing country and region

p1 <- ggtree(time_tree,
             layout = "circular",
             options(ignore.negative.edge = TRUE)
             ) %<+% Metadata_Tree +
  geom_tippoint(aes(color = country),
                show.legend = T) + 
  scale_color_manual(values = getPalette(colourCount))

p1 + geom_fruit(
    geom = geom_tile,
    mapping = aes(fill = region),
    width = 0.02, 
    offset = 0.1) +
  scale_fill_lancet()



Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

⚠️ **GitHub.com Fallback** ⚠️