Introduction to R - biobakery/biobakery GitHub Wiki

Introduction to R

R is a language and environment for statistical computing and graphics. It implements a wide variety of statistical and graphical techniques (e.g., linear and nonlinear modeling, statistical tests, time-series analysis, classification, and clustering), and it is highly extensible. It also has a strong community and a rich ecosystem of packages.

Why use R for Microbiome research?

R is not just a programming language; it’s a research environment tailored for data-rich biological sciences. For microbiome researchers, R offers a uniquely powerful ecosystem:

  • Vast capabilities in base R, including a wide range of statistical and graphical techniques, but also easy to expand by writing new functions and packages
  • Excellent community support: mailing lists, blogs, tutorials
  • Comprehensive statistical and graphical capabilities built-in Dedicated packages for microbiome analysis, including:
    • phyloseq for taxonomic and functional profiling workflows
    • vegan for ecological ordination and diversity analyses
    • MMUPHin and Maaslin3, for batch normalization, differential abundance, and association testing
  • The Bioconductor project, a cornerstone of modern bioinformatics, with over 2,000 curated packages for genomics, transcriptomics, metagenomics, and more
  • Seamless integration of statistical modeling, data wrangling, and plotting in a reproducible pipeline
  • Reproducibility-first culture: support for literate programming via R Markdown, Quarto, and integration with version control (Git)

Contents


1. Lab Setup and basics in R

1.1. Environment Options

Option 1. Use the R console inside RStudio (free and open-source integrated development environment for R)

  • Start the RStudio program
  • Open a new document and save the file.
  • The window in the upper-left is your R script. This is where you will write instructions for R to carry out.
  • The window in the lower-left is the R console. This is where results will be displayed.

Option 2. Use the console directly on a terminal. This is what I am going to be using today:

  • Open the terminal and start an R session:
R
  • You will be greeted by the splash screen stating the R version, version name and platform, alongside other instructions:
R version 4.4.2 (2024-10-31) -- "Pile of Leaves"
Copyright (C) 2025 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> █
  • End R session:
q()

1.2. Typing commands in the console

Let's start by getting comfortable in the R environment:

The user interacts with R by inputting commands at the prompt (> █). The interpreter reads, evaluates the command and, usually, prints out different types of messages depending on the result. We can, for example, command R to do basic calculations for us:

1 + 1
# Expected output:
# [1] 2

The console read the 1 + 1 command, evaluated the addition, and printed out the result (2).

Additional operators include -, *, /, and ^ (exponentiation). As an example, the following command calculates the (approximate) area of a circle with a radius of 2:

3.14 * 2^2
# Expected output:
# [1] 12.56

1.3. Installing and Loading Packages

Although the base R experience is quite powerful and already has basic graphics and statistical capabilities built in, users will want to expand on that by incorporating functionalities available on additional packages.

Let's install and load the ggplot2 package, which houses functionality to produce publication-quality charts and graphs.

install.packages("ggplot2")
library(ggplot2)

Do not forget to load the package with library() at the beginning of each code notebook! From experience, if you receive the error command not found - make sure you have loaded the library! (I made this mistake one too many times).

Alternative: In Rstudio, go to the "Packages" tab and click the "Install" button. Search in the pop-up window and click "Install".

1.4. Using R help

package development in R follows rigis standards, which include the documentation for each function and data structure. If you are stuck trying to get a function to work, it is usually a good idea to inspect the help page for this function, which can be done via the commands:

help(help)
help(sqrt)
?sqrt

1.5. Variables

You can create variables in R - individual units with names to store values in. These units can then be called on later on, to examine or use their stored values.

Let's create a variable named r, and assign the value 2 to it (using the <- assignment operator):

r <- 2
ℹ️ About '<-' vs. '=' for assignments in R

R provides several ways to assign values to objects. The most common is <-, which visually "points" to the object receiving the value:

x <- 5

You can also use = for assignment, and it's often seen in interactive use on a console:

x = 5

However, <- is preferred in scripts and packages because = is also used to name function arguments and their values:

print(r <- 5)    # Option 1 WORKS
print(r = 5)     # Option 2 DOES NOT WORK

The print function only expects one argument, labeled x - the opbject to be printed . You can check that using help(print)

The first option assigns the value 5 to r and passes the value of r along as whatever is the first argument to print (which, as we know, is x). The second option tells the function print to use the value 5 explicitly as the named argument r. However, since print does not expect an argument labeled r, this throws an error.

Hence, both operators are not equivalent or interchangeable:

  • They have different precedences:
    • <- has lower precedence than =;
    • That’s why a <- b = 2 works but a = b <- 2 does not
  • The functionality of = changes at different levels of a complex code hierarchy (blocks inside of blocks inside of blocks of code...)

When in doubt: use <- for assignments, and = for function arguments.

Read more: Official R Docs

Note that the above command didn't prompt R to generate any output messages; the operation here is implicit. However, I can now call on r to check its stored value:

r
# Expected output:
# [1] 2

I can use stored variables for future operations:

3.14 * r^2
# Expected output:
# [1] 12.56

R has some built-in variables that we can directly make use of. For example, the pi variable stores a more accurate version of the constant $\pi$ than our 3.14:

pi
# Expected output:
# [1] 3.141593

Now, can you make sense of the following operations (notice how I can change the value stored in r with a new assignment operation):

r <- 3
area <- pi * r^2
area
# Expected output:
# [1] 28.27433

Lastly, R can use and handle other "classes" of values than just numbers. For example, character strings:

circle = "cake"
circle
# Expected output:
# [1] "cake"

Question: try the following command in R: circle = cake. Does it run successfully? What is the problem?


1.6. R Functions

Functions are conveniently enclosed operations, that take zero or more input and generate the desired outcome. We use a couple of examples to illustrate the concept of R functions. The first one, the very basic c() function, combines values into a vector:

c(1, 2, 3)
# Expected output:
# [1] 1 2 3

Notice that you call functions by providing parameters (values in the parentheses) as input. They then (most times) return values as output.

You can, of course, use variables as input, or assign the returned value to new variables. Imagine two researchers individually collected age measurements (in months) of different subjects from the same infant population, and now would like to combine their data. They can do so with:

samples_ages_1 = c(3, 4, 2, 7, 5, 5, 6)
samples_ages_2 = c(3, 2, 3, 3, 4)
samples_ages_all = c(samples_ages_1, samples_ages_2)
samples_ages_all
# Expected output:
# [1] 3 4 2 7 5 5 6 3 2 3 3 4

Since we have two numerical vectors with sample ages from the same population, this is an excellent opportunity to try a function that performs a statistical test on them. For example, the function t.test() does exactly what its name suggests: it performs a t-test between two vectors, to see if the difference in their means is statistically significant. This function expects to receive as arguments two numerical vectors:

t.test(samples_ages_1, samples_ages_2)

# Expected output:
#         Welch Two Sample t-test
#
# data:  samples_ages_1 and samples_ages_2
# t = 2.1755, df = 8.4684, p-value = 0.05945
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -0.07837949  3.22123663
# sample estimates:
# mean of x mean of y
#  4.571429  3.000000

Certain function parameters have names, and you can explicitly invoke them during function calls. For example, here you will notice that the test performed is a two-sided test. What if we wanted to perform a one-sided test, to see if the average of samples_ages_1 is significantly higher than that of samples_ages_2? For this, we can invoke the alternative parameter in t.test(), which lets us select one of the options ("two.sided", "less", or "greater"), depending on the alternative hypothesis we are interested in.

t.test(x = samples_ages_1, y = samples_ages_2, alternative = "greater")

# Expected output:
#         Welch Two Sample t-test
# 
# data:  samples_ages_1 and samples_ages_2
# t = 2.1755, df = 8.4684, p-value = 0.02972
# alternative hypothesis: true difference in means is greater than 0
# 95 percent confidence interval:
#  0.237779      Inf
# sample estimates:
# mean of x mean of y
#  4.571429  3.000000

Again, you can check the full list of parameters for functions in R with the command ? + function name. For example ?t.test gives you the full documentation for the function t.test.

1.7. Writing R functions

The functions we used so far are built-in. Just like variables, we can also create our own functions, by invoking the function keyword.

area_circle = function(r) {
     return(pi * r^2)
  }
area_circle(r = 3)
# Expected output:
# [1] 28.27433
Questions

Study the following two functions, aimed at calculating the overall mean of samples collected by two separate researchers.

  • What happened in each function?
  • What are their differences?
  • Which one is better?
overall_mean1 = function(samples1, samples2) {
      samples_all = c(samples1, samples2)
      return(mean(samples_all))
  }
overall_mean2 = function(samples1, samples2) {
      mean1 = mean(samples1)
      mean2 = mean(samples2)
      return((mean1 + mean2) / 2)
  }
  • Hint: imagine the following scenarios:
    • If the first researcher collected a lot more samples than the second one, which way is better?

1.8. Parsing, loading and inspecting data

We will use an example project of the most popular baby names in the United States and the United Kingdom. A cleaned and merged version of the data file is available at http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv.

In order to read data from a file, you have to know what kind of file it is. In our case, it is a CSV file, which is also one of the most popular formats for storing generic, simple tabular data. Because it is so ubiquitous, R has a base method to read CSV files. We can learn more about this method by invoking the help vignette:

?read.csv

Question: What would you use for other file types? Do you think that the files you usually deal with can be parsed by functions in base, or would you need a special package to load data from them?


Read in the file and assign the result to the name baby.names, and use the class function to check that the object is of class data.frame

baby.names = read.csv(file="https://www.dropbox.com/s/kr76pha2p82snj4/babyNames.csv?dl=1")
class(baby.names)
# Expected output:
# [1] "data.frame"

We can get a high-level data type description of the data.frame using the function str:

str(baby.names)
# Expected output:
# 'data.frame':   400000 obs. of  6 variables:
#  $ Name       : chr  "Truth" "Jalyssa" "Carolina" "Quinnton" ...
#  $ Sex        : chr  "Male" "Female" "Female" "Male" ...
#  $ Count      : int  30 39 199 11 5 6 54 17 8 6 ...
#  $ Year       : int  2007 2013 1974 1993 2017 2008 2012 2011 1987 1998 ...
#  $ Location   : chr  "CA" "NM" "England and Wales" "TN" ...
#  $ Name.length: int  5 7 8 8 6 12 5 5 7 8 ...

There are some commands that will help us inspect or scan tabular data in R without having to print or access the entire content. For exemple, the head function allows us to look at the first rows of a table (6 by default):

head(baby.names)
# Expected output:
#           Name    Sex Count Year          Location Name.length
# 1        Truth   Male    30 2007                CA           5
# 2      Jalyssa Female    39 2013                NM           7
# 3     Carolina Female   199 1974 England and Wales           8
# 4     Quinnton   Male    11 1993                TN           8
# 5       Samiul   Male     5 2017                IL           6
# 6 Arinzechukwu   Male     6 2008                SC          12

The summary function will give us some useful stats that vary depending on the data type of each column:

summary(baby.names)
# Expected output:
#      Name               Sex                Count              Year
#  Length:400000      Length:400000      Min.   :    5.0   Min.   :1960
#  Class :character   Class :character   1st Qu.:    7.0   1st Qu.:1982
#  Mode  :character   Mode  :character   Median :   11.0   Median :1996
#                                        Mean   :  157.6   Mean   :1994
#                                        3rd Qu.:   30.0   3rd Qu.:2007
#                                        Max.   :86922.0   Max.   :2017
#    Location          Name.length
#  Length:400000      Min.   : 2.000
#  Class :character   1st Qu.: 5.000
#  Mode  :character   Median : 6.000
#                     Mean   : 6.232
#                     3rd Qu.: 7.000
#                     Max.   :15.000

2. Working with Data in R

Here we will work further with the baby.names file that you loaded in above.

2.1. Working with data.frames in R

Usually, tabular data read into R will be stored as a data.frame.

  • A data.frame is a list of vectors of equal length
  • Each vector in the list forms a column
  • Each column can be a different type of vector
  • Typically columns are variables and the rows are observations

A data.frame has two dimensions corresponding to the number of rows and the number of columns (in that order).

Check the dimensions of the data.frame then extract the first three rows of the data.frame.

dim(baby.names)
# Expected output:
# [1] 400000      6
baby.names[1:3,]
# Expected output:
#       Name    Sex Count Year          Location Name.length
# 1    Truth   Male    30 2007                CA           5
# 2  Jalyssa Female    39 2013                NM           7
# 3 Carolina Female   199 1974 England and Wales           8

Extract the first three columns of the data.frame.

baby.names[,1:3]
# Expected output:
#                  Name    Sex Count
# 1               Truth   Male    30
# 2             Jalyssa Female    39
# 3            Carolina Female   199
# 4            Quinnton   Male    11
# 5              Samiul   Male     5
#
#                   ...
# 
# 33331            Enis   Male    14
# 33332          Aashna Female    36
# 33333          Garcia   Male     5
#  [ reached 'max' / getOption("max.print") -- omitted 366667 rows ]

Output a specific columns of the data.frame.

baby.names$Name
# Try also baby.names[["Name"]] or baby.names[, "Name"]

Output only the unique values within a column.

unique(baby.names$Name)

2.2. Aggregate data stats in a data.frame

Using the base function sum(). Have many babies were called "jill"?

sum(baby.names$Name == "Jill")
# Expected output:
# [1] 25

Extract rows where Name == "Jill". This can also be done with the subset command in R.

baby.names[baby.names$Name == "Jill",]
subset(baby.names, Name == "Jill")

Relational and logical operators

Operator Meaning
! Not
== Equal to
!= Not equal to
> Greater than
>= Greater than or equal
< Less than
<= Less than or equal
%in% Contained in set
& And
| Or

Exercise # 1:

E1-1. How many female babies are listed in the table?

E1-2. How many babies were born after 2003? Save the subset in a new dataframe.


2.3 Adding columns

The original table has a column that stores the Location where the baby was born. Let's first chack that this is true:

head(baby.names)
# Expected output:
#           Name    Sex Count Year          Location Name.length
# 1        Truth   Male    30 2007                CA           5
# 2      Jalyssa Female    39 2013                NM           7
# 3     Carolina Female   199 1974 England and Wales           8
# 4     Quinnton   Male    11 1993                TN           8
# 5       Samiul   Male     5 2017                IL           6
# 6 Arinzechukwu   Male     6 2008                SC          12

Indeed, the Location column stores location data. Let's ask R to build a quick table of counts per unique value in Location:

table(baby.names$Location)
# Expected output:
#                AK                AL                AR                AZ 
#              8685             31652             23279             42775 
#                CA                CO                CT                DC 
#            133257             35181             21526             11074 
#                DE England and Wales                FL                GA 
#              8625            227449             77218             58715 
#                HI                IA                ID                IL 
#             12072             22748             16330             66576 
#                IN                KS                KY                LA 
#             40200             24548             27041             35370 
#                MA                MD                ME                MI 
#             33190             35279             10030             51601 
#                MN                MO                MS                MT 
#             32946             37078             24520              9759 
#                NC                ND                NE                NH 
#             51874              8470             17549              9806 
#                NJ                NM                NV                NY 
#             47219             18673             21894             89115 
#                OH                OK                OR                PA 
#             55633             29857             27524             53943 
#                RI                SC                SD                TN 
#              9020             29738              9687             39714 
#                TX                UT                VA                VT 
#            113754             29828             44859              5519 
#                WA                WI                WV                WY 
#             41231             32858             13726              5786 

Apparently, the Location column contains several 2-letter acronyms for US states, and a separate value for England and Wales. Knowing that, let's add a new column specifying the country.

baby.names$Country = ifelse(baby.names$Location == "England and Wales", "UK", "US")
head(baby.names)
# Expected output:
#           Name    Sex Count Year          Location Name.length Country
# 1        Truth   Male    30 2007                CA           5      US
# 2      Jalyssa Female    39 2013                NM           7      US
# 3     Carolina Female   199 1974 England and Wales           8      UK
# 4     Quinnton   Male    11 1993                TN           8      US
# 5       Samiul   Male     5 2017                IL           6      US
# 6 Arinzechukwu   Male     6 2008                SC          12      US
table(baby.names$Country)
# Expected output:
#     UK     US
#  46516 353484

2.4 Replacing data entries

Especially when it comes to metadata - some cleaning may be needed before you run statistics. Let's take a look at the Sex column.

table(baby.names$Sex)
# Expected output:
# Female      M   Male
# 241203   7709 151088

Do you notice any discrepancy in the output?

If we ran statistics on this column it would be confused by the classification of Males here. One way to fix this column is to perform a text replacement on the occurrences of M (by itself) with the word Male

baby.names$Sex = gsub("M$", "Male", baby.names$Sex)
# NOT RUN: another possibility would be to do
# baby.names$Sex <- ifelse(baby.names$Sex == "M", "Male", baby.names$Sex)

Question: Why do we need the $ sign? What happens if we omit it?


Check the Sex summary table again.

table(baby.names$Sex)
# Expected output:
# Female   Male
# 241203 158797

3. Exporting data

Now that we have made some changes to our data set, we might want to save those changes to a file taht we can store and share. We can save it as a generic/agnostic CSV table to be parsed by other systems, or as an R Image file (.RData), to be loaded by R.

3.1 Save the output as a csv file

When interacting wth the file system, you want to be careful about referencing file and folder paths. You should aim to write code that creates file paths procedurally, agnostic of filesystem or username or R version. In order to do that, R includes two libraries that can help in this process: fs to interact with the filesystem, and here to organize and locate R project roots and subfolders. Let's use a path builder to create a fodler in our user Desktop that will work across systems.

library(fs)
# Load the filesystem package
workdir_path <- path_home("Desktop", "R_Tutorial")
# Build a system-specific path from user home to Desktop and then R_Tutorial
dir.create(workdir_path)
# run a system command to create the R_Tutorial folder according to the path we built
write.csv(baby.names, file=path(workdir_path, "babyNames_v2.csv"))
# write the CSV file to the workdir under the name "babyNames_v2.csv"

Question: How would you save other file formats?

Locate and open the file outside of R. Were your changes saved with the file?


3.2 Save the output as an R Image file

save(list = "baby.names", file=path(workdir_path, "babyNames.Rdata"))

Question: How do you load an R object? Use ?load to help you figure it out.


4. Basic statistics

Descriptive statistics of single variables are straightforward:

Find the mean of baby name lengths:

mean(baby.names$Name.length)
# Expected output:
# [1] 6.231587

Find the median of baby name lengths:

median(baby.names$Name.length)
# Expected output:
# [1] 6

Find the standard deviation of baby name lengths:

sd(baby.names$Name.length)
# Expected output:
# [1] 1.472727

Summarize the baby name lengths:

summary(baby.names$Name.length)
# Expected output:
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#   2.000   5.000   6.000   6.232   7.000  15.000

Exercise # 2:

E2-1. Which are the longest names?

E2-2. Which are the shortest names?


5. Simple graphs

5.1 Boxplots

Compare the length of baby names for boys and girls using a boxplot.

p = ggplot(data = baby.names, aes(x = Sex, y = Name.length)) +
        geom_boxplot()

ggsave(plot = p,
       filename = path(workdir_path, "basic_box_introR.png"),
       width = 7, height = 6)

print(p)

Adding color to the boxplot:

p2 = ggplot(baby.names, aes(x = Sex, y = Name.length, fill = Sex)) + 
         geom_boxplot() + 
         theme_bw()

ggsave(plot = p2,
       filename = path(workdir_path, "fill_box_introR.png"), 
       width = 7, height = 6)

p3 = ggplot(baby.names, aes(x = Sex, y = Name.length, color = Sex)) + 
         geom_boxplot(lwd = 2) + 
         theme_bw()

ggsave(plot = p3,
       filename = path(workdir_path, "color_box_introR.png"), 
       width = 7, height = 6)

print(p2)
print(p3)

Exercise # 3:

Change the layout of the plot.

E3-1. Add a plot title.

E3-2. Add a title to the y-axis.

E3-3. Choose custom colors for the boxplots. Good places to look up color names are:


5.2 Histograms

How many names were recorded for each year?

  • Check the timeframe that is included in the table.
  • How many records were obtained in total?
p4 = ggplot(baby.names, aes(x = Name.length)) + 
         geom_histogram()

ggsave(plot = p4,
       filename = path(workdir_path, "basic_histogram_introR.png"), 
       width = 7, height = 6)

print(p4)

Exercise # 4:

E4-1. Take a look at ?geom_histogram and customize the layout of the plot like you did for E3-1.

  • If you google geom_histogram you will find the ggplot2 documentation with every aesthetical mapping and option for this function.
  • You can also use other resources like help, Discourses or StackOverflow to find answers to specific use-cases

Introduction to tidyverse

The "tidyverse" is

  • a dialect of R
  • a collection of R packages with a shared design philosophy focused on manipulating data frames

All packages of the tidyverse share an underlying design philosophy, grammar, and data structures. It simplifies complex data manipulation and visualization tasks.

1. Core Packages

  • dplyr: data manipulation
  • ggplot2: data visualization
  • tidyr: reshaping and tidying data
  • readr: reading rectangular data (CSV, TSV, etc.)
  • tibble: enhanced data frames
  • purrr: functional programming tools
  • stringr: string operations
  • forcats: categorical variable handling

You can load them all at once using library(tidyverse)

2. Working with tidyverse

One prominent feature of the tidyverse is the pipe: %>%. Pipes take input from their left and pass it to the function on their right as the first positional argument. So these two lines of code will produce the same result:

library(dplyr)

f <- function(x,y) { x + y }

x <- 5
y <- 3

f(x,y)
# Expected output:
# [1] 8 

x %>% f(y)
# Expected output:
# [1] 8

This makes code more readable when chaining multiple operations performed on an input data frame. As an example, we can re-load the original baby.names dataset and re-do the manual processing we did, but using dplyr alternatives. The following block of code will create the Country column, fix the Sex column and assign the processing result to its own variable name. Then, it will display a summary of name lengths (mean and SD) for each Country-Sex combination:

baby.names = read.csv(file="https://www.dropbox.com/s/kr76pha2p82snj4/babyNames.csv?dl=1")

processed_baby.names <- baby.names %>%
  mutate(
    Country = case_when(
      Location == "England and Wales" ~ "UK",
      .default = "US"
    )
  ) %>%
  mutate(
    Sex = recode(Sex, "M" = "Male", "F" = "Female")
  )

processed_baby.names %>%
  group_by(Country, Sex) %>%
  summarise(mean_length = mean(Name.length), sd_length = sd(Name.length))

The example below takes the baby.names data, filters out the "England and Wales" observations, groups by Year and Sex, then computes the average name length by group, and arranges the result in descending order. This is done using several functions from the dplyr package from the tidyverse family.

processed_baby.names %>%
  filter(Location != "England and Wales") %>% 
  group_by(Year, Sex) %>%
  summarise(mean_length = mean(Name.length)) %>%
  arrange(-mean_length)

You can see that the command ends up looking similar to the English sentence describing what it does. The final result shows that Females in 1989 had the longest names on average.

Note: The base R language recently introduced its own pipe in version 4.1.0 that looks like this: |> . There are some subtle differences in behavior but for the most part they are interchangeable.

Additional Resources

⚠️ **GitHub.com Fallback** ⚠️