Introduction to R - biobakery/biobakery GitHub Wiki
R is a language and environment for statistical computing and graphics. It implements a wide variety of statistical and graphical techniques (e.g., linear and nonlinear modeling, statistical tests, time-series analysis, classification, and clustering), and it is highly extensible. It also has a strong community and a rich ecosystem of packages.
R is not just a programming language; it’s a research environment tailored for data-rich biological sciences. For microbiome researchers, R offers a uniquely powerful ecosystem:
- Vast capabilities in base R, including a wide range of statistical and graphical techniques, but also easy to expand by writing new functions and packages
- Excellent community support: mailing lists, blogs, tutorials
- Comprehensive statistical and graphical capabilities built-in
Dedicated packages for microbiome analysis, including:
-
phyloseq
for taxonomic and functional profiling workflows -
vegan
for ecological ordination and diversity analyses -
MMUPHin
andMaaslin3
, for batch normalization, differential abundance, and association testing
-
- The Bioconductor project, a cornerstone of modern bioinformatics, with over 2,000 curated packages for genomics, transcriptomics, metagenomics, and more
- Seamless integration of statistical modeling, data wrangling, and plotting in a reproducible pipeline
- Reproducibility-first culture: support for literate programming via R Markdown, Quarto, and integration with version control (Git)
- Introduction to R
- Introduction to tidyverse
- Additional Resources
Option 1. Use the R console inside RStudio (free and open-source integrated development environment for R)
- Start the RStudio program
- Open a new document and save the file.
- The window in the upper-left is your R script. This is where you will write instructions for R to carry out.
- The window in the lower-left is the R console. This is where results will be displayed.
Option 2. Use the console directly on a terminal. This is what I am going to be using today:
- Open the terminal and start an R session:
R
- You will be greeted by the splash screen stating the R version, version name and platform, alongside other instructions:
R version 4.4.2 (2024-10-31) -- "Pile of Leaves"
Copyright (C) 2025 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> █
- End R session:
q()
Let's start by getting comfortable in the R environment:
The user interacts with R by inputting commands at the prompt (> █
). The interpreter reads, evaluates the command and, usually, prints out different types of messages depending on the result. We can, for example, command R to do basic calculations for us:
1 + 1
# Expected output:
# [1] 2
The console read the 1 + 1
command, evaluated the addition, and printed out the result (2).
Additional operators include -
, *
, /
, and ^
(exponentiation). As an
example, the following command calculates the (approximate) area of a circle with
a radius of 2:
3.14 * 2^2
# Expected output:
# [1] 12.56
Although the base R experience is quite powerful and already has basic graphics and statistical capabilities built in, users will want to expand on that by incorporating functionalities available on additional packages.
Let's install and load the ggplot2
package, which houses functionality to produce publication-quality charts and graphs.
install.packages("ggplot2")
library(ggplot2)
Do not forget to load the package with library()
at the beginning of each code notebook! From experience, if you receive the error command not found
- make sure you have loaded the library! (I made this mistake one too many times).
Alternative: In Rstudio, go to the "Packages" tab and click the "Install" button. Search in the pop-up window and click "Install".
package development in R follows rigis standards, which include the documentation for each function and data structure. If you are stuck trying to get a function to work, it is usually a good idea to inspect the help page for this function, which can be done via the commands:
help(help)
help(sqrt)
?sqrt
You can create variables in R - individual units with names to store values in. These units can then be called on later on, to examine or use their stored values.
Let's create a variable named r
, and assign the value 2
to it (using the <-
assignment operator):
r <- 2
ℹ️ About '<-' vs. '=' for assignments in R
R provides several ways to assign values to objects. The most common is
<-
, which visually "points" to the object receiving the value:x <- 5
You can also use
=
for assignment, and it's often seen in interactive use on a console:x = 5
However,
<-
is preferred in scripts and packages because=
is also used to name function arguments and their values:print(r <- 5) # Option 1 WORKS print(r = 5) # Option 2 DOES NOT WORK
The
x
- the opbject to be printed . You can check that usinghelp(print)
The first option assigns the value 5 to
r
and passes the value ofr
along as whatever is the first argument tox
). The second option tells the function5
explicitly as the named argumentr
. However, sincer
, this throws an error.Hence, both operators are not equivalent or interchangeable:
- They have different precedences:
<-
has lower precedence than=
;- That’s why
a <- b = 2
works buta = b <- 2
does not- The functionality of
=
changes at different levels of a complex code hierarchy (blocks inside of blocks inside of blocks of code...)When in doubt: use
<-
for assignments, and=
for function arguments.Read more: Official R Docs
Note that the above command didn't prompt R to generate any output messages; the operation here is implicit.
However, I can now call on r
to check its stored value:
r
# Expected output:
# [1] 2
I can use stored variables for future operations:
3.14 * r^2
# Expected output:
# [1] 12.56
R has some built-in variables that we can directly make use of. For example, the
pi
variable stores a more accurate version of the constant 3.14
:
pi
# Expected output:
# [1] 3.141593
Now, can you make sense of the following operations (notice how I can change the
value stored in r
with a new assignment operation):
r <- 3
area <- pi * r^2
area
# Expected output:
# [1] 28.27433
Lastly, R can use and handle other "classes" of values than just numbers. For example, character strings:
circle = "cake"
circle
# Expected output:
# [1] "cake"
Question: try the following command in R:
circle = cake
. Does it run successfully? What is the problem?
Functions are conveniently enclosed operations, that take zero or more input
and generate the desired outcome. We use a couple of examples to illustrate the
concept of R functions. The first one, the very basic c()
function, combines
values into a vector:
c(1, 2, 3)
# Expected output:
# [1] 1 2 3
Notice that you call functions by providing parameters (values in the parentheses) as input. They then (most times) return values as output.
You can, of course, use variables as input, or assign the returned value to new variables. Imagine two researchers individually collected age measurements (in months) of different subjects from the same infant population, and now would like to combine their data. They can do so with:
samples_ages_1 = c(3, 4, 2, 7, 5, 5, 6)
samples_ages_2 = c(3, 2, 3, 3, 4)
samples_ages_all = c(samples_ages_1, samples_ages_2)
samples_ages_all
# Expected output:
# [1] 3 4 2 7 5 5 6 3 2 3 3 4
Since we have two numerical vectors with sample ages from the same population,
this is an excellent opportunity to try a function that performs a statistical test on them.
For example, the function t.test()
does exactly what its name suggests: it performs
a t-test between two vectors,
to see if the difference in their means is statistically significant.
This function expects to receive as arguments two numerical vectors:
t.test(samples_ages_1, samples_ages_2)
# Expected output:
# Welch Two Sample t-test
#
# data: samples_ages_1 and samples_ages_2
# t = 2.1755, df = 8.4684, p-value = 0.05945
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.07837949 3.22123663
# sample estimates:
# mean of x mean of y
# 4.571429 3.000000
Certain function parameters have names, and you can explicitly invoke them during
function calls. For example, here you will notice that the test performed
is a two-sided test.
What if we wanted to perform a one-sided test, to see if the average of samples_ages_1
is
significantly higher than that of samples_ages_2
? For this, we can invoke the
alternative
parameter in t.test()
, which lets us select one of the options
("two.sided"
, "less"
, or "greater"
), depending on the alternative hypothesis
we are interested in.
t.test(x = samples_ages_1, y = samples_ages_2, alternative = "greater")
# Expected output:
# Welch Two Sample t-test
#
# data: samples_ages_1 and samples_ages_2
# t = 2.1755, df = 8.4684, p-value = 0.02972
# alternative hypothesis: true difference in means is greater than 0
# 95 percent confidence interval:
# 0.237779 Inf
# sample estimates:
# mean of x mean of y
# 4.571429 3.000000
Again, you can check the full list of parameters for functions in R with the command
?
+ function name. For example ?t.test
gives you the full documentation
for the function t.test
.
The functions we used so far are built-in. Just like variables, we can also
create our own functions, by invoking the function
keyword.
area_circle = function(r) {
return(pi * r^2)
}
area_circle(r = 3)
# Expected output:
# [1] 28.27433
Questions
Study the following two functions, aimed at calculating the overall mean of samples collected by two separate researchers.
- What happened in each function?
- What are their differences?
- Which one is better?
overall_mean1 = function(samples1, samples2) { samples_all = c(samples1, samples2) return(mean(samples_all)) } overall_mean2 = function(samples1, samples2) { mean1 = mean(samples1) mean2 = mean(samples2) return((mean1 + mean2) / 2) }
- Hint: imagine the following scenarios:
- If the first researcher collected a lot more samples than the second one, which way is better?
We will use an example project of the most popular baby names in the United States and the United Kingdom. A cleaned and merged version of the data file is available at http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv.
In order to read data from a file, you have to know what kind of file it is. In our case, it is a CSV file, which is also one of the most popular formats for storing generic, simple tabular data. Because it is so ubiquitous, R
has a base method to read CSV files. We can learn more about this method by invoking the help vignette:
?read.csv
Question: What would you use for other file types? Do you think that the files you usually deal with can be parsed by functions in
base
, or would you need a special package to load data from them?
Read in the file and assign the result to the name baby.names
, and use the class
function to check that the object is of class data.frame
baby.names = read.csv(file="https://www.dropbox.com/s/kr76pha2p82snj4/babyNames.csv?dl=1")
class(baby.names)
# Expected output:
# [1] "data.frame"
We can get a high-level data type description of the data.frame
using the function str
:
str(baby.names)
# Expected output:
# 'data.frame': 400000 obs. of 6 variables:
# $ Name : chr "Truth" "Jalyssa" "Carolina" "Quinnton" ...
# $ Sex : chr "Male" "Female" "Female" "Male" ...
# $ Count : int 30 39 199 11 5 6 54 17 8 6 ...
# $ Year : int 2007 2013 1974 1993 2017 2008 2012 2011 1987 1998 ...
# $ Location : chr "CA" "NM" "England and Wales" "TN" ...
# $ Name.length: int 5 7 8 8 6 12 5 5 7 8 ...
There are some commands that will help us inspect or scan tabular data in R without having to print or access the entire content. For exemple, the head
function allows us to look at the first rows of a table (6 by default):
head(baby.names)
# Expected output:
# Name Sex Count Year Location Name.length
# 1 Truth Male 30 2007 CA 5
# 2 Jalyssa Female 39 2013 NM 7
# 3 Carolina Female 199 1974 England and Wales 8
# 4 Quinnton Male 11 1993 TN 8
# 5 Samiul Male 5 2017 IL 6
# 6 Arinzechukwu Male 6 2008 SC 12
The summary
function will give us some useful stats that vary depending on the data type of each column:
summary(baby.names)
# Expected output:
# Name Sex Count Year
# Length:400000 Length:400000 Min. : 5.0 Min. :1960
# Class :character Class :character 1st Qu.: 7.0 1st Qu.:1982
# Mode :character Mode :character Median : 11.0 Median :1996
# Mean : 157.6 Mean :1994
# 3rd Qu.: 30.0 3rd Qu.:2007
# Max. :86922.0 Max. :2017
# Location Name.length
# Length:400000 Min. : 2.000
# Class :character 1st Qu.: 5.000
# Mode :character Median : 6.000
# Mean : 6.232
# 3rd Qu.: 7.000
# Max. :15.000
Here we will work further with the baby.names
file that you loaded in above.
Usually, tabular data read into R will be stored as a data.frame.
- A data.frame is a list of vectors of equal length
- Each vector in the list forms a column
- Each column can be a different type of vector
- Typically columns are variables and the rows are observations
A data.frame has two dimensions corresponding to the number of rows and the number of columns (in that order).
Check the dimensions of the data.frame then extract the first three rows of the data.frame.
dim(baby.names)
# Expected output:
# [1] 400000 6
baby.names[1:3,]
# Expected output:
# Name Sex Count Year Location Name.length
# 1 Truth Male 30 2007 CA 5
# 2 Jalyssa Female 39 2013 NM 7
# 3 Carolina Female 199 1974 England and Wales 8
Extract the first three columns of the data.frame.
baby.names[,1:3]
# Expected output:
# Name Sex Count
# 1 Truth Male 30
# 2 Jalyssa Female 39
# 3 Carolina Female 199
# 4 Quinnton Male 11
# 5 Samiul Male 5
#
# ...
#
# 33331 Enis Male 14
# 33332 Aashna Female 36
# 33333 Garcia Male 5
# [ reached 'max' / getOption("max.print") -- omitted 366667 rows ]
Output a specific columns of the data.frame.
baby.names$Name
# Try also baby.names[["Name"]] or baby.names[, "Name"]
Output only the unique values within a column.
unique(baby.names$Name)
Using the base function sum()
. Have many babies were called "jill"?
sum(baby.names$Name == "Jill")
# Expected output:
# [1] 25
Extract rows where Name == "Jill". This can also be done with the subset
command in R.
baby.names[baby.names$Name == "Jill",]
subset(baby.names, Name == "Jill")
Operator | Meaning |
---|---|
! |
Not |
== |
Equal to |
!= |
Not equal to |
> |
Greater than |
>= |
Greater than or equal |
< |
Less than |
<= |
Less than or equal |
%in% |
Contained in set |
& |
And |
| |
Or |
Exercise # 1:
E1-1. How many female babies are listed in the table?
E1-2. How many babies were born after 2003? Save the subset in a new dataframe.
The original table has a column that stores the Location where the baby was born. Let's first chack that this is true:
head(baby.names)
# Expected output:
# Name Sex Count Year Location Name.length
# 1 Truth Male 30 2007 CA 5
# 2 Jalyssa Female 39 2013 NM 7
# 3 Carolina Female 199 1974 England and Wales 8
# 4 Quinnton Male 11 1993 TN 8
# 5 Samiul Male 5 2017 IL 6
# 6 Arinzechukwu Male 6 2008 SC 12
Indeed, the Location
column stores location data. Let's ask R to build a quick table of counts per unique value in Location
:
table(baby.names$Location)
# Expected output:
# AK AL AR AZ
# 8685 31652 23279 42775
# CA CO CT DC
# 133257 35181 21526 11074
# DE England and Wales FL GA
# 8625 227449 77218 58715
# HI IA ID IL
# 12072 22748 16330 66576
# IN KS KY LA
# 40200 24548 27041 35370
# MA MD ME MI
# 33190 35279 10030 51601
# MN MO MS MT
# 32946 37078 24520 9759
# NC ND NE NH
# 51874 8470 17549 9806
# NJ NM NV NY
# 47219 18673 21894 89115
# OH OK OR PA
# 55633 29857 27524 53943
# RI SC SD TN
# 9020 29738 9687 39714
# TX UT VA VT
# 113754 29828 44859 5519
# WA WI WV WY
# 41231 32858 13726 5786
Apparently, the Location
column contains several 2-letter acronyms for US states, and a separate value for England and Wales. Knowing that, let's add a new column specifying the country.
baby.names$Country = ifelse(baby.names$Location == "England and Wales", "UK", "US")
head(baby.names)
# Expected output:
# Name Sex Count Year Location Name.length Country
# 1 Truth Male 30 2007 CA 5 US
# 2 Jalyssa Female 39 2013 NM 7 US
# 3 Carolina Female 199 1974 England and Wales 8 UK
# 4 Quinnton Male 11 1993 TN 8 US
# 5 Samiul Male 5 2017 IL 6 US
# 6 Arinzechukwu Male 6 2008 SC 12 US
table(baby.names$Country)
# Expected output:
# UK US
# 46516 353484
Especially when it comes to metadata - some cleaning may be needed before you run statistics. Let's take a look at the Sex
column.
table(baby.names$Sex)
# Expected output:
# Female M Male
# 241203 7709 151088
Do you notice any discrepancy in the output?
If we ran statistics on this column it would be confused by the classification of Males
here. One way to fix this column is to perform a text replacement on the occurrences of M
(by itself) with the word Male
baby.names$Sex = gsub("M$", "Male", baby.names$Sex)
# NOT RUN: another possibility would be to do
# baby.names$Sex <- ifelse(baby.names$Sex == "M", "Male", baby.names$Sex)
Question: Why do we need the
$
sign? What happens if we omit it?
Check the Sex summary table again.
table(baby.names$Sex)
# Expected output:
# Female Male
# 241203 158797
Now that we have made some changes to our data set, we might want to save those changes to a file taht we can store and share. We can save it as a generic/agnostic CSV table to be parsed by other systems, or as an R Image file (.RData
), to be loaded by R.
When interacting wth the file system, you want to be careful about referencing file and folder paths. You should aim to write code that creates file paths procedurally, agnostic of filesystem or username or R version. In order to do that, R includes two libraries that can help in this process: fs
to interact with the filesystem, and here
to organize and locate R project roots and subfolders. Let's use a path builder to create a fodler in our user Desktop
that will work across systems.
library(fs)
# Load the filesystem package
workdir_path <- path_home("Desktop", "R_Tutorial")
# Build a system-specific path from user home to Desktop and then R_Tutorial
dir.create(workdir_path)
# run a system command to create the R_Tutorial folder according to the path we built
write.csv(baby.names, file=path(workdir_path, "babyNames_v2.csv"))
# write the CSV file to the workdir under the name "babyNames_v2.csv"
Question: How would you save other file formats?
Locate and open the file outside of R. Were your changes saved with the file?
save(list = "baby.names", file=path(workdir_path, "babyNames.Rdata"))
Question: How do you load an R object? Use
?load
to help you figure it out.
Descriptive statistics of single variables are straightforward:
Find the mean of baby name lengths:
mean(baby.names$Name.length)
# Expected output:
# [1] 6.231587
Find the median of baby name lengths:
median(baby.names$Name.length)
# Expected output:
# [1] 6
Find the standard deviation of baby name lengths:
sd(baby.names$Name.length)
# Expected output:
# [1] 1.472727
Summarize the baby name lengths:
summary(baby.names$Name.length)
# Expected output:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 2.000 5.000 6.000 6.232 7.000 15.000
Exercise # 2:
E2-1. Which are the longest names?
E2-2. Which are the shortest names?
Compare the length of baby names for boys and girls using a boxplot.
p = ggplot(data = baby.names, aes(x = Sex, y = Name.length)) +
geom_boxplot()
ggsave(plot = p,
filename = path(workdir_path, "basic_box_introR.png"),
width = 7, height = 6)
print(p)

Adding color to the boxplot:
p2 = ggplot(baby.names, aes(x = Sex, y = Name.length, fill = Sex)) +
geom_boxplot() +
theme_bw()
ggsave(plot = p2,
filename = path(workdir_path, "fill_box_introR.png"),
width = 7, height = 6)
p3 = ggplot(baby.names, aes(x = Sex, y = Name.length, color = Sex)) +
geom_boxplot(lwd = 2) +
theme_bw()
ggsave(plot = p3,
filename = path(workdir_path, "color_box_introR.png"),
width = 7, height = 6)
print(p2)

print(p3)

Exercise # 3:
Change the layout of the plot.
E3-1. Add a plot title.
E3-2. Add a title to the y-axis.
E3-3. Choose custom colors for the boxplots. Good places to look up color names are:
- http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
- http://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3
How many names were recorded for each year?
- Check the timeframe that is included in the table.
- How many records were obtained in total?
p4 = ggplot(baby.names, aes(x = Name.length)) +
geom_histogram()
ggsave(plot = p4,
filename = path(workdir_path, "basic_histogram_introR.png"),
width = 7, height = 6)
print(p4)

Exercise # 4:
E4-1. Take a look at
?geom_histogram
and customize the layout of the plot like you did for E3-1.
- If you google
geom_histogram
you will find theggplot2
documentation with every aesthetical mapping and option for this function.- You can also use other resources like
help
, Discourses or StackOverflow to find answers to specific use-cases
The "tidyverse" is
- a dialect of R
- a collection of R packages with a shared design philosophy focused on manipulating data frames
All packages of the tidyverse
share an underlying design philosophy, grammar, and data structures. It simplifies complex data manipulation and visualization tasks.
-
dplyr
: data manipulation -
ggplot2
: data visualization -
tidyr
: reshaping and tidying data -
readr
: reading rectangular data (CSV, TSV, etc.) -
tibble
: enhanced data frames -
purrr
: functional programming tools -
stringr
: string operations -
forcats
: categorical variable handling
You can load them all at once using library(tidyverse)
One prominent feature of the tidyverse is the pipe: %>%
. Pipes take input from their left and pass it to the function on their right as the first positional argument. So these two lines of code will produce the same result:
library(dplyr)
f <- function(x,y) { x + y }
x <- 5
y <- 3
f(x,y)
# Expected output:
# [1] 8
x %>% f(y)
# Expected output:
# [1] 8
This makes code more readable when chaining multiple operations performed on an input data frame. As an example, we can re-load the original baby.names dataset and re-do the manual processing we did, but using dplyr
alternatives. The following block of code will create the Country
column, fix the Sex
column and assign the processing result to its own variable name. Then, it will display a summary of name lengths (mean and SD) for each Country-Sex combination:
baby.names = read.csv(file="https://www.dropbox.com/s/kr76pha2p82snj4/babyNames.csv?dl=1")
processed_baby.names <- baby.names %>%
mutate(
Country = case_when(
Location == "England and Wales" ~ "UK",
.default = "US"
)
) %>%
mutate(
Sex = recode(Sex, "M" = "Male", "F" = "Female")
)
processed_baby.names %>%
group_by(Country, Sex) %>%
summarise(mean_length = mean(Name.length), sd_length = sd(Name.length))
The example below takes the baby.names data, filters out the "England and Wales" observations, groups by Year
and Sex
, then computes the average name length by group, and arranges the result in descending order. This is done using several functions from the dplyr
package from the tidyverse family.
processed_baby.names %>%
filter(Location != "England and Wales") %>%
group_by(Year, Sex) %>%
summarise(mean_length = mean(Name.length)) %>%
arrange(-mean_length)
You can see that the command ends up looking similar to the English sentence describing what it does. The final result shows that Females in 1989 had the longest names on average.
Note: The base R language recently introduced its own pipe in version 4.1.0 that looks like this:
|>
. There are some subtle differences in behavior but for the most part they are interchangeable.
- RStudio Cheat Sheets
- ggplot2 Tutorial
- Data Carpentry: R for Ecology + tidyverse tutorial
- Tidyverse Book
- Bioconductor for Microbiome Analysis
- Other tutorials using the baby names data: rpub