Microbiome Helper 2 Analysis workflow in R - LangilleLab/microbiome_helper GitHub Wiki
Authors: Robyn Wright
Modifications by: NA
Based on initial versions by: Robyn Wright and Ben Fisher for previous workshops
Please note: We think that everything here should work, but we are still testing/developing this so use with caution :)
Here we are aiming to give some examples for a typical analysis workflow that we would carry out. Often, members of the Langille lab will use several different methods for analysing and plotting data, so this is not completely comprehensive, but should offer some examples of the types of analyses that can be performed and is hopefully a good starting point for many people.
The initial importing steps may differ depending on what kind of data you are looking at, but many of the analyses will be similar regardless.
You can choose to run RStudio locally on your own computer, or on something like RStudio server. If you are using an AWS server then this is already installed, and it is already installed on many servers.
If you are using an AWS server then you will need to set a password: sudo passwd ubuntu
- enter your new password when prompted.
You can then go to http://##.##.###.##:8080/
- you'll need to switch out the # for your public IPv4 address.
We are using several different packages, which can be installed using the following commands:
BiocManager::install(version = "3.21")
BiocManager::install("phyloseq")
library("devtools")
install_github("biobakery/maaslin3")
BiocManager::install("microbiome")
BiocManager::install("ALDEx2")
install.packages("tidyverse")
install.packages('vegan',
repos = c('https://vegandevs.r-universe.dev','https://cloud.r-project.org'))
install_github("statdivlab/radEmu")
install.packages("randomcoloR")
You could choose to run these things in the RStudio Console, but what I like to do is use an RStudio notebook - this allows me to keep track of all of the code that I have run, and produce an output at the end. If you'd like to use an RStudio notebook, then simply go to File > New File > R Notebook. Read through the brief instructions and try running it!
After you've familiarised yourself a little, you can choose a new title at the top, and then delete the rest. I would then make a new "Chunk" for each of the code chunks that I have below (click on the green C and choose "R"). You can make notes outside of these chunks if you like.
If you use the R notebook and make chunks, you'll also need to run the chunks! You can do that by clicking the green play button on the top right hand corner of each chunk, or by holding
Command
+Shift
+Enter
. You can run a single line by pressingCommand
+Enter
.
If you are running the code straight in the R console then you will only need to press
Enter
to run it, the same as you will have done in the Terminal.
I've added some drop downs titled Tutorial results - if you're using the tutorial data (assuming that you've started with the PacBio data), then you can see what the output is expected to look like when it's run in RStudio.
Load the packages:
library(phyloseq)
library(maaslin3)
library(ALDEx2)
library(ggplot2)
library(tidyr)
library(stringr)
library(vegan)
library(stats)
library(radEmu)
library(randomcoloR)
You'll notice that in each of these, we use the library()
command, and inside the brackets we give the name of the package. We've already installed these packages for you, but in most cases these could be installed by e.g., install.packages("tidyr")
. In some cases, they might be installed using another package, BiocManager, e.g., BiocManager::install("ALDEx2")
. If you're unsure, you can always google how to install a package and you can usually find the code to copy and paste, e.g., "r install aldex2"
. You could also try typing ??aldex2
into the console and see what comes up. It should give you some information about that package.
Set the name of the folder where our results are saved:
folder = '~/robyn/arctic_ocean_pacbio/'
Here, we have defined a variable called folder
- we always need to be careful that we don't save over these variables by calling something else the same thing, but you can always check these in the "Environment" area in the top right where you should see that folder
now exists. We could have called this anything, but it's always helpful to name our variables something that is simple and explains what they are, incase someone else is using our code. This is a string, as it is inside the punctuation marks ' '
or " "
, and this is used in programming to show that we are representing text rather than numbers. These variables can be really useful to avoid having to type in the same information multiple times, which always increases the chances that we'll make a mistake!
You can see how these would be different by running the following:
random_string = "12345"
random_number = 12345
You should see how these are different by how they show in the "Environment" area in the top right.
Now that we've explained the basics, we can read in some of our data.
First, we'll read in the feature table. Note that we're calling it asv_table
:
asv_table <- read.csv(paste(folder, "denoised_output_exported/feature-table_w_tax.txt", sep=""), sep='\t', skip=1, header=T)
Here we're combining a few different things. We're using the read.csv
function - functions are modules of code that accomplish a specific task. We can write them ourselves, they are what is contained in the packages that we imported at the start, or there are several that are built in to R. If you want more information about what a function does as well as the input that it expects and the output that it gives, you can type in ?read.csv
.
We're also using the paste function to add together the full file path of the feature table. We're giving it the folder variable that we already defined, as well as the additional folder and file name (both as positional arguments), and finally the sep=""
named argument - this tells the paste function that there should be no spaces or other punctuation between the two parts that are being pasted together. Try running just this part and see what the result is: paste(folder, "exports_filtered/feature-table.txt", sep="")
- note that as we haven't given this a name, it is not saved as a variable.
Next, we've given the sep='\t'
variable - this time, we're telling the read.csv function that the file that we're importing is tab-delimited, then we've told it to skip the first line of the file skip=1
, and that there is a header header=T
(where T
is the same as TRUE
). We've skipped the first line because this just reads "# Constructed from biom file" and is not actually data.
Just take a look at asv_table
using View(asv_table)
before continuing so you can see what the data looks like.

We often find in programming that things are not quite in the format that is expected by the packages that we're using, so we'll make a few modifications to asv_table
:
asv_table_num = data.matrix(asv_table[,2:19]) #convert the ASV table to a numeric matrix. Note that if you didn't use the tutorial data, you might need to change this "19" to make sure that you don't cut off any samples. It should be the number of samples + 1. Have a look at the asv_table object in the upper right corner!
rownames(asv_table_num) = asv_table[,1] #give the matrix row names
ASV = otu_table(asv_table_num, taxa_are_rows = TRUE) #convert asv_table_num to an otu_table
You can see here that I've added # after the code and have written what each line does. You can use # to make comments through your code documents, to explain what each line is doing. Here you should also notice that we didn't save over the previous asv_table
, but created a new one called asv_table_num
. Now, we've just told R that it should be expecting numberic data within this table, and then we gave it the same row names as our previous asv_table
object and at the end, we converted it to what Phyloseq calls an otu_table
(it doesn't matter if we have OTUs, ASVs or something else in this table).
The top right hand corner should probably look something like this:

And if you use View(ASV)
to look at the otu_table
, then you should have something like this:
Again, we're just showing a small section of the table!
You may have noticed that the last column in the initial asv_table
was our taxonomy information. We'll get that first:
taxonomy <- data.frame(asv_table[,"taxonomy"])
rownames(taxonomy) <- asv_table[,1]
colnames(taxonomy) = c("taxonomy")
If you print this out by typing taxonomy
or by clicking on this in the "Environment" area, you'll see that we currently only have a single column that contains all of the taxonomy information.
We want to split this so that we have columns for each of Domain, Phylum, Class, Order, Family, Genus and Species. Luckily, there is already a function that we can use for this:
taxonomy_split <- separate(data = taxonomy, col = taxonomy, into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\; ") #separate the taxonomy table so each phylogenetic level is its own column. Note that if you are using 18S then you might need to separate this differently!
TAX = tax_table(taxonomy_split) #convert taxonomy_split to a tax_table
taxa_names(TAX) <- rownames(taxonomy_split) #add names to TAX/the tax_table
You'll see that we tell the separate function that the data it should use is in the taxonomy table, the col
(column) is called "taxonomy". The "into" named argument takes a list as input, which is defined in R with the c()
, and the taxonomy column should be split into a new column each time the ;
symbol is found. You'll see a warning after running this line, but it still ran so that's fine - it's just telling us that there weren't values for all of Domain/Phylum/Class/Order/Family/Genus/Species for every ASV, which makes sense as not all of them will have a species-level classification.
And then afterwards, we convert this into a Phyloseq tax_table
.
Again, you can take a look at the results (or any of the intermediates) by using something like View(TAX)
.
Now we'll read in the metadata. I won't go through each stage step-by-step as hopefully you're getting the idea by now, but I've still added comments to each of the rows:
metadata <- read.csv(paste(folder, "arctic_study_metadata.txt", sep=""), sep='\t')
samples = data.frame(metadata[,c(3,4,5,6,7)], stringsAsFactors = FALSE) #convert this to a data frame - again, if you didn't use the 16S data, you have some extra variables here that you will need to include, rather than only column 2, you may need another, too.
rownames(samples) = metadata[,1] #add the sample names as row names
samples[, c(2,4,5)] <- sapply(samples[, c(2,4,5)], as.numeric) #convert numeric columns to numeric
samples[, 'Sequencing'] = 'PacBio' #add a column to the metadata table that says which kind of sequencing we've used - only necessary if you're using the tutorial data!
SAMPLE = sample_data(samples) #convert samples to a sample_data
Read it in as phy_tree
:
phy_tree <- read_tree(paste(folder, "denoised_output_exported/tree.nwk", sep=""))
Phyloseq is a really useful package that contains many useful functions for analysing microbiome data. While it can be a bit fiddly to get our data into the format that it is expecting (most good packages should give you examples of how data should be formatted to go into them, although it does take a lot of practice to get good at quickly working out how this is different from what you have!), once that it is in this format, the analyses are then very easy to perform.
physeq = phyloseq(ASV, TAX, phy_tree, SAMPLE) #combine these all to make a phyloseq object
physeq #print this out to see what a phyloseq object looks like
Now, we have combined all of these different parts into one phyloseq object. You can also get each of the individual objects back after performing manipulations, e.g., otu_table(physeq)
.
Now we can take a look at how many reads have been classified within our samples:
sample_sums(physeq)
print(c("Minimum sample size:",min(sample_sums(physeq)), "Maximum sample size:", max(sample_sums(physeq))))
It's also always good to have a look at a rarefaction curve. This tells us about whether we've sequenced our samples sufficiently to capture all of the diversity within them:
rarecurve(as.data.frame(t(otu_table(physeq))), step=50,cex=0.5,label=TRUE)
If you're not familiar with rarefaction curves, or species accumulation curves, then each line represents a single sample. You should see that the maximum x value is the number of reads in that sample, and the maximum y value is the number of ASVs. Each point along the curve represents the number of ASVs that are found if we randomly sub-sampled the sample to only have X number of reads. Ideally, these will all plateau, which would suggest that our sequencing depth is sufficient and we wouldn't expect to identify any more ASVs even if we'd sequenced more.
Note that if you'd like to zoom in on a particular area of the graph then you can simply set x/y limits using e.g.
xlim=c(1,5000)
:
rarecurve(as.data.frame(t(otu_table(physeq))), step=50,cex=0.5,label=TRUE, xlim=c(1,5000))
Let's also take a look at the number of reads assigned to each taxon within our samples:
length(taxa_sums(physeq))
print(c("Minimum reads:",min(taxa_sums(physeq)), "Maximum reads:", max(taxa_sums(physeq))))
taxa_sums(physeq)
Note that this is only the first part of the print-out! You will be able to see the sums for all 817 of the ASVs.
We're usually going to want to remove some of these very low abundance taxa - we can't be sure whether they were sequencing errors, bleed-through from previous sequencing runs, or just very low abundance taxa. I've chose a threshold of above 100 total reads to keep a taxon, but you can choose something different depending on what makes sense to you.
physeq <- prune_taxa(taxa_sums(physeq)>=100,physeq)
physeq
If you look at the physeq
object again, you should see that the number of taxa has been reduced.
Take another look at the rarefaction curve:
rarecurve(as.data.frame(t(otu_table(physeq))), step=50,cex=0.5,label=TRUE)
Now we're going to want to choose a number of reads where it looks like all samples have plateaued - note that any samples with fewer reads than this will be removed from further analyses. For me, this was around 1000.
So first, we'll remove samples with below 1000 reads:
physeq <- prune_samples(sample_sums(physeq)>=1000, physeq)
Take a look at the rarefaction curves again to double-check that you're OK with this:
rarecurve(as.data.frame(t(otu_table(physeq))), step=50,cex=0.5,label=TRUE, xlim=c(0,5000)
And then rarefy all samples to the minimum remaining read depth (you could choose the same value for this - 1000 in my case - but ideally you want this to be as high as possible so that you aren't unnecessarily removing reads):
set.seed(711)
rarefied_physeq<-rarefy_even_depth(physeq, rngseed = FALSE, sample.size = min(sample_sums(physeq)), trimOTUs = TRUE)
See that the samples are all have the same number of reads now:
sample_sums(rarefied_physeq)
And then we'll also normalise by converting to relative abundance.:
relabun_physeq <- transform_sample_counts(physeq, function(x) x*100 / sum(x))
Note that we are only going to use this rarefied object for the alpha diversity analyses, but we have excluded the low read depth samples from all further analyses!
Alpha diversity refers to within sample diversity, and there are three main components of it:
- Richness: the number of different taxa present
- Evenness: whether the taxa are present in similar (even) abundances
- Phylogenetic diversity: we are also often interested in whether the taxa in our samples are closely related or not, and Phylogenetic Diversity is used as a representation of this. It is calculated by summing the branch lengths of a phylogenetic tree that connects all taxa within a sample, so a high value means that the taxa are phylogenetically more diverse
Several of the most commonly used Alpha diversity metrics are:
- Observed taxa: simply a count of the number of different taxa in each sample
- Chao1 Richness: this is an estimate of the total number of taxa within your community based on your sample
- Shannon Index (or Shannon-Weiner Index): considers both richness and evenness and is calculated by summing the proportion of each species multiplied by the natural logarithm of that proportion
- Simpson's Index: this is a measure of the probability that two randomly selected taxa from a community will be the same. With only one taxa, the probability that both will be the same is 1. The inverse of this is what is most frequently used, where 0 represents low diversity and 1 represents high diversity
- Faith's Phylogenetic Diversity: as described above
- Fisher's Alpha: this also considers both richness and evenness and is often used when the taxa abundance follows a log-series distribution
Beta diversity refers to between sample diversity and there are again three main components of it:
- Whether the taxa in the samples are the same
- Whether the taxa are present at similar abundances in the different samples
- Whether the taxa are closely related to one another
There are a lot of different beta diversity metrics used, but some of the most widely used for microbiome analyses are:
- Bray-Curtis Dissimilarity Index: takes into account the similarity of the taxa between two samples as well as the abundances of the taxa within these samples. It ranges between 0 and 1, where 0 would indicate that the two samples are identical and 1 that there are no shared taxa between them
- Weighted/Unweighted UniFrac distance: measures the difference between communities by considering the branch lengths of a phylogenetic tree that are unique to each community. Weighted UniFrac additionally considers the abundance of the taxa within the two communities/samples, where Unweighted UniFrac doesn't
- Aitchison distance: a method for compositional data analysis that is the Euclidean distance calculated on Centered-Log Ratio (CLR) transformed data. Converting microbiome data to CLR required the addition of a pseudocount (usually 1) so that logs can be taken; the robust version of this is also often used and this removes zeroes prior to log-ratio calculations and adds them back afterwards
- Phylo-RPCA (Robust Principal Component Analysis): another method for compositional data analysis that also incorporates phylogeny
Many beta diversity metrics account for the compositionality of microbiome data. Data compositionality refers to the fact that the abundance of all taxa are dependent on the abundances of all other taxa within that sample because the number of reads (our sequencing effort) for each sample is finite. You can read more on this concept here.
Now that you hopefully understand alpha and beta diversity... we'll get to looking at them.
First, we'll use a phyloseq plot function to look at alpha diversity within our samples:
plot_richness(physeq = rarefied_physeq, color = "Depth", measures = c("Observed", "Shannon", "Fisher", "Simpson")) + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Note that if this is difficult to see your sample names, try clicking "Show in New Window" (the left of these three icons that will show above your plot!):
You'll see that I've coloured them by Depth
, which is one of the categories in the metadata that I'm using. You should switch this to whatever is of interest to you. You could also add a second variable using the shape = ""
option.
Now we'll calculate the Bray-Curtis dissimilarity between our samples and perform a PERMANOVA (ADONIS2) test to see whether they're significantly different from one another based on any of our metadata categories:
distance <- phyloseq::distance(relabun_physeq, method="bray", weighted=F)
ads = adonis2(distance ~ sample_data(relabun_physeq)$Depth + sample_data(relabun_physeq)$geographic_location_.latitude. + sample_data(relabun_physeq)$geographic_location_.longitude., by="margin", parallel=2)
ads
I've used a few different metadata categories here and have used the +
to look at them all together. The two key values we're interested in here are the R2 and the p-value (PR>F). The R2 tells us how much of the variation between our samples is accounted for by this category, and p tells us whether it's statistically significant or not. The residual R2 value tells you how much variation within your samples is unaccounted for by the metadata categories that you're looking at. You could use *
instead of +
if you wanted to look at the interaction between any of your variables.
This might take a while if you have a lot of samples!
Now we'll do the same but with Weighted UniFrac distance:
distance <- phyloseq::distance(relabun_physeq, method="unifrac", weighted=T)
ads = adonis2(distance ~ sample_data(relabun_physeq)$Depth + sample_data(relabun_physeq)$geographic_location_.latitude. + sample_data(relabun_physeq)$geographic_location_.longitude., by="margin", parallel=2)
ads
If you compare these to the Bray-Curtis results, you should see that the R2 value for Latitude is about the same for both (indicating that Latitude accounts for ~11% of the variation between samples), the R2 value is higher for UniFrac than Bray-Curtis for Depth, and lower for UniFrac than Bray-Curtis for Longitude. This would suggest to me that some of the Longitude differences that Bray-Curtis picks up on are from organisms that are relatively closely related phylogenetically, but some of the Depth differences that Bray-Curtis shows are from less closely related organisms.
Now we'll plot the samples on PCoA plots. PCoA plots attempt to represent the variation between our samples in fewer dimensions, and we can interpret them as that points that are closer together indicate that samples are similar, and points that are further apart indicate that samples are less similar. We'll also often add an ellipse to show each group. The percentages on the axes represent the amount of variation between our samples that is represented on that axes.
Plot Bray-Curtis dissimilarity:
ordination<-ordinate(relabun_physeq, "PCoA", "bray")
plot_ordination(relabun_physeq, ordination, color = "geographic_location_.longitude.") +
geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) + ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")
You'll notice that I decided to colour my points by longitude here, as that was what the PERMANOVA/ADONIS2 test showed had the largest impact on variation between my samples, but you can choose any of your metadata categories here, and again, you could also used the shape = ""
parameter if you have more than one variable of interest.
Plot Weighted UniFrac distance:
ordination<-ordinate(relabun_physeq, "PCoA", "wunifrac")
plot_ordination(relabun_physeq, ordination, color = "geographic_location_.longitude.") +
geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) + ggtitle("Beta Diversity", subtitle = "Weighted UniFrac distance")

A useful way to visualise the composition of our samples is through bar charts of taxonomic relative abundances. Because each bar will be a separate colour, it's not usually very useful to try to show too many taxa in one plot, so we'll often take just the top X most abundant taxa (in this case the top 20), or only those with >Y% relative abundance.
Plot the top 20 most abundant ASVs:
relabun_physeq_top<-prune_taxa(names(sort(taxa_sums(relabun_physeq),decreasing=TRUE)[1:20]), relabun_physeq)
plot_bar(relabun_physeq_top, fill="ta7") + facet_wrap(c(~geographic_location_.longitude., ~Depth), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1))
You should see here that we're first of all creating a new phyloseq object with only the top 20 most abundant taxa, and then we're plotting them. We're also grouping our samples by both longitude and depth (again, change these to whatever makes sense for you!). Hopefully you'll start to see some patterns here...

Now we'll try doing this at the phylum level instead:
#if you take a look at the taxonomy table within our phyloseq object, you should see that the phylum level is now called "ta2"
tax_table(relabun_physeq)
#so ta2 is what we'll use here to group everything at that rank
ps.phylum = tax_glom(relabun_physeq, taxrank="ta2", NArm=FALSE)
#have a look at this
print(ps.phylum)
#you can see that now there's only ten taxa, so we'll get a colour palette that has 10 unique colours
palette = distinctColorPalette(10)
#now we'll make the bar plot
plot_bar(ps.phylum, fill="ta2") + facet_wrap(c(~geographic_location_.longitude., ~Depth), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)
#you can see here that first we're plotting the barplot, and then we're adding some other options - we're telling it how to make the legend, and also which colours to use for plotting
Hopefully you can follow the comments on each line to see what they're doing!

And now we'll do the same, but only grouping the samples by longitude:
plot_bar(ps.phylum, fill="ta2") + facet_wrap(c(~geographic_location_.longitude.), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)
Now let's plot genus:
ps.genus = tax_glom(relabun_physeq, taxrank="ta6", NArm=FALSE)
ps.genus_top<-prune_taxa(names(sort(taxa_sums(ps.genus),decreasing=TRUE)[1:20]), ps.genus)
#you can see that now there's only ten taxa, so we'll get a colour palette that has 10 unique colours
palette = distinctColorPalette(20)
#now we'll make the bar plot
plot_bar(ps.genus_top, fill="ta6") + facet_wrap(c(~geographic_location_.longitude., ~Depth), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)
#you can see here that first we're plotting the barplot, and then we're adding some other options - we're telling it how to make the legend, and also which colours to use for plotting

And genus only grouped by longitude:
plot_bar(ps.genus_top, fill="ta6") + facet_wrap(c(~geographic_location_.longitude.), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)
And genus only grouped by Depth:
plot_bar(ps.genus_top, fill="ta6") + facet_wrap(c(~Depth), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)
Obviously this isn't a comprehensive list of everything that you could do, but hopefully it's becoming clear to you how you could change these for yourself to get a plot of what you're most interested in.
While stacked bar plots can be great for getting an overview of our microbial communities, they're not very quantitative; it's often really difficult to determine whether a taxon is more abundant in one sample vs another, especially if it's not one of the most abundant. So heatmaps can be really helpful here!
We'll use the same phyloseq object as we created above to create a plot of the top 20 genera in our samples:
plot_heatmap(ps.genus_top, method = "PCoA", distance = "bray", low = "white", high = "red", na.value = "grey", sample.label="geographic_location_.longitude.", taxa.label="ta6")
Feel free to change the colours for low and high, or the sample labels. Note that we're used PCoA and "bray" for grouping the samples, so they're being sorted according to this grouping. This is often really useful for seeing patterns within groups of samples!
Next, we'll use a different heatmap plotting function to show us the dendrogram showing sample similarity:
heatmap(otu_table(ps.genus_top), labRow = tax_table(ps.genus_top)[,"ta6"])
You'll see that this also adds a dendrogram to our taxa, but it's important to note that this is picking up on similarities in taxon abundance and we haven't added the phylogenetic tree to this plot!
The final type of plot we'll go through is a phylogenetic tree plot with the abundance of taxa within our samples.
Again, we're plotting the top 20 genera:
plot_tree(ps.genus_top, nodelabf=nodeplotblank, color="geographic_location_.longitude.", label.tips = "ta6", ladderize="left", size="Abundance")
You can see from the legend that the abundance within samples is represented by the size of the point, and we've chosen to colour them based on longitude.
These are probably quite difficult to see, though.
So what we often want to do is save a figure:
tree_fig <- plot_tree(ps.genus_top, nodelabf=nodeplotblank, color="geographic_location_.longitude.", label.tips = "ta6", ladderize="left", size="Abundance") #assign the plot to an object
png(filename=paste(folder, "tree_plot.png", sep=""), width=40, height=20, units="cm", res=300) #open a figure and tell it how big this should be
plot(tree_fig) #plot our tree object
dev.off() #and tell the plotting that we're finished
Let's do the same for species now, but with the top 50 most abundant:
ps.species = tax_glom(relabun_physeq, taxrank="ta7", NArm=FALSE)
rank.sum = tapply(taxa_sums(ps.species), tax_table(ps.species)[, "ta7"], sum, na.rm=TRUE)
top50 = names(sort(rank.sum, TRUE))[1:50]
#and then we'll reduce the phyloseq object to only have these top50 taxa
ps.species = prune_taxa((tax_table(ps.species)[, "ta7"] %in% top50), ps.species)
tree_fig <- plot_tree(ps.species, nodelabf=nodeplotblank, color="geographic_location_.longitude.", label.tips = "ta7", ladderize="left", size="Abundance")
png(filename=paste(folder, "tree_plot_species.png", sep=""), width=40, height=40, units="cm", res=300)
plot(tree_fig)
dev.off()
#you'll see a few blank labels here, so you can always change "ta7" to "ta6" so that you plot the genus name for each of the species, rather than the species name, to see what you have
You can modify this same code to save any of the other plots that we've made!
Tens (if not more) of tools have been used for differential abundance testing of microbiome samples - identifying taxa that differ in abundance between sample groups (for example, treatment versus control) or with variables of interest (for example, correlations with pH or blood metabolite measurements). While these tools/methods have often been used interchangeably in the literature, there are large differences in the performance of different tools even on the same sample group. There have now been several large scale comparisons of different tools (e.g., Calgaro et al., Thorsen et al.), and our lab also carried one out (published here). While these studies have typically found that different methods lead to different biases in the results, as there is no one single best method for determining differentially abundant taxa, the best practice that we have for overcoming these issues currently is to use multiple differential abundance tests, report these clearly, and focus on the taxa that are identified by multiple tests.
We'll be using the most recent versions of several different tools here: ALDEx, MaAsLin and radEmu.
If you are using the tutorial data, we're going to do the comparisons between Illumina and PacBio. So first we'll read in the other data and then look at a basic overview before continuing on with the others.
We used PacBio above, so we'll use Illumina here, but if you did the opposite then you'll need to switch some of these commands to reflect that.
Set the new folder name:
folder_2 = '~/robyn/arctic_ocean_illumina/'
Read in the ASV table:
asv_table_2 <- read.csv(paste(folder, "illumina/denoised_output_exported/feature-table_w_tax.txt", sep=""), sep='\t', skip=1, header=T)
asv_table_num_2 = data.matrix(asv_table_2[,2:40]) #convert the ASV table to a numeric matrix. Note that if you didn't use the tutorial data, you might need to change this "40" to make sure that you don't cut off any samples. It should be the number of samples + 1. Have a look at the asv_table object in the upper right corner!
rownames(asv_table_num_2) = asv_table_2[,1] #give the matrix row names
ASV_2 = otu_table(asv_table_num_2, taxa_are_rows = TRUE) #convert asv_table_num to an otu_table
View(ASV_2)
:
You should notice that in the sample names now they say "short" instead of "full"!
Get the taxonomy information:
taxonomy_2 <- data.frame(asv_table_2[,"taxonomy"])
rownames(taxonomy_2) <- asv_table_2[,1]
colnames(taxonomy_2) = c("taxonomy")
taxonomy_split_2 <- separate(data = taxonomy_2, col = taxonomy, into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\; ") #separate the taxonomy table so each phylogenetic level is its own column. Note that if you are using 18S then you might need to separate this differently!
TAX_2 = tax_table(taxonomy_split_2) #convert taxonomy_split to a tax_table
taxa_names(TAX_2) <- rownames(taxonomy_split_2) #add names to TAX/the tax_table
Get the metadata:
metadata_2 <- read.csv(paste(folder, "illumina/arctic_study_metadata.txt", sep=""), sep='\t')
samples_2 = data.frame(metadata_2[,c(3,4,5,6,7)], stringsAsFactors = FALSE) #convert this to a data frame - again, if you didn't use the 16S data, you have some extra variables here that you will need to include, rather than only column 2, you may need another, too.
rownames(samples_2) = metadata_2[,1] #add the sample names as row names
samples_2[, c(2,4,5)] <- sapply(samples_2[, c(2,4,5)], as.numeric) #convert numeric columns to numeric
samples_2[, 'Sequencing'] = 'Illumina'
SAMPLE_2 = sample_data(samples_2) #convert samples to a sample_data
Get the phylogenetic tree:
phy_tree_2 <- read_tree(paste(folder, "illumina/denoised_output_exported/tree.nwk", sep=""))
Combine this into a Phyloseq object:
physeq_2 = phyloseq(ASV_2, TAX_2, phy_tree_2, SAMPLE_2) #combine these all to make a phyloseq object
physeq_2 #print this out to see what the phyloseq object looks like
Note that if you are looking at this for the first time, it would be good to run the same checks on this as you did above (looking at the rarefaction curves, looking at the number of taxa and how many reads there are assigned to each, etc.), but here I'll just filter the Phyloseq object.
Prune taxa:
physeq_2 <- prune_taxa(taxa_sums(physeq_2)>=200,physeq_2)
physeq_2
Prune samples:
physeq_2 <- prune_samples(sample_sums(physeq_2)>=2000, physeq_2)
And now rarefy:
set.seed(711)
rarefied_physeq_2<-rarefy_even_depth(physeq_2, rngseed = FALSE, sample.size = min(sample_sums(physeq_2)), trimOTUs = TRUE)
Convert to relative abundance:
relabun_physeq_2 <- transform_sample_counts(physeq_2, function(x) x*100 / sum(x))
Now that we've got these normalised Phyloseq objects, you can perform the same analyses (alpha/beta diversity, bar plots, etc.) as we did above if you like. Just below we will show these analyses for both together, though.
Note that the trees don't combine so easily and we don't need them for the differential abundance analyses, so we're not adding them here. We also would not expect the two different sequencing types to match ASV names because they were from different length sequences, so we will group both at the species level and rename all of the taxa to be these species names.
physeq_pacbio = phyloseq(otu_table(physeq), sample_data(physeq), tax_table(physeq))
physeq_illum = phyloseq(otu_table(physeq_2), sample_data(physeq_2), tax_table(physeq_2))
physeq_species_pacbio = tax_glom(physeq_pacbio, taxrank="ta7")
all_tax_pacbio = paste(tax_table(physeq_species_pacbio)[,2], tax_table(physeq_species_pacbio)[,3], tax_table(physeq_species_pacbio)[,4], tax_table(physeq_species_pacbio)[,5], tax_table(physeq_species_pacbio)[,6], tax_table(physeq_species_pacbio)[,7], sep=';')
taxa_names(physeq_species_pacbio) = all_tax_pacbio
physeq_species_illum = tax_glom(physeq_illum, taxrank="ta7")
all_tax_illum = paste(tax_table(physeq_species_illum)[,2], tax_table(physeq_species_illum)[,3], tax_table(physeq_species_illum)[,4], tax_table(physeq_species_illum)[,5], tax_table(physeq_species_illum)[,6], tax_table(physeq_species_illum)[,7], sep=';')
taxa_names(physeq_species_illum) = all_tax_illum
merged_physeq = merge_phyloseq(physeq_illum, physeq_pacbio)
Rarefy together:
set.seed(711)
rarefied_merged_physeq<-rarefy_even_depth(merged_physeq, rngseed = FALSE, sample.size = min(sample_sums(merged_physeq)), trimOTUs = TRUE)
Convert to relative abundance together:
relabun_merged_physeq <- transform_sample_counts(merged_physeq, function(x) x*100 / sum(x))
Note that these next parts are optional and are for exploring the data together, but are not required for differential abundance analyses.
plot_richness(physeq = rarefied_merged_physeq, x = "Sequencing", color = "Sequencing", measures = c("Observed", "Shannon", "Fisher", "Simpson")) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Note that here we added in the
+ geom_boxplot()
part of this plot. This means that now we get a boxplot for each unique value of the x-axis (which we told to be the sequencing method withx = "Sequencing"
!)
So you can see that even though we have rarefied all of our samples to the same depth, we still have more ASVs in our Illumina samples than the PacBio samples.
Because Beta diversity is the between sample diversity and we don't have a phylogenetic tree here, it doesn't really make sense to do this at the ASV level because we wouldn't expect to have any shared taxa between the two different datasets (where our sequences are different lengths!). To compare them, we'll instead group the taxa at the species and then at the genus levels.
PERMANOVA test:
distance <- phyloseq::distance(relabun_merged_physeq, method="bray", weighted=F)
ads = adonis2(distance ~ sample_data(relabun_merged_physeq)$Sequencing + sample_data(relabun_merged_physeq)$Depth + sample_data(relabun_merged_physeq)$geographic_location_.latitude. + sample_data(relabun_merged_physeq)$geographic_location_.longitude., by="margin", parallel=2)
ads

So sequencing is still having the largest impact on the variation between samples, even though we grouped the taxa at the species level. This is likely because we have only 382/750 species-level classifications for the Illumina ASVs while we have 458/818 species-level classifications for the PacBio ASVs. This is not so bad at the genus level, where we have 584 genus-level classifications for Illumina and 691 for PacBio. Note that all of these numbers are before filtering out low abundance, but it's likely that we'd still have similar proportions!
Plot PCoA:
ordination<-ordinate(relabun_merged_physeq, "PCoA", "bray")
plot_ordination(relabun_merged_physeq, ordination, color = "Sequencing") +
geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) + ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")

Given how different the samples are based on the Sequencing variable, it's not very surprising that all of the samples group separately!
Now let's try this at the genus level.
Group at the genus level:
ps.genus = tax_glom(relabun_merged_physeq, taxrank="ta6", NArm=FALSE)
PERMANOVA test:
distance <- phyloseq::distance(ps.genus, method="bray", weighted=F)
ads = adonis2(distance ~ sample_data(ps.genus)$Sequencing + sample_data(ps.genus)$Depth + sample_data(ps.genus)$geographic_location_.latitude. + sample_data(ps.genus)$geographic_location_.longitude., by="margin", parallel=2)
ads
Interestingly, it actually seems to make very little difference here that we've grouped at the genus rather than the species level!
Plot PCoA:
ordination<-ordinate(ps.genus, "PCoA", "bray")
plot_ordination(ps.genus, ordination, color = "Sequencing") +
geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) + ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")
First, we'll group the Phyloseq objects at a few different taxonomic ranks.
Genus level:
ps.genus = tax_glom(relabun_merged_physeq, taxrank="ta6", NArm=FALSE)
ps.genus_top<-prune_taxa(names(sort(taxa_sums(ps.genus),decreasing=TRUE)[1:50]), ps.genus)
Family level:
ps.family = tax_glom(relabun_merged_physeq, taxrank="ta5", NArm=FALSE)
ps.family_top<-prune_taxa(names(sort(taxa_sums(ps.family),decreasing=TRUE)[1:50]), ps.family)
Phylum level:
ps.phylum = tax_glom(relabun_merged_physeq, taxrank="ta2", NArm=FALSE)
ps.phylum_top<-prune_taxa(names(sort(taxa_sums(ps.phylum),decreasing=TRUE)[1:50]), ps.phylum)
And make the heatmaps at each of these levels.
Genus:
plot_heatmap(ps.genus_top, method = "PCoA", distance = "bray", low = "white", high = "red", na.value = "grey", sample.label="Sequencing", taxa.label="ta6")
Family:
plot_heatmap(ps.family_top, method = "PCoA", distance = "bray", low = "white", high = "red", na.value = "grey", sample.label="Sequencing", taxa.label="ta5")
Phylum:
plot_heatmap(ps.phylum_top, method = "PCoA", distance = "bray", low = "white", high = "red", na.value = "grey", sample.label="Sequencing", taxa.label="ta2")
Note that if you merged the two Phyloseq objects then you may want to rename this to
physeq
to match the commands below, or you can change the commands.
First, we'll get a table at the genus level and add names to these that include all taxonomic ranks:
physeq_genus = tax_glom(physeq, taxrank="ta6")
all_tax = paste(tax_table(physeq_genus)[,2], tax_table(physeq_genus)[,3], tax_table(physeq_genus)[,4], tax_table(physeq_genus)[,5], tax_table(physeq_genus)[,6], sep=';')
taxa_names(physeq_genus) = all_tax
otu_table(physeq_genus)
Here, we are really just checking that our ASVs (or species) have been renamed at the genus level:
Add the sample sums to the metadata (for MaAsLin3 - this is a variable that it takes into account):
sample_data(physeq_genus)$sample_sums <- sample_sums(physeq_genus)
Then, I wanted to make some new sample groups here when I was using only the PacBio tutorial data - I had a lot of different values within my categories, so I wanted to group these together, but this part is entirely optional:
sample_data(physeq_genus)$depth_group <- c('below_100', 'below_100', 'below_100', 'below_100', 'below_100', 'below_100', 'below_100', 'above_100', 'below_100', 'below_100', 'below_100', 'below_100', 'above_100', 'above_100')
sample_data(physeq_genus)$longitude_group <- c('above_10', 'below_10', 'above_10', 'above_10', 'above_10', 'above_10', 'above_10', 'above_10', 'below_10', 'below_10', 'below_10', 'below_10', 'below_10', 'above_10')
If you do decide to do something like this, it's important to think about what is biologically relevant, and whether your groupings make sense. For example, with my depth grouping, it's likely that there's some light above 100m but little light below.
Now we'll create a rarefied genus table:
set.seed(345)
physeq_rare = rarefy_even_depth(physeq_genus, sample.size = min(sample_sums(physeq_genus)), replace = TRUE, trimOTUs = TRUE, verbose = TRUE)
#note that we did already do this, but we did it at the ASV level!
And we'll save the feature table and metadata table (some of the DA tools require these not inside the phyloseq object):
feat_table = data.frame(t(otu_table(physeq_rare)), check.rows=T, check.names=F, stringsAsFactors=T)
metadata = data.frame(sample_data(physeq_rare), stringsAsFactors = F)
Next, we'll run ALDEx2. Note that here the first step is to normalise the samples using a CLR normalisation, so you can see how both of the methods require similar steps - normalisation and then testing - but this is done in a slightly different way for each. We then use a Kruskal-Wallis test and a GLM ANOVA on the data, and we can do this test with multiple cores (useMC=2
):
x <- aldex.clr(otu_table(physeq_genus), sample_data(physeq_genus)$Sequencing, mc.samples = 128, verbose=F, denom="all")
kw.test <- aldex.kw(x, useMC=2, verbose=FALSE)
write.csv(kw.test, paste(folder, "ALDEx2.csv", sep=""))
We could repeat this with multiple variables if we wanted, but it can be a little difficult to integrate multiple variables into the same test with ALDEx2.
If you take a look at these results files then you'll see that we have 4 columns: kw.ep, kw.eBH, glm.ep, glm.eBH - the p-values and Benjamini-Hochberg (BH)-corrected p-values for the Kruskal-Wallis and GLM ANOVA tests between the sample groups. There are several other options for testing within the ALDEx2 R package (including aldex.ttest and aldex.glm), but in this case, we can use the GLM ANOVA results for parametric tests and the Kruskal-Wallis results for non-parametric tests. We could also use aldex.effect to see the magnitude of the differences between groups.
Note: parametric tests make assumptions about the distribution of the population that the samples are taken from (typically that it is normally distributed), while non-parametric tests are distribution free and can be used for non-normal variables. Microbiome data are not typically normally distributed, but we can test this. You don't need to run this next part, but can do if you like (if you have the
microbiome
R package installed - the AWS image does not becausemicrobiome
needs a different R version than some of the other packages!)
shapiro.test(as.data.frame(otu_table(physeq_genus))$"M19_1_full")
physeq_genus_clr = microbiome::transform(physeq_genus, "clr")
shapiro.test(as.data.frame(otu_table(physeq_genus_clr))$"M19_1_full")
Transformations/normalisations of microbiome data are often used to make the usually non-normal microbiome data normal, however, in this case, it is not normal (significant p-value so the null hypothesis that the data are normal can be rejected - they are not normal). This means that we should use the Kruskal-Wallis test results.
Run MaAsLin3:
fit_out <- maaslin3(input_data = feat_table,
input_metadata = metadata,
output = 'maaslin3',
formula = '~ Sequencing + sample_sums',
normalization = 'TSS',
transform = 'LOG',
augment = TRUE,
standardize = TRUE,
max_significance = 0.1,
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_pngs = 100,
cores = 1,
save_models = TRUE)
If you have used another dataset, you can run this step more than once with each of the variables! Or you can run both together by including them in the formula as ~ variable1 + variable2
. If you combined the Phyloseq objects then you may like to use Sequencing
here!
Take a look at ?maaslin3
to see what each of the inputs to this function are!
MaAsLin will save the results into a folder, so you can take a look in these folders and look at the all_results.tsv
as well as significant_results.tsv
files. You'll see that significant_results.tsv
is just a subsection of all_results.tsv
- have a look at one of these files. It also makes some nice plots for you of the significant results - have a look through these and see if you can work out what they're showing you.
There are a lot of columns in the output; you'll see some of the ones we're interested in are: (1) the genus ("feature"), (2) the metadata variable being tested ("metadata"), (3) for categorical features, the specific feature level for which the coefficient and significance of association is being reported ("value"), (4) the model effect size/coefficient value ("coef"), (5) the standard error from the model ("stderr"), (6) pval_individual (7) qval_individual (8) pval_joint (9) qval_joint (10) error (11) model (12) N (13) N_not_zero
Now we can run radEmu:
physeq_genus_no_zero <- prune_taxa(taxa_sums(physeq_genus)>0,physeq_genus) #this first step just removes any genera with zero abundance across all samples that would cause issues for radEmu!
radEmu_fit <- emuFit(formula = ~ Sequencing,
Y = physeq_genus_no_zero, run_score_tests=FALSE)
write.csv(radEmu_fit$coef, paste(folder, "radEmu.csv", sep=""))
If you are mainly interested in a single variable, then you can set the run_score_tests
variable to be TRUE
, but it might take quite a long time to run!:
radEmu_fit <- emuFit(formula = ~ Sequencing,
Y = physeq_genus_no_zero, run_score_tests=TRUE)
In order to compare the results that we've got from the three differential abundance tools, we'll want to first import the results and do some manipulations on these tables.
MaAsLin3:
maaslin = read.csv(paste(folder, "maaslin3/significant_results.tsv", sep=""), sep="\t")
maaslin = maaslin[maaslin$metadata == 'Sequencing', ] #we only want to look at Sequencing for now, as that is what we ran with the other tests
maaslin = maaslin[maaslin$model == 'abundance', ] #Maaslin3 compares both abundance and prevalence, but as it is the only one to do prevalence, we will just look at abundance for now
rownames(maaslin) = maaslin[,1]
Now we'll do the same for ALDEx2:
aldex = read.csv(paste(folder, "ALDEx2.csv", sep=""))
aldex = aldex[aldex$kw.eBH <= 0.1, ] #keep only taxa with Benjamini-Hochberg (BH)-adjusted p<=0.1 (the same as MaAsLin3 default)
rownames(aldex) = aldex[,1]
And for radEmu:
rademu = read.csv(paste(folder, "radEmu.csv", sep=""))
rademu = rademu[rademu$pval <= 0.1, ] #keep only taxa with p<=0.1 (the same as MaAsLin3 default)
rownames(rademu) = rademu[,3]
We often consider those taxa found by >=2 tests to be significantly differentially abundant as actually being differentially abundant. With this dataset, a lot of taxa are found to be differentially abundant, so I'm actually going to take only the taxa that are found by all three tests to narrow this down as much as possible.
Get the taxa found in all tables of significant results:
taxa = intersect(rownames(maaslin), rownames(aldex))
taxa = intersect(taxa, rownames(rademu))
We still have a lot here! This isn't necessary, but we'll actually narrow this down even further and only look at those within the 20 most abundant genera.
Get just the top 20:
physeq_genus_relabun = transform_sample_counts(physeq_genus, function(x) (x / sum(x))*100 )
physeq_genus_relabun_top20<-prune_taxa(names(sort(taxa_sums(physeq_genus_relabun),decreasing=TRUE)[1:20]), physeq_genus_relabun)
Get the DA taxa that are within the top 20:
taxa = intersect(taxa, rownames(otu_table(physeq_genus_relabun_top20)))
length(taxa)
physeq_genus_relabun_da<-prune_taxa(taxa, physeq_genus_relabun_top20)
And rename them so that the names will display a little better (add line breaks):
new_names = c()
for (i in 1:length(rownames(otu_table(physeq_genus_relabun_da)))) {
new_name = str_replace(rownames(otu_table(physeq_genus_relabun_da))[i], ';o__', '\no__')
new_names = c(new_names, str_replace(new_name, ';g__', '\ng__'))
}
taxa_names(physeq_genus_relabun_da) = new_names
Finally, we'll plot them!
psmelt(physeq_genus_relabun_da) %>%
ggplot(data = ., aes(x = Sequencing, y = Abundance)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(aes(color = OTU), height = 0, width = .2) +
labs(x = "", y = "Abundance\n") +
facet_wrap(~ OTU, scales = "free") +
theme(legend.position = "none")
Note: you'll probably want to open this up in a bigger window so that you can see all of the taxa properly!
The really nice thing about R notebooks is that you can save them as an HTML document so that anyone can see your figures as long as the code that you've used to generate them. When you'd like to do this, you can simply click on the dropdown menu next to the Preview button and click "Knit to HTML". This will be working for a little while because it needs to re-run all of the code, but it should pop up with an HTML document when it's done!
- Metagenomics taxonomic data
- Functional data (PICRUSt2 predictions or metagenomic annotations)
- Stratified data
- Fundamentals of adding plots together (e.g. ordering by taxa so that a heatmap will appear in the same order as the tree and can be linked?)
- Taxonomic contributions to function of interest
- Longitudinal analysis
- Combining taxonomic/functional visualisations with JarrVis
- ANCOM?