Microbiome Helper Analysis workflow in R ‐ for experiential learning - LangilleLab/microbiome_helper GitHub Wiki

Authors: Robyn Wright
Modifications by: NA
Based on initial versions by: Robyn Wright and Ben Fisher for previous workshops

Please note: We think that everything here should work, but we are still testing/developing this so use with caution :)

This page is mainly a copy of this page, but designed to work with an older version of R and some data that is generated in the lab for the 2026 experiential learning students.

Introduction

Here we are aiming to give some examples for a typical analysis workflow that we would carry out. Often, members of the Langille lab will use several different methods for analysing and plotting data, so this is not completely comprehensive, but should offer some examples of the types of analyses that can be performed and is hopefully a good starting point for many people.

The initial importing steps may differ depending on what kind of data you are looking at, but many of the analyses will be similar regardless.

We've chosen to use R/RStudio for this as it's what we think there are the most available packages for, and these tend to work quite smoothly once you have your data imported. Below is a fairly full workflow, but we'll be adding pages to separate this out into its different components soon.

Getting started in RStudio

You can choose to run RStudio locally on your own computer, or on something like RStudio server. If you are using an AWS server then this is already installed, and it is already installed on many servers.

Kronos/Langille lab server

If you are using the Langille lab Kronos server then you can log in to RStudio on there by opening up a new terminal window and typing in:

ssh -N -L 8787:localhost:8787 [email protected]

Type in your password when prompted, but you won't see it log into Kronos and come up with the command prompt like normal. Next go to this url in your browser: http://localhost:8787/

Type in the username and password that you have for kronos (i.e. your name and the same password that you just used to log in from the terminal. It might take a little while to load for the first time, but then you should see RStudio come up.

The first thing that you should do is run these two lines of code:

.libPaths(c("/usr/lib/R/site-library", .libPaths()))
options(bitmapType='cairo')

This tells R the location of the packages that are already installed and how it is going to make plots.

Required packages

At some point we will add a page that details how to install each of these packages, but for now, you can usually find this information by googling "install PACKAGE_NAME R".

You could choose to run these things in the RStudio Console, but what I like to do is use an RStudio notebook - this allows me to keep track of all of the code that I have run, and produce an output at the end. If you'd like to use an RStudio notebook, then simply go to File > New File > R Notebook. Read through the brief instructions and try running it! We have some more information on R notebooks here.

After you've familiarised yourself a little, you can choose a new title at the top, and then delete the rest. I would then make a new "Chunk" for each of the code chunks that I have below (click on the green C and choose "R"). You can make notes outside of these chunks if you like.

If you use the R notebook and make chunks, you'll also need to run the chunks! You can do that by clicking the green play button on the top right hand corner of each chunk, or by holding Command+Shift+Enter. You can run a single line by pressing Command+Enter.

If you are running the code straight in the R console then you will only need to press Enter to run it, the same as you will have done in the Terminal.

I've added some drop downs titled Tutorial results - if you're using the tutorial data (assuming that you've started with the PacBio data), then you can see what the output is expected to look like when it's run in RStudio.

Load required packages and setup

Load the packages:

library(phyloseq)
library(Maaslin2)
library(ALDEx2)
library(ggplot2)
library(tidyr)
library(stringr)
library(vegan)
library(stats)
library(randomcoloR)

You'll notice that in each of these, we use the library() command, and inside the brackets we give the name of the package. We've already installed these packages for you, but in most cases these could be installed by e.g., install.packages("tidyr"). In some cases, they might be installed using another package, BiocManager, e.g., BiocManager::install("ALDEx2"). If you're unsure, you can always google how to install a package and you can usually find the code to copy and paste, e.g., "r install aldex2". You could also try typing ??aldex2 into the console and see what comes up. It should give you some information about that package.

Set the name of the folder where our results are saved:

folder = '/home/robyn/experiential_poopmobile/denoised_output_exported/'

Explaining variables, strings, etc.

Here, we have defined a variable called folder - we always need to be careful that we don't save over these variables by calling something else the same thing, but you can always check these in the "Environment" area in the top right where you should see that folder now exists. We could have called this anything, but it's always helpful to name our variables something that is simple and explains what they are, incase someone else is using our code. This is a string, as it is inside the punctuation marks ' ' or " ", and this is used in programming to show that we are representing text rather than numbers. These variables can be really useful to avoid having to type in the same information multiple times, which always increases the chances that we'll make a mistake!

You can see how these would be different by running the following:

random_string = "12345"
random_number = 12345

You should see how these are different by how they show in the "Environment" area in the top right.

Now that we've explained the basics, we can read in some of our data.

Importing data

First, we'll read in the feature table. Note that we're calling it asv_table:

asv_table <- read.csv(paste(folder, "feature-table_w_tax.txt", sep=""), sep='\t', skip=1, header=T, check.names = F)

Here we're combining a few different things. We're using the read.csv function - functions are modules of code that accomplish a specific task. We can write them ourselves, they are what is contained in the packages that we imported at the start, or there are several that are built in to R. If you want more information about what a function does as well as the input that it expects and the output that it gives, you can type in ?read.csv.

We're also using the paste function to add together the full file path of the feature table. We're giving it the folder variable that we already defined, as well as the additional folder and file name (both as positional arguments), and finally the sep="" named argument - this tells the paste function that there should be no spaces or other punctuation between the two parts that are being pasted together. Try running just this part and see what the result is: paste(folder, "exports_filtered/feature-table.txt", sep="") - note that as we haven't given this a name, it is not saved as a variable.

Next, we've given the sep='\t' variable - this time, we're telling the read.csv function that the file that we're importing is tab-delimited, then we've told it to skip the first line of the file skip=1, and that there is a header header=T (where T is the same as TRUE). We've skipped the first line because this just reads "# Constructed from biom file" and is not actually data. The check_names=F ensures that R won't do things like removing spaces and replacing them with . in your sample names.

Just take a look at asv_table using View(asv_table) before continuing so you can see what the data looks like.

Tutorial results

Note that this is just a small part of the table - you should be able to scroll and see all of the samples.

We often find in programming that things are not quite in the format that is expected by the packages that we're using, so we'll make a few modifications to asv_table:

asv_table_num = data.matrix(asv_table[,2:28]) #convert the ASV table to a numeric matrix and take the columns with samples (not the taxonomy column)
rownames(asv_table_num) = asv_table[,1] #give the matrix row names
ASV = otu_table(asv_table_num, taxa_are_rows = TRUE) #convert asv_table_num to an otu_table

You can see here that I've added # after the code and have written what each line does. You can use # to make comments through your code documents, to explain what each line is doing. Here you should also notice that we didn't save over the previous asv_table, but created a new one called asv_table_num. Now, we've just told R that it should be expecting numberic data within this table, and then we gave it the same row names as our previous asv_table object and at the end, we converted it to what Phyloseq calls an otu_table (it doesn't matter if we have OTUs, ASVs or something else in this table).

Tutorial results
The top right hand corner should probably look something like this:

And if you use View(ASV) to look at the otu_table, then you should have something like this:


Again, we're just showing a small section of the table!

Get the taxonomy information

You may have noticed that the last column in the initial asv_table was our taxonomy information. We'll get that first:

taxonomy <- data.frame(asv_table[,"taxonomy"])
rownames(taxonomy) <- asv_table[,1]
colnames(taxonomy) = c("taxonomy")

If you print this out by typing taxonomy or by clicking on this in the "Environment" area, you'll see that we currently only have a single column that contains all of the taxonomy information.

Tutorial results

We want to split this so that we have columns for each of Domain, Phylum, Class, Order, Family, Genus and Species. Luckily, there is already a function that we can use for this:

taxonomy_split <- separate(data = taxonomy, col = taxonomy, into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\; ") #separate the taxonomy table so each phylogenetic level is its own column. Note that if you are using 18S then you might need to separate this differently!
TAX = tax_table(taxonomy_split) #convert taxonomy_split to a tax_table
taxa_names(TAX) <- rownames(taxonomy_split) #add names to TAX/the tax_table

This will print out some things in red, but it is fine!

You'll see that we tell the separate function that the data it should use is in the taxonomy table, the col (column) is called "taxonomy". The "into" named argument takes a list as input, which is defined in R with the c(), and the taxonomy column should be split into a new column each time the ; symbol is found. You'll see a warning after running this line, but it still ran so that's fine - it's just telling us that there weren't values for all of Domain/Phylum/Class/Order/Family/Genus/Species for every ASV, which makes sense as not all of them will have a species-level classification.

And then afterwards, we convert this into a Phyloseq tax_table.

Again, you can take a look at the results (or any of the intermediates) by using something like View(TAX).

Tutorial results

Read in the metadata

Now we'll read in the metadata. I won't go through each stage step-by-step as hopefully you're getting the idea by now, but I've still added comments to each of the rows:

metadata <- read.csv(paste(folder, "metadata.txt", sep=""), sep='\t')
samples = data.frame(metadata, stringsAsFactors = FALSE) #convert this to a data frame
rownames(samples) = metadata[,1] #add the sample names as row names
SAMPLE = sample_data(samples) #convert samples to a sample_data
Tutorial results

View(SAMPLE):

Read in the phylogenetic tree

Read it in as phy_tree:

phy_tree <- read_tree(paste(folder, "tree.nwk", sep=""))

Combine these into a phyloseq object

Phyloseq is a really useful package that contains many useful functions for analysing microbiome data. While it can be a bit fiddly to get our data into the format that it is expecting (most good packages should give you examples of how data should be formatted to go into them, although it does take a lot of practice to get good at quickly working out how this is different from what you have!), once that it is in this format, the analyses are then very easy to perform.

physeq = phyloseq(ASV, TAX, phy_tree, SAMPLE) #combine these all to make a phyloseq object
physeq #print this out to see what a phyloseq object looks like
Tutorial results

Now, we have combined all of these different parts into one phyloseq object. You can also get each of the individual objects back after performing manipulations, e.g., otu_table(physeq).

This can be a bit fiddly getting everything into Phyloseq, but the nice thing is that once it is in, there are lots of things that we can run directly on our Phyloseq object and we at least know that all of our sample names, taxa names, etc match up with eachother!

Look at rarefaction curves and remove low abundance taxa and samples

Now we can take a look at how many reads have been classified within our samples:

sample_sums(physeq)
print(c("Minimum sample size:",min(sample_sums(physeq)), "Maximum sample size:", max(sample_sums(physeq))))
Tutorial results

It's also always good to have a look at a rarefaction curve. This tells us about whether we've sequenced our samples sufficiently to capture all of the diversity within them:

rarecurve(as.data.frame(t(otu_table(physeq))), step=50,cex=0.5,label=TRUE)
Tutorial results

If you're not familiar with rarefaction curves, or species accumulation curves, then each line represents a single sample. You should see that the maximum x value is the number of reads in that sample, and the maximum y value is the number of ASVs. Each point along the curve represents the number of ASVs that are found if we randomly sub-sampled the sample to only have X number of reads. Ideally, these will all plateau, which would suggest that our sequencing depth is sufficient and we wouldn't expect to identify any more ASVs even if we'd sequenced more.

Note that if you'd like to zoom in on a particular area of the graph then you can simply set x/y limits using e.g. xlim=c(1,5000):

rarecurve(as.data.frame(t(otu_table(physeq))), step=50,cex=0.5,label=TRUE, xlim=c(1,5000))
Tutorial results

Let's also take a look at the number of reads assigned to each taxon within our samples:

length(taxa_sums(physeq))
print(c("Minimum reads:",min(taxa_sums(physeq)), "Maximum reads:", max(taxa_sums(physeq))))
taxa_sums(physeq)
Tutorial results


Note that this is only the first part of the print-out! You will be able to see the sums for all 817 of the ASVs.

We're usually going to want to remove some of these very low abundance taxa - we can't be sure whether they were sequencing errors, bleed-through from previous sequencing runs, or just very low abundance taxa. I've chose a threshold of above 100 total reads to keep a taxon, but you can choose something different depending on what makes sense to you.

physeq <- prune_taxa(taxa_sums(physeq)>=100,physeq)
physeq

If you look at the physeq object again, you should see that the number of taxa has been reduced.

Tutorial results

Take another look at the rarefaction curve:

rarecurve(as.data.frame(t(otu_table(physeq))), step=50,cex=0.5,label=TRUE)
Tutorial results

Now we're going to want to choose a number of reads where it looks like all samples have plateaued - note that any samples with fewer reads than this will be removed from further analyses. For me, this was around 1000.

So first, we'll remove samples with below 1000 reads:

physeq <- prune_samples(sample_sums(physeq)>=1000, physeq)

Note that we shouldn't have any samples with <1000 reads, so nothing should change here!

Take a look at the rarefaction curves again to double-check that you're OK with this:

rarecurve(as.data.frame(t(otu_table(physeq))), step=50,cex=0.5,label=TRUE, xlim=c(0,5000))
Tutorial results

And then rarefy all samples to the minimum remaining read depth (you could choose the same value for this - 1000 in my case - but ideally you want this to be as high as possible so that you aren't unnecessarily removing reads):

set.seed(711)
rarefied_physeq<-rarefy_even_depth(physeq, rngseed = FALSE, sample.size = min(sample_sums(physeq)), trimOTUs = TRUE)

More red writing that is fine :)

See that the samples are all have the same number of reads now:

sample_sums(rarefied_physeq)
Tutorial results

And then we'll also normalise by converting to relative abundance.:

relabun_physeq <- transform_sample_counts(physeq, function(x) x*100 / sum(x))

Note that we are only going to use this rarefied object for the alpha diversity analyses, but we have excluded the low read depth samples from all further analyses!

Alpha and beta diversity

Alpha diversity refers to within sample diversity, and there are three main components of it:

  • Richness: the number of different taxa present
  • Evenness: whether the taxa are present in similar (even) abundances
  • Phylogenetic diversity: we are also often interested in whether the taxa in our samples are closely related or not, and Phylogenetic Diversity is used as a representation of this. It is calculated by summing the branch lengths of a phylogenetic tree that connects all taxa within a sample, so a high value means that the taxa are phylogenetically more diverse

Several of the most commonly used Alpha diversity metrics are:

  • Observed taxa: simply a count of the number of different taxa in each sample
  • Chao1 Richness: this is an estimate of the total number of taxa within your community based on your sample
  • Shannon Index (or Shannon-Weiner Index): considers both richness and evenness and is calculated by summing the proportion of each species multiplied by the natural logarithm of that proportion
  • Simpson's Index: this is a measure of the probability that two randomly selected taxa from a community will be the same. With only one taxa, the probability that both will be the same is 1. The inverse of this is what is most frequently used, where 0 represents low diversity and 1 represents high diversity
  • Faith's Phylogenetic Diversity: as described above
  • Fisher's Alpha: this also considers both richness and evenness and is often used when the taxa abundance follows a log-series distribution

Beta diversity refers to between sample diversity and there are again three main components of it:

  • Whether the taxa in the samples are the same
  • Whether the taxa are present at similar abundances in the different samples
  • Whether the taxa are closely related to one another

There are a lot of different beta diversity metrics used, but some of the most widely used for microbiome analyses are:

  • Bray-Curtis Dissimilarity Index: takes into account the similarity of the taxa between two samples as well as the abundances of the taxa within these samples. It ranges between 0 and 1, where 0 would indicate that the two samples are identical and 1 that there are no shared taxa between them
  • Weighted/Unweighted UniFrac distance: measures the difference between communities by considering the branch lengths of a phylogenetic tree that are unique to each community. Weighted UniFrac additionally considers the abundance of the taxa within the two communities/samples, where Unweighted UniFrac doesn't
  • Aitchison distance: a method for compositional data analysis that is the Euclidean distance calculated on Centered-Log Ratio (CLR) transformed data. Converting microbiome data to CLR required the addition of a pseudocount (usually 1) so that logs can be taken; the robust version of this is also often used and this removes zeroes prior to log-ratio calculations and adds them back afterwards
  • Phylo-RPCA (Robust Principal Component Analysis): another method for compositional data analysis that also incorporates phylogeny

Many beta diversity metrics account for the compositionality of microbiome data. Data compositionality refers to the fact that the abundance of all taxa are dependent on the abundances of all other taxa within that sample because the number of reads (our sequencing effort) for each sample is finite. You can read more on this concept here.

Now that you hopefully understand alpha and beta diversity... we'll get to looking at them.

First, we'll use a phyloseq plot function to look at alpha diversity within our samples:

plot_richness(physeq = rarefied_physeq, color = "Sample_type", measures = c("Observed", "Shannon", "Simpson")) + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

This probably doesn't look very nice because we have too many samples for the x axis, so let's make a boxplot instead:

plot_richness(rarefied_physeq, x = "Sample_type", color = "Sample_type", measures = c("Observed", "Shannon", "Simpson")) + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + geom_boxplot()

Note that if this is difficult to see, try clicking "Show in New Window" (the left of these three icons that will show above your plot!):

Tutorial results

You'll see that I've coloured them by Sample_type, which is our main metadata category. If you'd like to add something else, try using the shape = "" option.

Now we'll calculate the Bray-Curtis dissimilarity between our samples and perform a PERMANOVA (ADONIS2) test to see whether they're significantly different from one another based on any of our metadata categories:

distance <- phyloseq::distance(relabun_physeq, method="bray", weighted=F)
ads = adonis2(distance ~ sample_data(relabun_physeq)$Sample_type + sample_data(relabun_physeq)$Age + sample_data(relabun_physeq)$sex_at_birth, by="margin", parallel=2)
ads

I've used a few different metadata categories here and have used the + to look at them all together. The two key values we're interested in here are the R2 and the p-value (PR>F). The R2 tells us how much of the variation between our samples is accounted for by this category, and p tells us whether it's statistically significant or not. The residual R2 value tells you how much variation within your samples is unaccounted for by the metadata categories that you're looking at. You could use * instead of + if you wanted to look at the interaction between any of your variables.

Tutorial results

This might take a while if you have a lot of samples!

Now we'll do the same but with Weighted UniFrac distance:

distance <- phyloseq::distance(relabun_physeq, method="unifrac", weighted=T)
ads = adonis2(distance ~ sample_data(relabun_physeq)$Sample_type + sample_data(relabun_physeq)$Age + sample_data(relabun_physeq)$sex_at_birth, by="margin", parallel=2)
ads
Tutorial results

If you compare these to the Bray-Curtis results, you should see that there aren't significant differences for Age or sex_at_birth with either, and that the R2 value for Sample_type is much higher for the Weighted UniFrac than for Bray-Curtis. This suggests that some of the abundance differences we're seeing are from less closely related related organisms, which is why the R2 is higher when we're taking phylogeny into account.

There might not be many differences with age and sex because we have quite a small sample size, or not much diversity in these variables here!

You can also see what happens with Unweighted UniFrac by changing weighted=T to weighted=F in the UniFrac command.

Now we'll plot the samples on PCoA plots. PCoA plots attempt to represent the variation between our samples in fewer dimensions, and we can interpret them as that points that are closer together indicate that samples are similar, and points that are further apart indicate that samples are less similar. We'll also often add an ellipse to show each group. The percentages on the axes represent the amount of variation between our samples that is represented on that axes.

Plot Bray-Curtis dissimilarity:

ordination<-ordinate(relabun_physeq, "PCoA", "bray")
plot_ordination(relabun_physeq, ordination, color = "Sample_type") +
  geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) +  ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")

You'll notice that I decided to colour my points by Sample_type here, as that was what the PERMANOVA/ADONIS2 test showed had the largest impact on variation between my samples, but you can choose any of your metadata categories here, and again, you could also used the shape = "" parameter if you have more than one variable of interest.

Tutorial results
Screenshot 2026-03-17 at 10 37 01 am

Plot Weighted UniFrac distance:

ordination<-ordinate(relabun_physeq, "PCoA", "wunifrac")
plot_ordination(relabun_physeq, ordination, color = "Sample_type") +
  geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) +  ggtitle("Beta Diversity", subtitle = "Weighted UniFrac distance")
Tutorial results
Screenshot 2026-03-17 at 10 37 43 am

You can see some good separation between samples according to Sample_type with both of the beta diversity metrics, but more variation is accounted for by the axes in the Weighted UniFrac than Bray-Curtis.

Stacked bar plots

A useful way to visualise the composition of our samples is through bar charts of taxonomic relative abundances. Because each bar will be a separate colour, it's not usually very useful to try to show too many taxa in one plot, so we'll often take just the top X most abundant taxa (in this case the top 20), or only those with >Y% relative abundance.

Plot the top 20 most abundant ASVs:

relabun_physeq_top<-prune_taxa(names(sort(taxa_sums(relabun_physeq),decreasing=TRUE)[1:20]), relabun_physeq)

plot_bar(relabun_physeq_top, fill="ta7") + facet_wrap(c(~Sample_type), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1))

You should see here that we're first of all creating a new phyloseq object with only the top 20 most abundant taxa, and then we're plotting them. We're also grouping our samples by both longitude and depth (again, change these to whatever makes sense for you!). Hopefully you'll start to see some patterns here...

Tutorial results

Note that you'll see some of the bars being the same colours - we're plotting ASVs but colouring them by their species name, so this just means that multiple ASVs are the same species!

Now we'll try doing this at the phylum level instead:

#if you take a look at the taxonomy table within our phyloseq object, you should see that the phylum level is now called "ta2"
tax_table(relabun_physeq)

#so ta2 is what we'll use here to group everything at that rank
ps.phylum = tax_glom(relabun_physeq, taxrank="ta2", NArm=FALSE)

#have a look at this
print(ps.phylum)

#you can see that now there's only ten taxa, so we'll get a colour palette that has 10 unique colours
palette = distinctColorPalette(12)

#now we'll make the bar plot
plot_bar(ps.phylum, fill="ta2") + facet_wrap(c(~Sample_type), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)
#you can see here that first we're plotting the barplot, and then we're adding some other options - we're telling it how to make the legend, and also which colours to use for plotting

Hopefully you can follow the comments on each line to see what they're doing!

If you don't like the colours, just re-run the last two lines of code again and you'll get a different random set of colours.

Tutorial results

Now you should be able to start seeing some differences more easily! You should see that we have a lot more Bacteroidota in the gut, with more Firmicutes_D in the oral samples.

Now let's plot genus:

ps.genus = tax_glom(relabun_physeq, taxrank="ta6", NArm=FALSE)
ps.genus_top<-prune_taxa(names(sort(taxa_sums(ps.genus),decreasing=TRUE)[1:20]), ps.genus)

#you can see that now there's only ten taxa, so we'll get a colour palette that has 10 unique colours
palette = distinctColorPalette(20)

#now we'll make the bar plot
plot_bar(ps.genus_top, fill="ta6") + facet_wrap(c(~Sample_type), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)
#you can see here that first we're plotting the barplot, and then we're adding some other options - we're telling it how to make the legend, and also which colours to use for plotting
Tutorial results
Screenshot 2026-03-17 at 10 42 47 am

You should see that we have a lot of Phocaeicola in the gut samples and Streptococcus and Veillonella in the oral samples.

Obviously this isn't a comprehensive list of everything that you could do, but hopefully it's becoming clear to you how you could change these for yourself to get a plot of what you're most interested in.

Heatmaps

While stacked bar plots can be great for getting an overview of our microbial communities, they're not very quantitative; it's often really difficult to determine whether a taxon is more abundant in one sample vs another, especially if it's not one of the most abundant. So heatmaps can be really helpful here!

We'll use the same phyloseq object as we created above to create a plot of the top 20 genera in our samples:

plot_heatmap(ps.genus_top, method = "PCoA", distance = "bray", low = "white", high = "red", na.value = "grey", sample.label="Sample_type", taxa.label="ta6")
Tutorial results

Feel free to change the colours for low and high, or the sample labels. Note that we're used PCoA and "bray" for grouping the samples, so they're being sorted according to this grouping. This is often really useful for seeing patterns within groups of samples!

Next, we'll use a different heatmap plotting function to show us the dendrogram showing sample similarity:

heatmap(otu_table(ps.genus_top), labRow = tax_table(ps.genus_top)[,"ta6"])

You'll see that this also adds a dendrogram to our taxa, but it's important to note that this is picking up on similarities in taxon abundance and we haven't added the phylogenetic tree to this plot!

Tutorial results

Phylogenetic tree plots

The final type of plot we'll go through is a phylogenetic tree plot with the abundance of taxa within our samples.

Again, we're plotting the top 20 genera:

plot_tree(ps.genus_top, nodelabf=nodeplotblank, color="Sample_type", label.tips = "ta6", ladderize="left", size="Abundance")

You can see from the legend that the abundance within samples is represented by the size of the point, and we've chosen to colour them based on longitude.

Tutorial results

These are probably quite difficult to see, though.

So what we often want to do is save a figure:

tree_fig <- plot_tree(ps.genus_top, nodelabf=nodeplotblank, color="Sample_type", label.tips = "ta6", ladderize="left", size="Abundance") #assign the plot to an object

png(filename=paste(folder, "tree_plot.png", sep=""), width=40, height=20, units="cm", res=300) #open a figure and tell it how big this should be
plot(tree_fig) #plot our tree object
dev.off() #and tell the plotting that we're finished
Tutorial results

This will be the output from the chunk: Screenshot 2026-03-17 at 10 46 34 am

And now if you go to find this in the Files explorer in the bottom right (tree_plot.png) and click on it to open it, it should look nicer!

Let's do the same for species now, but with the top 50 most abundant:

ps.species = tax_glom(relabun_physeq, taxrank="ta7", NArm=FALSE)
rank.sum = tapply(taxa_sums(ps.species), tax_table(ps.species)[, "ta7"], sum, na.rm=TRUE)
top50 = names(sort(rank.sum, TRUE))[1:50]

#and then we'll reduce the phyloseq object to only have these top50 taxa
ps.species = prune_taxa((tax_table(ps.species)[, "ta7"] %in% top50), ps.species)

tree_fig <- plot_tree(ps.species, nodelabf=nodeplotblank, color="geographic_location_.longitude.", label.tips = "ta7", ladderize="left", size="Abundance")

png(filename=paste(folder, "tree_plot_species.png", sep=""), width=40, height=40, units="cm", res=300)
plot(tree_fig)
dev.off()
#you'll see a few blank labels here, so you can always change "ta7" to "ta6" so that you plot the genus name for each of the species, rather than the species name, to see what you have
Tutorial results

You can modify this same code to save any of the other plots that we've made!

Differential abundance

Tens (if not more) of tools have been used for differential abundance testing of microbiome samples - identifying taxa that differ in abundance between sample groups (for example, treatment versus control) or with variables of interest (for example, correlations with pH or blood metabolite measurements). While these tools/methods have often been used interchangeably in the literature, there are large differences in the performance of different tools even on the same sample group. There have now been several large scale comparisons of different tools (e.g., Calgaro et al., Thorsen et al.), and our lab also carried one out (published here). While these studies have typically found that different methods lead to different biases in the results, as there is no one single best method for determining differentially abundant taxa, the best practice that we have for overcoming these issues currently is to use multiple differential abundance tests, report these clearly, and focus on the taxa that are identified by multiple tests.

We'll be using two different tools here: ALDEx and MaAsLin.

If you are using the tutorial data, we're going to do the comparisons between Gut and Oral samples.

Preprocessing

First, we'll get a table at the genus level and add names to these that include all taxonomic ranks:

physeq_genus = tax_glom(physeq, taxrank="ta6")
all_tax = paste(tax_table(physeq_genus)[,2], tax_table(physeq_genus)[,3], tax_table(physeq_genus)[,4], tax_table(physeq_genus)[,5], tax_table(physeq_genus)[,6], sep=';')
taxa_names(physeq_genus) = all_tax
otu_table(physeq_genus)

Now we'll create a rarefied genus table:

set.seed(345)
physeq_rare = rarefy_even_depth(physeq_genus, sample.size = min(sample_sums(physeq_genus)), replace = TRUE, trimOTUs = TRUE, verbose = TRUE)
#note that we did already do this, but we did it at the ASV level!

And we'll save the feature table and metadata table (some of the DA tools require these not inside the phyloseq object):

feat_table = data.frame(t(otu_table(physeq_rare)), check.rows=T, check.names=F, stringsAsFactors=T)
metadata = data.frame(sample_data(physeq_rare), stringsAsFactors = F)

ALDEx

Next, we'll run ALDEx2. Note that here the first step is to normalise the samples using a CLR normalisation, so you can see how both of the methods require similar steps - normalisation and then testing - but this is done in a slightly different way for each. We then use a Kruskal-Wallis test and a GLM ANOVA on the data, and we can do this test with multiple cores (useMC=2):

x <- aldex.clr(otu_table(physeq_genus), sample_data(physeq_genus)$Sample_type, mc.samples = 128, verbose=F, denom="all")
kw.test <- aldex.kw(x, useMC=2, verbose=FALSE)
write.csv(kw.test, paste(folder, "ALDEx2.csv", sep=""))

We could repeat this with multiple variables if we wanted, but it can be a little difficult to integrate multiple variables into the same test with ALDEx2.

If you take a look at these results files then you'll see that we have 4 columns: kw.ep, kw.eBH, glm.ep, glm.eBH - the p-values and Benjamini-Hochberg (BH)-corrected p-values for the Kruskal-Wallis and GLM ANOVA tests between the sample groups. There are several other options for testing within the ALDEx2 R package (including aldex.ttest and aldex.glm), but in this case, we can use the GLM ANOVA results for parametric tests and the Kruskal-Wallis results for non-parametric tests. We could also use aldex.effect to see the magnitude of the differences between groups.

Note: parametric tests make assumptions about the distribution of the population that the samples are taken from (typically that it is normally distributed), while non-parametric tests are distribution free and can be used for non-normal variables. Microbiome data are not typically normally distributed, but we can test this. You don't need to run this next part, but can do if you like (if you have the microbiome R package installed.

shapiro.test(as.data.frame(otu_table(physeq_genus))$"M19_1_full")
physeq_genus_clr = microbiome::transform(physeq_genus, "clr")
shapiro.test(as.data.frame(otu_table(physeq_genus_clr))$"M19_1_full")

Transformations/normalisations of microbiome data are often used to make the usually non-normal microbiome data normal, however, in this case, it is not normal (significant p-value so the null hypothesis that the data are normal can be rejected - they are not normal). This means that we should use the Kruskal-Wallis test results.

MaAsLin

Run MaAsLin2:

fit_out = Maaslin2(feat_table,
         metadata,
         paste(folder, "maaslin2_out", sep=""),
         fixed_effects = c("Sample_type"))

If you have used another dataset, you can run this step more than once with each of the variables!

Take a look at ?Maaslin2 to see what each of the inputs to this function are!

MaAsLin will save the results into a folder (maaslin2_out), so you can take a look in these folders and look at the all_results.tsv as well as significant_results.tsv files. You'll see that significant_results.tsv is just a subsection of all_results.tsv - have a look at one of these files. It also makes some nice plots for you of the significant results - have a look through these and see if you can work out what they're showing you.

In the output you will see several columns: (1) the genus ("feature"), (2) the metadata variable being tested ("metadata") (if you include multiple variables in your tests by using the fixed/random effects, then you will see several in this column), (3) for categorical features, the specific feature level for which the coefficient and significance of association is being reported ("value"), (4) the model effect size/coefficient value ("coef"), (5) the standard error from the model ("stderr"), (6) N (number of samples), (7) N.not.0 (number of samples where this taxon has non-zero abundance), (8) pval (9) qval (False Discovery Rate adjusted p-value)).

Combine the results

In order to compare the results that we've got from the three differential abundance tools, we'll want to first import the results and do some manipulations on these tables.

MaAsLin:

maaslin = read.csv(paste(folder, "maaslin2_out/significant_results.tsv", sep=""), sep="\t", header=1, check.names=F)
maaslin = maaslin[maaslin$qval <= 0.1, ]
rownames(maaslin) = maaslin$feature
rownames(maaslin) <- gsub(".c__", ";c__", rownames(maaslin))
rownames(maaslin) <- gsub(".o__", ";o__", rownames(maaslin))
rownames(maaslin) <- gsub(".f__", ";f__", rownames(maaslin))
rownames(maaslin) <- gsub(".g__", ";g__", rownames(maaslin))

You'll see that first we are filtering to keep only the taxa that were significantly differentially abundant (q-value <= 0.1), and then we're just changing the names of the genera to match what we had previously (it's a little annoying that Maaslin likes to change the names to remove the ; and replaces these with ..

Now we'll do the same for ALDEx2:

aldex = read.csv(paste(folder, "ALDEx2.csv", sep=""))
aldex = aldex[aldex$kw.eBH <= 0.1, ] #keep only taxa with Benjamini-Hochberg (BH)-adjusted p<=0.1 (the same as MaAsLin)
rownames(aldex) = aldex[,1]

We often consider those taxa found by >=2 tests to be significantly differentially abundant as actually being differentially abundant. With this dataset, a lot of taxa are found to be differentially abundant, so I'm actually going to take only the taxa that are found by both tests to narrow this down as much as possible.

Get the taxa found in all tables of significant results:

taxa = intersect(rownames(maaslin), rownames(aldex))

Try looking at how many taxa this is with length(taxa).

We still have a lot here! This isn't necessary, but we'll actually narrow this down even further and only look at those within the 20 most abundant genera.

Get just the top 20:

physeq_genus_relabun = transform_sample_counts(physeq_genus, function(x) (x / sum(x))*100 )
physeq_genus_relabun_top20<-prune_taxa(names(sort(taxa_sums(physeq_genus_relabun),decreasing=TRUE)[1:20]), physeq_genus_relabun)

Get the DA taxa that are within the top 20:

taxa = intersect(taxa, rownames(otu_table(physeq_genus_relabun_top20)))
length(taxa)
physeq_genus_relabun_da<-prune_taxa(taxa, physeq_genus_relabun_top20)

So now we should just have 14 of them, which is more reasonable to look at.

And rename them so that the names will display a little better (add line breaks):

new_names = c()
for (i in 1:length(rownames(otu_table(physeq_genus_relabun_da)))) {
  new_name = str_replace(rownames(otu_table(physeq_genus_relabun_da))[i], ';o__', '\no__')
  new_names = c(new_names, str_replace(new_name, ';g__', '\ng__'))
}
taxa_names(physeq_genus_relabun_da) = new_names

Plot the results

Finally, we'll plot them!

psmelt(physeq_genus_relabun_da) %>%
ggplot(data = ., aes(x = Sample_type, y = Abundance)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(aes(color = OTU), height = 0, width = .2) +
labs(x = "", y = "Abundance\n") +
facet_wrap(~ OTU, scales = "free") +
theme(legend.position = "none")

Note: you'll probably want to open this up in a bigger window so that you can see all of the taxa properly! And remember that you can modify the code from above to save this as a larger figure if you like.

Tutorial results
da_plot <- psmelt(physeq_genus_relabun_da) %>%
ggplot(data = ., aes(x = Sample_type, y = Abundance)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(aes(color = OTU), height = 0, width = .2) +
labs(x = "", y = "Abundance\n") +
facet_wrap(~ OTU, scales = "free") +
theme(legend.position = "none")

png(filename=paste(folder, "da_plot_all.png", sep=""), width=40, height=40, units="cm", res=300)
plot(da_plot)
dev.off()

Finally, if we want to then we can make individual plots for each of these taxa:

for(i in 1:length(taxa_names(physeq_genus_relabun_da))) {
  print(microbiome::boxplot_abundance(physeq_genus_relabun_da, x='Sample_type', y=taxa_names(physeq_genus_relabun_da)[i], line = NULL, violin = FALSE, na.rm = FALSE, show.points = TRUE) + ggtitle(taxa_names(physeq_genus_relabun_da)[i]) + ylab('Relative abundance (%)'))
}

You can click through each of them to see them!

⚠️ **GitHub.com Fallback** ⚠️