Microbiome Helper 2 Analysis workflow in R - LangilleLab/microbiome

Authors: Robyn Wright
Modifications by: NA
Based on initial versions by: Robyn Wright and Ben Fisher for previous workshops

Please note: We think that everything here should work, but we are still testing/developing this so use with caution :)

Introduction

Here we are aiming to give some examples for a typical analysis workflow that we would carry out. Often, members of the Langille lab will use several different methods for analysing and plotting data, so this is not completely comprehensive, but should offer some examples of the types of analyses that can be performed and is hopefully a good starting point for many people.

The initial importing steps may differ depending on what kind of data you are looking at, but many of the analyses will be similar regardless.

Getting started in RStudio

You can choose to run RStudio locally on your own computer, or on something like RStudio server. If you are using an AWS server then this is already installed, and it is already installed on many servers.

If you are using an AWS server then you will need to set a password: sudo passwd ubuntu - enter your new password when prompted.

You can then go to http://##.##.###.##:8080/ - you'll need to switch out the # for your public IPv4 address.

We are using several different packages, which can be installed using the following commands:

BiocManager::install(version = "3.21")
BiocManager::install("phyloseq")
library("devtools")
install_github("biobakery/maaslin3")
BiocManager::install("microbiome")
BiocManager::install("ALDEx2")
install.packages("tidyverse")
install.packages('vegan',
    repos = c('https://vegandevs.r-universe.dev','https://cloud.r-project.org'))
install_github("statdivlab/radEmu")
install.packages("randomcoloR")

You could choose to run these things in the RStudio Console, but what I like to do is use an RStudio notebook - this allows me to keep track of all of the code that I have run, and produce an output at the end. If you'd like to use an RStudio notebook, then simply go to File > New File > R Notebook. Read through the brief instructions and try running it!

After you've familiarised yourself a little, you can choose a new title at the top, and then delete the rest. I would then make a new "Chunk" for each of the code chunks that I have below (click on the green C and choose "R"). You can make notes outside of these chunks if you like.

If you use the R notebook and make chunks, you'll also need to run the chunks! You can do that by clicking the green play button on the top right hand corner of each chunk, or by holding Command+Shift+Enter. You can run a single line by pressing Command+Enter.

If you are running the code straight in the R console then you will only need to press Enter to run it, the same as you will have done in the Terminal.

I've added some drop downs titled Tutorial results - if you're using the tutorial data (assuming that you've started with the PacBio data), then you can see what the output is expected to look like when it's run in RStudio.

Load required packages and setup

Load the packages:

library(phyloseq)
library(maaslin3)
library(ALDEx2)
library(ggplot2)
library(tidyr)
library(stringr)
library(vegan)
library(stats)
library(radEmu)
library(randomcoloR)

You'll notice that in each of these, we use the library() command, and inside the brackets we give the name of the package. We've already installed these packages for you, but in most cases these could be installed by e.g., install.packages("tidyr"). In some cases, they might be installed using another package, BiocManager, e.g., BiocManager::install("ALDEx2"). If you're unsure, you can always google how to install a package and you can usually find the code to copy and paste, e.g., "r install aldex2". You could also try typing ??aldex2 into the console and see what comes up. It should give you some information about that package.

Set the name of the folder where our results are saved:

folder = '~/robyn/arctic_ocean_pacbio/'

Explaining variables, strings, etc.

Here, we have defined a variable called folder - we always need to be careful that we don't save over these variables by calling something else the same thing, but you can always check these in the "Environment" area in the top right where you should see that folder now exists. We could have called this anything, but it's always helpful to name our variables something that is simple and explains what they are, incase someone else is using our code. This is a string, as it is inside the punctuation marks ' ' or " ", and this is used in programming to show that we are representing text rather than numbers. These variables can be really useful to avoid having to type in the same information multiple times, which always increases the chances that we'll make a mistake!

You can see how these would be different by running the following:

random_string = "12345"
random_number = 12345

You should see how these are different by how they show in the "Environment" area in the top right.

Now that we've explained the basics, we can read in some of our data.

Importing data

First, we'll read in the feature table. Note that we're calling it asv_table:

asv_table <- read.csv(paste(folder, "denoised_output_exported/feature-table_w_tax.txt", sep=""), sep='\t', skip=1, header=T)

Here we're combining a few different things. We're using the read.csv function - functions are modules of code that accomplish a specific task. We can write them ourselves, they are what is contained in the packages that we imported at the start, or there are several that are built in to R. If you want more information about what a function does as well as the input that it expects and the output that it gives, you can type in ?read.csv.

We're also using the paste function to add together the full file path of the feature table. We're giving it the folder variable that we already defined, as well as the additional folder and file name (both as positional arguments), and finally the sep="" named argument - this tells the paste function that there should be no spaces or other punctuation between the two parts that are being pasted together. Try running just this part and see what the result is: paste(folder, "exports_filtered/feature-table.txt", sep="") - note that as we haven't given this a name, it is not saved as a variable.

Next, we've given the sep='\t' variable - this time, we're telling the read.csv function that the file that we're importing is tab-delimited, then we've told it to skip the first line of the file skip=1, and that there is a header header=T (where T is the same as TRUE). We've skipped the first line because this just reads "# Constructed from biom file" and is not actually data.

Just take a look at asv_table using View(asv_table) before continuing so you can see what the data looks like.

Tutorial results

Note that this is just a small part of the table - you should be able to scroll and see all of the samples.

We often find in programming that things are not quite in the format that is expected by the packages that we're using, so we'll make a few modifications to asv_table:

asv_table_num = data.matrix(asv_table[,2:19]) #convert the ASV table to a numeric matrix. Note that if you didn't use the tutorial data, you might need to change this "19" to make sure that you don't cut off any samples. It should be the number of samples + 1. Have a look at the asv_table object in the upper right corner!
rownames(asv_table_num) = asv_table[,1] #give the matrix row names
ASV = otu_table(asv_table_num, taxa_are_rows = TRUE) #convert asv_table_num to an otu_table

You can see here that I've added # after the code and have written what each line does. You can use # to make comments through your code documents, to explain what each line is doing. Here you should also notice that we didn't save over the previous asv_table, but created a new one called asv_table_num. Now, we've just told R that it should be expecting numberic data within this table, and then we gave it the same row names as our previous asv_table object and at the end, we converted it to what Phyloseq calls an otu_table (it doesn't matter if we have OTUs, ASVs or something else in this table).

Tutorial results

The top right hand corner should probably look something like this:
Screenshot 2025-06-17 at 7 47 00 pm

And if you use View(ASV) to look at the otu_table, then you should have something like this:

Screenshot 2025-06-17 at 7 47 53 pm
Again, we're just showing a small section of the table!

Get the taxonomy information

You may have noticed that the last column in the initial asv_table was our taxonomy information. We'll get that first:

taxonomy <- data.frame(asv_table[,"taxonomy"])
rownames(taxonomy) <- asv_table[,1]
colnames(taxonomy) = c("taxonomy")

If you print this out by typing taxonomy or by clicking on this in the "Environment" area, you'll see that we currently only have a single column that contains all of the taxonomy information.

Tutorial results

We want to split this so that we have columns for each of Domain, Phylum, Class, Order, Family, Genus and Species. Luckily, there is already a function that we can use for this:

taxonomy_split <- separate(data = taxonomy, col = taxonomy, into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\; ") #separate the taxonomy table so each phylogenetic level is its own column. Note that if you are using 18S then you might need to separate this differently!
TAX = tax_table(taxonomy_split) #convert taxonomy_split to a tax_table
taxa_names(TAX) <- rownames(taxonomy_split) #add names to TAX/the tax_table

You'll see that we tell the separate function that the data it should use is in the taxonomy table, the col (column) is called "taxonomy". The "into" named argument takes a list as input, which is defined in R with the c(), and the taxonomy column should be split into a new column each time the ; symbol is found. You'll see a warning after running this line, but it still ran so that's fine - it's just telling us that there weren't values for all of Domain/Phylum/Class/Order/Family/Genus/Species for every ASV, which makes sense as not all of them will have a species-level classification.

And then afterwards, we convert this into a Phyloseq tax_table.

Again, you can take a look at the results (or any of the intermediates) by using something like View(TAX).

Tutorial results

Read in the metadata

Now we'll read in the metadata. I won't go through each stage step-by-step as hopefully you're getting the idea by now, but I've still added comments to each of the rows:

metadata <- read.csv(paste(folder, "arctic_study_metadata.txt", sep=""), sep='\t')
samples = data.frame(metadata[,c(3,4,5,6,7)], stringsAsFactors = FALSE) #convert this to a data frame - again, if you didn't use the 16S data, you have some extra variables here that you will need to include, rather than only column 2, you may need another, too. 
rownames(samples) = metadata[,1] #add the sample names as row names
samples[, c(2,4,5)] <- sapply(samples[, c(2,4,5)], as.numeric) #convert numeric columns to numeric
samples[, 'Sequencing'] = 'PacBio' #add a column to the metadata table that says which kind of sequencing we've used - only necessary if you're using the tutorial data!
SAMPLE = sample_data(samples) #convert samples to a sample_data

Tutorial results

View(SAMPLE):
Screenshot 2025-06-17 at 7 56 40 pm

Read in the phylogenetic tree

Read it in as phy_tree:

phy_tree <- read_tree(paste(folder, "denoised_output_exported/tree.nwk", sep=""))

Combine these into a phyloseq object

Phyloseq is a really useful package that contains many useful functions for analysing microbiome data. While it can be a bit fiddly to get our data into the format that it is expecting (most good packages should give you examples of how data should be formatted to go into them, although it does take a lot of practice to get good at quickly working out how this is different from what you have!), once that it is in this format, the analyses are then very easy to perform.

physeq = phyloseq(ASV, TAX, phy_tree, SAMPLE) #combine these all to make a phyloseq object
physeq #print this out to see what a phyloseq object looks like

Tutorial results

Now, we have combined all of these different parts into one phyloseq object. You can also get each of the individual objects back after performing manipulations, e.g., otu_table(physeq).

Look at rarefaction curves and remove low abundance taxa and samples

Now we can take a look at how many reads have been classified within our samples:

sample_sums(physeq)
print(c("Minimum sample size:",min(sample_sums(physeq)), "Maximum sample size:", max(sample_sums(physeq))))

Tutorial results

It's also always good to have a look at a rarefaction curve. This tells us about whether we've sequenced our samples sufficiently to capture all of the diversity within them:

rarecurve(as.data.frame(t(otu_table(physeq))), step=50,cex=0.5,label=TRUE)

Tutorial results

If you're not familiar with rarefaction curves, or species accumulation curves, then each line represents a single sample. You should see that the maximum x value is the number of reads in that sample, and the maximum y value is the number of ASVs. Each point along the curve represents the number of ASVs that are found if we randomly sub-sampled the sample to only have X number of reads. Ideally, these will all plateau, which would suggest that our sequencing depth is sufficient and we wouldn't expect to identify any more ASVs even if we'd sequenced more.

Note that if you'd like to zoom in on a particular area of the graph then you can simply set x/y limits using e.g. xlim=c(1,5000):

rarecurve(as.data.frame(t(otu_table(physeq))), step=50,cex=0.5,label=TRUE, xlim=c(1,5000))

Tutorial results

Let's also take a look at the number of reads assigned to each taxon within our samples:

length(taxa_sums(physeq))
print(c("Minimum reads:",min(taxa_sums(physeq)), "Maximum reads:", max(taxa_sums(physeq))))
taxa_sums(physeq)

Tutorial results

Screenshot 2025-06-17 at 8 01 34 pm
Note that this is only the first part of the print-out! You will be able to see the sums for all 817 of the ASVs.

We're usually going to want to remove some of these very low abundance taxa - we can't be sure whether they were sequencing errors, bleed-through from previous sequencing runs, or just very low abundance taxa. I've chose a threshold of above 100 total reads to keep a taxon, but you can choose something different depending on what makes sense to you.

physeq <- prune_taxa(taxa_sums(physeq)>=100,physeq)
physeq

If you look at the physeq object again, you should see that the number of taxa has been reduced.

Tutorial results

Take another look at the rarefaction curve:

rarecurve(as.data.frame(t(otu_table(physeq))), step=50,cex=0.5,label=TRUE)

Tutorial results

Now we're going to want to choose a number of reads where it looks like all samples have plateaued - note that any samples with fewer reads than this will be removed from further analyses. For me, this was around 1000.

So first, we'll remove samples with below 1000 reads:

physeq <- prune_samples(sample_sums(physeq)>=1000, physeq)

Take a look at the rarefaction curves again to double-check that you're OK with this:

rarecurve(as.data.frame(t(otu_table(physeq))), step=50,cex=0.5,label=TRUE, xlim=c(0,5000)

Tutorial results

And then rarefy all samples to the minimum remaining read depth (you could choose the same value for this - 1000 in my case - but ideally you want this to be as high as possible so that you aren't unnecessarily removing reads):

set.seed(711)
rarefied_physeq<-rarefy_even_depth(physeq, rngseed = FALSE, sample.size = min(sample_sums(physeq)), trimOTUs = TRUE)

See that the samples are all have the same number of reads now:

sample_sums(rarefied_physeq)

Tutorial results

And then we'll also normalise by converting to relative abundance.:

relabun_physeq <- transform_sample_counts(physeq, function(x) x*100 / sum(x))

Note that we are only going to use this rarefied object for the alpha diversity analyses, but we have excluded the low read depth samples from all further analyses!

Alpha and beta diversity

Alpha diversity refers to within sample diversity, and there are three main components of it:

Richness: the number of different taxa present
Evenness: whether the taxa are present in similar (even) abundances
Phylogenetic diversity: we are also often interested in whether the taxa in our samples are closely related or not, and Phylogenetic Diversity is used as a representation of this. It is calculated by summing the branch lengths of a phylogenetic tree that connects all taxa within a sample, so a high value means that the taxa are phylogenetically more diverse

Several of the most commonly used Alpha diversity metrics are:

Observed taxa: simply a count of the number of different taxa in each sample
Chao1 Richness: this is an estimate of the total number of taxa within your community based on your sample
Shannon Index (or Shannon-Weiner Index): considers both richness and evenness and is calculated by summing the proportion of each species multiplied by the natural logarithm of that proportion
Simpson's Index: this is a measure of the probability that two randomly selected taxa from a community will be the same. With only one taxa, the probability that both will be the same is 1. The inverse of this is what is most frequently used, where 0 represents low diversity and 1 represents high diversity
Faith's Phylogenetic Diversity: as described above
Fisher's Alpha: this also considers both richness and evenness and is often used when the taxa abundance follows a log-series distribution

Beta diversity refers to between sample diversity and there are again three main components of it:

Whether the taxa in the samples are the same
Whether the taxa are present at similar abundances in the different samples
Whether the taxa are closely related to one another

There are a lot of different beta diversity metrics used, but some of the most widely used for microbiome analyses are:

Bray-Curtis Dissimilarity Index: takes into account the similarity of the taxa between two samples as well as the abundances of the taxa within these samples. It ranges between 0 and 1, where 0 would indicate that the two samples are identical and 1 that there are no shared taxa between them
Weighted/Unweighted UniFrac distance: measures the difference between communities by considering the branch lengths of a phylogenetic tree that are unique to each community. Weighted UniFrac additionally considers the abundance of the taxa within the two communities/samples, where Unweighted UniFrac doesn't
Aitchison distance: a method for compositional data analysis that is the Euclidean distance calculated on Centered-Log Ratio (CLR) transformed data. Converting microbiome data to CLR required the addition of a pseudocount (usually 1) so that logs can be taken; the robust version of this is also often used and this removes zeroes prior to log-ratio calculations and adds them back afterwards
Phylo-RPCA (Robust Principal Component Analysis): another method for compositional data analysis that also incorporates phylogeny

Many beta diversity metrics account for the compositionality of microbiome data. Data compositionality refers to the fact that the abundance of all taxa are dependent on the abundances of all other taxa within that sample because the number of reads (our sequencing effort) for each sample is finite. You can read more on this concept here.

Now that you hopefully understand alpha and beta diversity... we'll get to looking at them.

First, we'll use a phyloseq plot function to look at alpha diversity within our samples:

plot_richness(physeq = rarefied_physeq, color = "Depth", measures = c("Observed", "Shannon", "Fisher", "Simpson")) + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

Note that if this is difficult to see your sample names, try clicking "Show in New Window" (the left of these three icons that will show above your plot!): Screenshot 2025-06-17 at 8 13 16 pm

Tutorial results

You'll see that I've coloured them by Depth, which is one of the categories in the metadata that I'm using. You should switch this to whatever is of interest to you. You could also add a second variable using the shape = "" option.

Now we'll calculate the Bray-Curtis dissimilarity between our samples and perform a PERMANOVA (ADONIS2) test to see whether they're significantly different from one another based on any of our metadata categories:

distance <- phyloseq::distance(relabun_physeq, method="bray", weighted=F)
ads = adonis2(distance ~ sample_data(relabun_physeq)$Depth + sample_data(relabun_physeq)$geographic_location_.latitude. + sample_data(relabun_physeq)$geographic_location_.longitude., by="margin", parallel=2)
ads

I've used a few different metadata categories here and have used the + to look at them all together. The two key values we're interested in here are the R2 and the p-value (PR>F). The R2 tells us how much of the variation between our samples is accounted for by this category, and p tells us whether it's statistically significant or not. The residual R2 value tells you how much variation within your samples is unaccounted for by the metadata categories that you're looking at. You could use * instead of + if you wanted to look at the interaction between any of your variables.

Tutorial results

This might take a while if you have a lot of samples!

Now we'll do the same but with Weighted UniFrac distance:

distance <- phyloseq::distance(relabun_physeq, method="unifrac", weighted=T)
ads = adonis2(distance ~ sample_data(relabun_physeq)$Depth + sample_data(relabun_physeq)$geographic_location_.latitude. + sample_data(relabun_physeq)$geographic_location_.longitude., by="margin", parallel=2)
ads

Tutorial results

Screenshot 2025-06-17 at 8 15 15 pm
If you compare these to the Bray-Curtis results, you should see that the R2 value for Latitude is about the same for both (indicating that Latitude accounts for ~11% of the variation between samples), the R2 value is higher for UniFrac than Bray-Curtis for Depth, and lower for UniFrac than Bray-Curtis for Longitude. This would suggest to me that some of the Longitude differences that Bray-Curtis picks up on are from organisms that are relatively closely related phylogenetically, but some of the Depth differences that Bray-Curtis shows are from less closely related organisms.

Now we'll plot the samples on PCoA plots. PCoA plots attempt to represent the variation between our samples in fewer dimensions, and we can interpret them as that points that are closer together indicate that samples are similar, and points that are further apart indicate that samples are less similar. We'll also often add an ellipse to show each group. The percentages on the axes represent the amount of variation between our samples that is represented on that axes.

Plot Bray-Curtis dissimilarity:

ordination<-ordinate(relabun_physeq, "PCoA", "bray")
plot_ordination(relabun_physeq, ordination, color = "geographic_location_.longitude.") +
  geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) +  ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")

You'll notice that I decided to colour my points by longitude here, as that was what the PERMANOVA/ADONIS2 test showed had the largest impact on variation between my samples, but you can choose any of your metadata categories here, and again, you could also used the shape = "" parameter if you have more than one variable of interest.

Tutorial results

Plot Weighted UniFrac distance:

ordination<-ordinate(relabun_physeq, "PCoA", "wunifrac")
plot_ordination(relabun_physeq, ordination, color = "geographic_location_.longitude.") +
  geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) +  ggtitle("Beta Diversity", subtitle = "Weighted UniFrac distance")

Tutorial results

It's a bit difficult to see too much on these PCoA plots with a continuous variable like Longitude! It does seem like there is mainly a shift towards higher Longitudes as we move to the right of the plot, though.

Stacked bar plots

A useful way to visualise the composition of our samples is through bar charts of taxonomic relative abundances. Because each bar will be a separate colour, it's not usually very useful to try to show too many taxa in one plot, so we'll often take just the top X most abundant taxa (in this case the top 20), or only those with >Y% relative abundance.

Plot the top 20 most abundant ASVs:

relabun_physeq_top<-prune_taxa(names(sort(taxa_sums(relabun_physeq),decreasing=TRUE)[1:20]), relabun_physeq)

plot_bar(relabun_physeq_top, fill="ta7") + facet_wrap(c(~geographic_location_.longitude., ~Depth), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1))

You should see here that we're first of all creating a new phyloseq object with only the top 20 most abundant taxa, and then we're plotting them. We're also grouping our samples by both longitude and depth (again, change these to whatever makes sense for you!). Hopefully you'll start to see some patterns here...

Tutorial results

Note that you'll see some of the bars being the same colours - we're plotting ASVs but colouring them by their species name, so this just means that multiple ASVs are the same species!

Now we'll try doing this at the phylum level instead:

#if you take a look at the taxonomy table within our phyloseq object, you should see that the phylum level is now called "ta2"
tax_table(relabun_physeq)

#so ta2 is what we'll use here to group everything at that rank
ps.phylum = tax_glom(relabun_physeq, taxrank="ta2", NArm=FALSE)

#have a look at this
print(ps.phylum)

#you can see that now there's only ten taxa, so we'll get a colour palette that has 10 unique colours
palette = distinctColorPalette(10)

#now we'll make the bar plot
plot_bar(ps.phylum, fill="ta2") + facet_wrap(c(~geographic_location_.longitude., ~Depth), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)
#you can see here that first we're plotting the barplot, and then we're adding some other options - we're telling it how to make the legend, and also which colours to use for plotting

Hopefully you can follow the comments on each line to see what they're doing!

Tutorial results

Now you should be able to start seeing some differences more easily! It looks like Verrucomicrobia are more abundant in the deeper samples, Cyanobacteria appear to be especially abundant in the samples taken from ~7.79 Longitude, and Pseudomonadota are especially abundant at ~12.39 Longitude.

And now we'll do the same, but only grouping the samples by longitude:

plot_bar(ps.phylum, fill="ta2") + facet_wrap(c(~geographic_location_.longitude.), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)

Tutorial results

Now let's plot genus:

ps.genus = tax_glom(relabun_physeq, taxrank="ta6", NArm=FALSE)
ps.genus_top<-prune_taxa(names(sort(taxa_sums(ps.genus),decreasing=TRUE)[1:20]), ps.genus)

#you can see that now there's only ten taxa, so we'll get a colour palette that has 10 unique colours
palette = distinctColorPalette(20)

#now we'll make the bar plot
plot_bar(ps.genus_top, fill="ta6") + facet_wrap(c(~geographic_location_.longitude., ~Depth), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)
#you can see here that first we're plotting the barplot, and then we're adding some other options - we're telling it how to make the legend, and also which colours to use for plotting

Tutorial results

Now you can see that specifically, you have Synechococcus_C in the samples from 7.79 Longitude, and Pelagibacter_A in the ~12.39 Longitude samples.

And genus only grouped by longitude:

plot_bar(ps.genus_top, fill="ta6") + facet_wrap(c(~geographic_location_.longitude.), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)

Tutorial results

And genus only grouped by Depth:

plot_bar(ps.genus_top, fill="ta6") + facet_wrap(c(~Depth), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)

Tutorial results

Obviously this isn't a comprehensive list of everything that you could do, but hopefully it's becoming clear to you how you could change these for yourself to get a plot of what you're most interested in.

Heatmaps

While stacked bar plots can be great for getting an overview of our microbial communities, they're not very quantitative; it's often really difficult to determine whether a taxon is more abundant in one sample vs another, especially if it's not one of the most abundant. So heatmaps can be really helpful here!

We'll use the same phyloseq object as we created above to create a plot of the top 20 genera in our samples:

plot_heatmap(ps.genus_top, method = "PCoA", distance = "bray", low = "white", high = "red", na.value = "grey", sample.label="geographic_location_.longitude.", taxa.label="ta6")

Tutorial results

Feel free to change the colours for low and high, or the sample labels. Note that we're used PCoA and "bray" for grouping the samples, so they're being sorted according to this grouping. This is often really useful for seeing patterns within groups of samples!

Next, we'll use a different heatmap plotting function to show us the dendrogram showing sample similarity:

heatmap(otu_table(ps.genus_top), labRow = tax_table(ps.genus_top)[,"ta6"])

You'll see that this also adds a dendrogram to our taxa, but it's important to note that this is picking up on similarities in taxon abundance and we haven't added the phylogenetic tree to this plot!

Tutorial results

Phylogenetic tree plots

The final type of plot we'll go through is a phylogenetic tree plot with the abundance of taxa within our samples.

Again, we're plotting the top 20 genera:

plot_tree(ps.genus_top, nodelabf=nodeplotblank, color="geographic_location_.longitude.", label.tips = "ta6", ladderize="left", size="Abundance")

You can see from the legend that the abundance within samples is represented by the size of the point, and we've chosen to colour them based on longitude.

Tutorial results

These are probably quite difficult to see, though.

So what we often want to do is save a figure:

tree_fig <- plot_tree(ps.genus_top, nodelabf=nodeplotblank, color="geographic_location_.longitude.", label.tips = "ta6", ladderize="left", size="Abundance") #assign the plot to an object

png(filename=paste(folder, "tree_plot.png", sep=""), width=40, height=20, units="cm", res=300) #open a figure and tell it how big this should be
plot(tree_fig) #plot our tree object
dev.off() #and tell the plotting that we're finished

Tutorial results

Let's do the same for species now, but with the top 50 most abundant:

ps.species = tax_glom(relabun_physeq, taxrank="ta7", NArm=FALSE)
rank.sum = tapply(taxa_sums(ps.species), tax_table(ps.species)[, "ta7"], sum, na.rm=TRUE)
top50 = names(sort(rank.sum, TRUE))[1:50]

#and then we'll reduce the phyloseq object to only have these top50 taxa
ps.species = prune_taxa((tax_table(ps.species)[, "ta7"] %in% top50), ps.species)

tree_fig <- plot_tree(ps.species, nodelabf=nodeplotblank, color="geographic_location_.longitude.", label.tips = "ta7", ladderize="left", size="Abundance")

png(filename=paste(folder, "tree_plot_species.png", sep=""), width=40, height=40, units="cm", res=300)
plot(tree_fig)
dev.off()
#you'll see a few blank labels here, so you can always change "ta7" to "ta6" so that you plot the genus name for each of the species, rather than the species name, to see what you have

Tutorial results

You can modify this same code to save any of the other plots that we've made!

Differential abundance

Tens (if not more) of tools have been used for differential abundance testing of microbiome samples - identifying taxa that differ in abundance between sample groups (for example, treatment versus control) or with variables of interest (for example, correlations with pH or blood metabolite measurements). While these tools/methods have often been used interchangeably in the literature, there are large differences in the performance of different tools even on the same sample group. There have now been several large scale comparisons of different tools (e.g., Calgaro et al., Thorsen et al.), and our lab also carried one out (published here). While these studies have typically found that different methods lead to different biases in the results, as there is no one single best method for determining differentially abundant taxa, the best practice that we have for overcoming these issues currently is to use multiple differential abundance tests, report these clearly, and focus on the taxa that are identified by multiple tests.

We'll be using the most recent versions of several different tools here: ALDEx, MaAsLin and radEmu.

If you are using the tutorial data, we're going to do the comparisons between Illumina and PacBio. So first we'll read in the other data and then look at a basic overview before continuing on with the others.

Combine Illumina and PacBio

We used PacBio above, so we'll use Illumina here, but if you did the opposite then you'll need to switch some of these commands to reflect that.

Set the new folder name:

folder_2 = '~/robyn/arctic_ocean_illumina/'

Read in the ASV table:

asv_table_2 <- read.csv(paste(folder, "illumina/denoised_output_exported/feature-table_w_tax.txt", sep=""), sep='\t', skip=1, header=T)
asv_table_num_2 = data.matrix(asv_table_2[,2:40]) #convert the ASV table to a numeric matrix. Note that if you didn't use the tutorial data, you might need to change this "40" to make sure that you don't cut off any samples. It should be the number of samples + 1. Have a look at the asv_table object in the upper right corner!
rownames(asv_table_num_2) = asv_table_2[,1] #give the matrix row names
ASV_2 = otu_table(asv_table_num_2, taxa_are_rows = TRUE) #convert asv_table_num to an otu_table

Tutorial results

View(ASV_2):
Screenshot 2025-06-17 at 10 05 23 pm
You should notice that in the sample names now they say "short" instead of "full"!

Get the taxonomy information:

taxonomy_2 <- data.frame(asv_table_2[,"taxonomy"])
rownames(taxonomy_2) <- asv_table_2[,1]
colnames(taxonomy_2) = c("taxonomy")
taxonomy_split_2 <- separate(data = taxonomy_2, col = taxonomy, into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\; ") #separate the taxonomy table so each phylogenetic level is its own column. Note that if you are using 18S then you might need to separate this differently!
TAX_2 = tax_table(taxonomy_split_2) #convert taxonomy_split to a tax_table
taxa_names(TAX_2) <- rownames(taxonomy_split_2) #add names to TAX/the tax_table

Tutorial results

View(TAX_2):
Screenshot 2025-06-17 at 10 14 03 pm

Get the metadata:

metadata_2 <- read.csv(paste(folder, "illumina/arctic_study_metadata.txt", sep=""), sep='\t')
samples_2 = data.frame(metadata_2[,c(3,4,5,6,7)], stringsAsFactors = FALSE) #convert this to a data frame - again, if you didn't use the 16S data, you have some extra variables here that you will need to include, rather than only column 2, you may need another, too. 
rownames(samples_2) = metadata_2[,1] #add the sample names as row names
samples_2[, c(2,4,5)] <- sapply(samples_2[, c(2,4,5)], as.numeric) #convert numeric columns to numeric
samples_2[, 'Sequencing'] = 'Illumina'
SAMPLE_2 = sample_data(samples_2) #convert samples to a sample_data

Tutorial results

View(SAMPLE_2):
Screenshot 2025-06-17 at 10 18 02 pm

Get the phylogenetic tree:

phy_tree_2 <- read_tree(paste(folder, "illumina/denoised_output_exported/tree.nwk", sep=""))

Combine this into a Phyloseq object:

physeq_2 = phyloseq(ASV_2, TAX_2, phy_tree_2, SAMPLE_2) #combine these all to make a phyloseq object
physeq_2 #print this out to see what the phyloseq object looks like

Tutorial results

Note that if you are looking at this for the first time, it would be good to run the same checks on this as you did above (looking at the rarefaction curves, looking at the number of taxa and how many reads there are assigned to each, etc.), but here I'll just filter the Phyloseq object.

Prune taxa:

physeq_2 <- prune_taxa(taxa_sums(physeq_2)>=200,physeq_2)
physeq_2

Tutorial results

Prune samples:

physeq_2 <- prune_samples(sample_sums(physeq_2)>=2000, physeq_2)

And now rarefy:

set.seed(711)
rarefied_physeq_2<-rarefy_even_depth(physeq_2, rngseed = FALSE, sample.size = min(sample_sums(physeq_2)), trimOTUs = TRUE)

Convert to relative abundance:

relabun_physeq_2 <- transform_sample_counts(physeq_2, function(x) x*100 / sum(x))

Now that we've got these normalised Phyloseq objects, you can perform the same analyses (alpha/beta diversity, bar plots, etc.) as we did above if you like. Just below we will show these analyses for both together, though.

Merge the Phyloseq objects

Note that the trees don't combine so easily and we don't need them for the differential abundance analyses, so we're not adding them here. We also would not expect the two different sequencing types to match ASV names because they were from different length sequences, so we will group both at the species level and rename all of the taxa to be these species names.

physeq_pacbio = phyloseq(otu_table(physeq), sample_data(physeq), tax_table(physeq))
physeq_illum = phyloseq(otu_table(physeq_2), sample_data(physeq_2), tax_table(physeq_2))

physeq_species_pacbio = tax_glom(physeq_pacbio, taxrank="ta7")
all_tax_pacbio = paste(tax_table(physeq_species_pacbio)[,2], tax_table(physeq_species_pacbio)[,3], tax_table(physeq_species_pacbio)[,4], tax_table(physeq_species_pacbio)[,5], tax_table(physeq_species_pacbio)[,6], tax_table(physeq_species_pacbio)[,7], sep=';')
taxa_names(physeq_species_pacbio) = all_tax_pacbio

physeq_species_illum = tax_glom(physeq_illum, taxrank="ta7")
all_tax_illum = paste(tax_table(physeq_species_illum)[,2], tax_table(physeq_species_illum)[,3], tax_table(physeq_species_illum)[,4], tax_table(physeq_species_illum)[,5], tax_table(physeq_species_illum)[,6], tax_table(physeq_species_illum)[,7], sep=';')
taxa_names(physeq_species_illum) = all_tax_illum

merged_physeq = merge_phyloseq(physeq_illum, physeq_pacbio)

Tutorial results

merged_physeq:
Screenshot 2025-06-17 at 10 46 20 pm

Rarefy together:

set.seed(711)
rarefied_merged_physeq<-rarefy_even_depth(merged_physeq, rngseed = FALSE, sample.size = min(sample_sums(merged_physeq)), trimOTUs = TRUE)

Convert to relative abundance together:

relabun_merged_physeq <- transform_sample_counts(merged_physeq, function(x) x*100 / sum(x))

Note that these next parts are optional and are for exploring the data together, but are not required for differential abundance analyses.

Alpha diversity

plot_richness(physeq = rarefied_merged_physeq, x = "Sequencing", color = "Sequencing", measures = c("Observed", "Shannon", "Fisher", "Simpson")) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

Note that here we added in the + geom_boxplot() part of this plot. This means that now we get a boxplot for each unique value of the x-axis (which we told to be the sequencing method with x = "Sequencing"!)

Tutorial results

Screenshot 2025-06-17 at 10 46 58 pm
So you can see that even though we have rarefied all of our samples to the same depth, we still have more ASVs in our Illumina samples than the PacBio samples.

Beta diversity

Because Beta diversity is the between sample diversity and we don't have a phylogenetic tree here, it doesn't really make sense to do this at the ASV level because we wouldn't expect to have any shared taxa between the two different datasets (where our sequences are different lengths!). To compare them, we'll instead group the taxa at the species and then at the genus levels.

Species level

PERMANOVA test:

distance <- phyloseq::distance(relabun_merged_physeq, method="bray", weighted=F)
ads = adonis2(distance ~ sample_data(relabun_merged_physeq)$Sequencing + sample_data(relabun_merged_physeq)$Depth + sample_data(relabun_merged_physeq)$geographic_location_.latitude. + sample_data(relabun_merged_physeq)$geographic_location_.longitude., by="margin", parallel=2)
ads

Tutorial results

So sequencing is still having the largest impact on the variation between samples, even though we grouped the taxa at the species level. This is likely because we have only 382/750 species-level classifications for the Illumina ASVs while we have 458/818 species-level classifications for the PacBio ASVs. This is not so bad at the genus level, where we have 584 genus-level classifications for Illumina and 691 for PacBio. Note that all of these numbers are before filtering out low abundance, but it's likely that we'd still have similar proportions!

Plot PCoA:

ordination<-ordinate(relabun_merged_physeq, "PCoA", "bray")
plot_ordination(relabun_merged_physeq, ordination, color = "Sequencing") +
  geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) +  ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")

Tutorial results

Given how different the samples are based on the Sequencing variable, it's not very surprising that all of the samples group separately!

Genus level

Now let's try this at the genus level.

Group at the genus level:

ps.genus = tax_glom(relabun_merged_physeq, taxrank="ta6", NArm=FALSE)

PERMANOVA test:

distance <- phyloseq::distance(ps.genus, method="bray", weighted=F)
ads = adonis2(distance ~ sample_data(ps.genus)$Sequencing + sample_data(ps.genus)$Depth + sample_data(ps.genus)$geographic_location_.latitude. + sample_data(ps.genus)$geographic_location_.longitude., by="margin", parallel=2)
ads

Tutorial results

Screenshot 2025-06-17 at 10 49 53 pm
Interestingly, it actually seems to make very little difference here that we've grouped at the genus rather than the species level!

Plot PCoA:

ordination<-ordinate(ps.genus, "PCoA", "bray")
plot_ordination(ps.genus, ordination, color = "Sequencing") +
  geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) +  ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")

Tutorial results

Heatmap

First, we'll group the Phyloseq objects at a few different taxonomic ranks.

Genus level:

ps.genus = tax_glom(relabun_merged_physeq, taxrank="ta6", NArm=FALSE)
ps.genus_top<-prune_taxa(names(sort(taxa_sums(ps.genus),decreasing=TRUE)[1:50]), ps.genus)

Family level:

ps.family = tax_glom(relabun_merged_physeq, taxrank="ta5", NArm=FALSE)
ps.family_top<-prune_taxa(names(sort(taxa_sums(ps.family),decreasing=TRUE)[1:50]), ps.family)

Phylum level:

ps.phylum = tax_glom(relabun_merged_physeq, taxrank="ta2", NArm=FALSE)
ps.phylum_top<-prune_taxa(names(sort(taxa_sums(ps.phylum),decreasing=TRUE)[1:50]), ps.phylum)

And make the heatmaps at each of these levels.

Genus:

plot_heatmap(ps.genus_top, method = "PCoA", distance = "bray", low = "white", high = "red", na.value = "grey", sample.label="Sequencing", taxa.label="ta6")

Tutorial results

Family:

plot_heatmap(ps.family_top, method = "PCoA", distance = "bray", low = "white", high = "red", na.value = "grey", sample.label="Sequencing", taxa.label="ta5")

Tutorial results

Phylum:

plot_heatmap(ps.phylum_top, method = "PCoA", distance = "bray", low = "white", high = "red", na.value = "grey", sample.label="Sequencing", taxa.label="ta2")

Tutorial results

Preprocessing

Note that if you merged the two Phyloseq objects then you may want to rename this to physeq to match the commands below, or you can change the commands.

Tutorial results

physeq = merged_physeq

First, we'll get a table at the genus level and add names to these that include all taxonomic ranks:

physeq_genus = tax_glom(physeq, taxrank="ta6")
all_tax = paste(tax_table(physeq_genus)[,2], tax_table(physeq_genus)[,3], tax_table(physeq_genus)[,4], tax_table(physeq_genus)[,5], tax_table(physeq_genus)[,6], sep=';')
taxa_names(physeq_genus) = all_tax
otu_table(physeq_genus)

Tutorial results

Here, we are really just checking that our ASVs (or species) have been renamed at the genus level:
Screenshot 2025-06-17 at 10 54 59 pm

Add the sample sums to the metadata (for MaAsLin3 - this is a variable that it takes into account):

sample_data(physeq_genus)$sample_sums <- sample_sums(physeq_genus)

Then, I wanted to make some new sample groups here when I was using only the PacBio tutorial data - I had a lot of different values within my categories, so I wanted to group these together, but this part is entirely optional:

sample_data(physeq_genus)$depth_group <- c('below_100', 'below_100', 'below_100', 'below_100', 'below_100', 'below_100', 'below_100', 'above_100', 'below_100', 'below_100', 'below_100', 'below_100', 'above_100', 'above_100')
sample_data(physeq_genus)$longitude_group <- c('above_10', 'below_10', 'above_10', 'above_10', 'above_10', 'above_10', 'above_10', 'above_10', 'below_10', 'below_10', 'below_10', 'below_10', 'below_10', 'above_10')

If you do decide to do something like this, it's important to think about what is biologically relevant, and whether your groupings make sense. For example, with my depth grouping, it's likely that there's some light above 100m but little light below.

Now we'll create a rarefied genus table:

set.seed(345)
physeq_rare = rarefy_even_depth(physeq_genus, sample.size = min(sample_sums(physeq_genus)), replace = TRUE, trimOTUs = TRUE, verbose = TRUE)
#note that we did already do this, but we did it at the ASV level!

And we'll save the feature table and metadata table (some of the DA tools require these not inside the phyloseq object):

feat_table = data.frame(t(otu_table(physeq_rare)), check.rows=T, check.names=F, stringsAsFactors=T)
metadata = data.frame(sample_data(physeq_rare), stringsAsFactors = F)

ALDEx

Next, we'll run ALDEx2. Note that here the first step is to normalise the samples using a CLR normalisation, so you can see how both of the methods require similar steps - normalisation and then testing - but this is done in a slightly different way for each. We then use a Kruskal-Wallis test and a GLM ANOVA on the data, and we can do this test with multiple cores (useMC=2):

x <- aldex.clr(otu_table(physeq_genus), sample_data(physeq_genus)$Sequencing, mc.samples = 128, verbose=F, denom="all")
kw.test <- aldex.kw(x, useMC=2, verbose=FALSE)
write.csv(kw.test, paste(folder, "ALDEx2.csv", sep=""))

We could repeat this with multiple variables if we wanted, but it can be a little difficult to integrate multiple variables into the same test with ALDEx2.

If you take a look at these results files then you'll see that we have 4 columns: kw.ep, kw.eBH, glm.ep, glm.eBH - the p-values and Benjamini-Hochberg (BH)-corrected p-values for the Kruskal-Wallis and GLM ANOVA tests between the sample groups. There are several other options for testing within the ALDEx2 R package (including aldex.ttest and aldex.glm), but in this case, we can use the GLM ANOVA results for parametric tests and the Kruskal-Wallis results for non-parametric tests. We could also use aldex.effect to see the magnitude of the differences between groups.

Note: parametric tests make assumptions about the distribution of the population that the samples are taken from (typically that it is normally distributed), while non-parametric tests are distribution free and can be used for non-normal variables. Microbiome data are not typically normally distributed, but we can test this. You don't need to run this next part, but can do if you like (if you have the microbiome R package installed - the AWS image does not because microbiome needs a different R version than some of the other packages!)

shapiro.test(as.data.frame(otu_table(physeq_genus))$"M19_1_full")
physeq_genus_clr = microbiome::transform(physeq_genus, "clr")
shapiro.test(as.data.frame(otu_table(physeq_genus_clr))$"M19_1_full")

Transformations/normalisations of microbiome data are often used to make the usually non-normal microbiome data normal, however, in this case, it is not normal (significant p-value so the null hypothesis that the data are normal can be rejected - they are not normal). This means that we should use the Kruskal-Wallis test results.

MaAsLin

Run MaAsLin3:

fit_out <- maaslin3(input_data = feat_table,
                    input_metadata = metadata,
                    output = 'maaslin3',
                    formula = '~ Sequencing + sample_sums',
                    normalization = 'TSS',
                    transform = 'LOG',
                    augment = TRUE,
                    standardize = TRUE,
                    max_significance = 0.1,
                    median_comparison_abundance = TRUE,
                    median_comparison_prevalence = FALSE,
                    max_pngs = 100,
                    cores = 1,
                    save_models = TRUE)

If you have used another dataset, you can run this step more than once with each of the variables! Or you can run both together by including them in the formula as ~ variable1 + variable2. If you combined the Phyloseq objects then you may like to use Sequencing here!

Take a look at ?maaslin3 to see what each of the inputs to this function are!

MaAsLin will save the results into a folder, so you can take a look in these folders and look at the all_results.tsv as well as significant_results.tsv files. You'll see that significant_results.tsv is just a subsection of all_results.tsv - have a look at one of these files. It also makes some nice plots for you of the significant results - have a look through these and see if you can work out what they're showing you.

There are a lot of columns in the output; you'll see some of the ones we're interested in are: (1) the genus ("feature"), (2) the metadata variable being tested ("metadata"), (3) for categorical features, the specific feature level for which the coefficient and significance of association is being reported ("value"), (4) the model effect size/coefficient value ("coef"), (5) the standard error from the model ("stderr"), (6) pval_individual (7) qval_individual (8) pval_joint (9) qval_joint (10) error (11) model (12) N (13) N_not_zero

radEmu

Now we can run radEmu:

physeq_genus_no_zero <- prune_taxa(taxa_sums(physeq_genus)>0,physeq_genus) #this first step just removes any genera with zero abundance across all samples that would cause issues for radEmu!

radEmu_fit <- emuFit(formula = ~ Sequencing, 
                 Y = physeq_genus_no_zero, run_score_tests=FALSE) 

write.csv(radEmu_fit$coef, paste(folder, "radEmu.csv", sep=""))

If you are mainly interested in a single variable, then you can set the run_score_tests variable to be TRUE, but it might take quite a long time to run!:

radEmu_fit <- emuFit(formula = ~ Sequencing, 
                 Y = physeq_genus_no_zero, run_score_tests=TRUE)

Combine the results

In order to compare the results that we've got from the three differential abundance tools, we'll want to first import the results and do some manipulations on these tables.

MaAsLin3:

maaslin = read.csv(paste(folder, "maaslin3/significant_results.tsv", sep=""), sep="\t")
maaslin = maaslin[maaslin$metadata == 'Sequencing', ] #we only want to look at Sequencing for now, as that is what we ran with the other tests
maaslin = maaslin[maaslin$model == 'abundance', ] #Maaslin3 compares both abundance and prevalence, but as it is the only one to do prevalence, we will just look at abundance for now
rownames(maaslin) = maaslin[,1]

Now we'll do the same for ALDEx2:

aldex = read.csv(paste(folder, "ALDEx2.csv", sep=""))
aldex = aldex[aldex$kw.eBH <= 0.1, ] #keep only taxa with Benjamini-Hochberg (BH)-adjusted p<=0.1 (the same as MaAsLin3 default)
rownames(aldex) = aldex[,1]

And for radEmu:

rademu = read.csv(paste(folder, "radEmu.csv", sep=""))
rademu = rademu[rademu$pval <= 0.1, ] #keep only taxa with p<=0.1 (the same as MaAsLin3 default)
rownames(rademu) = rademu[,3]

We often consider those taxa found by >=2 tests to be significantly differentially abundant as actually being differentially abundant. With this dataset, a lot of taxa are found to be differentially abundant, so I'm actually going to take only the taxa that are found by all three tests to narrow this down as much as possible.

Get the taxa found in all tables of significant results:

taxa = intersect(rownames(maaslin), rownames(aldex))
taxa = intersect(taxa, rownames(rademu))

We still have a lot here! This isn't necessary, but we'll actually narrow this down even further and only look at those within the 20 most abundant genera.

Get just the top 20:

physeq_genus_relabun = transform_sample_counts(physeq_genus, function(x) (x / sum(x))*100 )
physeq_genus_relabun_top20<-prune_taxa(names(sort(taxa_sums(physeq_genus_relabun),decreasing=TRUE)[1:20]), physeq_genus_relabun)

Get the DA taxa that are within the top 20:

taxa = intersect(taxa, rownames(otu_table(physeq_genus_relabun_top20)))
length(taxa)
physeq_genus_relabun_da<-prune_taxa(taxa, physeq_genus_relabun_top20)

And rename them so that the names will display a little better (add line breaks):

new_names = c()
for (i in 1:length(rownames(otu_table(physeq_genus_relabun_da)))) {
  new_name = str_replace(rownames(otu_table(physeq_genus_relabun_da))[i], ';o__', '\no__')
  new_names = c(new_names, str_replace(new_name, ';g__', '\ng__'))
}
taxa_names(physeq_genus_relabun_da) = new_names

Plot the results

Finally, we'll plot them!

psmelt(physeq_genus_relabun_da) %>%
ggplot(data = ., aes(x = Sequencing, y = Abundance)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(aes(color = OTU), height = 0, width = .2) +
labs(x = "", y = "Abundance\n") +
facet_wrap(~ OTU, scales = "free") +
theme(legend.position = "none")

Note: you'll probably want to open this up in a bigger window so that you can see all of the taxa properly!

Tutorial results

Output R Notebook

The really nice thing about R notebooks is that you can save them as an HTML document so that anyone can see your figures as long as the code that you've used to generate them. When you'd like to do this, you can simply click on the dropdown menu next to the Preview button and click "Knit to HTML". This will be working for a little while because it needs to re-run all of the code, but it should pop up with an HTML document when it's done!

Other things to be added at a later date

Metagenomics taxonomic data
Functional data (PICRUSt2 predictions or metagenomic annotations)
Stratified data
Fundamentals of adding plots together (e.g. ordering by taxa so that a heatmap will appear in the same order as the tree and can be linked?)
Taxonomic contributions to function of interest
Longitudinal analysis
Combining taxonomic/functional visualisations with JarrVis
ANCOM?

Microbiome Helper 2 Analysis workflow in R - LangilleLab/microbiome_helper GitHub Wiki

Introduction

Getting started in RStudio

Load required packages and setup

Explaining variables, strings, etc.

Importing data

Tutorial results

Tutorial results

Get the taxonomy information

Tutorial results

Tutorial results

Read in the metadata

Tutorial results

Read in the phylogenetic tree

Combine these into a phyloseq object

Tutorial results

Look at rarefaction curves and remove low abundance taxa and samples

Tutorial results

Tutorial results

Tutorial results

Tutorial results

Tutorial results

Tutorial results

Tutorial results

Tutorial results

Alpha and beta diversity

Tutorial results

Tutorial results

Tutorial results

Tutorial results

Tutorial results

Stacked bar plots

Tutorial results

Tutorial results

Tutorial results

Tutorial results

Tutorial results

Tutorial results

Heatmaps

Tutorial results

Tutorial results

Phylogenetic tree plots

Tutorial results

Tutorial results

Tutorial results

Differential abundance

Combine Illumina and PacBio

Tutorial results

Tutorial results

Tutorial results

Tutorial results

Tutorial results

Merge the Phyloseq objects

Tutorial results

Alpha diversity

Tutorial results

Beta diversity

Species level

Tutorial results

Tutorial results

Genus level

Tutorial results

Tutorial results

Heatmap

Tutorial results

Tutorial results

Tutorial results

Preprocessing

Tutorial results

Tutorial results

ALDEx

MaAsLin

radEmu

Combine the results

Plot the results

Tutorial results

Output R Notebook

Other things to be added at a later date

Contact

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️