Module 4: Functional prediction and additional analyses

This tutorial is part of the 2025 CBW Beginner Microbiome Analysis (held in Vancouver, BC, May 26-27).

Author: Robyn Wright

Introduction

In this module, we'll be taking a look at three of the major ways in which we use amplicon sequencing data to learn about microbial communities; (1) identifying taxa that are significantly differentially abundant between different sample groups, (2) predicting the functional capabilities of these communities based on the taxa that are present, and (3) constructing and visualising a co-occurrence network.

There is likely more in this module than you can get through in the time provided, so we suggest that you choose which of these is of most interest to you, and start there. You can always come back and do some of the others if you are finished in time. PICRUSt2 is really only appropriate for 16S data, so if you used one of the other datasets then you can either copy across the 16S output or do one of the other sections of this lab. Ask one of us for help if you need it! You can read some information on each of these sections below before deciding which to run.

Throughout this module, there are some questions aimed to help your understanding of some of the key concepts. As with the previous modules, you'll find the answers at the bottom of this page, but no one will be marking them.

Differential abundance testing with MaAsLin3 and ALDEx2

Tens (if not more) of tools have been used for differential abundance testing of microbiome samples - identifying taxa that differ in abundance between sample groups (for example, treatment versus control) or with variables of interest (for example, correlations with pH or blood metabolite measurements). While these tools/methods have often been used interchangeably in the literature, there are large differences in the performance of different tools even on the same sample group. There have now been several large scale comparisons of different tools (e.g., Calgaro et al., Thorsen et al.), and our lab also carried one out (published here). While these studies have typically found that different methods lead to different biases in the results, as there is no one single best method for determining differentially abundant taxa, the best practice that we have for overcoming these issues currently is to use multiple differential abundance tests, report these clearly, and focus on the taxa that are identified by multiple tests.

For these purposes, we have chosen to use two tools today, MaAsLin3 (preprint/wiki) and ALDEx2 (paper/vignette). In our comparison of different tools, we found that ALDEx2 was one of the tools that controlled the false discovery rate well, but this came at the cost of reduced sensitivity, while MaAsLin2 had much better sensitivity than ALDEx2, but this came at the cost of reduced specificity. It is worth noting, however, that MaAsLin3 is a new version and Jacob Nearing - a previous PhD student in the Langille lab who was one of the first authors on our DA tool comparison - was heavily involved in its development. In some of the recent studies from our lab, we have then chosen to focus on only the taxa that are identified by at least two of these tools (e.g., this paper or this paper).

In this tutorial, we'll be running both of these tools on the output from Module 2, and then we'll combine the results and look at the taxa that are identified. We can then apply this to the PICRUSt2 output, too.

Functional prediction with PICRUSt2

PICRUSt2 is a tool that predicts the functional capacity of a microbial community based on the taxa that are present in amplicon sequencing data. The first iteration of PICRUSt (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) was developed by Morgan (and many other collaborators) during his postdoc. This was expanded and improved upon to make PICRUSt2 by a previous PhD student in the lab, Gavin Douglas (currently a postdoc at North Carolina State University), and Morgan and I (Robyn) have recently updated the database used by PICRUSt2 to improve the functional predictions.

There are several other tools that have also been developed for this purpose, e.g., Tax4Fun2, Piphillin and PanFP. I don't personally have experience including these, so we're going to be focusing on PICRUSt2, however, it's always important to understand the strengths and weaknesses of different bioinformatic tools before choosing which one to use for your own research.

You can see full information on PICRUSt2 here, but it includes several key steps (the new database changes these slightly!):

Placement of sequences into reference phylogenies for bacteria and archaea
Hidden-state prediction to get 16S copy numbers and Nearest Sequenced Taxon Indices (NSTI) per genome
Determine the best domain for each sequence (whether they fit best in the bacterial or archaeal reference phylogeny)
Hidden-state prediction to get trait abundances (usually KOs or EC numbers) per predicted genome
Combine bacterial and archaeal files from hidden-state prediction
Predict metagenomes for each sample for each trait
Infer MetaCyc pathway abundances and coverages (using EC numbers)

These steps can be run separately, but are all wrapped together into a single command for what we'll be running today.

Constructing and visualising a co-occurrence network using `XYZ`

4.1 Differential abundance testing with MaAsLin3 and ALDEx2
4.2 Functional prediction with PICRUSt2
4.3 Constructing and visualising a co-occurrence network
Answers

4.1 Differential abundance analysis using MaAsLin3 and ALDEx2

You can use whichever amplicon you like for this! We've given instructions for 16S, but you can hopefully work out how to modify these commands for 18S or ITS :).

4.1.1 Prepare data from the end of Module 2

First, we'll create a new directory for whichever dataset we want to use and we'll copy the output from the end of Module 2 into this.

16S:

cd ~/workspace
mkdir bmb_module4 bmb_module4/16S_analysis
cd bmb_module4/16S_analysis
ln -s ~/CourseData/MIC_data/bmb_module3/output/16S_analysis/deblur_output_exported/ .
ln -s ~/CourseData/MIC_data/bmb_module2/output/16S_analysis/Blueberry_16S_metadata.tsv .

18S:

cd ~/workspace
mkdir bmb_module4 bmb_module4/18S_analysis
cd bmb_module4/18S_analysis
ln -s ~/CourseData/MIC_data/bmb_module3/output/18S_analysis/deblur_output_exported/ .
ln -s ~/CourseData/MIC_data/bmb_module2/output/18S_analysis/Plastisphere_18S_metadata.tsv .

ITS:

cd ~/workspace
mkdir bmb_module4 bmb_module4/ITS_analysis
cd bmb_module4/ITS_analysis
ln -s ~/CourseData/MIC_data/bmb_module3/output/ITS_analysis/deblur_output_exported/ .
ln -s ~/CourseData/MIC_data/bmb_module2/output/ITS_analysis/pregnancy_ITS_metadata.tsv .

4.1.2 Read output into R/Phyloseq

Next, we'll go to RStudio on the server. Go to your browser and navigate to:

http://##.uhn-hpc.ca:8080/

(where ## is the student number that you have been given).

Enter "ubuntu" as your username and the password that have been given. It takes a little while to load up the first time but should be quicker on subsequent times!

Everything that you're entering in here should be entered into the "Console" part of the page. This is like the Terminal that we've been using on the Server, but it uses the R programming language. Most statistical analyses in bioinformatics will use either R or Python - many bioinformaticians have a preference for one or the other, but most use both. In my opinion, each has different strengths and weaknesses - R has many built in programs and functions, while Python is much more customisable but perhaps has a steeper learning curve. The syntax varies slightly for both (and both are different from the Terminal that we have been using, which is the bash programming language), but there are also many similarities. RStudio is an Integrated Developing Environment (IDE) for R and Python - it is an application that provides further functionality than the R programming language itself.

Importing packages

First of all, we'll import the packages that we'll be using today:

library(phyloseq)
library(maaslin3)
library(microbiome)
library(ALDEx2)
library(ggplot2)
library(tidyr)
library(stringr)
library(vegan)
library(stats)
library(SpiecEasi)
library(indicspecies)
library(igraph)

You'll notice that in each of these, we use the library() command, and inside the brackets we give the name of the package. We've already installed these packages for you, but in most cases these could be installed by e.g., install.packages("tidyr"). In some cases, they might be installed using another package, BiocManager, e.g., BiocManager::install("ALDEx2"). If you're unsure, you can always google how to install a package and you can usually find the code to copy and paste, e.g., "r install aldex2". You could also try typing ??aldex2 into the console and see what comes up. It should give you some information about that package.

If you've used the 18S or the ITS datasets, I'm assuming that you'd like to challenge yourself a little more and I've made this code and explanation for 16S, but you'll need to figure out some of the changes to make for 18S/ITS. If you're struggling, you can find the full working code in the Answers, but without any of the explanation that we have up here.

Setting variables

Next, we'll set the name of the folder where our results are saved:

folder = '/home/ubuntu/workspace/bmb_module4/16S_analysis/'

Note that if you used 18S or ITS then you'll need to modify this!

Here, we have defined a variable called "folder" - we always need to be careful that we don't save over these variables by calling something else the same thing, but you can always check these in the "Environment" area in the top right where you should see that "folder" now exists. We could have called this anything, but it's always helpful to name our variables something that is simple and explains what they are, incase someone else is using our code. This is a string, as it is inside the punctuation marks ' ' or " ", and this is used in programming to show that we are representing text rather than numbers. These variables can be really useful to avoid having to type in the same information multiple times, which always increases the chances that we'll make a mistake!

You can see how these would be different by running the following:

random_string = "12345"
random_number = 12345

You should see how these are different by how they show in the "Environment" area in the top right.

Now that we've explained the basics, we can read in some of our data.

Read in the feature table

First, we'll read in the feature table. Note that we're calling it asv_table:

asv_table <- read.csv(paste(folder, "deblur_output_exported/feature-table_w_tax.txt", sep=""), sep='\t', skip=1, header=T)

Here we're combining a few different things. We're using the read.csv function - functions are modules of code that accomplish a specific task. We can write them ourselves, they are what is contained in the packages that we imported at the start, or there are several that are built in to R. If you want more information about what a function does as well as the input that it expects and the output that it gives, you can type in ?read.csv.

We're also using the paste function to add together the full file path of the feature table. We're giving it the folder variable that we already defined, as well as the additional folder and file name (both as positional arguments), and finally the sep="" named argument - this tells the paste function that there should be no spaces or other punctuation between the two parts that are being pasted together. Try running just this part and see what the result is: paste(folder, "exports_filtered/feature-table.txt", sep="") - note that as we haven't given this a name, it is not saved as a variable.

Next, we've given the sep='\t' variable - this time, we're telling the read.csv function that the file that we're importing is tab-delimited, then we've told it to skip the first line of the file skip=1, and that there is a header header=T (where T is the same as TRUE). We've skipped the first line because this just reads "# Constructed from biom file" and is not actually data.

Question 1: Type in asv_table or look at it in the "Environment" area in the top right. How many rows are there? Does this make sense to you?

Manipulate the feature table

We often find in programming that things are not quite in the format that is expected by the packages that we're using, so we'll make a few modifications to asv_table:

asv_table_num = data.matrix(asv_table[,2:11]) #convert the ASV table to a numeric matrix. Note that if you didn't use 16S, you might need to change this "11" to make sure that you don't cut off any samples. It should be the number of samples + 1. Have a look at the asv_table object in the upper right corner!
rownames(asv_table_num) = asv_table[,1] #give the matrix row names

You can see here that I've added # after the code and have written what each line does. You can use # to make comments through your code documents, to explain what each line is doing. Here you should also notice that we didn't save over the previous asv_table, but created a new one called asv_table_num. Now, we've just told R that it should be expecting numberic data within this table, and then we gave it the same row names as our previous asv_table object.

Next, take a quick look at your sample names within asv_table_num.

Question 2: Are they the same as you remember - any different punctuation?

This is kind of annoying because it means that the sample names won't match the metadata table, so we'll change those now.

colnames(asv_table_num) = gsub("B.", "B-", colnames(asv_table_num))

Get the taxonomy information

You may have noticed that the last column in the initial asv_table was our taxonomy information. We'll get that first:

taxonomy <- data.frame(asv_table[,"taxonomy"])
rownames(taxonomy) <- asv_table[,1]
colnames(taxonomy) = c("taxonomy")

If you print this out by typing taxonomy or by clicking on this in the "Environment" area, you'll see that we currently only have a single column that contains all of the taxonomy information.

We want to split this so that we have columns for each of Domain, Phylum, Class, Order, Family, Genus and Species. Luckily, there is already a function that we can use for this:

taxonomy_split <- separate(data = taxonomy, col = taxonomy, into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\; ") #separate the taxonomy table so each phylogenetic level is its own column. Note that if you are using 18S then you might need to separate this differently!

You'll see that we tell the separate function that the data it should use is in the taxonomy table, the col (column) is called "taxonomy". The "into" named argument takes a list as input, which is defined in R with the c(), and the taxonomy column should be split into a new column each time the ; symbol is found. You'll see a warning after running this line, but it still ran so that's fine - it's just telling us that there weren't values for all of Domain/Phylum/Class/Order/Family/Genus/Species for every ASV, which makes sense as not all of them will have a species-level classification.

Again, you can take a look at the results in the "Environment" area.

Read in the metadata

Now we'll read in the metadata. I won't go through each stage step-by-step as hopefully you're getting the idea by now, but I've still added comments to each of the rows:

metadata <- read.csv(paste(folder, "Blueberry_16S_metadata.tsv", sep=""), sep='\t')
samples = data.frame(metadata[,2], stringsAsFactors = FALSE) #convert this to a data frame - again, if you didn't use the 16S data, you have some extra variables here that you will need to include, rather than only column 2, you may need another, too. 
rownames(samples) = metadata[,1] #add the sample names as row names
colnames(samples) = c('treatment') #and add a column name that makes sense. If you have extra variables, you'll need to add them here

Combine these into a phyloseq object

Phyloseq is a really useful package that contains many useful functions for analysing microbiome data. While it can be a bit fiddly to get our data into the format that it is expecting (most good packages should give you examples of how data should be formatted to go into them, although it does take a lot of practice to get good at quickly working out how this is different from what you have!), once that it is in this format, the analyses are then very easy to perform.

ASV = otu_table(asv_table_num, taxa_are_rows = TRUE) #convert asv_table_num to an otu_table
TAX = tax_table(taxonomy_split) #convert taxonomy_split to a tax_table
taxa_names(TAX) <- rownames(taxonomy_split) #add names to TAX/the tax_table
SAMPLE = sample_data(samples) #convert samples to a sample_data
physeq = phyloseq(ASV, TAX, SAMPLE) #combine these all to make a phyloseq object
physeq #print this out to see what a phyloseq object looks like

Now, we have combined all of these different parts into one phyloseq object. Note that you can also import a phylogenetic tree to phyloseq, but we won't be using that for this part of the workshop. You can also get each of the individual objects back after performing manipulations, e.g., otu_table(physeq).

As we typically find that ASVs are not always shared across many samples (and neither would we expect them to be, if they potentially represent species or strain level differences between taxa), we'll collapse the phyloseq object to the genus level. If you're not sure what the different taxonomy levels are called within the phyloseq object, you can always look at tax_table(physeq).

So we're collapsing at the genus level, which is "ta6":

physeq_genus = tax_glom(physeq, taxrank="ta6")

If you take a look at this taxonomy table (tax_table(physeq_genus)), you'll notice a lot of missing information (because many environmental ASVs don't have similar taxa within reference databases!), so we can add the full taxonomy information in to the ASV names, so that this is hopefully a little more informative:

all_tax = paste(tax_table(physeq_genus)[,2], tax_table(physeq_genus)[,3], tax_table(physeq_genus)[,4], tax_table(physeq_genus)[,5], tax_table(physeq_genus)[,6], sep=';')
taxa_names(physeq_genus) = all_tax
otu_table(physeq_genus)

Take a look at each of these if you want to see what the differences were in each step!

Question 3: How does the number of genera compare with the number of ASVs?

4.1.3 Run MaAsLin3

Now that we've prepared the phyloseq objects, it's time to run our first differential abundance test. Each of the differential abundance tools expects data to be normalised in a different way (or some of them normalise for you), and we're going to run MaAsLin with rarefied data.

So to do that, we'll need to rarefy it:

set.seed(345)
physeq_rare = rarefy_even_depth(physeq_genus, sample.size = min(sample_sums(physeq_genus)), replace = TRUE, trimOTUs = TRUE, verbose = TRUE)

Note that if you aren't using the 16S data, you might want to take a look at min(sample_sums(physeq_genus)) before you run the rarefy command. Does that seem like a sensible number of sequences to rarefy to? Take a look at all of the sample sums using sample_sums(physeq_genus) to see what might be sensible. Setting this to be higher than some samples will remove any samples with below this number of reads.

Remember that you can always use ?rarefy_even_depth if you want to see what a function is doing! Here we've rarefied all of the samples to the lowest number of reads, ASVs are replaced after sampling (replace = TRUE), those that are no longer present are trimmed (trimOTUs = TRUE) and the function will tell us about what it is doing (verbose = TRUE).

Next, we'll get the feature table and metadata that we're using:

feat_table = data.frame(t(otu_table(physeq_rare)), check.rows=T, check.names=T, stringsAsFactors=T)
metadata = data.frame(sample_data(physeq_rare), stringsAsFactors = F)

You'll notice that these will be formatted as tables again - we could have just read in some tables to be used with MaAsLin2, but it's useful to perform manipulations in Phyloseq first, and then to take these tables out again to ensure that all samples match up!

Now we'll run MaAsLin3:

fit_out <- maaslin3(input_data = feat_table,
                    input_metadata = metadata,
                    output = paste(folder, 'maaslin3_treatment', sep=""),
                    formula = '~ treatment',
                    normalization = 'TSS',
                    transform = 'LOG',
                    augment = TRUE,
                    standardize = TRUE,
                    max_significance = 0.2,
                    median_comparison_abundance = TRUE,
                    median_comparison_prevalence = FALSE,
                    max_pngs = 100,
                    cores = 1,
                    save_models = TRUE)

If you have used another amplicon, you can run this step more than once with each of the variables! Or you can run both together by including them in the formula as ~ variable1 + variable2

Take a look at ?maaslin3 to see what each of the inputs to this function are! Note that we'd usually set the max significance to 0.1, but for the sake of this workshop we're using a higher level.

MaAsLin will save the results into a folder, so you can take a look in these folders and look at the all_results.tsv as well as significant_results.tsv files. You'll see that significant_results.tsv is just a subsection of all_results.tsv - have a look at one of these files. It also makes some nice plots for you of the significant results - have a look through these and see if you can work out what they're showing you.

There are a lot of columns in the output; you'll see some of the ones we're interested in are: (1) the genus ("feature"), (2) the metadata variable being tested ("metadata"), (3) for categorical features, the specific feature level for which the coefficient and significance of association is being reported ("value"), (4) the model effect size/coefficient value ("coef"), (5) the standard error from the model ("stderr"), (6) pval_individual (7) qval_individual (8) pval_joint (9) qval_joint (10) error (11) model (12) N (13) N_not_zero

Question 4: How many taxa does MaAsLin3 say are significantly differentially abundant?

4.1.4 Run ALDEx2

Next, we'll run ALDEx2. Note that here the first step is to normalise the samples using a CLR normalisation, so you can see how both of the methods require similar steps - normalisation and then testing - but this is done in a slightly different way for each. We then use a Kruskal-Wallis test and a GLM ANOVA on the data, and we can do this test with multiple cores (useMC=2). You can see that we're again running this separately for each metadata variable, and then saving the outputs as .csv files:

x <- aldex.clr(otu_table(physeq_genus), sample_data(physeq_genus)$treatment, mc.samples = 128, verbose=F, denom="all")
kw.test <- aldex.kw(x, useMC=2, verbose=FALSE)
write.csv(kw.test, paste(folder, "ALDEx2_taxa.csv", sep=""))

If you take a look at these results files then you'll see that we have 4 columns: kw.ep, kw.eBH, glm.ep, glm.eBH - the p-values and Benjamini-Hochberg (BH)-corrected p-values for the Kruskal-Wallis and GLM ANOVA tests between the sample groups. There are several other options for testing within the ALDEx2 R package (including aldex.ttest and aldex.glm), but in this case, we can use the GLM ANOVA results for parametric tests and the Kruskal-Wallis results for non-parametric tests. We could also use aldex.effect to see the magnitude of the differences between groups.

Note: parametric tests make assumptions about the distribution of the population that the samples are taken from (typically that it is normally distributed), while non-parametric tests are distribution free and can be used for non-normal variables. Microbiome data are not typically normally distributed, but we can test this. You don't need to run this next part, but can do if you like.

shapiro.test(as.data.frame(otu_table(physeq_genus))$"B-Rt96")
physeq_genus_clr = microbiome::transform(physeq_genus, "clr")
shapiro.test(as.data.frame(otu_table(physeq_genus_clr))$"B-Rt96")

Transformations/normalisations of microbiome data are often used to make the usually non-normal microbiome data normal, however, in this case, it is not normal (significant p-value so the null hypothesis that the data are normal can be rejected - they are not normal). This means that we should use the Kruskal-Wallis test results.

Question 5: How many taxa does ALDEx2 find to be significantly differentially abundant?

4.1.5 Combine and plot differential abundance results from MaAsLin3 and ALDEx2

In order to compare the results that we've got from the three differential abundance tools, we'll want to first import the results and do some manipulations on these tables.

For MaAsLin2, we'll read in the table and then we'll rename some of the taxa names because some R formats don't like punctuation other than "_" or ".", so we'll convert these back to how they started (so they match up with the other tables):

maaslin = read.csv(paste(folder, "maaslin3_treatment/significant_results.tsv", sep=""), sep="\t")
maaslin$feature = gsub("\\g__Palsa.187", "g__Palsa-187", maaslin$feature) #replace anything matching g__Palsa.187 in the feature column with g__Palsa-187
maaslin$feature = gsub("\\.", ";", maaslin$feature) #replace any remaining . with ;

Maaslin3 compares both abundance and prevalence, but as it is the only one to do prevalence, we will just look at abundance for now so we'll filter the table:

maaslin = maaslin[maaslin$model == 'abundance', ]

Question 6: Are there any taxa left? Is that what you expected?

Now we'll do the same for ALDEx2 (although here we don't need to do the renaming):

aldex = read.csv(paste(folder, "ALDEx2_taxa.csv", sep=""))
aldex = aldex[aldex$kw.eBH <= 0.2, ]

You should see that there are a few taxa that ALDEx2 found to be significantly differentially abundant.

I mentioned previously that often we will consider those taxa found by >=2 tests to be significantly differentially abundant as actually being differentially abundant.

To do that, we'll create a list of the taxa that are found to be significantly differentially abundant by both tests:

overlap = intersect(maaslin$feature, aldex$X)

If we take a look at this, we should see that there are 7 taxa that were found by both tests, although a few of them are unclassified at the genus level.

And now we'll convert the physeq_genus phyloseq object to relative abundance using the transform_sample_counts function:

physeq_relabun  = transform_sample_counts(physeq_genus, function(x) (x / sum(x))*100 )

Plot DA single taxon

Next we can look at plotting the abundance of a single taxon. Have a look at the list of taxa (overlap). Replace the empty quotation marks below with one of these:

single_taxon = ""

And now we'll make a boxplot of it's abundance based on the categories in treatment:

microbiome::boxplot_abundance(physeq_relabun, x='treatment', y=single_taxon, line = NULL, violin = FALSE, na.rm = FALSE, show.points = TRUE) + ggtitle(str_replace(single_taxon, ';f__', '\nf__')) + ylab('Relative abundance (%)')

Note that if you didn't replace the quotation marks with the name of a taxon above, then you will get an error message! Otherwise, you should see the plot pop up in the "Plots" section on the bottom right of the window.

To explain what we have just done here a little, we're using a function from the microbiome R package - sometimes there can be functions with the same name from multiple R packages, so adding microbiome:: at the start tells R which one we are wanting to use. The first positional argument is the phyloseq object that we are using, and then everything else consists of named arguments: The metadata variable that we'd like to use for the x axis (and making the boxplots), what we'd like on the y axis (this needs to be a taxon that is in the phyloseq object), whether we'd like to map a variable onto the lines, whether we'd like this to be a violin plot rather than a boxplot, whether NAs should be removed, and whether points should be shown in addition to the boxes.

This boxplot_abundance function builds upon the R package ggplot2, and we can therefore use additional arguments as we would with ggplot2. These are added with the + and are fairly self explanatory; we have added a title (ggtitle) and a y label (ylab). For the title, we've replaced part of the text with the "\n" which means that a line break has been added in to make this a little more readable.

Question 7: Did it look like there are differences between the two treatments for the taxon that you plotted?

Plot DA all taxa

We can also make the same plots within a for loop:

for(i in 1:length(overlap)) {
  print(microbiome::boxplot_abundance(physeq_relabun, x='treatment', y=overlap[i]) + ggtitle(str_replace(overlap[i], ';f__', '\nf__')) + ylab('Relative abundance (%)'))
}

Now, if you press the forward and back buttons in the "Plots" pane, you can scroll through all of the taxa that we've identified as differentially abundant by one of the tests. You'll probably be able to see that a lot of these taxa are really only found in one of the treatments and not the other, so it's fairly clear why they're coming up as being significantly different.

For loops

For loops are an incredibly useful thing that we do in programming - to my knowledge, they exist in every programming language, and they allow us to repeat a section of code many times, with a different input each time. To explain what we are doing above, we can go through this step-by-step. If you already know about for loops, feel free to skip ahead to section 2.8.

It is easiest to explain by showing you, so try running this:

print(overlap)
print(length(overlap))

You'll see that overlap is a list containing all of the taxa that we found to be significantly differentially abundant, and length(overlap) tells you how long it is, or how many taxa are in it.

Now try running this:

for(i in 1:length(overlap)) {
  print(i)
  print(overlap[i])
}

You should see that now, a number is printed out, and then that number item from the list is also printed out after it. The for loop is telling us that for every value of i between 1 and the length(overlap), print out i and then print out the i'th item in the list. You can use this for accessing particular items in lists, too, e.g., overlap[2].

This is a really simple example, but you can see how in the loop above, we just replaced single_taxon, from when we just plotted one taxon, with overlap[i] within the loop, so that it would change each time.

For loops can get incredibly complicated, can do lots of things, and can be nested inside other for loops, but we won't be doing anything more on them for now!

Back to the results

Question 8: Are there any taxa that you're surprised to see are significantly differentially abundant? Why?

4.2 Functional prediction with PICRUSt2

4.2.1 Filter input for PICRUSt2

As PICRUSt2 can take quite a long time to run, we want to do some filtering on the files so that we'll have fewer ASVs. Usually we might investigate several different cutoffs for the minimum number of times an ASV should occur to be included, or the minimum prevalence of ASVs, but for the purposes of this workshop, we'll just use 50 for the minimum frequency of an ASV and 2 for the minimum number of samples that it must occur in.

First, we'll copy over the output:

cd ~/workspace
mkdir bmb_module4 bmb_module4/16S_analysis
cd bmb_module4/16S_analysis
ln -s ~/CourseData/MIC_data/bmb_module2/output/16S_analysis/deblur_output/ .
ln -s ~/CourseData/MIC_data/bmb_module2/output/16S_analysis/taxa/ .
ln -s ~/CourseData/MIC_data/bmb_module2/output/16S_analysis/Blueberry_16S_metadata.tsv .

First, we need to reactivate the QIIME2 environment:

conda activate qiime2-amplicon-2025.4

Then we'll filter the ASVs based on the parameters I mentioned above:

qiime feature-table filter-features \
  --i-table deblur_output/deblur_table_final.qza \
  --p-min-frequency 50 \
  --p-min-samples 2 \
  --o-filtered-table final_table_filtered.qza

Now we'll filter the sequences to only include those that are in our new filtered table (final_table_filtered.qza):

qiime feature-table filter-seqs \
  --i-data deblur_output/representative_sequences.qza \
  --i-table final_table_filtered.qza \
  --o-filtered-data  representative_sequences_final_filtered.qza

Then we can export these files - first the feature table, which we'll export and then convert from .biom format to .txt format:

qiime tools export \
  --input-path final_table_filtered.qza \
  --output-path exports_filtered
  
biom convert \
  -i exports_filtered/feature-table.biom \
  -o exports_filtered/feature-table.txt \
  --to-tsv

Then we can export the sequences file:

qiime tools export \
  --input-path representative_sequences_final_filtered.qza \
  --output-path exports_filtered

This will be exported as a .fasta file, which you can look at in the exports_filtered folder, if you like.

Finally, we'll deactivate the conda environment:

conda deactivate

4.2.2 Start running PICRUSt2

PICRUSt2 can take quite a long time to run - for PICRUSt2, as well as other programs that may take a while, there are several tools that are pre-installed on most Linux systems that we can use to make sure that our program carries on running even if we get disconnected from the server. One of the most frequently used ones is called tmux. To activate it, just type in tmux and press enter. It should take a second to start up, and then load up with a similar looking command prompt to previously, but with a coloured bar at the bottom of the screen.

To get out of this window again, press ctrl+b at the same time, and then d. You should see your original command prompt and something like

[detached (from session 0)]

We can actually use tmux to have multiple sessions, so to see a list of the active sessions, use:

tmux ls

We can rename the tmux session that we just created with this:

tmux rename-session -t 0 picrust

Note that we know it was session 0 because it said that we detached from session 0 when we exited it.

If we want to re-enter this window, we use: tmux attach-session -t picrust

Now, we can run PICRUSt2 inside this tmux session. First, we'll activate the conda environment:

conda activate picrust2-v2.6.2

Now, as we mentioned above, we'll just set PICRUSt2 to run for now, and we'll come back to see the output later:

picrust2_pipeline.py -s exports_filtered/dna-sequences.fasta -i exports_filtered/feature-table.biom -o picrust2_out_pipeline_filtered -p 4 -t sepp --in_traits EC --verbose

You can see that here the options we're setting are:

-s: the fasta file of DNA sequences
-i: the feature table in .biom format
-o: the folder to save the PICRUSt2 output to
-p: the number of threads to use
-t: the method to use for placement of ASVs into the phylogenetic tree - note that the default is to use EPA-NG, but this takes much more memory, and as we are limited on these servers we are using SEPP
--in_traits EC: we don't need to add this option as the default is to run both EC and KO, but as this can take quite a long time and we don't have very much memory, we'll just run EC numbers as this is what we can get pathway predictions from
--verbose: this tells PICRUSt2 to print out each of the steps that it's running so that we know what's going on.

You might want to move on to something else while PICRUSt2 runs, as it's likely to take a little while (~20 mins).

4.2.3 Read PICRUSt2 output into R/Phyloseq

Once it's run, we can come back to the output.

There are a few files that we'll take a look at now, so to prepare for that, we'll just unzip them:

gunzip picrust2_out_pipeline_filtered/*.gz

There is a key file that we can take a look at to see how well PICRUSt2 is likely to work for our data, and that is the combined_marker_predicted_and_nsti.tsv file. Download this and open it with Excel (or similar). You'll see that it has five columns - the first contains the ASV name, the second the number of 16S rRNA gene copies that this ASV is predicted to have, the third the NSTI, the fourth contains the domain that this ASV matched best with (our 16S dataset used the bacteria-specific primers, so it shouldn't be surprising that we only have bacteria here!), and the fifth, the closest reference genome within the database used for predictions. The NSTI is the Nearest Sequenced Taxon Index and refers to the distance within the phylogenetic tree of that ASV to the closest relative that it has in the reference database. A value of 0 would indicate that the database that PICRUSt2 uses contains a genome with an identical sequence. By default, PICRUSt2 excludes all ASVs with a value of above 2, but the lower these NSTI values are, the better the predictions are likely to be. In our case, the median NSTI is ~0.09. This is not 0, but it is at least a long way off 2. If you look in the EC_metagenome_out folder then you'll also see a weighted_nsti.tsv.gz - in this one, the NSTI has been calculated on a per-sample basis but it's been weighted by the abundance of the ASVs within each sample. You'll see that the weighted NSTI's are actually quite similar to our median, although they are typically slightly higher, indicating that the abundant ASVs within our dataset are less similar to the reference genomes. You can see in our recent paper that these values vary - these samples are actually a subset of the Blueberry ones shown, so it makes sense that we get a similar median NSTI value - with the NSTI of less well characterised environments (e.g. ocean samples) typically being higher than in better characterised environments (like HMP samples).

In this folder, you'll also see EC_predicted.tsv - this shows the number of copies of Enzyme Commission (EC) numbers predicted to be within each ASV. You'll also see out.tre files which contains a tree of the reference as well as study 16S sequences.

The intermediate folder shows us some of the intermediate files that were produced/used, but we don't really need those for now.

Finally, the files that we are likely most interested in are picrust2_out_pipeline_filtered/EC_metagenome_out/pred_metagenome_unstrat.tsv.gz and picrust2_out_pipeline_filtered/pathways_out/path_abun_unstrat.tsv.gz.

We're going to focus on the pathways file today, because it groups the other functions into functional categories, but this could be repeated in the same way with the EC results.

Note that if we were interested in seeing which ASVs contribute to the functions within each sample then we would include the --stratified option when running PICRUSt2, although this increases the time taken to run. We also often predict the KOs (KEGG orthologs), but we left these for now for the sake of time.

So now try to do the same for the PICRUSt2 results as you did for the taxa results above in RStudio. I've again copied that code below here to avoid you needing to scroll up and down, and the code needed is in the answers if you are struggling.

If you didn't do 4.1, just go through the first step of importing packages. If you already did 4.1 and would like to challenge yourself, try just modifying the commands that we gave above for reading in the files and performing differential abundance testing with the PICRUSt2 pathways file!

Read in the pathways file as a feature table:

folder = '/home/ubuntu/workspace/bmb_module4/16S_analysis/'
pwy_table <- read.csv(paste(folder, "picrust2_out_pipeline_filtered/pathways_out/path_abun_unstrat.tsv", sep=""), sep='\t', skip=1, header=T)

Hint: You'll need to unzip the picrust2_out_pipeline_filtered/pathways_out/path_abun_unstrat.tsv.gz file before you can start.

Take a look at pwy_table. Does that look right? Compare it with path_abun_unstrat.tsv. We just copied this code from when we were reading in our taxa file above.

Question 9: What do you think might be wrong with it?

OK, so modify this line of code to fix it and let's try that again:

pwy_table <- read.csv(paste(folder, "picrust2_out_pipeline_filtered/pathways_out/path_abun_unstrat.tsv", sep=""), sep='\t', skip=1, header=T)

Manipulate the feature table:

pwy_table_num = data.matrix(pwy_table[,2:11]) #convert the ASV table to a numeric matrix
rownames(pwy_table_num) = pwy_table[,1] #give the matrix row names

Change the sample names:

colnames(pwy_table_num) = gsub("B.", "B-", colnames(pwy_table_num))

We don't have taxonomy information this time, so that part isn't necessary, but next we'll want to read in the metadata. This will actually be exactly the same as before, but it won't hurt if you do it again.

metadata <- read.csv(paste(folder, "Blueberry_16S_metadata.tsv", sep=""), sep='\t')
samples = data.frame(metadata[,2], stringsAsFactors = FALSE) #convert this to a data frame
rownames(samples) = metadata[,1] #add the sample names as row names
colnames(samples) = c('treatment') #and add a column name that makes sense

Now combine everything into the phyloseq object:

PWY = otu_table(pwy_table_num, taxa_are_rows = TRUE) #convert asv_table_num to an otu_table
SAMPLE = sample_data(samples) #convert samples to a sample_data
physeq_pwy = phyloseq(PWY, SAMPLE) #combine these all to make a phyloseq object
physeq_pwy #print this out to see what a phyloseq object looks like

We also aren't collapsing this at the genus level because there is no taxonomy information/higher levels to collapse on.

Now, we can convert the pathways to relative abundance, take the top 100 pathways for differential abundance testing, and convert the numbers into integers (whole numbers) for some of the alpha diversity calculations.

physeq_pwy_relabun = transform_sample_counts(physeq_pwy, function(x) (x / sum(x))*100 )
top_100_abun <- names(sort(taxa_sums(physeq_pwy_relabun), TRUE)[1:100]) #get most abundant pathways
physeq_pwy_top_100 = prune_taxa(top_100_abun, physeq_pwy) #now filter the table to have only the most abundant pathways
otu_table(physeq_pwy_top_100) = round(otu_table(physeq_pwy_top_100)) #round the abundance table so that everything is whole numbers
mode(pwy_table_num) = "integer" #convert the table to integers (whole numbers)
PWY_int = otu_table(pwy_table_num, taxa_are_rows = TRUE)
physeq_pwy_int = phyloseq(PWY_int, SAMPLE) #make a new phyloseq object with this

4.2.3 Look at PICRUSt2 alpha and beta diversity

Now we will just take a quick look at these to see how they compare with the analyses that we did in Module 3 based on the taxonomy data.

First we will plot the Alpha diversity:

plot_richness(physeq_pwy_int, x="treatment", measures=c("Observed", "Simpson", "Shannon")) + geom_boxplot()

Luckily, phyloseq has some built-in functions for these that make it very easy. You can see more information about these functions with ?plot_richness.

Then have a look at the Beta diversity:

ps.ord <- ordinate(physeq_pwy_relabun, "PCoA", "bray")
plot_ordination(physeq_pwy_relabun, ps.ord, type="samples", color="treatment", shape="treatment") + geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) +  ggtitle("Bray-Curtis dissimilarity")

Are you getting an error here? With most errors that we get, a quick Google search will give us some clues on what might be going wrong. When I google the error message "No available covariate data to map on the points for this plot type", one of the first things that comes up is this page, and one of the users on there suggests that this is due to only having one column in our metadata table. Sometimes there are weird bugs in things and this seems to be one of those times! We'll just add a fake second metadata column to get around this.

sample_data(physeq_pwy_relabun)$second_column = 'fake_data'

Now try running the above plot_ordination function again. Hopefully you'll see that it works now! It's annoying when we have to find work-arounds like this, but it happens reasonably often.

You should notice that we've chosen to colour and shape the samples by treatment. If you had more than one sample variable here, you could choose colour for one and shape for the other.

Question 10: Can you see differences between the samples based on the metadata variables?
Question 11: What are these differences like compared with those seen in the taxonomy data?

4.2.4 Run PICRUSt2 differential abundance

Now we'll try running the differential abundance tests.

First, rarefy the pathway data:

set.seed(345)
physeq_rare_pwy = rarefy_even_depth(physeq_pwy, sample.size = min(sample_sums(physeq_pwy)), replace = TRUE, trimOTUs = TRUE, verbose = TRUE)
physeq_rare_pwy_top_100 = prune_taxa(top_100_abun, physeq_rare_pwy)

Get the feature tables and metadata:

feat_table = data.frame(t(otu_table(physeq_rare_pwy)), check.rows=T, check.names=T, stringsAsFactors=T)
metadata = data.frame(sample_data(physeq_rare_pwy), stringsAsFactors = F)

Now we'll run MaAsLin3, first with our treatment grouping:

fit_out <- maaslin3(input_data = feat_table,
                    input_metadata = metadata,
                    output = paste(folder, 'maaslin3_pathway_treatment', sep=""),
                    formula = '~ treatment',
                    normalization = 'TSS',
                    transform = 'LOG',
                    augment = TRUE,
                    standardize = TRUE,
                    max_significance = 0.2,
                    median_comparison_abundance = TRUE,
                    median_comparison_prevalence = FALSE,
                    max_pngs = 100,
                    cores = 1,
                    save_models = TRUE)

Run ALDEx2:

x <- aldex.clr(otu_table(physeq_pwy_top_100), sample_data(physeq_pwy_top_100)$treatment, mc.samples = 128, verbose=F, denom="all")
kw.test <- aldex.kw(x, useMC=2, verbose=FALSE)
write.csv(kw.test, paste(folder, "ALDEx2_pathway.csv", sep=""))

4.2.5 Combine and plot PICRUSt2 differential abundance results from MaAsLin3 and ALDEx2

Import the MaAsLin results:

maaslin = read.csv(paste(folder, "maaslin3_pathway_treatment/significant_results.tsv", sep=""), sep="\t")
maaslin$feature = gsub("\\.", "-", maaslin$feature) #replace any . in the pathway names with -

Again, we're just going to be looking at the abundance models, although MaAsLin3 does carry out differential abundance testing based on both prevalence and abundance.

maaslin = maaslin[maaslin$model == 'abundance', ]

Now we'll do the same for ALDEx2:

aldex = read.csv(paste(folder, "ALDEx2_pathway.csv", sep=""))
aldex = aldex[aldex$kw.eBH <= 0.2, ]

ALDEx2 doesn't find anything to be differentially abundant. This means that there isn't anything that we would consider to be significantly differentially abundant (DA) with 2 tests, but we can still plot out some of the pathways that were identified by MaAsLin3 as DA.

pathways = maaslin$feature

Plot DA single pathway

Next we can look at plotting the abundance of a single pathways. Have a look at the list of pathways (pathways). Replace the empty quotation marks below with one of these:

single_pathway = ""

Now make the boxplot:

microbiome::boxplot_abundance(physeq_pwy_relabun, x='treatment', y=single_pathway, line = NULL, violin = FALSE, na.rm = FALSE, show.points = TRUE) + ggtitle(single_pathway) + ylab('Relative abundance (%)')

Plot DA all pathways

Now make all of the boxplots:

for(i in 1:length(pathways)) {
  print(microbiome::boxplot_abundance(physeq_pwy_relabun, x='treatment', y=pathways[i]) + ggtitle(pathways[i]) + ylab('Relative abundance (%)'))
}

You'll see that these are all quite low abundance (<1% abundance), but they do all seem to have clear differences between the two treatments.

4.3 Constructing and visualising a co-occurrence network

text will go here :)

Here are the commands for spiec-easi:

# === Step 1: Indicator species analysis ===
cat("Running indicator species analysis...\n")

# Transpose OTU table to samples x taxa
otu_mat <- as.data.frame(t(otu_table(physeq)))

# Extract treatment vector
group_vector <- sample_data(physeq)$treatment

# Run multipatt
set.seed(123)
indval_res <- multipatt(otu_mat, group_vector, func = "IndVal.g", duleg = TRUE, control = how(nperm = 999))

cat("Indicator species analysis complete.\n")
print(summary(indval_res))

# Extract significant ASVs
sig_hits <- indval_res$sign
sig_asvs <- rownames(sig_hits)[which(sig_hits$p.value <= 0.05)]

# Assign indicator group based on specificity columns
asv_indicator_group <- apply(sig_hits[sig_asvs, c("s.Forest", "s.Managed")], 1, function(x) {
    if (x["s.Forest"] == 1 && x["s.Managed"] == 0) return("Forest")
    if (x["s.Forest"] == 0 && x["s.Managed"] == 1) return("Managed")
    return("Both")
})

# Named vector of ASV → group
indicator_vector <- setNames(asv_indicator_group, sig_asvs)

# === Step 2: Filter by prevalence and abundance ===
cat("Filtering phyloseq object by prevalence and abundance...\n")

min_prev <- 0.05
min_abun <- 0.001

# Prevalence
prev_df <- apply(otu_table(physeq), 1, function(x) sum(x > 0) / length(x))

# Relative abundance
rel_abun <- transform_sample_counts(physeq, function(x) x / sum(x))
mean_abun <- taxa_sums(rel_abun) / nsamples(physeq)

# Logical filtering
keep_taxa <- (prev_df >= min_prev) & (mean_abun >= min_abun)

# Subset
physeq_filtered <- prune_taxa(keep_taxa, physeq)
cat("Remaining taxa after filtering:", ntaxa(physeq_filtered), "\n")

# === Step 3: Run SpiecEasi ===
cat("Running SpiecEasi...\n")

se.mb.16s <- spiec.easi(physeq_filtered, method = 'mb',
                        lambda.min.ratio = 1e-2, nlambda = 20,
                        pulsar.params = list(rep.num = 50, ncores = 3))

cat("SpiecEasi complete.\n")

# === Step 4: Extract and rename graph ===
ig2.mb <- adj2igraph(getRefit(se.mb.16s))

# Replace numeric node names with actual ASV IDs
correct_taxa_names <- taxa_names(physeq_filtered)[as.integer(igraph::V(ig2.mb)$name)]
igraph::V(ig2.mb)$name <- correct_taxa_names

# Prune phyloseq object to match taxa in network
physeq_net <- prune_taxa(correct_taxa_names, physeq_filtered)

# Add indicator status as a tax_table column (NA if not an indicator)
indicator_status <- indicator_vector[taxa_names(physeq_net)]
tax_table(physeq_net) <- cbind(tax_table(physeq_net), Indicator = indicator_status)

# === Step 5: Plot ===
cat("Plotting network...\n")

# === Shared setup ===
set.seed(42)
layout_fr <- layout_with_fr(ig2.mb)

# --- Indicator plot ---
cat("Saving Indicator network...\n")
pdf(paste(folder, "network_indicator.pdf", sep=""), width = 7, height = 6)

indicator_vals <- tax_table(physeq_net)[igraph::V(ig2.mb)$name, "Indicator"]
indicator_vals[is.na(indicator_vals)] <- "Not significant"
indicator_factor <- factor(indicator_vals)
colors_ind <- scales::hue_pal()(length(levels(indicator_factor)))
node_colors_ind <- colors_ind[as.integer(indicator_factor)]

par(mar = c(4, 4, 4, 8))
par(xpd = TRUE)
plot(ig2.mb,
     layout = layout_fr,
     vertex.label = NA,
     vertex.color = node_colors_ind,
     vertex.size = 5,
     edge.color = "gray80",
     main = "Indicator Status")
legend("topright", inset = c(-0.35, 0), legend = levels(indicator_factor),
       col = colors_ind, pch = 19, pt.cex = 1.5, bty = "n", title = "Indicator")

dev.off()

# --- ta3 plot ---
cat("Saving ta3 network...\n")
pdf(paste(folder, "network_ta3.pdf", sep=""), width = 9, height = 6)

ta3_vals <- tax_table(physeq_net)[igraph::V(ig2.mb)$name, "ta3"]
ta3_vals[is.na(ta3_vals)] <- "Unclassified"
ta3_factor <- factor(ta3_vals)
colors_ta3 <- scales::hue_pal()(length(levels(ta3_factor)))
node_colors_ta3 <- colors_ta3[as.integer(ta3_factor)]

par(mar = c(4, 4, 4, 8))
par(xpd = TRUE)
plot(ig2.mb,
     layout = layout_fr,
     vertex.label = NA,
     vertex.color = node_colors_ta3,
     vertex.size = 5,
     edge.color = "gray80",
     main = "ta3 Taxonomy Rank")
legend("topright", inset = c(-0.18, 0), legend = levels(ta3_factor),
       col = colors_ta3, pch = 19, pt.cex = 1.5, bty = "n", title = "ta3")

dev.off()

# --- Panel figure ---
cat("Saving combined panel figure...\n")
pdf(paste(folder, "network_panel.pdf", sep=""), width = 18, height = 6)
par(mfrow = c(1, 2), mar = c(4, 4, 4, 8), xpd = TRUE)

# Left panel: Indicator
plot(ig2.mb,
     layout = layout_fr,
     vertex.label = NA,
     vertex.color = node_colors_ind,
     vertex.size = 5,
     edge.color = "gray80",
     main = "Indicator Status")
legend("topright", inset = c(-0.05, 0), legend = levels(indicator_factor),
       col = colors_ind, pch = 19, pt.cex = 1.2, bty = "n", title = "Indicator")

# Right panel: ta3
plot(ig2.mb,
     layout = layout_fr,
     vertex.label = NA,
     vertex.color = node_colors_ta3,
     vertex.size = 5,
     edge.color = "gray80",
     main = "ta3 Taxonomy Rank")
legend("topright", inset = c(-0.18, 0), legend = levels(ta3_factor),
       col = colors_ta3, pch = 19, pt.cex = 1.2, bty = "n", title = "ta3")

dev.off()
cat("All plots saved.\n")

Answers

Answers 16S differential abundance and PICRUSt2

Question 1: Type in asv_table or look at it in the "Environment" area in the top right. How many rows are there? Does this make sense to you? There should be the same number of rows as there were in the feature_table.txt, but without the # Constructed from biom file line.

Question 2: Are they the same as you remember - any different punctuation? You should notice that where there were hyphens (-) before, these have now been replaced with . - this is an annoying thing that R does when importing data. There are options to turn this behaviour off, but we'll just rename the columns for now.

Question 3: How does the number of genera compare with the number of ASVs? There are a lot fewer genera than ASVs.

Question 4: How many taxa does MaAsLin3 say are significantly differentially abundant? These numbers might vary slightly depending on how you have rarefied your data and whether you have kept the random seed the same, but there are 16 rows in the significant_results.tsv file, eight of which are unique (they are each shown twice - once for the abundance DA model and once for the prevalence DA model).

Question 5: How many taxa does ALDEx2 find to be significantly differentially abundant? Looking at the Kruskal-Wallis false discovery rate corrected p-value (kw.eBH), there are 26 taxa with p <= 0.2.

Question 6: Are there any taxa left? Is that what you expected? There are seven genera that are found to be DA by both MaAsLin3 and ALDEx2! This means that almost all of the genera identified by MaAsLin3 are also DA with ALDEx2, which is relatively surprising given the different models that they each use.

Question 7: Did it look like there are differences between the two treatments for the taxon that you plotted? Because you chose the taxon, I can't give the answer here!! Feel free to talk to one of us if you aren't sure on how to interpret these results.

Question 8: Are there any taxa that you're surprised to see are significantly differentially abundant? Why? Again, chat to one of us if you aren't sure on how to interpret these results!

Question 9: What do you think might be wrong with it? For the ASVs, we were needing to skip the first line of the file that didn't contain a useful header. Now we have the column names in our header line, so the skip=1 part of this is making the second line of the file be read in as the header, and this contains pathway abundance information. We should remove the skip=1 part before running this again.

Question 10: Can you see differences between the samples based on the metadata variables? Yes - Shannon and Simpson diversity seems to be slightly higher in Managed than Forest samples, although Observed richness is about the same in both. The samples don't group entirely separately on the PCoA plot, although there does seem to be some more variation among the Managed than the Forest samples.

Question 11: What are these differences like compared with those seen in the taxonomy data?

Answers 18S differential abundance

See below for the answers to questions!

Code needed

Importing packages:

library(phyloseq)
library(maaslin3)
library(microbiome)
library(ALDEx2)
library(ggplot2)
library(tidyr)
library(stringr)
library(vegan)
library(stats)
library(SpiecEasi)

Setting variables:

folder = '/home/ubuntu/workspace/bmb_module4/18S_analysis/'

Read in ASV table:

asv_table <- read.csv(paste(folder, "deblur_output_exported/feature-table_w_tax.txt", sep=""), sep='\t', skip=1, header=T)

Manipulate the feature table:

asv_table_num = data.matrix(asv_table[,2:13]) 
rownames(asv_table_num) = asv_table[,1] #give the matrix row names

Get the taxonomy information:

taxonomy <- data.frame(asv_table[,"taxonomy"])
rownames(taxonomy) <- asv_table[,1]
colnames(taxonomy) = c("taxonomy")

We want to split this so that we have columns for each of Domain, Supergroup, Division, Subdivision, Class, Order, Family, Genus and Species:

taxonomy_split <- separate(data = taxonomy, col = taxonomy, into = c("Domain", "Supergroup", "Division", "Subdivision", "Class", "Order", "Family", "Genus", "Species"), sep = "\\; ")

Read in the metadata:

metadata <- read.csv(paste(folder, "Plastisphere_18S_metadata.tsv", sep=""), sep='\t')
samples = data.frame(metadata[,2:3], stringsAsFactors = FALSE)
rownames(samples) = metadata[,1] #add the sample names as row names

Combine these into a phyloseq object:

ASV = otu_table(asv_table_num, taxa_are_rows = TRUE) #convert asv_table_num to an otu_table
TAX = tax_table(taxonomy_split) #convert taxonomy_split to a tax_table
taxa_names(TAX) <- rownames(taxonomy_split) #add names to TAX/the tax_table
SAMPLE = sample_data(samples) #convert samples to a sample_data
physeq = phyloseq(ASV, TAX, SAMPLE) #combine these all to make a phyloseq object
physeq #print this out to see what a phyloseq object looks like

Collapse at the genus level, which is "ta8" for 18S:

physeq_genus = tax_glom(physeq, taxrank="ta8")

Add names:

all_tax = paste(tax_table(physeq_genus)[,4], tax_table(physeq_genus)[,5], tax_table(physeq_genus)[,6], tax_table(physeq_genus)[,7], tax_table(physeq_genus)[,8], sep=';')
taxa_names(physeq_genus) = all_tax
otu_table(physeq_genus)

Rarefy:

set.seed(345)
sample_sums(physeq_genus)
physeq_rare = rarefy_even_depth(physeq_genus, sample.size = 1000, replace = TRUE, trimOTUs = TRUE, verbose = TRUE)

Note that we've changed the minimum sample size here because we had some samples with very low numbers of reads. I've printed out those sums to see how many samples we'll lose. This isn't ideal, but we don't really want to go below 1000 sequences in our samples. You'll see that 4 samples get removed, almost all of which are cave samples.

If this were a "real" analysis, we might find this reduced sample size a bit problematic, but we'll just carry on for now.

Get the feature table and metadata that we're using:

feat_table = data.frame(t(otu_table(physeq_rare)), check.rows=T, check.names=T, stringsAsFactors=T)
metadata = data.frame(sample_data(physeq_rare), stringsAsFactors = F)

Run MaAsLin3:

fit_out <- maaslin3(input_data = feat_table,
                    input_metadata = metadata,
                    output = paste(folder, 'maaslin3_treatment', sep=""),
                    formula = '~ light + time',
                    normalization = 'TSS',
                    transform = 'LOG',
                    augment = TRUE,
                    standardize = TRUE,
                    max_significance = 0.2,
                    median_comparison_abundance = TRUE,
                    median_comparison_prevalence = FALSE,
                    max_pngs = 100,
                    cores = 1,
                    save_models = TRUE)

Probably unsurprisingly given the small numbers of samples in each group, none of these were significant :(

We can also see what happens if we run each of these separately:

fit_out <- maaslin3(input_data = feat_table,
                    input_metadata = metadata,
                    output = paste(folder, 'maaslin3_light', sep=""),
                    formula = '~ light',
                    normalization = 'TSS',
                    transform = 'LOG',
                    augment = TRUE,
                    standardize = TRUE,
                    max_significance = 0.2,
                    median_comparison_abundance = TRUE,
                    median_comparison_prevalence = FALSE,
                    max_pngs = 100,
                    cores = 1,
                    save_models = TRUE)

fit_out <- maaslin3(input_data = feat_table,
                    input_metadata = metadata,
                    output = paste(folder, 'maaslin3_time', sep=""),
                    formula = '~ time',
                    normalization = 'TSS',
                    transform = 'LOG',
                    augment = TRUE,
                    standardize = TRUE,
                    max_significance = 0.2,
                    median_comparison_abundance = TRUE,
                    median_comparison_prevalence = FALSE,
                    max_pngs = 100,
                    cores = 1,
                    save_models = TRUE)

But none of these are significant either :(

Run ALDEx2 separately for light and time:

x <- aldex.clr(otu_table(physeq_genus), sample_data(physeq_genus)$light, mc.samples = 128, verbose=F, denom="all")
kw.test <- aldex.kw(x, useMC=2, verbose=FALSE)
write.csv(kw.test, paste(folder, "ALDEx2_light.csv", sep=""))

x <- aldex.clr(otu_table(physeq_genus), sample_data(physeq_genus)$time, mc.samples = 128, verbose=F, denom="all")
kw.test <- aldex.kw(x, useMC=2, verbose=FALSE)
write.csv(kw.test, paste(folder, "ALDEx2_time.csv", sep=""))

Get ALDEx2 results:

aldex_light = read.csv(paste(folder, "ALDEx2_light.csv", sep=""))
aldex_light = aldex_light[aldex_light$kw.eBH <= 0.2, ]

aldex_time = read.csv(paste(folder, "ALDEx2_time.csv", sep=""))
aldex_time = aldex_time[aldex_time$kw.eBH <= 0.2, ]

None of these are significant either :(

18S Answers to questions

Question 2: Are they the same as you remember - any different punctuation? Yes! Unlike with the 16S data, there was no punctuation in our sample names to begin with so these have remained correct.

Question 3: How does the number of genera compare with the number of ASVs? There are a lot fewer genera than ASVs.

Question 4: How many taxa does MaAsLin3 say are significantly differentially abundant? None :(

Question 5: How many taxa does ALDEx2 find to be significantly differentially abundant? None :(

Question 6: Are there any taxa left? Is that what you expected? There are none found with either MaAsLin3 or ALDEx2, and not with either (or both) metadata variables, so there is no overlap here. If you didn't run the 16S data already but want to see the plots, try running through the 16S data instead.

Question 7: Did it look like there are differences between the two treatments for the taxon that you plotted? Not applicable.

Question 8: Are there any taxa that you're surprised to see are significantly differentially abundant? Why? Not applicable.

Answers ITS differential abundance

Code needed

Importing packages:

library(phyloseq)
library(maaslin3)
library(microbiome)
library(ALDEx2)
library(ggplot2)
library(tidyr)
library(stringr)
library(vegan)
library(stats)
library(SpiecEasi)

Setting variables:

folder = '/home/ubuntu/workspace/bmb_module4/ITS_analysis/'

Read in ASV table:

asv_table <- read.csv(paste(folder, "deblur_output_exported/feature-table_w_tax.txt", sep=""), sep='\t', skip=1, header=T)

Manipulate the feature table:

asv_table_num = data.matrix(asv_table[,2:21]) 
rownames(asv_table_num) = asv_table[,1] #give the matrix row names

Get the taxonomy information:

taxonomy <- data.frame(asv_table[,"taxonomy"])
rownames(taxonomy) <- asv_table[,1]
colnames(taxonomy) = c("taxonomy")

We want to split this so that we have columns for each of Kingdom, Phylum, Class, Order, Family, Genus, Species and Subspecies:

taxonomy_split <- separate(data = taxonomy, col = taxonomy, into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species", "Subspecies"), sep = "\\; ")

Read in the metadata:

metadata <- read.csv(paste(folder, "pregnancy_ITS_metadata.tsv", sep=""), sep='\t')
samples = data.frame(metadata[,2:3], stringsAsFactors = FALSE)
rownames(samples) = metadata[,1] #add the sample names as row names

Combine these into a phyloseq object:

ASV = otu_table(asv_table_num, taxa_are_rows = TRUE) #convert asv_table_num to an otu_table
TAX = tax_table(taxonomy_split) #convert taxonomy_split to a tax_table
taxa_names(TAX) <- rownames(taxonomy_split) #add names to TAX/the tax_table
SAMPLE = sample_data(samples) #convert samples to a sample_data
physeq = phyloseq(ASV, TAX, SAMPLE) #combine these all to make a phyloseq object
physeq #print this out to see what a phyloseq object looks like

Collapse at the genus level, which is "ta6" for ITS:

physeq_genus = tax_glom(physeq, taxrank="ta6")

Add names:

all_tax = paste(tax_table(physeq_genus)[,2], tax_table(physeq_genus)[,3], tax_table(physeq_genus)[,4], tax_table(physeq_genus)[,5], tax_table(physeq_genus)[,6], sep=';')
taxa_names(physeq_genus) = all_tax
otu_table(physeq_genus)

Rarefy:

set.seed(345)
sample_sums(physeq_genus)
physeq_rare = rarefy_even_depth(physeq_genus, sample.size = 1000, replace = TRUE, trimOTUs = TRUE, verbose = TRUE)

Note that we've changed the minimum sample size here because we had one samples with a very low number of reads. I've printed out those sums to see how many samples we'll lose. This isn't ideal, but we don't really want to go below 1000 sequences in our samples. You'll see that 1 sample gets removed - if we wanted to be even more robust then we might set this to say 10,000 sequences, which would remove 3 samples.

Get the feature table and metadata that we're using:

feat_table = data.frame(t(otu_table(physeq_rare)), check.rows=T, check.names=T, stringsAsFactors=T)
metadata = data.frame(sample_data(physeq_rare), stringsAsFactors = F)

Run MaAsLin3:

fit_out <- maaslin3(input_data = feat_table,
                    input_metadata = metadata,
                    output = paste(folder, 'maaslin3_treatment', sep=""),
                    formula = '~ exact_age + category',
                    normalization = 'TSS',
                    transform = 'LOG',
                    augment = TRUE,
                    standardize = TRUE,
                    max_significance = 0.2,
                    median_comparison_abundance = TRUE,
                    median_comparison_prevalence = FALSE,
                    max_pngs = 100,
                    cores = 1,
                    save_models = TRUE)

None of these were significant :(

We can also see what happens if we run each of these separately:

fit_out <- maaslin3(input_data = feat_table,
                    input_metadata = metadata,
                    output = paste(folder, 'maaslin3_exact_age', sep=""),
                    formula = '~ exact_age',
                    normalization = 'TSS',
                    transform = 'LOG',
                    augment = TRUE,
                    standardize = TRUE,
                    max_significance = 0.2,
                    median_comparison_abundance = TRUE,
                    median_comparison_prevalence = FALSE,
                    max_pngs = 100,
                    cores = 1,
                    save_models = TRUE)

fit_out <- maaslin3(input_data = feat_table,
                    input_metadata = metadata,
                    output = paste(folder, 'maaslin3_category', sep=""),
                    formula = '~ category',
                    normalization = 'TSS',
                    transform = 'LOG',
                    augment = TRUE,
                    standardize = TRUE,
                    max_significance = 0.2,
                    median_comparison_abundance = TRUE,
                    median_comparison_prevalence = FALSE,
                    max_pngs = 100,
                    cores = 1,
                    save_models = TRUE)

But none of these are significant either :(

Run ALDEx2 separately for exact_age and category:

x <- aldex.clr(otu_table(physeq_genus), sample_data(physeq_genus)$exact_age, mc.samples = 128, verbose=F, denom="all")
kw.test <- aldex.kw(x, useMC=2, verbose=FALSE)
write.csv(kw.test, paste(folder, "ALDEx2_exact_age.csv", sep=""))

x <- aldex.clr(otu_table(physeq_genus), sample_data(physeq_genus)$category, mc.samples = 128, verbose=F, denom="all")
kw.test <- aldex.kw(x, useMC=2, verbose=FALSE)
write.csv(kw.test, paste(folder, "ALDEx2_category.csv", sep=""))

Get ALDEx2 results:

aldex_exact_age = read.csv(paste(folder, "ALDEx2_exact_age.csv", sep=""))
aldex_exact_age = aldex_exact_age[aldex_exact_age$kw.eBH <= 0.2, ]

aldex_category = read.csv(paste(folder, "ALDEx2_category.csv", sep=""))
aldex_category = aldex_category[aldex_category$kw.eBH <= 0.2, ]

None of these are significant either :(

ITS Answers to questions

Question 3: How does the number of genera compare with the number of ASVs? There are a lot fewer genera than ASVs.

Question 4: How many taxa does MaAsLin3 say are significantly differentially abundant? None :(

Question 5: How many taxa does ALDEx2 find to be significantly differentially abundant? None :(

Question 7: Did it look like there are differences between the two treatments for the taxon that you plotted? Not applicable.

Question 8: Are there any taxa that you're surprised to see are significantly differentially abundant? Why? Not applicable.

CBW‐2025‐BMB‐Module4 - LangilleLab/microbiome_helper GitHub Wiki

Module 4: Functional prediction and additional analyses

Introduction

Differential abundance testing with MaAsLin3 and ALDEx2

Functional prediction with PICRUSt2

Constructing and visualising a co-occurrence network using XYZ

Table of Contents

4.1 Differential abundance analysis using MaAsLin3 and ALDEx2

4.1.1 Prepare data from the end of Module 2

16S:

18S:

ITS:

4.1.2 Read output into R/Phyloseq

Importing packages

Setting variables

Read in the feature table

Manipulate the feature table

Get the taxonomy information

Read in the metadata

Combine these into a phyloseq object

4.1.3 Run MaAsLin3

4.1.4 Run ALDEx2

4.1.5 Combine and plot differential abundance results from MaAsLin3 and ALDEx2

Plot DA single taxon

Plot DA all taxa

For loops

Back to the results

4.2 Functional prediction with PICRUSt2

4.2.1 Filter input for PICRUSt2

4.2.2 Start running PICRUSt2

4.2.3 Read PICRUSt2 output into R/Phyloseq

4.2.3 Look at PICRUSt2 alpha and beta diversity

4.2.4 Run PICRUSt2 differential abundance

4.2.5 Combine and plot PICRUSt2 differential abundance results from MaAsLin3 and ALDEx2

Plot DA single pathway

Plot DA all pathways

4.3 Constructing and visualising a co-occurrence network

Answers

Answers 16S differential abundance and PICRUSt2

Answers 18S differential abundance

Code needed

18S Answers to questions

Answers ITS differential abundance

Code needed

ITS Answers to questions

Contact

⚠️ **GitHub.com Fallback** ⚠️

Constructing and visualising a co-occurrence network using `XYZ`

⚠️ GitHub.com Fallback ⚠️