CBW‐2025‐AMB‐Module1 - LangilleLab/microbiome_helper GitHub Wiki

Module 1: Introduction to metagenomics and read‐based profiling

This tutorial is part of the 2025 CBW Advanced Microbiome Analysis (held in Vancouver, BC, May 29-30). It is based on a version written for the 2024 CBW Advanced Microbiome Analysis workshop by Ben Fisher.

Author: Robyn Wright

Overview

The goal of this tutorial is to familiarise students with processes by which we analyse metagenomic data. Shotgun Metagenomic Sequencing, sometimes called MGS or WGS, is capable of capturing any DNA extracted from a given sample (however, this does not necessarily mean it captures ALL of the DNA). With MGS reads, we must consider that there could be significant host contamination, so - depending on our sample type - we must filter against host sequences in our pipeline. There are many tools that can work to produce similar results, and this is not a one-size-fits-all solution. This is instead a foray into popular tools and processes, and how to appropriately use them for our analyses.

Throughout this module, there are some questions aimed to help your understanding of some of the key concepts. You'll find the answers at the bottom of this page, but no one will be marking them. You may wish to keep them open in another tab.

Table of Contents

1. Initial setup and pre-processing of metagenomic reads

Log into your instance

If you weren't in the beginner microbiome analysis workshop and need a refresher, check the notes in Beginner Module 1.

Working on the Command Line

Make a new folder in your workspace to work from:

cd workspace
mkdir amb_module1
cd amb_module1

If you weren't in the beginner workshop, feel free to take a look around the instance using the ls, pwd and cd commands, but make sure you come back to ~/workspace/amb_module1 before the next step.

Make a link to the files to your working directory.

ln -s ~/CourseData/MIC_data/AMB_data/raw_data_subsampled_subset/ raw_data
cp ~/CourseData/MIC_data/AMB_data/metagenome_metadata_subset.txt metagenome_metadata.txt

About the samples

These samples are from a study that investigated gut microbiome functional capacity of chicken ceca. You can read more in the original study here:

Or in the additional analysis done by John's lab here:

A Crash Course in GNU Parallel

First, activate our conda environment for this tutorial:

conda activate amb_module1

Sometimes in bioinformatics, the number of tasks you have to complete can get VERY large (e.g. when we have thousands of samples). Fortunately, there are several tools that can help us with this. One such tool is GNU Parallel. This tool can simplify the way in which we approach large tasks, and as the name suggests, it can iterate though many tasks in parallel, i.e. at the same time. If you aren't familiar with using top or htop to look at the number of processes that you have available to you, take a look at this section of the beginner workshop.

We can use a simple command to demonstrate how to use parallel:

parallel 'echo {}' ::: a b c

With the command above, the program contained within the quotation marks ' ' is echo. This program is run 3 times, as there are 3 inputs listed after the ::: characters. What happens if there are multiple lists of inputs? Try the following:

parallel 'echo {}' ::: a b c ::: 1 2 3

Here, we have demonstrated how parallel treats multiple inputs. It uses all combinations of one of each from a b c and 1 2 3. But, what if we wanted to use 2 inputs that were sorted in a specific order? This is where the --link flag becomes particularly useful. Try the following:

parallel --link 'echo {}' ::: a b c ::: 1 2 3

In this case, the inputs are "linked", such that only one of each is used. If the lists are different lengths, parallel will go back to the beginning of the shortest list and continue to use it until the longest list is completed. You do not have to run the following command, as the output is provided to demonstrate this.

parallel --link 'echo {}' ::: light dark ::: red blue green

Notice how light appears a second time (on the third line of the output) to satisfy the length of the second list.

Another useful feature is specifying which inputs we give parallel are to go where. This can be done intuitively by using multiple brackets { } containing numbers corresponding to the list we are interested in. Again, you do not have to run the following command, as the output is provided to demonstrate this.

parallel --link 'echo {1} {3}; echo {2} {3}' ::: one red ::: two blue ::: fish

Finally, a handy feature is that parallel accepts files as inputs. This is done slightly differently than before, as we need to use four colon characters :::: instead of three. Parallel will then read each line of the file and treat its contents as a list. You can also mix this with the three-colon character lists ::: you are already familiar with. Using the following code, create a test file and use parallel to run the echo program:

echo -e "A\nB\nC" > test.txt
parallel --link 'echo {2} {1}' :::: test.txt ::: 1 2 3

Take a look inside test.txt with the less command if you like. Remember that you can use q to exit the file again.

And with that, you're ready to use parallel for all of your bioinformatic needs! We will continue to use it throughout this tutorial and show some additional features along the way. There is also a cheat-sheet here for quick reference.

Quality Control

Quality control is an important step in any pipeline. For this tutorial, we will use FastQC to inspect the quality of our samples.

Visualization with FastQC

First, make your desired output directory fastqc_out. Then, run FastQC as follows:

fastqc -t 4 raw_data/*fastq.gz -o fastqc_out

Run MultiQC on the FastQC results:

multiqc fastqc_out --filename multiqc.html

Remember that you can go to http://##.uhn-hpc.ca/ (where ## is your personal number) to find all of the files that we're creating.

Copy across the multiqc.html file to your computer.

Click on the .html files to view the results for each sample. Let's look at the Per base sequence quality tab. This shows a boxplot representing the quality of the base call for each position of the read. In general, due to the inherent degradation of quality with increased sequence length, the quality scores will trend downward as the read gets longer. However, you may notice that for our samples this is not the case! This is because for the purpose of this tutorial, your raw data has already been trimmed.

More often, per base sequence quality will look like the following. The FastQC documentation provides examples of "good" and "bad" data. These examples are also shown below:

Now, you may have also noticed that most of the samples fail the "Per Base Sequence Content" module of FastQC.

This specific module plots out the proportion of each base position in a file, and raises a warning/error if there are large discrepancies in base proportions. In a given sequence, the lines for each base should run in parallel, indicating that the base calling represents proper nucleotide pairing. Additionally, the A and T lines may appear separate from the G and C lines, which is a consequence of the GC content of the sample. The relative amount of each base reflects the overall amount of the bases in the genome, and should not be imbalanced. When this is not the case, the sequence composition is biased. A common cause of this is the use of primers, which throws off our sequence content at the beginning of the read. Fortunately, although the module error stems from this bias, according to the FastQC documentation for this module it will not largely impact our downstream analysis.

The other modules are explained in depth in the FastQC Module Help Documentation

Filtering with KneadData

KneadData is a tool which "wraps" several programs to create a pipeline that can be executed with one command. For this tutorial though, we will use KneadData to filter our reads for contaminant sequences against a human database. KneadData will:

  • Run Trimmomatic to remove adapter sequences
  • Run Bowtie2 to remove reads mapping to the reference genome and the PhiX genome (commonly used as a sequencing control)

We don't have paired-end data here, but if we did, it would also:

  • Check whether both pairs of a read exist
  • Check whether they both map to the reference genome

Bowtie2 needs a reference genome/index file for its mapping step. There are some pre-made indexes on this page. You can now choose to use the database that we have already made (quicker option) or you can see the steps involved in building this for yourself (slower option).

Quicker option - use pre-made database

ln -s ~/CourseData/MIC_data/amb_module1/bowtie2_db/ .

Slower option - build the database yourself

First, download the NCBI RefSeq assembly summary:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

Have a look in this file. You'll see that it contains information on all of the genomes in NCBI RefSeq - we often need reference genomes for constructing various kinds of databases, so it's useful to get familiar with what is available on NCBI. You'll see that each assembly/accession has information on the NCBI taxonomy ID, whether it's a complete genome, whether it's a reference or representative genome, and a link to where we can find the associated files. Because the samples we are using are from chickens, we're going to look for the files that are associated with the chicken Gallus gallus:

grep "Gallus gallus" assembly_summary_refseq.txt

The first one has "reference genome" in the first column, so that's what we'll use. Copy the link that starts https://ftp.ncbi..... into a new browser window. You will see a bunch of different file types, so find the one that ends _genomic.fna.gz, right click, and click "Copy Link Address". Now use the wget command to download this to your instance.

Now we're going to do the same with PhiX. PhiX is added as a sequencing control and is often removed before we're given back our samples, but it's good to make sure.

  • Use grep to search for "phiX174"
  • Go to the link that you find
  • Download the _genomic.fna.gz file using wget

Question 1: This one isn't really a question, but you can see the code needed for these steps in the answers section if you need to!

Now we'll combine these two files into one:

cat GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.fna.gz GCF_000819615.1_ViralProj14015_genomic.fna.gz > chicken_Gallusgallus_phiX174.fna.gz

And finally, we'll build our Bowtie2 database:

mkdir bowtie2_db
bowtie2-build chicken_Gallusgallus_phiX174.fna.gz bowtie2_db/chicken_Gallusgallus_phiX174 --threads 4

This will take about ten minutes to run. If this is taking too long, feel free to stop this command using ctrl+c and link to the one that we already made:

ln -s ~/CourseData/MIC_data/amb_module1/bowtie2_db/ .

Back to KneadData

Now we are ready to run KneadData using parallel:

parallel -j 1 --eta 'kneaddata -un {1} -o kneaddata_out -db bowtie2_db/chicken_Gallusgallus_phiX174 --remove-intermediate-output -t 4' ::: raw_data/*

You'll see that after the first sample has finished, you'll get an estimate of the amount of time remaining.

While kneaddata is running, consider the following:

Note that if we had paired reads to start with, we'd also have some files containing reads that were unmatched between the forward and reverse files.

We'll also run the following to get counts of how many reads we had in each file after each step:

kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt

Question 2: Take a look at this file. Are there any surprises? Which of the output files in kneaddata_out will you use for analysis?

Question 3: How many reads are in each sample before and after KneadData?

We'll move the other output files that we're not interested into a new folder:

mkdir kneaddata_out/contam_seq
mv kneaddata_out/*_contam*.fastq kneaddata_out/contam_seq

2. Generating Taxonomic Profiles

Annotation with Kraken2/Bracken

Now that we have our reads of interest, we want to understand what these reads are. To accomplish this, we use tools which annotate the reads based on different methods and databases. There are many tools which are capable of this, with varying degrees of speed and precision. For this tutorial, we will be using Kraken2 for fast exact k-mer matching against a database.

We have also investigated which parameters impact tool performance in this Microbial Genomics paper. One of the most important factors is the contents of the database, which should include as many taxa as possible to avoid the reads being assigned an incorrect taxonomic label. Generally, the bigger and more inclusive database, the better. However, due to the constraints of our AWS cloud instances, we will be using a "PlusPF 8GB" index provided by the Kraken2 developers.

First of all, we want to download and extract the Kraken database:

wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_08gb_20250402.tar.gz
mkdir k2_pluspf_08gb
tar -xvf k2_pluspf_08gb_20250402.tar.gz -C k2_pluspf_08gb

Then we can clean-up by removing the original file:

rm k2_pluspf_08gb_20250402.tar.gz

First, you must create the appropriate output directories, or Kraken2 will not write any files. Use the mkdir command to make the directories to match what we're using below. Using parallel, we will then run Kraken2 for our concatenated reads.

Note that it's always a good idea to first try out a --dry-run of parallel before you run any long jobs. If you're satisfied with what it's going to be running, remove the --dry-run flag.

parallel -j 1 --eta --dry-run 'kraken2 --db k2_pluspf_08gb --output kraken2_outraw/{/.}.kraken --report kraken2_kreport/{/.}.kreport --confidence 0 {} --threads 4' ::: kneaddata_out/*.fastq

This process can take some time. While this runs, let's learn about what our command is doing!

  • We first specify our options for parallel, where:
    • the -j 2 option specifies that we want to run two jobs concurrently;
    • the --eta option will count down the jobs are they are completed;
    • after the program contained in quotation marks, we specify our input files with :::, and use a regex to match all of the concatenated, unzipped .fastq files.
  • We then describe how we want kraken to run:
    • by first specifying the location of the database with the --db option;
    • then specifying the output directory for the raw kraken annotated reads;
      • notice that we use a special form of the brackets here, {/.}, this is a special function of parallel that will remove both the file path and extension when substituting the input into our kraken command. This is useful when files are going into different directories, and when we want to change the extension.
    • similarly, we also specify the output of our "report" files with the --report option;
    • the -confidence option allows us to filter annotations below a certain threshold (a float between 0 and 1) into the unclassified node. We are using 0 because our samples are already subset, however this should generally be higher. See our paper for more information.
    • and finally, we use the empty brackets {} for parallel to tell kraken what our desired input file is.

As Kraken runs, you should see it printing out a summary of the number of reads within each sample that it was able to classify.

With Kraken2, we have annotated the reads in our sample with taxonomy information. If we want to use this to investigate diversity metrics, we need to find the abundances of taxa in our samples. This is done with Kraken2's companion tool, Bracken (Bayesian Reestimation of Abundance with KrakEN).

Let's run Bracken on our Kraken2 outputs! First, make the expected output directory, then run the following:

parallel -j 2 --eta 'bracken -d k2_pluspf_08gb -i {} -o bracken_out/{/.}.species.bracken -r 150 -l S -t 1' ::: kraken2_kreport/*.kreport

Some notes about this command:

  • -d specifies the database we want to use. It should be the same database we used when we ran Kraken2;
  • -i is our input file(s);
  • -o is where and what we want the output files to be;
  • -r is the read length to get all classifications for, the default is 100 but our reads are 150bp;
  • -l is the taxonomic level at which we want to estimate abundances;
  • -t is the number of reads required prior to abundance estimation to perform re-estimation

Finally, let's merge our bracken outputs:

combine_bracken_outputs.py --files bracken_out/*.species.bracken -o merged_output.species.bracken

Annotation with MetaPhlAn

Another tool that is commonly used for taxonomic annotation of metagenomic sequences is MetaPhlAn. This tool is different from Kraken2 in that it uses a database of marker genes, instead of a collection of genomes, and it identifies only these marker genes within our reads, rather than trying to classify all reads. It then attempts to estimate the abundance of the taxa it identified within our whole samples, but it's important to remember that this is an estimation, and not the actual number of reads classified. We will use MetaPhlAn 3 for this tutorial due to the constraints of the AWS instances, but a newer version (MetaPhlAn4) utilizing a larger database is available.

Usually, we would build this ourself so that we make sure to have the most recent version. This can take quite a while, so we will create a link to it:

ln -s ~/CourseData/MIC_data/ref_dbs/CHOCOPhlAn_201901 .

And then activate the environment:

conda activate metaphlan3

Now, let's make an output folder metaphlan_out. Then, we can run the following:

parallel -j 1 --eta 'metaphlan --bowtie2db CHOCOPhlAn_201901 --input_type fastq -o metaphlan_out/{/.}.mpa {} --nproc 2' ::: kneaddata_out/*.fastq

This might take a while to run, so move on to the next step while this runs and then come back to this after.

3. Visualisation of results in R

Now we'll go to RStudio server: http://##.uhn-hpc.ca:8080 (remember to replace ## with your own instance number). If you weren't in the beginner microbiome workshop, you may be prompted to enter your username and password. It will also take a little while to log in the first time, but will be quicker afterwards.

Create the R Notebook

Using the menus, click File > New File > R Notebook, which will open an Untitled R markdown (Rmd) document. R notebooks are helpful in that you can run several lines, or chunks of code at the same time, and the results will appear within the document itself (in the whitespace following the chunk).

The default R Notebook contains a header and some information you may find helpful. Try running the chunk containing plot(cars) to see what happens!

You do not need to preserve most of the information in the new, untitled document. Select all of its contents by click+dragging your cursor or entering the ctrl+a (Windows) / cmd+a (Mac) shortcut, and press backspace or delete to clear the document.

The chunks are distinguished by the grey shading. Everything between the first ```{r} and subsequent``` belongs to the chunk. Anything written in the white space surrounding the chunk is meant to be annotation. Although you can run lines of code outside of the chunks, the chunks are useful for running multiple lines in series with one click.

Adding new chunks

To add a new chunk into your R notebook by navigating to Code > Insert Chunk from the toolbar, or clicking on the little green C and selecting R.

And then save the document - it doesn't really matter what you call it, but something like amb_module1.rmd would be sensible.

Setup, Importing and Formatting Data

Now we'll start building our R markdown notebook. Paste the following into a chunk, and then click the little green "play" button on the top right of the chunk to run it.

library(phyloseq)
library(vegan)
library(ggplot2)
library(gridExtra)

setwd("~/workspace/amb_module1/")

In this chunk, we've imported the libraries/packages that we're going to use (phyloseq, vegan, ggplot2 and gridExtra) and we've told R which directory it should be working from.

Now we're going to read in the Bracken output, and do some formatting of it:

bracken <- read.csv("merged_output.species.bracken", sep="\t") #read in the file as a dataframe
rownames(bracken) = bracken[,1] #set the row names to those in the first column
cols = c() #make an empty list/vector
for (colname in colnames(bracken)) { #for each of the columns in bracken
  if (grepl("bracken_num", colname, fixed = TRUE)) { #if "bracken_num" is in the column name
    cols <- c(cols, colname) #add it to the columns list - we're going to use this to filter/subset the dataframe
  }
}
bracken = bracken[,cols] #subset the dataframe to only the columns in the cols list (i.e. only those with bracken_num in the column name)
colnames(bracken) = sapply(strsplit(colnames(bracken),"_"), `[`, 1) #rename the column names by splitting them on the first occurrence of an underscore, so this means we are left with only the sample name for each of them

If you're not familiar with R or Python, anything after the # can be used for making comments, as it won't be read by them. So you can see what I've written about each step.

It's good practice to take a look at what we're changing each time! In the "Environment" section of RStudio, you should be able to see all of these objects. Click on them to have a look. I won't always tell you to click on them, but it's a good idea to either click on them or print them out so that you can understand what is changing each time. There are some other ways to look at this too:

View(bracken)
print(bracken)
bracken

Some of these options are more or less appropriate depending on what you're doing, but regularly printing things out is the easiest way to troubleshoot a script that isn't working. If you were wanting to print something out within a loop, you'd need to use the print() function.

Now we're going to get the genus name of each of the species in our dataframe:

all_genus = c() #make an empty list/vector
for (species in rownames(bracken)) { #for each of the rows (species) in the bracken dataframe
  this_species = species
  if (grepl("uncultured", this_species, fixed = TRUE)) { #if "uncultured" is in the name
     this_species = gsub("uncultured ", "", this_species) #rename the species by removing the "uncultured" and replacing it with nothing (the empty quotation marks)
  }
  this_genus = strsplit(this_species, " ")[[1]][1] #get the first word of the species name (the genus name)
  all_genus = c(all_genus, this_genus) #add this genus name to our list
}
tax_table = data.frame(genus=all_genus, species=rownames(bracken)) #make a new table with the genus and species names
rownames(tax_table) = rownames(bracken) #and add the rownames to it

Next, we'll read in our metadata and reformat it:

metadata = read.csv("metagenome_metadata.txt", sep="\t") #read in the file as a dataframe
samples = data.frame(metadata[,2], stringsAsFactors = FALSE) #get only the second column
rownames(samples) = metadata[,1] #add the sample names
colnames(samples) = c('group') #change the column name
samples['classifier'] = 'kraken' #add a new column that shows us that these samples were classified by Kraken

And now we'll combine these objects to a phyloseq object:

BRACKEN = otu_table(bracken, taxa_are_rows = TRUE) #convert bracken to an otu_table
SAMPLE = sample_data(samples) #convert the samples dataframe to a phyloseq sample_data format
TAX = tax_table(tax_table) #convert the tax_table dataframe to a phyloseq tax_table format
taxa_names(TAX) <- rownames(bracken) #get the taxa names from the bracken row names
physeq_bracken = phyloseq(BRACKEN, TAX, SAMPLE) #make the phyloseq object

Print out physeq_bracken to see if it has all of the taxa and samples that you are expecting.

And collapse the table at the genus level:

physeq_bracken_genus = tax_glom(physeq_bracken) #collapse at the genus level (the default is the first level in the tax_table, which for us is the genus)
taxa_names(physeq_bracken_genus) = data.frame(tax_table(physeq_bracken_genus))$ta1 #add the genus names to this

Have a look at it:

View(physeq_bracken_genus)

Filtering and rarefying

Now we can take a look at how many reads have been classified within our samples:

sample_sums(physeq_bracken_genus)
print(c("Minimum sample size:",min(sample_sums(physeq_bracken_genus)), "Maximum sample size:", max(sample_sums(physeq_bracken_genus))))

Question 4: Roughly what proportion of the reads you have in the samples after KneadData have been classified? Is this surprising to you? Remember that you can look in the kneaddata_read_counts.txt file to see how many reads the files all ended up with!

It's also always good to have a look at a rarefaction curve. This tells us about whether we've sequenced our samples sufficiently to capture all of the diversity within them:

rarecurve(as.data.frame(t(otu_table(physeq_bracken_genus))), step=50,cex=0.5,label=TRUE, ylim = c(1,800))

Question 5: How do these look to you? If these were your samples, would you be happy with these?

Let's also take a look at the number of reads assigned to each taxon within our samples:

length(taxa_sums(physeq_bracken_genus))
print(c("Minimum reads:",min(taxa_sums(physeq_bracken_genus)), "Maximum reads:", max(taxa_sums(physeq_bracken_genus))))
taxa_sums(physeq_bracken_genus)

You'll see that we have a minimum of 1 read assigned to a genus, and a maximum of >14M reads. Usually, we'll remove some of the very low abundance taxa in our samples.

physeq_bracken_genus<-prune_taxa(taxa_sums(physeq_bracken_genus)>=10,physeq_bracken_genus)

If you take a look at this, you'll see that we've already gone from >1500 genera to only 490. But we can reduce this a little further:

physeq_bracken_genus<-prune_taxa(taxa_sums(physeq_bracken_genus)>=100,physeq_bracken_genus)

Now we only have 112 genera, and with a minimum of a sum of 100 reads (average 10 per sample), this seems fairly reasonable for filtering.

Let's take a look at the rarefaction curves again:

rarecurve(as.data.frame(t(otu_table(physeq_bracken_genus))), step=50,cex=0.5,label=TRUE, ylim = c(1,150))

Question 6: How do you feel about the samples now?

Next we'll rarefy the samples prior to our diversity calculations - the number of classified reads varies a lot between our different samples!

set.seed(711)
rarefied_bracken_genus<-rarefy_even_depth(physeq_bracken_genus, rngseed = FALSE, sample.size = min(sample_sums(physeq_bracken_genus)), trimOTUs = TRUE)

See that the samples are all the same size now:

sample_sums(rarefied_bracken_genus)

Note that we are only going to use this rarefied object for the alpha diversity analyses!

Alpha and beta diversity analyses

We're not going to go into explaining all of these again as we did that in the beginner workshop, but we'll take a look at the samples with a few different methods.

First, we'll look at the alpha diversity:

plot_richness(physeq = rarefied_bracken_genus, x="group", color = "group", measures = c("Observed", "Shannon", "Fisher", "Simpson")) + geom_boxplot() +theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

Question 7: What do you notice about the alpha diversity differences between the samples?

Now we'll convert the non-rarefied phyloseq object to relative abundance:

relabun_bracken_genus <- transform_sample_counts(physeq_bracken_genus, function(x) x*100 / sum(x))

And then we'll use this for a Bray-Curtis PCoA plot:

ordination<-ordinate(relabun_bracken_genus, "PCoA", "bray")
plot_ordination(relabun_bracken_genus, ordination, color = "group") +
  geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) +  ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")

Question 8: do the samples group how you expected them to based on their alpha diversity?

Next we'll make a bar plot. Plotting all 112 of the taxa on a barplot is a bit useless, so first we'll filter to include only the 20 most abundant genera:

relabun_bracken_genus_top<-prune_taxa(names(sort(taxa_sums(relabun_bracken_genus),decreasing=TRUE)[1:20]), relabun_bracken_genus)

plot_bar(relabun_bracken_genus_top, fill="ta1") + facet_wrap(c(~group), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1))

Question 9: What do you notice about the composition of these samples? What are the samples from the Non-Bacteroides group dominated by?

Often heatmaps are a little easier to look at than a stacked barplot, and it's easier to see differences between samples:

plot_heatmap(relabun_bracken_genus_top, method = "PCoA", distance = "bray", low = "white", high = "red", na.value = "grey", sample.label="group")

This function is grouping our samples by their Bray-Curtis distance from one-another, although note that this may not look quite the same as it would in the PCoA plot because we're only looking at the 20 most abundant genera.

That was the plotting inside phyloseq, but sometimes we also want a dendrogram to show how our samples cluster together, so let's also try something else:

heatmap(otu_table(relabun_bracken_genus_top))

Note that this has just clustered our taxa together, and the tree that we see for the genera is not a phylogenetic tree! Question 10: What can you notice about our samples now? Particularly the taxonomic composition of the Non-Bacteroides samples?

Now lets do the same with the MetaPhlAn3 output

MetaPhlAn 3 should have finished running by now! So go back to your terminal window.

You may see a warning come out, but as long as it ran, that doesn't matter!

Similar to Kraken2, MetaPhlAn will output individual files for each sample. We can use a utility script from MetaPhlAn to merge our outputs into one table.

merge_metaphlan_tables.py metaphlan_out/*.mpa > metaphlan_merged_out.txt

If we view this new combined table, we will see three key things:

  1. First, the output format is different to that of Kraken2, where the full taxonomic lineages are expanded line by line.
  2. Second, MetaPhlAn only outputs relative abundances for each taxonomic node, whereas Kraken2 (before re-analysis with Bracken) will output absolute numbers of reads assigned to each taxonomic node.
  3. Third, the number of taxa that MetaPhlAn finds is much smaller than Kraken2. This is partially due to us using a low confidence threshold with Kraken2, but this discrepancy between the two tools tends to hold true at higher confidence thresholds as well. See our paper for more info about how these tools perform compared to each other.

We're going to perform some similar steps on this MetaPhlAn output as we did on the Bracken output, but we'll be able to just keep the lines of the file that are at the genus level, rather than grouping by genus. I haven't explained things much here as you can see more explanation above, so see if you can work out what each of the lines of code is doing as you go.

Read in the file and filter to only the relevant lines:

metaphlan <- read.csv("metaphlan_merged_out.txt", sep="\t", skip=1)
rownames(metaphlan) = metaphlan[,1]

keeping = c()
genus = c()
for (row in rownames(metaphlan)) {
  if (grepl("g__", row, fixed = TRUE)) {
    if (grepl("s__", row, fixed = TRUE)) {
      skip_this = TRUE
    } else {
      keeping = c(keeping, row)
      genus = c(genus, strsplit(row,"__")[[1]][7])
    }
  }
}

metaphlan = metaphlan[keeping,]
metaphlan = metaphlan[,c(-1,-2)]
rownames(metaphlan) = genus
colnames(metaphlan) = sapply(strsplit(colnames(metaphlan),"_"), `[`, 1)

Read in the metadata again:

metadata = read.csv("metagenome_metadata.txt", sep="\t")
samples = data.frame(metadata[,2], stringsAsFactors = FALSE)
rownames(samples) = metadata[,1]
colnames(samples) = c('group')
samples['classifier'] = 'metaphlan'

Make a taxonomy table:

tax_table = data.frame(genus=rownames(metaphlan))
rownames(tax_table) = rownames(metaphlan)

Combine these into a phyloseq object:

METAPHLAN = otu_table(metaphlan, taxa_are_rows = TRUE) #convert asv_table_num to an otu_table
SAMPLE = sample_data(samples)
TAX = tax_table(tax_table)
taxa_names(TAX) <- rownames(metaphlan)
physeq_metaphlan_genus = phyloseq(METAPHLAN, TAX, SAMPLE)

We're going to skip some of the first steps looking at read depths and the number of reads assigned to different species, as we've identified marker genes using MetaPhlAn3 and not reads, and this is relative abundance and not a count.

There are only two alpha diversity metrics that we can look at without counts:

plot_richness(physeq = physeq_metaphlan_genus, x="group", color = "group", measures = c("Shannon", "Simpson")) + geom_boxplot() +theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

Make a PCoA plot with Bray-Curtis distance:

ordination<-ordinate(physeq_metaphlan_genus, "PCoA", "bray")
plot_ordination(physeq_metaphlan_genus, ordination, color = "group") +
  geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) +  ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")

Question 11: Are these results for alpha and beta diversity as you expect them to be from the Kraken/Bracken results?

We don't have very many taxa in our MetaPhlAn3 output, so we won't bother filtering this before plotting the stacked barplot:

plot_bar(physeq_metaphlan_genus, fill="ta1") + facet_wrap(c(~group), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1))

Then make the first of the heatmaps again:

plot_heatmap(physeq_metaphlan_genus, method = "PCoA", distance = "bray", low = "white", high = "red", na.value = "grey", sample.label="group")

You'll probably notice a lot of grey spots here - that means that genus wasn't found in the sample at all.

And the second one:

heatmap(otu_table(physeq_metaphlan_genus))

Question 12: What do you notice here compared with the Kraken/Bracken results?

Combine Kraken and MetaPhlAn output

First we'll rename both of the Bracken and MetaPhlAn phyloseq objects so that we don't have duplicated sample names in them:

sample_names(relabun_bracken_genus) = paste(sample_names(relabun_bracken_genus), "_bracken", sep="")
sample_names(physeq_metaphlan_genus) = paste(sample_names(physeq_metaphlan_genus), "_metaphlan", sep="")

Hopefully you can see how we're just pasting together each of the sample names with either "_bracken" or "_metaphlan"

Now join the two phyloseq objects:

physeq_combined = merge_phyloseq(relabun_bracken_genus, physeq_metaphlan_genus)

Make two PCoA plots, one with Jaccard distance (looking at presence/absence only) and then one with Bray-Curtis dissimilarity (that also accounts for the abundance of each genus:

ordination<-ordinate(physeq_combined, "PCoA", "binary")
plot1 <- plot_ordination(physeq_combined, ordination, color = "group", shape = "classifier") +
  geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20), legend.position = "none") +  ggtitle("Beta Diversity", subtitle = "Jaccard distance")
ordination<-ordinate(physeq_combined, "PCoA", "bray")
plot2 <- plot_ordination(physeq_combined, ordination, color = "group", shape = "classifier") +
  geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) +  ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")
grid.arrange(plot1, plot2, ncol=2)

Does this look weird? Click the little button on the right between the code and the plot to "Show in new window".

Question 13: Anything surprising here? What do you notice about the two plots?

We had included all of the taxa previously, but Bracken has a lot more genera identified than MetaPhlAn. So we'll filter the Bracken output to be a little more similar:

relabun_bracken_genus_top<-prune_taxa(names(sort(taxa_sums(relabun_bracken_genus),decreasing=TRUE)[1:100]), relabun_bracken_genus)
physeq_combined = merge_phyloseq(relabun_bracken_genus_top, physeq_metaphlan_genus)

Now our PCoA plot looks a little nicer:

ordination<-ordinate(physeq_combined, "PCoA", "bray")
plot_ordination(physeq_combined, ordination, color = "group", shape = "classifier") +
  geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) +  ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")

Let's have a look at the taxa in the samples:

plot_bar(physeq_combined, fill="ta1") + facet_wrap(c(~"group", "classifier"), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1))

Can you see all of the taxa? Try filtering to only the top 20, modifying the code that we used above.

And look at the heatmap:

heatmap(otu_table(physeq_combined))

Question 14: What differences do you notice between the two classifiers?

Answers

Question 1: This one isn't really a question, but you can see the code needed for these steps in the answers section if you need to!

grep "phiX174" assembly_summary_refseq.txt 
#go to https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz

Question 2: Take a look at this file. Are there any surprises? Which of the output files in kneaddata_out will you use for analysis? Nope! There are no samples where we seem to lost a really large number of reads at any step - we don't lose many reads that couldn't be trimmed (difference between raw single and trimmed single), and not too many of the reads are host-associated (difference between trimmed single and decontaminated chicken_Gallusgallus_phiX174 single). We'll use the sample_kneaddata.fastq files for analysis, and not the sample_kneaddata_chicken_Gallusgallus_phiX174_bowtie2_contam.fastq files.

Question 3: How many reads are in each sample? The original number of reads per sample ranges between 931,686 and 1,017,130 (average 964,187) and after KneadData ranges between 912,712 and 998,264 (average 950,470).

Question 4: Roughly what proportion of the reads you have in the samples after KneadData have been classified? Is this surprising to you? We have between 52,405 (min(sample_sums(physeq_bracken_genus))) and 446,368 (max(sample_sums(physeq_bracken_genus))) - average 221,440 (mean(sample_sums(physeq_bracken_genus))). 221,440/950,470 = ~23% of reads classified. This does seem relatively low, although it's not uncommon for only a small proportion of the reads in samples not from humans (i.e. from less well-characterised environments) to be classified. In this case, it's likely mainly due to the really small database that we've used for taxonomic classification. We've used an 8GB database due to the constraints of these AWS instances, but we recommend using the largest database that you possibly can - the ones that we typically use in our lab are 800GB-1.1TB in size.

Question 5: How do these look to you? If these were your samples, would you be happy with these? These curves don't look great - they don't saturate at all.

Question 6: How do you feel about the samples now? This is much better - and we can see that our sample rarefaction curves not being saturated is down to these very low abundance taxa. While some of them could be important, anything present at abundances this low could be due to false-positive noise, and we didn't do any filtering with Bracken.

Question 7: What do you notice about the alpha diversity differences between the samples? There are huge differences between our Bacteroides and Non-Bacteroides sample groups. The samples in the Bacteroides group have much lower diversity than the Non-Bacteroides samples.

Question 8: do the samples group how you expected them to based on their alpha diversity? Yes. All of the Bacteroides samples group away from the Non-Bacteroides samples, which isn't surprising based on how different their alpha diversity is.

Question 9: What do you notice about the composition of these samples? What are the samples from the Non-Bacteroides group dominated by? The Bacteroides samples have very low abundance of anything other than Bacteroides, while the Non-Bacteroides samples have many other genera also present. There isn't really any one genus that dominates the Non-Bacteroides samples, although a couple do seem to have high Escherichia.

Question 10: What can you notice about our samples now? Particularly the taxonomic composition of the Non-Bacteroides samples? There is very little of anything but Bacteroides in the Bacteroides samples. The Non-Bacteroides are not particularly similar to one-another, with a range of taxa present in mid-high abundances in each of them.

Question 11: Are these results for alpha and beta diversity as you expect them to be from the Kraken/Bracken results? Yes - the Bacteroides and Non-Bacteroides samples are still separating from one another and the alpha diversity is much lower in Bacteroides than Non-Bacteroides samples.

Question 12: What do you notice here compared with the Kraken/Bracken results? These samples are very sparse! Lots of the taxa are only identified in one or a few samples. This is one of the draw-backs of the marker-gene approach used by MetaPhlAn - we can typically be very confident in the taxa that it identifies as present, but it is likely to miss lots of taxa, too (low recall, high precision).

Question 13: Anything surprising here? What do you notice about the two plots? Jaccard distance only takes into account presence/absence of taxa, while Bray-Curtis also accounts for abundance. It's surprising how similar the two groups (Bacteroides/Non-Bacteroides) look with Bray-Curtis when they are so different with Jaccard, but this means that the most abundant taxa must be similar between the two different classification methods.

Question 14: What differences do you notice between the two classifiers? The Bacteroides samples look very similar between the two methods, but the Non-Bacteroides samples seem to be more different. Escherichia seems to dominate three MetaPhlAn samples but only two Kraken samples. There are lots of other differences too! See if you can come up with some explanations for these differences.

⚠️ **GitHub.com Fallback** ⚠️