Metagenome sequence data analysis with Kraken2 and Phyloseq - LangilleLab/microbiome_helper GitHub Wiki

This tutorial is for the 2023 ICG bioinformatics workshop running from May 24th-25th with access to Acenet servers.

Authors: Robyn Wright and Morgan Langille

Table of Contents

  1. Initial setup

  2. Running Kraken2 using batch job submission

  3. Running Bracken

  4. Visualising the results using Phyloseq in R

  5. Optional additional material

Introduction

The main goal of this tutorial is to introduce students to the taxonomic profiling of metagenomic data from microbiome samples. We want to emphasize that there is not a one-size-fits-all pipeline for analyzing MGS data. For example some individuals may opt to examine their data using a marker-based approach such as MetaPhlAn3 while others may opt to use a kmer based strategy such as Kraken2+Bracken (that we will be using here).

Throughout this tutorial there will be questions are aimed at helping students understand the various steps in reference based metagenomic profiling. The answers are found on this page.

This tutorial is based on our Metagenomics SOP and we are also working on a list of resources and papers that may be useful for beginners (or those that are not beginners!) in the microbiome field here.

Initial setup

Usually the first steps in metagenome analysis would involve: (1) Quality control - checking that all of the reads are of sufficient quality to use in our analysis - and primer trimming - removal of any of the primers/adapters used in sequencing; (2) removal of unwanted reads - we usually check that there are no PhiX reads left over from the sequencing (usually added into Illumina sequencing runs for internal quality control) and may also remove any host-associated reads, for example, in human studies we would usually remove reads that map to the human genome; and (3) joining of the forward and reverse reads. Both of the first two steps can be performed using Kneaddata. The third step could be with a program that checks for matches between the reads, but as metagenome reads often don't have a large - or any - overlap, we may just simply concatenate the reads together. We don't have time for these steps in todays workshop, but you can read/work through a previous workshop that we ran for a more complete overview of metagenome sequencing analysis.

First go to Jupyter Lab and open up a terminal session (go to the launcher and click 'Terminal').

The first thing that we will do is give ourselves access to the data, which is currently in a shared area of the servers by running this:

ln -s ~/projects/def-sponsor00/SHARED/microbiome_metagenome/* .

If you now list everything that is in this directory (using ls) you should see a number of files and folders showing up in a turquoise colour. We haven't created a copy of these files, but we've created a link to them so that we can see them without typing in the full path to them (like creating a link to a folder on our Desktop).

If you take a look at the files in the subsampled_reads directory, you'll see 10 different samples in there that match the 16S samples you were looking at earlier. If you have a look at one of the files using the head command (you can exit by typing q), you'll see the format is the same as for the 16S files with lines for the sequence name, sequence and quality information for each sequence. These files have also been sub-sampled to have only 100,000 sequences per file. This is not typical for metagenomic reads but helps us be able to process them within the short amount of time we have in this workshop; these samples had 10,613,908-43,508,088 reads initially.

Running Kraken2 using batch job submission

After we have performed quality control on all of our reads the next step is to use Kraken2 to classify them with taxonomic labels. Kraken2 is one of many popular tools for classifying metagenomic reads into taxonomic profiles. In this case Kraken2 uses a kmer approach to assign reads to specific taxonomic lineages (look back at the lecture slides to remind yourself how kmers are used to classify reads).

There are several choices that we can make when classifying our samples that can have a very significant impact on both the proportion of reads in samples that we are able to classify as well as the taxa that are classified (if a taxon doesn't exist in our database, then we won't be able to classify it in our samples). The biggest choices here are the reference database that we use as well as the confidence threshold. We have recently published a paper that addresses this in some detail, but we'll summarise some of the most important points below.

As it will take a few minutes for Kraken2 to run, we will just start running the commands and then there are a few things that we should understand about what we're doing.

First of all, we're going to make some directories to hold our results:

mkdir kraken2_kreport
mkdir kraken2_outraw

Typically when we run Kraken2 there are two major steps computationally: (1) loading the database into memory, and (2) searching each of the sequences in our file against this database. As Kraken2 is very efficient, the most time-consuming step in this is usually loading the database into memory, and then it usually takes a few minutes for each sample to be run against the database. On the Acenet servers, we share the computational resources with other users. As a few minutes per sample multiplied across all of you in this workshop is much more time than we have, we'll just practice with how to run a single sample each against the database and then we'll provide you with the results for all samples to visualise in R. For this, we'll use the batch job submission that you were shown yesterday.

First we'll create a file for this batch job submission using the vi text editor that is pre-installed on most Linux systems:

vi kraken.job

This creates a text file called kraken.job and should open up an editor. Type in i to enter "insert" mode.

Now copy the following into this:

#!/bin/bash
#SBATCH --job-name=kraken_userYYY
#SBATCH --output=/home/userYYY/kraken.out
#SBATCH --error=/home/userYYY/kraken.err
#SBATCH --mem=10G
#SBATCH --time=0-00:30
#SBATCH --cpus-per-task=4
source /home/userYYY/.bashrc
~/projects/def-sponsor00/SHARED/anaconda3/envs/kraken/bin/kraken2 --db  k2_pluspf_08gb_20230314/ --threads 4 --output /home/userYYY/kraken2_outraw/BBXXX_0.0_minikraken.kraken.txt --report /home/userYYY/kraken2_kreport/BBXXX_0.0_minikraken.kreport --use-names --confidence 0.0 /home/userYYY/subsampled_reads/BBXXX.fastq

You should edit all of the instances of "userYYY" to be your user number and you should change the sample name BBXXX to a sample of your choosing. Once you are finished you can type esc to enter insert mode and then :x followed by enter to exit this text document while saving the changes you have made.

In this file we've given a job name, specified file names for the output of the job and for any errors encountered to be written to, set the maximum amount of memory allowed, the maximum time allowed, the maximum number of CPUs or threads and then we've set the location that the server can find the installed programs to use (the source lines) and finally, the kraken2 command to be run.

Now we can submit this job file:

sbatch kraken.job

This give us something like Submitted batch job 77; we can check on the status of this job with seff 77 (replacing 77 with whichever number job yours is) or on all jobs with squeue.

While you wait for this job to run you can read a bit more about the kraken2 command below, or carry on to the visualizing the results using Phyloseq section. Once the command has finished you can run Bracken on the output.

About the Kraken2 command

  • --db - This option points to the location of the database you want Kraken2 to use for classification. We have already downloaded the minikraken database to the shared area (downloaded from here).
  • --threads 4 - This option indicates that we want to use four processors for this Kraken2 job
  • --output - This argument points to the file name we want to output our classifications to.
  • --report - This argument points to the file name we want to output a more detailed classifcation report for each sample. These more in-depth classification reports are required to correct abundance estimates using bracken.
  • --use-names - This indicates we want the classifier to output the taxon names rather than their NCBI IDs.
  • --confidence - This allows us to set the confidence threshold so as reads with a low proportion of their k-mers classified as the taxon will be unclassified in the output (see below).

Databases

Unlike metagenomic assembly, we will be classifying our sequenced reads based on how similar they are to reads we have already classified in a reference database. In the Kraken2 manual, the Kraken2 developers recommend building a "standard" database that is comprised of the bacterial, archaeal and viral domains as well as the human genome and a collection of known vectors (UniVec_Core). There are commands given in the manual for doing this, and a relatively up-to-date version of this database is also available for download via the Langmead lab here. This database requires approximately 100 Gb RAM/memory to build, and 40 Gb RAM to run. The Langmead Lab also provides some secondary databases that are capped at different sizes, and also that contain other domains in addition to the bacteria, archaea and viruses. In our preprint, we found that the best database to use with a range of simulated and mock samples (where the real composition of the sample was known) was one that contained all genomes from the NCBI Reference Sequence Database (NCBI RefSeq), and can be downloaded from here. However, this database is approximately 1.2 Tb and will therefore be too large for many researchers to use. We found that bigger databases weren't always better, and without access to a server with very large amounts of RAM, the best options were smaller, curated databases - we recommend that everyone thinks carefully about this choice prior to use on their own samples, and provided some guidance on this in the preprint.

Because we only have a limited amount of RAM/memory on our instances here, we will be using the Standard-8 database (standard database capped at 8 Gb, which we'll call the minikraken database). The smaller size means that we can run it on workstations with a lower amount of RAM, but this comes at the cost of reduced performance. Also note that by default, kraken uses NCBI taxonomy (and requires a taxonomy structure similar to that of the NCBI taxonomy, where all taxa at all ranks are assigned a taxonomic ID, and the "parent" taxon is known for all), however, there are a number of other databases that users have made using different styles of taxonomy such as the Genome Taxonomy Database.

Confidence thresholds

As mentioned above, Kraken2, uses exact alignment of k-mers (sub-sequences of length k) to a reference database for query read classification. Because k-mers within a read could map to multiple sequences within the database when using Kraken2, a lowest common ancestor (LCA) approach is used to classify the sequence, and the number of k-mers mapping to a given taxon are calculated as a proportion of the number of k-mers within that sequence that do not have ambiguous nucleotides. This proportion of k-mers mapping to a taxon is called the “confidence”, and there is a confidence threshold that can be set within the program (0 by default), where a taxonomic classification is only taken if it is above this pre-defined threshold. For a short example of this, have a look at (this figure in the Kraken2 paper).

Here, we can see that the query sequence has been broken up into 27 k-mers. 11 of these k-mers are unclassified (shown in white), while 10 have been classified as the blue taxon, 4 as the orange taxon, 1 as the purple taxon and 1 as the red taxon. In this example, the purple taxon is the ancestor of all taxa and the blue is the ancestor of the orange. The sequence will be classified to the leaf node, or the lowest rank possible, and we can use the number of k-mers that were classified to calculate how confident we are in that taxonomic assignment. Our confidence that this sequence belongs to the orange taxon if 4/27, or 0.15. Our confidence that it belongs to the blue taxon is (10+4)/27 = 14/27 = 0.52, and our confidence that it belongs to the purple taxon is the sum of all classified k-mers, so 16/27 = 0.59. By default, the confidence threshold that Kraken uses is 0.00, which would mean that the sequence would be classified as the orange taxon. If we increased the confidence threshold to 0.20 then the sequence would be classified as the blue taxon, and if we increased it to 0.60 or higher then the sequence would be unclassified.

Again, you can see our paper if you want to see how this impacts the classifications given in more detail (and for more guidance on this), but as the confidence threshold is increased, fewer reads will be classified, but we can be more certain that the reads that are classified are classified correctly. To some extent, the "best" confidence threshold to use will depend on your research question and whether you are more concerned with ensuring that you identify all taxa that you possibly can (no false negatives), or that all of the taxa that you do identify are actually in the sample (no false positives). If you are running your own samples, then it may be worth taking a small (representative) subset and running them with a few different thresholds to see how the confidence threshold impacts this.

Back to the Kraken2 output

Now we will take a look at some of the output files from Kraken2. First take a look at a file in the kraken2_kreport_refseq folder with head -30 kraken2_kreport_mini/BB190_0.0_minikraken.kreport. The first few lines should look something like this:

92.79	4603672	4603672	U	0	unclassified
  7.21	357516	251	R	1	root
  7.20	357015	2589	R1	131567	  cellular organisms
  7.03	348734	29028	D	2	    Bacteria
  3.50	173651	13906	P	1224	      Pseudomonadota
  2.26	112004	9162	C	28211	        Alphaproteobacteria
  1.70	84372	9284	O	356	          Hyphomicrobiales
  1.06	52698	4984	F	41294	            Nitrobacteraceae
  0.90	44453	15558	G	374	              Bradyrhizobium
  0.20	10170	10170	S	1437360	                Bradyrhizobium erythrophlei
  0.16	7763	1740	G1	2631580	                unclassified Bradyrhizobium
  0.01	593	593	S	2782654	                  Bradyrhizobium sp. 186
  0.01	480	480	S	2782665	                  Bradyrhizobium sp. 200
  0.01	475	475	S	2782641	                  Bradyrhizobium sp. 170

The columns here are (from the Kraken2 manual):

Column Number Data Type
1 The percentage of sequences covered by the clade rooted at this taxon
2 The number of sequences covered by the clade rooted at this taxon
3 The number of sequences assigned directly to this taxon
4 A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies
5 NCBI taxonomic ID number
6 Indented scientific name

You should notice first that we have a very large proportion of unclassified reads. In the example output above, 92.79% of sequences were unclassified while only 7.21% of sequences were classified. The output is hierarchical, so going down, we can see that 7.20% of sequences were from cellular organisms (so almost all of the classified sequences), and 7.03% were Bacterial sequences. If we look at the second and third columns for the Bacteria, we can see that 29,028 of the 348,734 sequences that were classified as Bacteria weren't classified to any rank lower than Bacteria (Domain rank).

The reason that we have so few reads classified here is because we've used the smallest Kraken2 database. We typically find that as the size of the database increases, so does the proportion of reads that can be classified because there's a higher chance that something similar to each of your reads exists in the database. Unfortunately the full RefSeq database is far too big to use in this workshop (it's ~1.2 TB), but that's what we've used to classify the samples for the data that you're using in the Phyloseq part of the workshop, below. Now take a look at the same sample that's been classified using the full database: head -30 kraken2_kreport_refseq/BB190_0.0.kreport

Question 1: How many more reads are classified using the full database than with the mini database?

Now take a look at a sample that was classified using a different confidence threshold with the full database, e.g.: head -30 kraken2_kreport_refseq/BB190_0.2.kreport

Question 2: What do you notice about the number, or proportion, of reads that have been classified?

Now look at the same sample but classified with the mini database and a higher confidence threshold, e.g.: head -30 kraken2_kreport_mini/BB190_0.2_minikraken.kreport

Question 3: What do you notice now about the number of reads that have been classified? Does this seem like an appropriate confidence threshold to use?

Running Bracken

The raw output files created by Kraken2 should be corrected using Bracken to get more accurate estimations on the abundance of different taxa at various taxonomic levels. While Kraken2 classifies each of our reads to their best locations within a taxonomic tree, in some cases this will result in a read being placed higher up in the tree due to two species or sometimes even genera (or higher ranks) sharing the exact same sequence (k-mer) to which that read matched. Bracken can be used to solve this issue by taking these reads and mapping them back to the genomes they might belong to. By doing this it generates new abundance estimates that take into account the problem laid out above.

Once Kraken2 has finished running, we can run Bracken on these samples. Bracken doesn't require much memory at all so we can each run it at the same time without running into problems.

First we will make a directory to output the results to:

mkdir bracken_out

Now we'll run Bracken:

parallel -j 1 '~/projects/def-sponsor00/SHARED/anaconda3/envs/bracken/bin/bracken -d k2_pluspf_08gb_20230314/ -i {1} -o bracken_out/{1/.}.{2}.bracken -r 150 -l {2} -t 1' \
    ::: kraken2_kreport/*.kreport ::: S G P

In this example, we're using a program called parallel to run bracken on any .kreport files that we might have as well as at three different ranks; S (species), G (genus) and P (phylum). Parallel is a really useful program that can be used to run a program on multiple files at the same time, or, for example, just to loop through and run a program on all of the files in a folder or at different taxonomic ranks.

Here, we're giving Bracken several options:

  • -d - The database folder that contains the Bracken files in addition to the Kraken2 database (the e.g. database100mers.kmer_distrib files)
  • -i - The input kreport file from kraken2, denoted for parallel by {1} because it's the first option outside of the quoted bracken command after the :::
  • -r - The length of the reads used for mapping
  • -l - The level at which to estimate taxon abundance (S indicates species), denoted for parallel by {2} because it's the second option outside of the quoted bracken command
  • -t - The number of reads required for a taxon to be used in the abundance estimate

Note generally you would increase the -t argument to be somewhere between 0.1-1% of the total average read depth, however, since we are working with files that have already been subset we will not set this option. The reason for this is that k-mer based strategies such as kraken2 are notorious for making a large number of low abundance false positives. Check-out this paper or our paper for more information.

Usually if we had bracken results for multiple samples we would join all of these together (like we did in this other workshop, but with only one sample this isn't necessary. If you want to look at what this looks like though, take a look at the merged_output_0.1.species.bracken file with the head command. You will notice that the file has a number of different columns corresponding to different information about a single taxon.

Column Name Data Type
name Taxonomic ID
taxonomy_id NCBI taxonomic ID
taxonomy lvl Character representing the level of taxonomy
SAMPLE.species.bracken_num Number of reads that aligned to this classifcation for this SAMPLE
SAMPLE.species.bracken_frac Proportion of reads that aligned to this classification for this SAMPLE
SAMPLE2.species.bracken_num Number of reads that aligned to this classifcation for this SAMPLE2
SAMPLE2.species.bracken_frac Proportion of reads that aligned to this classification for this SAMPLE2
etc.

Visualizing the results using Phyloseq in R

Now we're going to explore the results a bit in RStudio. We'll mainly be using Phyloseq, which is a popular package for microbiome data analysis as once you've got your data into the correct format for it, it has lots of really useful functions that you can use for simple visualisation and exploration of your data. The initial steps of getting your data into the correct format for it can be a bit fiddly, but it's really useful and easy to use once you've performed these.

Open up JupyterLab and go to loaded modules (the hexagon) and search Rstudio if it doesn't already appear as an option. Click load and it should appear in the main window. Now launch it.

First of all, we'll import the packages that we'll be using:

.libPaths("/project/def-sponsor00/R/lib")
library(phyloseq)
library(vegan)
library(ggplot2)
library(randomcoloR)
library(gridExtra)

Note that because we've preinstalled these, the first line just tells R where it can find them.

Next we'll read in the file that contains our metadata and we'll also set the row names to be the SampleID and then remove the first column (so we don't have the row names duplicated):

metadata = read.csv('Blueberry_metadata_reduced.tsv', sep='\t')
rownames(metadata) = metadata$SampleID
metadata = metadata[,-1]

Next we'll read in the taxonomy information about the species in our samples and we'll again change the row names and then remove the first column:

tax = read.csv('tax_file_species_reduced_for_bracken.csv')
rownames(tax) = tax[,1]
tax <- tax[,-1]

Note that this file isn't usually an output from Kraken2/Bracken - I've put it together based on the NCBI taxonomy information (which can be found here for a browsable version or here for a downloadable version. It's not necessary for analysis of our samples, but I find it useful to have information on which e.g. family a given species belongs to and be able to collapse the species at higher ranks.

Next we'll import the Bracken output and we'll perform a couple of things with it that you can see described in this chunk of code (note that using # ahead of some text means that it won't be interpreted as code by R so we can use it to make notes about what we're doing):

#import the bracken output at confidence = 0.1 and drop the other columns (we are only keeping the number of reads assigned to each species, not the relative abundance or taxonomy ID and level information)
bracken = read.csv('merged_output_0.1.species.bracken', sep='\t')
row.names(bracken) = bracken$name
drop = c('name', 'taxonomy_id', 'taxonomy_lvl')
for (i in 1:length(colnames(bracken))) {
  if (grepl("bracken_frac", colnames(bracken[i]), fixed = TRUE)) {
    drop = c(drop, (colnames(bracken)[i]))
  }
}
bracken_red = bracken[,!(names(bracken) %in% drop)]

#rename the columns from what bracken has them by default to match the sample names in the metadata file
new_column_names = c()
for (i in 1:length(colnames(bracken_red))) {
  new_column_names = c(new_column_names, strsplit(colnames(bracken_red)[i], "_")[1](/LangilleLab/microbiome_helper/wiki/1)[1])
} 
colnames(bracken_red) = new_column_names

Now we'll convert the taxonomy information, metadata and bracken output to a phyloseq object:

#filter the taxonomy table to include only the species that are in the bracken table
tax_red = subset(tax, rownames(tax) %in% rownames(bracken_red))
#convert the taxonomy table to a phyloseq object and set the row names
taxonomy = tax_table(tax_red)
taxa_names(taxonomy) <- rownames(tax_red)
#import the bracken table, metadata and taxonomy information to a single phyloseq object
physeq = phyloseq(otu_table(bracken_red, taxa_are_rows = TRUE), sample_data(metadata), taxonomy)

Now we have our data imported to the phyloseq object (named "physeq") and from here we'll just be using this object.

Now we'll perform the normalisations that we will want to use:

physeq_rare <- rarefy_even_depth(physeq, sample.size = min(sample_sums(physeq)), replace = TRUE, trimOTUs = TRUE, verbose = TRUE) #rarefy to the lowest sample depth
physeq_relabun  <- transform_sample_counts(physeq, function(x) (x / sum(x))*100) #convert to relative abundance

There are different opinions in the microbiome community about how samples should be normalised to account for biases in sequencing, but for simplicity we're not getting into that today. If you're interested and want to know more, you can find several papers linked in the microbiome for beginners page.

Stacked bar plots

We will often use something like a stacked bar plot to get a quick overview of the data. Here we'll use the relative abundance phyloseq object to get a quick overview of how the community is composed.

Domain

First we'll look at the domain (Bacteria/Archaea/Eukaryotes) level:

palette = distinctColorPalette(30)
ps.domain = tax_glom(physeq_relabun, taxrank="ta1", NArm=FALSE)
plot_bar(ps.domain, fill="ta1") + facet_wrap(c(~Description_1, ~Description_3), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)

The first line here defines the colour palette that we want to use (and that we want it to have 30 distinct colours), the second line groups our phyloseq object to the domain level (called "ta1" here) and then the third line is used for plotting the stacked barplot. You can go ahead and run just parts of this line to see how each part of the code changes the plotted graph, e.g. plot_bar(ps.domain, fill="ta1"), plot_bar(ps.domain, fill="ta1") + facet_wrap(c(~Description_1, ~Description_3), scales="free_x", nrow=1), etc.

Question 4: Here you should see that there are also Eukaryotes in your samples. What do you notice about the distribution of the Eukaryotes in different samples?

Phylum

Now we'll look at the phylum level:

ps.phylum = tax_glom(physeq_relabun, taxrank="ta3", NArm=FALSE)
rank.sum = tapply(taxa_sums(ps.phylum), tax_table(ps.phylum)[, "ta3"], sum, na.rm=TRUE)
top30 = names(sort(rank.sum, TRUE))[1:30]
ps.phylum = prune_taxa((tax_table(ps.phylum)[, "ta3"] %in% top30), ps.phylum)
plot_bar(ps.phylum, fill="ta3") + facet_wrap(c(~Description_1, ~Description_3), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)

Here we're doing the same as above, but we're doing this at the rank of "ta3" rather than "ta1", and because we now have more than just three taxa at this rank, we're first filtering our table to include only the 30 most abundant taxa. So the line starting with rank.sum is first calculating the sum of each phylum across all samples, the top30 line is sorting these sums to find out which the top 30 most abundant phyla are and then the ps.phylum = prune_taxa line is filtering the ps.phylum object to include only those top 30 most abundant taxa before we plot them.

Question 5: Can you modify this phylum level code to work at a lower rank? Hint: You will need to modify both the ps.phylum and ta3 parts

Alpha diversity

Now we'll have a look at some measures of alpha diversity in our samples. We're going to be looking at Richness (observed taxa), Chao1 richness (estimated richness) and Simpson's diversity (a measure of diversity which takes into account the number of species present, as well as the relative abundance of each species). The diversity measures can't be calculated on numbers that aren't integers, so we can't calculate these for the relative abundance data. We also don't have a tree for the metagenomic data so we can't use Faith's phylogenetic diversity.

Question 6: Why do we not usually have a phylogenetic tree for read-based metagenomic analyses?

First we'll plot alpha diversity individually for each sample:

plot_richness(physeq_rare, measures=c("Observed", "Chao1", "Simpson"))

Here there are other alpha diversity metrics that we could look at, but we're just focusing on these 3 that give a good overview of what our samples are like.

And now we'll take a look at the differences in alpha diversity between the groups.

Per group (bulk vs rhizosphere):

plot_richness(physeq_rare, x="Description_1", measures=c("Observed", "Chao1", "Simpson")) + geom_boxplot()

Per group (forest vs managed):

plot_richness(physeq_rare, x="Description_3", measures=c("Observed", "Chao1", "Simpson")) + geom_boxplot()

Question 7: What patterns do you notice with the alpha diversity between groups?

Beta diversity ordination plots

As you heard earlier, there are a lot of different beta diversity metrics that we could use, but we'll just take a look at two of them; Bray-Curtis dissimilarity (which takes into account the relative abundance of taxa) and Jaccard distance (which accounts only for the presence/absence of taxa). Here we're also using the adonis2 function to perform PERMANOVA tests. These tell us about any significant differences between groups.

Bray-Curtis dissimilarity:

ps = physeq_rare
ps.ord <- ordinate(ps, "PCoA", "bray")
plot_ordination(ps, ps.ord, type="samples", color="Description_1", shape="Description_3")
distance <- phyloseq::distance(ps, method="bray", weighted=F)
adonis2(distance ~ sample_data(ps)$Description_1*sample_data(ps)$Description_3)

Here the lines are: 1) creating a copy of the physeq_rare object, 2) creating an ordination of this object, telling the function that we want to use the PCoA ordination method with the bray (short for Bray-Curtis dissimilarity) method, 3) plotting this ordination, using Description_1 to colour them and Description_3 for the shape of the points, 4) calculating a distance matrix with the bray method, and 5) performing a statistical test using the adonis2 function (PERMANOVA test) using Description_1 and Description_3 in the metadata.

Question 8: What can you tell about the samples from how they group on the PCoA plots? Does this agree with the results of the PERMANOVA tests?

Jaccard distance:

ps = physeq_rare
ps.ord <- ordinate(ps, "PCoA", "jaccard")
plot_ordination(ps, ps.ord, type="samples", color="Description_1", shape="Description_3")
distance <- phyloseq::distance(ps, method="jaccard", weighted=F)
adonis2(distance ~ sample_data(ps)$Description_1*sample_data(ps)$Description_3)

Here we're doing exactly the same as above, but with Jaccard distance rather than Bray-Curtis dissimilarity.

Question 9: How do these results compare with those for Bray-Curtis dissimilarity? What does this tell you about the abundances of different taxa?

Optional additional material

If you've finished everything and want more to do, try redoing the above analyses with a different Bracken file.

Right now, we've looked at the results for the full RefSeq database run with a confidence threshold of 0.1 (bracken_out_merged_species/merged_output_0.1.species.bracken). You could have a look at:

  • How changing the confidence threshold changes the results (e.g. using a much higher confidence threshold: bracken_out_merged_species/merged_output_0.5.species.bracken, or no confidence threshold at all: bracken_out_merged_species/merged_output_0.0.species.bracken
  • How changing the database changes the results (using the Mini database with either no confidence threshold: bracken_out_merged_mini/merged_output_0.0.species.bracken, or a very low one: bracken_out_merged_mini/merged_output_0.1.species.bracken (remember that from what we saw above, there were very few reads classified when we try using a higher confidence threshold with the Mini database)

You will need to change the line that reads: bracken = read.csv('merged_output_0.1.species.bracken', sep='\t'), but all of the other subsequent lines should work.