Phyloseq Tutorial for Metagenomics - LangilleLab/microbiome_helper GitHub Wiki

Author: Ben Fisher

This is a draft of the tutorial.

Table of Contents

Introduction

The purpose of this tutorial is to familiarize you with basic functionality of the R programming language, the RStudio integrated development environment, and some tools commonly used for bioinformatic analyses, including Phyloseq. Phyloseq is powerful R package that facilitates the interactive and reproducible analysis of microbial community data. This tutorial is designed to be run in RStudio, either via the server or local desktop release. This workflow picks up following the last steps of the Metagenomics (taxonomic annotation) tutorial.

For more advanced information on visualizing phyloseq data in R, see this tutorial: Phyloseq and Data Visualization

Teaching Objectives

  • Learn to use basic functions in R
  • Learn to format data for Phyloseq
  • Learn about Phyloseq objects and how to use them
  • Evaluate alpha and beta diversity with Phyloseq
  • Compute summary statistics for diversity metrics

Bioinformatic Tool Citations

* Indicates that this is an R package.

Setup

Downloading and Configuring RStudio

First download the latest version of R for your operating system, then download RStudio. The Instructions to do so are found here on the Posit website. If you would like a crash course, check out this resource.

Once you have obtained RStudio, run it and paste the following code into the console: Note that installing tidyverse can take some time.

install.packages("devtools","BiocManager","vegan","tidyverse")
devtools::install_github("pmartinezarbizu/pairwiseAdonis/pairwiseAdonis")
BiocManager::install("phyloseq")

Now your R packages are installed! Notice that the last two commands don't use the install.packages() function of base R. This is because these specific packages come from different repositories and require additional packages (devtools, BiocManager) to install - that's why we installed those first.

Downloading the tutorial data

Run the following commands to obtain the tutorial data. This is the same data that is used in our previous Metagenomics (taxonomic annotation; IMPACTT 2022) tutorial, so you may already have it! Note that the metagenome_tutorial.tar.gz archive is approximately 31GB in size

wget -O metagenome_tutorial.tar.gz https://www.dropbox.com/sh/cn06suaq77di9sf/AADpERJGxme7xfON-rLoOE2Oa/metagenome_tutorial.tar.gz?dl=1
tar -xvf metagenome_tutorial.tar.gz
rm metagenome_tutorial.tar.gz

This code does the following:

  1. Downloads the .tar.gz archive from the source (dropbox) using wget, with the -O flag specifying the output directory.
  2. Extracts the files contained within the .tar.gz archive. This will create a separate ~/metagenome_tutorial folder in the directory where the archive was downloaded. This is similar to "unzipping" a file.
  3. Removes the .tar.gz file after its contents have been extracted. This is because we now have the ~/metagenome_tutorial folder and all of its contents, so the .tar.gz archive is no longer necessary!

Setting up your conda environment

If you followed the previous tutorial:

You should already have miniconda installed and an environment called taxonomic. After ensuring that you are not already within an active conda environment, activate the taxonomic environment using:

conda activate taxonomic
If you did not follow the previous tutorial: (click to expand)
If you do not have miniconda installed, you must first so by following the instructions for your operating system here

With conda installed, run the following commands:

conda config --add channels bioconda
conda config --add channels conda-forge
conda create --name taxonomic

Now you can activate the conda environment with:

conda activate taxonomic

Then, install the packages necessary for this tutorial from the bioconda channel.

conda install -c bioconda kraken2
conda install -c bioconda bracken

Install kraken-biom

We must now add a new package, kraken-biom to the taxonomic environment. Inside the activated conda environment, install the package with:

pip install kraken-biom

(see the kraken-biom GitHub for other installation options).

Preparing data for Phyloseq with kraken-biom

If you are using your own data, you need to have run Kraken2 and Bracken as detailed in our Metagenomics (taxonomic annotation) tutorial. The output files should be in the .kreport format.

kraken-biom creates BIOM-format tables from Kraken/Bracken output. By default, the Kraken/Bracken .kreport file format is not supported by Phyloseq's import functions. kraken-biom makes it easy to translate your experiment's data into a format that Phyloseq can readily use.

Identify which files you will import to Phyloseq

First, navigate to the folder containing the Bracken-corrected .kreport files:

cd metagenome_tutorial/
cd kraken2_kreport_RefSeqCompleteV205/

If you use the ls command, you will see that there are 40 files in this folder, which look like:

CSM79HR8_0.0_RefSeqCompleteV205_bracken.kreport
CSM79HR8_0.0_RefSeqCompleteV205.kreport
CSM79HR8_0.1_RefSeqCompleteV205_bracken.kreport
CSM79HR8_0.1_RefSeqCompleteV205.kreport
CSM7KOMH_0.0_RefSeqCompleteV205_bracken.kreport
CSM7KOMH_0.0_RefSeqCompleteV205.kreport
CSM7KOMH_0.1_RefSeqCompleteV205_bracken.kreport
CSM7KOMH_0.1_RefSeqCompleteV205.kreport
...
PSM7J18I_0.1_RefSeqCompleteV205.kreport

From the previous tutorial, the files are distinguished by two things:

  1. The confidence threshold that was used in Kraken
  2. The suffix indicating if the .kreport was re-estimated with Bracken.

We want to use these two features to properly select our files. We should only use files that are annotated at the same confidence threshold (i.e. 0.1), and that are processed the same (i.e. with Bracken). We do not want to mix files with different confidence thresholds or files that have been re-estimated by Bracken with files that have not.

Given these criteria, let's say that our files of interest are:

  1. Files processed in Kraken with a confidence threshold of 0.1
  2. Files that have been re-estimated by Bracken

To choose only these files, we can use a Regular Expression (sometimes called regex). The following regex will match the relevant files based on the filenames; we can test the expression using ls:

ls *_0.1_*_bracken.kreport

which returns a list of 10 files:

CSM79HR8_0.1_RefSeqCompleteV205_bracken.kreport
CSM7KOMH_0.1_RefSeqCompleteV205_bracken.kreport
HSM6XRQY_0.1_RefSeqCompleteV205_bracken.kreport
HSM7J4QT_0.1_RefSeqCompleteV205_bracken.kreport
HSMA33J3_0.1_RefSeqCompleteV205_bracken.kreport
HSMA33KE_0.1_RefSeqCompleteV205_bracken.kreport
MSM79HA3_0.1_RefSeqCompleteV205_bracken.kreport
MSM9VZHR_0.1_RefSeqCompleteV205_bracken.kreport
MSMB4LXW_0.1_RefSeqCompleteV205_bracken.kreport
PSM7J18I_0.1_RefSeqCompleteV205_bracken.kreport

This regex works by using a combination of *, which is a wildcard that matches any number of any character, and the specific strings _0.1_ and _bracken.kreport that are shared only by the files we want. Including the * allows there to be different characters in between the ones we want to be constant, so we can successfully match multiple file names with one expression.

Specifying the input for kraken-biom

kraken-biom $(ls *_0.1_*_bracken.kreport) -o taxa.biom --fmt json
A few notes about this command:
  • we use the regex discussed in the section above to efficiently pass all of the file names to kraken-biom. The ls command with the regex is inside parenthesis preceded by a dollar sign $. Technically, this means that the command is executed within a subshell, and its value, which is our list of 10 files, is returned and used by kraken-biom. You can think of it as being similar to the order of operations in mathematics: the subshell(s) will be evaluated first.
  • Other subshell commands, such as cat with a file containing file names can be used if desired.
  • The -o flag specifies the output file. This will be written to the current directory, but you could specify an absolute path for the output file. Without the output flag, kraken-biom will write a file to the current directory called table.biom
  • The --fmt flag specifies the BIOM format. The program supports 3 output types, but we want to specify json as our desired format because it is not the default.
  • If we did not use the regex, the equivalent expression would be:
kraken-biom CSM79HR8_0.1_RefSeqCompleteV205_bracken.kreport CSM7KOMH_0.1_RefSeqCompleteV205_bracken.kreport HSM6XRQY_0.1_RefSeqCompleteV205_bracken.kreport HSM7J4QT_0.1_RefSeqCompleteV205_bracken.kreport HSMA33J3_0.1_RefSeqCompleteV205_bracken.kreport HSMA33KE_0.1_RefSeqCompleteV205_bracken.kreport MSM79HA3_0.1_RefSeqCompleteV205_bracken.kreport MSM9VZHR_0.1_RefSeqCompleteV205_bracken.kreport MSMB4LXW_0.1_RefSeqCompleteV205_bracken.kreport PSM7J18I_0.1_RefSeqCompleteV205_bracken.kreport -o taxa.biom --fmt json

This hopefully conveys that clever regex's can often make your life easier!

Congrats! Our data is now ready for Phyloseq. You should have a file called taxa.biom. This file is only a single line, so if you check it out with head or cat, your screen will be covered in one long string.

Importing data into R

Create the R Project File

In RStudio, create a new project by doing File > New Project ..., select New Directory, New Project, and name your project "Phyloseq". The directory can be stored as a subdirectory in its default location, or wherever you would like to store it. Leave the "Create a git repository" (if visible) and "Use renv with this project" options unchecked.

Create the R Notebook

Using the menus, click File > New File > R Notebook, which will open an Untitled R markdown (Rmd) document. R notebooks are helpful in that you can run several lines, or chunks of code at the same time, and the results will appear within the document itself (in the whitespace following the chunk).

The default R Notebook contains a header and some information you may find helpful. Try running the chunk containing plot(cars) to see what happens!

You do not need to preserve most of the information in the new, untitled document. Select all of its contents by click+dragging your cursor or entering the ctrl+a shortcut, and press backspace or delete to clear the document. Then, copy the following block of code and paste it (ctrl+v) into the blank notebook:

---
title: "Phyloseq Notebook"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

```{r}

```

Your document should now look like this. The chunks are distinguished by the grey shading. Everything between the first ```{r} and subsequent``` belongs to the chunk. Anything written in the white space surrounding the chunk is meant to be annotation. Screenshot 2023-12-01 154040 Although you can run lines of code outside of the chunks, the chunks are useful for running multiple lines in series with one click.

Adding new chunks

To add a new chunk into your R notebook, either:

  1. Navigate to Code > Insert Chunk from the toolbar, or,
  2. Use the shortcut ctrl+alt+I

Loading your Packages

When you open RStudio, although you have downloaded the packages required for this analysis, they may not be loaded into your environment. To do this, paste the following commands into the first chunk:

library(tidyverse)
library(phyloseq)
library(vegan)
library(pairwiseAdonis)

Then, run the chunk by pressing the green play button, which will execute all of the commands contained within. This will load our desired packages.

image

Getting your data into R

To import our data, we first need to set the working directory. It will be easiest to use the location where the taxa.biom file is already located. There are a couple of ways to do this, pick one of the options:

1. Execute the following command in the RStudio console:

setwd("~/metagenome_tutorial/kraken2_kreport_RefSeqCompleteV205/")
Some notes about this command:
  • Generally, commands in R follow the structure command(parameter1, parameter2, ... ), where the information we want to pass to a given command is contained within the parentheses that immediately follow it.
    • In the above case, we are giving the path to our new working directory as the parameter for the setwd() command.
  • It is important that the path within the parenthesis is surrounded by quotation marks. Otherwise, you will see this error: Error: unexpected '/' in "setwd(~/"
  • This could also be written and executed within the first chunk in the R notebook. However, be sure you actually run the command by either executing the chunk, or with the specific line selected, click Run > Run Selected Line(s) or use the ctrl + enter shortcut.

2. Or, navigate the "Session" menu to interactively change the working directory.

image

Be sure to set the directory to the kraken2_kreport_RefSeqCompleteV205/ folder!

Importing and formatting Phyloseq data

Now with our working directory set, we're ready to tackle Phyloseq! We're going to first import the data then format it before continuing our analysis.

Let's import our data!

In order to do this, we need to tell R which file we want to import. First, write a new chunk (as seen above) into the notebook below the first one wherein we load our packages. Then, paste and execute the following command to import our taxa.biom file:

#import the .biom data
data<-import_biom("taxa.biom")
Some notes about this command:
  • Our command is preceded by a comment. Comments are notated by a hashtag # preceding them. Anything after a # in a given line will not be executed. Comments are useful for describing what your code is doing!
  • We are using the import_biom() command that is part of the phyloseq package. If we forgot to load phyloseq with the library("Phyloseq") command previously, this command will not work.
  • With the arrow <-, we are creating an R Object. The import_biom("taxa.biom") command reads our file, however to store it in R and do our analysis, we need to save it as an object. In this case we called it data, which is done with the <- arrow.
  • If something is wrong, R will tell you! Look at the red error text for clues. Common errors are getting the file path or file name wrong, or forgetting the "quotation" marks.

If you succeed, you should see the data object pop up in the "Environment" pane!

image

Note that the data type is listed as Formal class phyloseq. This is because the import.biom() command is from the Phyloseq package, and it has created a phyloseq object with our data. Congrats, you just executed your first Phyloseq command!

Formatting the sample data

A useful characteristic of Phyloseq objects is that they can contain Sample Data, which is a component that can store metadata about samples in a dataset. Metadata is very useful for separating groups of samples in our analyses. However as of Phyloseq v1.44.0, after importing our taxa.biom file, we are left without our sam_data (shown as NULL)

image

How can we fix this?

The following 2 commands use the sample names stored in data to create the sam_data and append it to our Phyloseq object.

#create a new phyloseq object containing only sample data
sampledata<-sample_data(data.frame(Id=colnames(data@[email protected]), row.names=colnames(data@[email protected]), stringsAsFactors=FALSE))
#append sample data to the original data with merge
data<-merge_phyloseq(data, sampledata)
Some notes about this command:

For these complex R commands, it is sometimes best to work inside-out to understand what is going on:

  • We are constructing a data frame (a type of R object) with the data.frame() command.
    • Inside this command, we are calling the sample names Id, which is made up of the column names contained within the data object's otu_table.
    • We are also assigning row names with row.names, that are the same as the entries in the Id column. This is because Phyloseq needs to match the rownames of sam_data to the sample names (stored as column names) in the otu_table.
    • We add stringsAsFactors=FALSE inside the data.frame() command so our character vectors stay as character vectors.
    • We then use the Phyloseq command sample_data() to build our sample data from the data frame we created.
    • Finally, we create a new Phyloseq object called sampledata.
  • Then, we merge our two Phyloseq objects. One is our original data with no sam_data, and the other is our new sampledata that contains only the sam_data we constructed from the original data.
  • Finally, we are left with our data object that contains otu_table, tax_table, and sam_data

image

adding the metadata

If we want to clean up the names stored in Id in our sam_data, we can use a modified version of the above command. Replace the last two commands you added (sample_data() and merge_phyloseq()) with these:

#create a new phyloseq object containing only sample data
sampledata<-sample_data(data.frame(Id=substr(colnames(data@[email protected]),0,8),row.names=colnames(data@[email protected]), stringsAsFactors=FALSE))
#append sample data to the original data with merge
data<-merge_phyloseq(data, sampledata)
  • This uses the base R command substr (substring) to further modify the Id before constructing the data frame. For all of our samples, we know that the names only vary by the first 8 characters. So, we can use substr to grab the first 8 characters of the longer names.
  • The second command, merging data and sampledata, stays the same, but we need to execute it again to administer our changes to the data object.

Cleaning up the data

The taxonomy table, or tax_table, is a component of our phyloseq object data that stores information about all of the taxa in our dataset. If we enter the following View() command in our console to look at the data that is stored inside the tax_table, we see a data frame like the following:

View(data@[email protected])

image

From this, it is evident that we need to modify our data to be more user-friendly. We will do this by:

  1. Renaming the columns to be proper taxonomic levels instead of "Rank", and
  2. Removing the 3 characters preceding the taxonomic names, i.e. k__, p__, etc.

To do this, we can add the following commands to the chunk in our document:

#assign the column names to be proper taxonomic ranks
colnames(data@[email protected])<-c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
#remove the characters preceding the taxonomic names
data@[email protected]<-substring(data@[email protected], 4)
Some notes about these commands:
  • Instead of creating an R object with <-, we are directly modifying the column names of the taxonomy table with colnames(data@[email protected]).
    • We do this by creating a character vector with c(), wherein the strings that we want to become the column names are given in quotations and separated by commas. Note that there are 7 "Rank" columns, and our character vector has 7 entries (i.e. is of length 7). If the length of colnames() and c() did not match, an error would occur.
  • Similarly, we are directly modifying the vectors within the data frame with the substr() command. The parameters of this command tell R we only want the strings starting at the 4th character, which skips the letter and two underscores (i.e. `k__).

Now, if we View() the data again, we should see something like:

View(data@[email protected])

image That looks much better!

Subsetting the sample data

In our analysis, we are only interested in the bacterial components of our samples. So, we must subset our dataset with the subset_taxa() command. We will create a new phyloseq object called bacteria which is a subset of our original data object.

bacteria<-subset_taxa(data, Kingdom == "Bacteria")
  • The subset_taxa() command looks inside of the taxonomy table stored in data for all taxa that belong to the Bacteria kingdom.

A quick way to check if this worked is to execute the following command in your console. This looks for unique values in the "Kingdom" row of the taxonomy table, after coercing it to a more friendly data frame:

unique(as.data.frame(bacteria@[email protected])$Kingdom)

Which should only return 1 unique string:

[1] "Bacteria"

Rarefaction

An important step in analyzing sequencing data is rarefaction. Rarefaction involves randomly subsetting samples so that all samples have even sequencing depth

A quick note on rarefaction: (click to expand)
It is worth knowing that the practice of rarefaction has been called into question in the past. However, the current thinking is that "rarefaction is the most robust approach to control for uneven sequencing effort when considered across a variety of alpha and beta diversity metrics."

Generally, it is a good idea to start by manually removing rare taxa. It is common to remove taxa that have less than 20 reads across all samples in our dataset. To do this, first create a new chunk, and use the prune_taxa() command from phyloseq.

#remove taxa that have less than 20 reads across all samples
bacteria_pruned<-prune_taxa(taxa_sums(bacteria)>=20,bacteria)
Some notes about this command:
  • The first argument we give to prune_taxa() is the condition we want met for the taxa to 'pass' this filtering.
  • We are looking for the sum of the taxa (found by taxa_sums) in our phyloseq object bacteria to be greater than 20.
  • The second argument is the phyloseq object we are pruning, which will still be bacteria.
  • The result of this command is stored in a new R object, bacteria_pruned.

Generating the Rarefaction Curve

To rarefy our dataset, we must first visualize the rarefaction curve of our samples using the vegan package. To do this, we need to create a matrix that vegan can work with from our Phyloseq object bacteria_pruned. Fortunately, Phyloseq's otu_table() function can help us with that.

#transpose the matrix given by otu_table, create object
otu_matrix<-t(otu_table(bacteria_pruned))
#force the class of otu_matrix to matrix, warning expected
class(otu_matrix)<-"matrix"
#generate the rarefaction curve
rarecurve(otu_matrix,step=50,label=FALSE)
  • Assigning the class of otu_matrix to matrix is not necessarily elegant, but a quick way to get what we need. The warning message is expected in the output.

From this, we should see our rarefaction curve:

image

Rarefying the Data

Generally, we rarefy after visualizing rarefaction curves to determine at what read depth the richness of the samples plateaus. We choose a cut-off as close to this plateau as possible while retaining a sufficient sample size for the analyses. With our samples, we see that the two bottom lines plateau earlier than the rest. This means that as the number of sample reads increases, the species richness increases at a much lesser rate. Thus, at higher sequence depths, more rare taxa are being included.

Sometimes, with overlapping lines it can be difficult to see where the minimum read depth is. We can double-check this by executing the following command in our console:

min(colSums(t(otu_matrix)))

which returns

[1] 42702

So, we know that at a minimum we must rarefy to this number. However, considering some samples plateau much before this number, we should choose a smaller cutoff. There is not a strict method to choosing a cutoff, but we should remember we want to include abundant taxa and exclude rare taxa. With this in mind, it is acceptable to rarefy our samples to a size of 20000 reads. Use the following lines of code to achieve this:

#set the 'seed' for randomness to be used by the rarefy_even_depth() command
set.seed(711)
#rarefy the samples to our given cutoff
rarefied<-rarefy_even_depth(bacteria_pruned, rngseed = FALSE, sample.size = 20000, trimOTUs = TRUE)
Some notes about these commands:
  • We first set the seed to some number, in this case 711. This is for reproducibility, since otherwise, rarefy_even_depth() would use a random seed by default. This means that if you ran the same code twice without setting the seed, you would get two different rarefied subsets.
  • The rarefy_even_depth() command needs to know to not use a random seed (rngseed = FALSE), that our cutoff is 20000 (sample.size = 20000), and that we are rarefying the bacteria_pruned phyloseq object.
  • The trimOTUs = TRUE parameter of rarefy_even_depth() means that if a taxa is subsampled to an abundance of 0 across all samples, that taxa is removed from the OTU table. Having taxa with 0 reads can mess things up later in the analysis.

Excellent! Now that our data is imported, formatted, and rarefied, we can finally look at some diversity metrics!

Alpha Diversity

Alpha diversity is a metric that evaluates the different types of taxa in a given sample. Different alpha diversity methods use calculations based on different components of the sample, which are:

  • Richness: the number of taxa in a sample.
  • Evenness: the distribution of abundances of the taxa in a sample (i.e. similarities/differences in read quantity per taxa).

Some methods use only one of these components, and some will use a combination of both. There are many indices to choose from, but some common ones are:

  • Observed taxa: The number of different taxa (richness)
  • Chao1 index: another measure of richness, more sensitive to rare taxa
  • Shannon index: a combination of richness and evenness, with more weight to the richness component.
  • Simpson index: a combination of richness and evenness, with more weight to the evenness component.

Calculating Alpha Diversity

For more information on visualizing phyloseq data in R, see this tutorial: Phyloseq and Data Visualization

⚠️ **GitHub.com Fallback** ⚠️