Metagenomics (functional annotation; IMPACTT 2022) - LangilleLab/microbiome_helper Wiki

This tutorial is for the 2022 IMPACTT bioinformatics workshop running from May 30th - June 1st. It has later been adapted to work for others that don't have access to the AWS server instance that was used for the workshop.

Authors: Robyn Wright and Morgan Langille

This tutorial has been adapted from the previous 2021 version written for the 2021 Canadian Bioinformatics Workshop by Jacob Nearing and Morgan Langille.

Table of Contents

1. Determining functional profiles using MMseqs2

2. Adding descriptions and calculating pathway abundances using PICRUSt2

3. Visualising with JARRVIS


The main goal of this tutorial is to introduce students to functional profiling of taxonomic reads using MMSeqs and to visualise the results of both the taxonomic and functional annotations. This assumes that you have already been through the IMPACTT taxonomic composition tutorial. As in the taxonomic composition workshop, we wand to emphasize that there is not a one-size-fits-all pipeline for analyzing MGS data, and another commonly used functional profiling tool is HUMAnN 3.

Throughout this tutorial there will be questions are aimed at helping students understand the various steps in reference based metagenomic profiling. The answers are found on this page.

This tutorial is based on our Metagenomics SOP and we are also working on a list of resources and papers that may be useful for beginners (or those that are not beginners!) in the microbiome field here.

Teaching Objectives

  1. Learn how to generate functional profiles using MMseqs2 in combination with Kraken2.
  2. Visualize the resulting taxonomic and functional profiles using JARRVIS.

Bioinformatic tool citations

Properly citing bioinformatic tools is important to ensure that developers are getting the proper recognition. If you use any of the commands in this tutorial for your own work be sure to cite the relevant tools listed below!

1. Determining functional profiles using MMseqs2

Now that we have an idea of the taxonomic profiles of our samples we can also look at their functional potential. To do this we will being using MMseqs2 along with a few scripts developed by our lab to connect the functional annotations to the taxonomic annotations.

MMseqs2 works by taking sequenced reads, translating them into protein and then mapping them against a protein database. In this case we will be using UniRef90, a large protein database clustered at 90% identity. Our lab has developed a set of scripts to run this tool and use the information to assign both a taxonomic ID and a functional ID to each read within our samples.

Activate conda environment

As previously, we've installed all of the tools that we'll need for this part of the tutorial into an Anaconda environment, which we can activate with the following command:

conda activate functional

Job file generation

One way of running multiple commands in terminal one after another is to create a job file. This job file contains all of the commands of interest along with all the needed parameters to run those commands. We can use the following command to create a job file for each sample that we would like to process with MMseqs2:

ln -s ~/CourseData/Module4/Data/Functional_Helper_Scripts/ .
/usr/bin/perl Functional_Helper_Scripts/ --mapmethod mmseqs \
--db ~/CourseData/Module4/Data/MMSeqs2_db/mmseqsUniref90DB -p 4 -o mmseqs_U90_out cat_reads/*fastq

If you are not in the IMPACTT workshop, run this command instead:

/usr/bin/perl Functional_Helper_Scripts/ --mapmethod mmseqs \
--db MMSeqs2_db/mmseqsUniref90DB -p 4 -o mmseqs_U90_out cat_reads/*fastq

This will give a warning about not having given a MEGAN path, but that's fine because we don't want to run MEGAN!

This should create ten separate job files, one for each sample. While you don't need to look in these files its always a good idea to understand what your bioinformatic programs are doing! Lets take take a look at one of the job files using the less command. You should see three lines each containing a separate command:

mmseqs createdb cat_reads/CSM7KOMH.fastq mmseqs_U90_out/mmseqs-CSM7KOMHqueryDB

This command creates an mmseqs database from the the input fastq file. The creation of this database is necessary for mmseqs as it vastly increases the speed at which translated DNA sequences can be mapped against a protein database.

mmseqs search mmseqs_U90_out/mmseqs-CSM7KOMHqueryDB MMSeqs2_db/mmseqsUniref90DB mmseqs_U90_out/mmseqs-CSM7KOMHresultDB tmp --db-load-mode 3 --threads 4 --max-seqs 25 -s 1 -a -e 1e-5 > /dev/null 2>&1

This command is the real meat of the job file and runs the freshly created sample database against the provided UniRef90 protien database. There are a number of parameters in this command.

  • --db-load-mode 3 - This parameter tells MMSeqs how to deal with loading the database into memory. For more information you can check out this page. However, setting this parameter to 3 helps when running MMSeqs on a cluster environment.
  • --threads - The number of processors we want MMSeqs to use during the search
  • --max-seqs 25 - This indicates that we want MMSeqs to output at maximum 25 hits for each sequence
  • -s 1 - This indicates the sensitivity that we want MMSeqs to run at. Increasing this number will lower the speed at which mmseqs runs but will increase its sensitivity. For well-explored environments such as the human gut, a setting of 1 should suffice.
  • -a - This indicates that we want our results to output backtraces for each sequence match. These are needed to convert the resulting MMSeqs file into a usable file format.
  • -e 1e-5 - This indicates that we only want to keep matches that are below an E-value of 1e-5 (E-values are a measure of how well two sequences match one another, and the closer they are to zero, the better the match is).
  • > /dev/null 2>&1 This final part of the command allows us to run the command without having too much text printed to our screen.

The final command allows us to convert the resulting file from the MMSeqs2 format into one that is more usable:

mmseqs convertalis mmseqs_U90_out/mmseqs-CSM7KOMHqueryDB MMSeqs2_db/mmseqsUniref90DB mmseqs_U90_out/mmseqs-CSM7KOMHresultDB mmseqs_U90_out/mmseqs-CSM7KOMH-s1.m8 --db-load-mode 2 > /dev/null 2>&1

This command is similar and takes as input the query database we made from our first command, the UniRef90 database we searched against and the resulting file from our search command. It will output the file mmseqs_U90_out/mmseqs-CSM79HR8-s1.m8.

We would run these job files with the following commands:


The * here is saying that we want to run all of the files that start with mmseqs- and end with .sh.

Unfortunately the AWS instances we are using do not have enough memory to run the above three commands in a timely manner (they take over an hour on this AWS configuration). As such I already provided the expected output from the above commands. You can copy them to your working directory using the following commands:

cp -r ~/CourseData/Module4/Intermediate_Results/mmseqs_U90_out .

If you are not in the IMPACTT workshop, you should already have a folder called mmseqs_U90_out in your current directory, so you don't need to run the above command

(If you already started running the MMSeqs commands then you can stop it by pressing ctrl+C together).

All of these commands made a lot of files that we are actually not interested in! We will go ahead and remove all of the files except for those that end in .m8. This will help us save significantly on the amount of hard drive space that is being taken up by these files:

mkdir mmseqs_m8_files
mv mmseqs_U90_out/*.m8 mmseqs_m8_files/
rm -r mmseqs_U90_out

Lets take a quick look at one of the files we just moved into the directory mmseqs_m8_files using the less command.

less mmseqs_m8_files/mmseqs-CSM7KOMH-s1.m8

We you will see is a file in BLAST tabular format.

Column Number Data Type
0 query sequence ID
1 Subject (database) sequence ID
2 Percent Identity
3 Alignment Length
4 Number of gaps
5 Number of mismatches
6 Start on the query sequence
7 End on the query sequence
8 Start on the database sequence
9 End on the database sequence
10 E value - the expectation that this alignment is random given the length of the sequence and length of the database
11 bit score - the score of the alignment itself

Question 1: How many protein sequences did the sequence CB1APANXX170426:1:1101:2664:65700/2 align with in the sample CSM7KOMH? Which alignment/alignments have the lowest E-value/highest bitscore?

The next step we need to take is to get the name of the protein sequence that had the best alignment for each sequence read in our samples. We can achieve this by running the command:

mkdir mmseqs_U90_out_tophit
python Functional_Helper_Scripts/ --unirefm8Dir mmseqs_m8_files --output_path mmseqs_U90_out_tophit

Now that we have the best protein sequence that matches best with each of the sequences in our samples we can begin creating our final data table containing the stratified abundance of each function in our samples. Because we (in the Langille lab) are continually updating some of these scripts, and the ones that we are using are ones that we have only recently developed, there are a few different steps that we'll take to get the files in the format that we want them to be in. Often as we develop things in bioinformatics, we'll try out a lot of different things, and as the protocols that we use things mature we can consolidate them into a single script. We have just about reached that point here, but we haven't consolidated things yet so we'll run these few steps.

First we need to generate an input file that contains information about where all of the results we have generated are located. We can generate this table using the following commands:

  • The 1st step is to generate the sample tags; we can use either the concatenated fastq files or the kraken output for extracting the sample tag IDs: ls -1 kraken2_outraw/*_0.1_RefSeqCompleteV205.kraken.txt |sed -e 's/.*\///g'|sed -e 's/\_0.1_RefSeqCompleteV205\.kraken\.txt//g' > sample-tags.txt

  • Next we need the paths to the uniref functional files (parsed to get only the tophit) and the filetype (refseq, uniref or COG): ls -1 mmseqs_U90_out_tophit/* | sed -e 's/$/\tuniref/g' > uniref-hits-list.txt

  • Similarly, we would need the paths to the kraken2 classification files and the corresponding filetypes (note that here we're taking the kraken2 output from the RefSeqCompleteV205 database run with a confidence threshold of 0.1): ls -1 kraken2_outraw/*_0.1_RefSeqCompleteV205.kraken.txt | sed -e 's/$/\tkraken2/g' > kraken2-results-list.txt

  • Finally, we need the list of the m8 output files for each sample: ls -1 mmseqs_m8_files/* > m8-list.txt

  • The last step is to paste put all of these files together: paste sample-tags.txt kraken2-results-list.txt uniref-hits-list.txt m8-list.txt > multi-sample-outfiles-w-m8.txt

It is always good practice to check the files that you've just made - if something went wrong then you want to pick up on this as soon as possible rather than trying to work out where something went wrong later on!

Now that we have this master file we can pass this information into the helper script to add all of it together for our samples:

ln -s ~/CourseData/Module4/Data/MMSeqs2_db/*.pbz2 .
python Functional_Helper_Scripts/ --multisample multi-sample-outfiles-w-m8.txt --outputf Workshop_strat_matrix_RPKM.txt --stratified Y --map2EC Y

If you are not in the IMPACTT workshop then you should instead run the following:

ln -s MMSeqs2_db/*.pbz2 .
python Functional_Helper_Scripts/ --multisample multi-sample-outfiles-w-m8.txt --outputf Workshop_strat_matrix_RPKM.txt --stratified Y --map2EC Y

The first link command will grab all the databases that contain information about the length of each gene in our UniRef protein database. This will be important to normalize the abundance of each functional classification.

The second command will generate a final stratified data matrix that shows the abundance of each EC number stratified by the different taxonomic classifications within the sample. This script also normalizes the abundances of each of these ECS into reads per kilobase per million (RPKM). This abundance metric is the number of reads that mapped to that EC number per million reads mapped within the sample, divided by the gene length of that EC. It's important that the abundances are normalized by gene length or there would be an unfair bias toward longer genes due to the higher chance of them being sequenced. We can also run the same command as above without the --stratified Y option to generate functional profiles that are not broken down by the contributing taxa.

Wait a minute my command ended with the output Killed!

Do not panic this was expected as the AWS instance we are running on does not have enough memory to load the database that contains the lengths of each gene in our UniRef90 database. However I have already generated the expected output from this command and we can copy it over to our working directory:

cp ~/CourseData/Module4/Intermediate_Results/Workshop_strat_matrix_RPKM.txt .

If you are not in the IMPACTT workshop, the server you are using may well have run this command. If you already have a file called Workshop_strat_matrix_RPKM.txt in your current directory, then continue to the next command. If not, run this command first:

cp other_files/Workshop_strat_matrix_RPKM.txt .

Lets take a look at this file with the less command:

less Workshop_strat_matrix_RPKM.txt

Question 2: What is the RPKM contributed to the sample CSM79HR8 for the EC: contributed by Bacteroides intestinalis?

2. Adding descriptions and calculating pathway abundances using PICRUSt2

Using scripts from PICRUSt2 we can add descriptions (i.e the names for each EC number) or generate stratified pathway abundances. Before we can do this we need to make some minor formatting adjustments:

sed -e "s/|/\t/g" Workshop_strat_matrix_RPKM.txt > Workshop_strat_matrix_RPKM_fixed.txt
sed -e "s/function/function\tsequence/g" Workshop_strat_matrix_RPKM_fixed.txt > Workshop_strat_matrix_RPKM_fixed_final.txt

Now we will activate the PICRUSt2 conda environment:

conda activate picrust2

We can add descriptions using the following command: -i Workshop_strat_matrix_RPKM_fixed_final.txt -m EC -o Workshop_strat_matrix_RPKM_fixed_des.txt

Question 3: What is the name of the enzyme with the EC number EC:

Next we will generate stratified pathway abundances using the following command: -i Workshop_strat_matrix_RPKM_fixed_final.txt -p 4 -o pathways_out --wide_table

The results of this command will be in the pathways_out folder.

Question 4: How many total pathways were identified?

Question 5: how many taxa contribute to the pathway PANTO-PWY?

3. Visualising with JARRVIS

Now we'll visualise the differences between our samples using a new tool, JARRVIS, that is currently being developed in the Langille lab (it's currently in beta testing!). We will just need to run a few commands to get our data tables into the format that the tool is expecting.

In some of the above steps, we generated output tables that are stratified by taxa as well as by function. Now, we want to further stratify these so that they are separated by sample with only a single abundance column. We also want to change the taxonomy that we have so that instead of each just being listed as a species name, it will have all levels of the taxonomy.

First, we'll add together two of the files that we've already made to make an input file, and then remove the last column of this file:

paste sample-tags.txt kraken2-results-list.txt > input_for_taxonomy_abundance_script_int.txt
cut -d$'\t' -f1-2 input_for_taxonomy_abundance_script_int.txt > input_for_taxonomy_abundance_script.txt

Now we can run the script to get all of the taxonomy levels for our output:

conda activate functional
python Functional_Helper_Scripts/ --taxafilelist input_for_taxonomy_abundance_script.txt --outfile kraken2_abundance_matrix-6level-tax.txt

This should print out some information on what has been created to the screen. And we can take a look at the kraken2_abundance_matrix-6level-tax.txt file with the less command. Most of the information in this file isn't needed for what we're doing (but as I mentioned above, this isn't yet a very mature pipeline!).

Next, we'll convert our abundance file into the format that is needed for the next script. We'll use the stratified pathway abundance file here as, with fewer pathways than EC numbers, this will be simpler to view, although this kind of visulisation can also be done with the EC numbers file (Workshop_strat_matrix_RPKM_fixed_final.txt):

gunzip pathways_out/path_abun_strat.tsv.gz
python Functional_Helper_Scripts/ --filename pathways_out/path_abun_strat.tsv --outfilename stratified-file-formatted.tsv

Finally, we'll run another script to convert this table to the right format for visualisation:

python Functional_Helper_Scripts/ --StratFileName stratified-file-formatted.tsv --taxaAbundFile kraken2_abundance_matrix-6level-tax.txt --outfile pathways-stratified-SankeyFormat.txt

If you take a look at this file with the less command, then it should look something like this:

Sample  Genus   Gene    Contribution
CSM79HR8        p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__NA;g__NA;s__NA      PANTO-PWY       0.0
CSM79HR8        p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__NA PANTO-PWY       0.0
CSM79HR8        p__NA;c__NA;o__NA;f__NA;g__NA;s__NA     PANTO-PWY       0.0
CSM79HR8        p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides caccae PWY-6700        0.0
CSM79HR8        p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides uniformis      PWY-6700

We'll just copy across the metadata file to our directory:

cp ~/CourseData/Module4/Data/mgs_metadata.txt .

If you are not in the IMPACTT workshop then this should already be in your current directory. You will need to either download these files to your computer (local) and follow the instructions below for installing RStudio, or you will need to have RStudio installed on your server. Even if you have RStudio installed on your server, you will need to download the files to your computer.

You can now download both the mgs_metadata.txt and the pathways-stratified-SankeyFormat.txt files to your computer. You can right click them and select "Save link as..." (or similar) to download them.

Open RStudio on the server and run:


If you are not in the IMPACTT workshop then you will need RStudio for this part. You can download it from here (the free version) and you can run the same code as above in the console within RStudio, but you will first need to run the following in the console:


A window should pop up and you can continue with the tutorial

You can now select the pathways-stratified-SankeyFormat.txt file for the first box and the mgs_metadata.txt file for the second box. On the "Metadata categories" dropdown you should choose "disease_state", and after a few seconds, it should automatically come up with a plot. If it initially looks like you have an error message, don't panic - you just need to choose the disease_state category before the tool can visualise the data.

You should now see "CD" (Crohn's disease) and "nonIBD" on the left hand side, taxa names in the middle and pathway names on the right. Hovering over the lines will highlight them and give you information on what they connect.

Note that there is an RPKM threshold filter of 10 currently applied - usually it is a good idea to filter out very low abundance pathways/taxa as these can add a lot of confusion to visualisations, but as we are working with small, sub-sampled samples here, it's not really necessary to apply this as everything fits on fine as it is.

Question 6: Which samples is the Streptococcus genus found in?

Question 7: Are there any pathways that are only found in one sample and not the other?