CBW IMPACTT Metagenome sequence data analysis with MMSeqs2 and JarrVis - LangilleLab/microbiome_helper GitHub Wiki

This tutorial is for the 2023 CBW-IMPACTT bioinformatics workshop running in Calgary from July 5-7th. This tutorial has been adapted from the 2022 IMPACTT bioinformatics workshops, but it now uses different samples that match those in the amplicon sequencing part of this workshop. This assumes that you have first run through the 2023 taxonomic annotation workshop.

Authors: Robyn Wright and Morgan Langille

Introduction

The main goal of this tutorial is to introduce students to functional profiling of taxonomic reads using MMSeqs and to visualise the results of both the taxonomic and functional annotations. This assumes that you have already been through the IMPACTT taxonomic composition tutorial. As in the taxonomic composition workshop, we want to emphasize that there is not a one-size-fits-all pipeline for analyzing MGS data, and another commonly used functional profiling tool is HUMAnN 3.

Throughout this tutorial there will be questions are aimed at helping students understand the various steps in reference based metagenomic profiling. The answers are found on this page.

This tutorial is based on our Metagenomics SOP and we are also working on a list of resources and papers that may be useful for beginners (or those that are not beginners!) in the microbiome field here.

Bioinformatic tool citations

Properly citing bioinformatic tools is important to ensure that developers are getting the proper recognition. If you use any of the commands in this tutorial for your own work be sure to cite the relevant tools listed below!

GNU Parallel (website, paper)
Microbiome Helper (website, paper)
MMseqs2 (website, paper)
PICRUSt2 (website, paper)
JARRVIS (website)

1. Determining functional profiles using MMseqs2

Now that we have an idea of the taxonomic profiles of our samples we can also look at their functional potential. To do this we will being using MMseqs2 along with a few scripts developed by our lab to connect the functional annotations to the taxonomic annotations.

MMseqs2 works by taking sequenced reads, translating them into protein and then mapping them against a protein database. In this case we will be using UniRef90, a large protein database clustered at 90% identity. Our lab has developed a set of scripts to run this tool and use the information to assign both a taxonomic ID and a functional ID to each read within our samples.

As for the taxonomic composition, we would usually perform some initial steps of quality control, removal of unwanted reads and joining of forward and reverse reads, but for now, we've done this for you.

We already created the directory that we'll be using in the taxonomic workshop, so we'll change into that and as previously, we've installed all of the tools that we'll need for this part of the tutorial into an Anaconda environment, which we can activate with the following commands:

cd workspace/metagenome_workshop
conda activate functional

Now we'll create a folder for our MMSeqs output to go into:

mkdir mmseqs_U90_out

The first step will be running a command creates an MMSeqs database from the the input fastq files. The creation of this database is necessary for MMSeqs as it vastly increases the speed at which translated DNA sequences can be mapped against a protein database:

parallel -j 4 --progress 'mmseqs createdb {} mmseqs_U90_out/mmseqs-{/.}queryDB' ::: cat_reads/*

Notice that we're using Parallel for this again, and this time we're running 4 jobs at once and we've used the --progress flag to see which job we're on and how many have already been run. There aren't many options in the command, but you can see that after createdb this is where the reads file is taken in, and then we modify this name slightly for the output that will go into the mmseqs_U90_out folder.

The next commands we unfortunately won't be able to run as they require more memory than we have, and would take a very long time to run (I would usually assume >24 hours to run this for a set of samples), but we'll go through what they're doing still.

This command is the real meat of the job file and runs the freshly created sample database against the provided UniRef90 protien database:

parallel -j 1 'mmseqs search mmseqs_U90_out/mmseqs-{/.}queryDB ~/CourseData/MIC_data/metagenome_data/MMSeqs2_db/mmseqsUniref90DB mmseqs_U90_out/mmseqs-{/.}resultDB tmp --db-load-mode 3 --threads 4 --max-seqs 25 -s 1 -a -e 1e-5 > /dev/null 2>&1' ::: cat_reads/*

Note: If you ran this command by accident, you can stop it again by pressing ctrl+c (at the same time)

There are a number of parameters in this command:

--db-load-mode 3 - This parameter tells MMSeqs how to deal with loading the database into memory. For more information you can check out this page. However, setting this parameter to 3 helps when running MMSeqs on a cluster environment.
--threads - The number of processors we want MMSeqs to use during the search
--max-seqs 25 - This indicates that we want MMSeqs to output at maximum 25 hits for each sequence
-s 1 - This indicates the sensitivity that we want MMSeqs to run at. Increasing this number will lower the speed at which mmseqs runs but will increase its sensitivity. For well-explored environments such as the human gut, a setting of 1 should suffice.
-a - This indicates that we want our results to output backtraces for each sequence match. These are needed to convert the resulting MMSeqs file into a usable file format.
-e 1e-5 - This indicates that we only want to keep matches that are below an E-value of 1e-5 (E-values are a measure of how well two sequences match one another, and the closer they are to zero, the better the match is).
> /dev/null 2>&1 This final part of the command allows us to run the command without having too much text printed to our screen.

The final command allows us to convert the resulting file from the MMSeqs2 format into one that is more usable:

#parallel -j 1 'mmseqs convertalis mmseqs_U90_out/mmseqs-{/.}queryDB MMSeqs2_db/mmseqsUniref90DB mmseqs_U90_out/mmseqs-{/.}resultDB mmseqs_U90_out/mmseqs-{/.}-s1.m8 --db-load-mode 2 > /dev/null 2>&1' ::: cat_reads/*

This command is similar and takes as input the query database we made from our first command, the UniRef90 database we searched against and the resulting file from our search command. It will output the file mmseqs_U90_out/*.m8.

We'll create a new folder for the .m8 files and create a link to them:

mkdir mmseqs_m8_files
ln -s ~/CourseData/MIC_data/metagenome_data/mmseqs_U90_out/*.m8 mmseqs_m8_files/

The mmseqs search also made a lot of files that we're not interested in! We didn't copy them across here, but you can see them by running ls ~/CourseData/MIC_data/metagenome_data/mmseqs_U90_out/ If you were running this for yourself, after you have the .m8 files you could remove the others to save significantly on the amount of hard drive space that is being taken up by these files.

Lets take a quick look at one of the files we just moved into the directory mmseqs_m8_files using the less command.

less mmseqs_m8_files/mmseqs-BB209-s1.m8

We you will see is a file in BLAST tabular format.

Column Number	Data Type
0	query sequence ID
1	Subject (database) sequence ID
2	Percent Identity
3	Alignment Length
4	Number of gaps
5	Number of mismatches
6	Start on the query sequence
7	End on the query sequence
8	Start on the database sequence
9	End on the database sequence
10	E value - the expectation that this alignment is random given the length of the sequence and length of the database
11	bit score - the score of the alignment itself

Question 1: How many protein sequences did the sequence CB1APANXX170426:1:1101:2664:65700/2 align with in the sample CSM7KOMH? Which alignment/alignments have the lowest E-value/highest bitscore?

The next step we need to take is to get the name of the protein sequence that had the best alignment for each sequence read in our samples. We can achieve this by running the command:

mkdir mmseqs_U90_out_tophit
cp -r ~/CourseData/MIC_data/metagenome_data/Functional_Helper_Scripts/ .
python Functional_Helper_Scripts/pick_uniref_top_hit.py --unirefm8Dir mmseqs_m8_files --output_path mmseqs_U90_out_tophit

Now that we have the best protein sequence that matches best with each of the sequences in our samples we can begin creating our final data table containing the stratified abundance of each function in our samples. The next steps are using various databases to create files that contain information on the taxonomic and functional annotation that we've obtained for each read in our initial sequence files as well as collating that information so that we know about the abundance of different taxa/function combinations within samples.

First we'll copy across a file that contains mapping between the raw kraken2 output files (that contain information on a read-by-read basis) and the MMSeqs2 output files for each sample:

cp ~/CourseData/MIC_data/metagenome_data/multi-sample-outfiles-w-m8.txt .

Have a look at it with the head or less commands if you like.

Now we'll create a link to some of the database files within out current directory:

ln -s ~/CourseData/MIC_data/metagenome_data/MMSeqs2_db/* .

These databases contain information about the length of each gene in our UniRef protein database, which is important to normalise the abundance of each functional classification.

The next step would be to run a script that will take in the taxonomy and function for each read and convert this to a more usable format:

python Functional_Helper_Scripts/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf Workshop_strat_matrix_RPKM.txt --stratified Y --map2EC Y

You can run this command if you like, but it takes a really long time to run, so for the sake of time it is probably easier to just grab the file that we've already made:

ln -s ~/CourseData/MIC_data/metagenome_data/Workshop_strat_matrix_RPKM.txt .

This command would have generated this final stratified data matrix that shows the abundance of each EC (Enzyme Commission) number stratified by the different taxonomic classifications within the sample. This script also normalises the abundances of each of these ECs into reads per kilobase per million (RPKM). This abundance metric is the number of reads that mapped to that EC number per million reads mapped within the sample, divided by the gene length of that EC. It's important that the abundances are normalised by gene length or there would be an unfair bias toward longer genes due to the higher chance of them being sequenced. We can also run the same command as above without the --stratified Y option to generate functional profiles that are not broken down by the contributing taxa.

Lets take a look at this file with the less command:

less Workshop_strat_matrix_RPKM.txt

Question 2: What is the RPKM contributed to the sample CSM79HR8 for the EC:6.3.5.2 contributed by Bacteroides intestinalis?

2. Adding descriptions and calculating pathway abundances

Next, we want to add some descriptions to the EC numbers file and also convert this to pathway abundances. For this, we'll use some of the files included within PICRUSt2, but we've modified the scripts for this workshop.

First, we'll add the descriptions to our file:

ln -s ~/CourseData/MIC_data/metagenome_data/add_descriptions.py .
python add_descriptions.py

After running this, you should see a new file called Workshop_strat_matrix_RPKM_des.txt. Take a look at it with the head command. You should see that each EC number now has a description associated.

Question 3: What is the name of the enzyme with the EC number EC:6.3.5.2?

Next we will generate stratified pathway abundances using the following command:

pathway_pipeline.py -i Workshop_strat_matrix_RPKM_des.txt -p 4 -o pathways_out --wide_table

The results of this command will be in the pathways_out folder.

Question 4: How many total pathways were identified?

Question 5: how many taxa contribute to the pathway PANTO-PWY?

3. Visualising with JARRVIS

Now we'll visualise the differences between our samples using a new tool, JARRVIS, that is currently being developed in the Langille lab (it's currently in beta testing!). We will just need to run a few commands to get our data tables into the format that the tool is expecting.

In some of the above steps, we generated output tables that are stratified by taxa as well as by function. Now, we want to further stratify these so that they are separated by sample with only a single abundance column. We also want to change the taxonomy that we have so that instead of each just being listed as a species name, it will have all levels of the taxonomy.

First, we'll add together two of the files that we've already made to make an input file, and then remove the last column of this file:

paste sample-tags.txt kraken2-results-list.txt > input_for_taxonomy_abundance_script_int.txt
cut -d$'\t' -f1-2 input_for_taxonomy_abundance_script_int.txt > input_for_taxonomy_abundance_script.txt

Now we can run the script to get all of the taxonomy levels for our output:

conda activate functional
python Functional_Helper_Scripts/add_6levelsTaxonomy_to_Kraken2Results.py --taxafilelist input_for_taxonomy_abundance_script.txt --outfile kraken2_abundance_matrix-6level-tax.txt

This should print out some information on what has been created to the screen. And we can take a look at the kraken2_abundance_matrix-6level-tax.txt file with the less command. Most of the information in this file isn't needed for what we're doing (but as I mentioned above, this isn't yet a very mature pipeline!).

Next, we'll convert our abundance file into the format that is needed for the next script. We'll use the stratified pathway abundance file here as, with fewer pathways than EC numbers, this will be simpler to view, although this kind of visulisation can also be done with the EC numbers file (Workshop_strat_matrix_RPKM_fixed_final.txt):

gunzip pathways_out/path_abun_strat.tsv.gz
python Functional_Helper_Scripts/convert_picrust2_to_stratified_output_format.py --filename pathways_out/path_abun_strat.tsv --outfilename stratified-file-formatted.tsv

Finally, we'll run another script to convert this table to the right format for visualisation:

python Functional_Helper_Scripts/convert_stratifiedRpkm_to_SankeyFormat.py --StratFileName stratified-file-formatted.tsv --taxaAbundFile kraken2_abundance_matrix-6level-tax.txt --outfile pathways-stratified-SankeyFormat.txt

If you take a look at this file with the less command, then it should look something like this:

Sample  Genus   Gene    Contribution
CSM79HR8        p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__NA;g__NA;s__NA      PANTO-PWY       0.0
CSM79HR8        p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__NA PANTO-PWY       0.0
CSM79HR8        p__NA;c__NA;o__NA;f__NA;g__NA;s__NA     PANTO-PWY       0.0
CSM79HR8        p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides caccae PWY-6700        0.0
CSM79HR8        p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides uniformis      PWY-6700
        0.0

We'll just copy across the metadata file to our directory:

cp ~/CourseData/Module4/Data/mgs_metadata.txt .

You can now download both the mgs_metadata.txt and the pathways-stratified-SankeyFormat.txt files to your computer. You can right click them and select "Save link as..." (or similar) to download them.

Open RStudio on the server and run:

library(shiny)
runGist("943ff5fdbd94815cc27f302d9f56ff0b")

A window should pop up and you can continue with the tutorial

You can now select the pathways-stratified-SankeyFormat.txt file for the first box and the mgs_metadata.txt file for the second box. On the "Metadata categories" dropdown you should choose "disease_state", and after a few seconds, it should automatically come up with a plot. If it initially looks like you have an error message, don't panic - you just need to choose the disease_state category before the tool can visualise the data.

You should now see "CD" (Crohn's disease) and "nonIBD" on the left hand side, taxa names in the middle and pathway names on the right. Hovering over the lines will highlight them and give you information on what they connect.

Note that there is an RPKM threshold filter of 10 currently applied - usually it is a good idea to filter out very low abundance pathways/taxa as these can add a lot of confusion to visualisations, but as we are working with small, sub-sampled samples here, it's not really necessary to apply this as everything fits on fine as it is.

Question 6: Which samples is the Streptococcus genus found in?

Question 7: Are there any pathways that are only found in one sample and not the other?