Microbiome Helper 2 Annotation of reads with MMSeqs2 - LangilleLab/microbiome_helper GitHub Wiki
Authors: Robyn Wright,
Modifications by: NA
Based on initial versions by: André Comeau, Dhwani Desai and Gina Filloramo (Metagenomics Standard Operating Procedure v3)
Please note: We are still testing/developing this. If you are concerned by this then please use section 4 from our previous SOP
We will first use the MMseqs2 program to map reads to the UniRef90 protein database (other options are/will eventually be the RefSeqComplete or COG). Note that you can see how we installed this and downloaded/constructed the UniRef90 database here. The MMSeqs2 user guide has details for UniRef90 as well as others like RefSeqComplete, COG, and GTDB.
This is accomplished in multiple steps below where jobfiles are generated to run all samples together and top hits are selected to later combine taxa+function.
The scripts that we use in this workflow can be found here.
Running this script generates a jobfile correspnding to each sample included in the input folder with certain parameters for running MMseqs2. It assumes the default processing pipeline above in that the forward and reverse reads were concatenated together. If you instead assembled your reads above, you need to change cat_reads/*.fastq to kneaddata_out/*.fastq. Note that the below option of -p 4 is setting the number of processors that will be used when running each job/sample, so be conscious of your total available resources and how many jobfiles you want to run at the same time (see discussion/options below).
/usr/bin/perl /home/shared/MH2/MMSeqs2_functional_annotation/run_TaxonomyFunctionSearchMegan.pl \
--mapmethod mmseqs --db /bigpool/shared/mmseqs2_db/UniRef90_2026-01/UniRef90 -p 4 \
-o mmseqs_U90_out cat_reads/*.fastq
Here you have two options, depending on the amount of computing resources available to you (processors+RAM) and the total number of samples to be run.
Option 1: Run each jobfile individually, so that one sample is processed at a time using a high number of CPUs for each sample. If this is the case, you'd want to increase the CPU flag in the above step of generating the jobfiles to something approaching your resource limit (such as -p 40 on a 48 core machine). In this way, a sample will be classified the fastest possible and is the option I prefer - it allows for quick testing to make sure everything is working and will give you a good idea of how long your entire project processing is going to take given the size of functional database you are using. If using this option, you would simply run a jobfile as follows (change the SampleID and data information for each file):
./mmseqs-SAMPLEID-YEAR-MM-DD-HH-MM-SS-jobfile.sh
Option 2: Run all jobfiles simultaneously, so that all samples are processed at the same time using a low number of CPUs for each sample. If this is the case, you'd keep the CPU flag in the above step of generating the jobfiles to something quite low so that when it is multiplied by the total number of samples that you remain within your resource limits (such as -p 4 with 10 samples on a 48 core machine). If using this option, you first make a text file containing the list of all the jobfiles generated in the above step and then execute the file-of-files using a another script:
ls -1 *YEAR-MM-DD*jobfile.sh > jobfiles.list
bash /home/shared/MH2/MMSeqs2_functional_annotation/run-mmseqs-jobfiles-mod.sh /bigpool/shared/mmseqs2_db/UniRef90_2026-01/UniRef90
Note that if you are using a different database than this one, you should change that in this command.
Using either option, I prefer to then clean up the root directory by moving the now used jobfiles to a separate folder:
mkdir mmseqs_jobfiles
mv *jobfile.sh mmseqs_jobfiles/
Finally, clean up all the intermediate files in the mmseqs_U90_out folder to keep only the m8 results files or move all the m8 files to a new folder:
mkdir mmseqs_m8_files
mv mmseqs_U90_out/*.m8 mmseqs_m8_files/
One of the parameters that we set in the jobfiles above sets the max number of hits (--max-seqs) to 25 in order to be comparable to default DIAMOND output. The multiple hits are also useful if you plan to use MEGAN or some other "consensus method" for assigning taxonomy to the reads instead of Kraken2 that we used here. For our purposes, we are keeping only the top functional hit/assignment for each read and do so with the following script:
mkdir mmseqs_U90_out_tophit
python /home/shared/MH2/MMSeqs2_functional_annotation/pick_uniref_top_hit.py --unirefm8Dir mmseqs_m8_files/ --output_path mmseqs_U90_out_tophit
Now that we have the taxonomic AND functional assignments for the reads, we can combine them and convert the counts into final normalized RPKM (Reads Per Kilobase per Million) tables using the multiple ways of running the parse_TaxonomyFunction.py script, after we have prepared a master file-of-files.
This section assumes that you have already run Kraken 2 to generate taxonomic profiles for all of your reads (and would probably also work fine if you used a different program that also gives a read-by-read output, but we haven't tested it with anything else.
Prepare a tab-delimited input file specifying the corresponding taxonomy assignment file, functional assignment file, etc. for each sample ID. This file should have the following format:
Sample1 kraken2_outraw/Sample1.kraken.txt kraken2 mmseqs_U90_out_tophit/mmseqs-Sample1-s1.m8-parsed.txt uniref mmseqs_m8_files/mmseqs-Sample2-s1.m8
Sample2 kraken2_outraw/Sample2.kraken.txt kraken2 mmseqs_U90_out_tophit/mmseqs-Sample2-s1.m8-parsed.txt uniref mmseqs_m8_files/mmseqs-Sample2-s1.m8
Here, column1 is the SampleID, column2 is the path to the Kraken2 classification files, column3 is the type of taxonomic input run (kraken2 or MEGAN), column4 is the path to the functional annotation file, column5 is the type of functional database used (refseq, uniref or COG), and the last column is the raw output of the MMseqs2 mapping (m8 BLAST format file). Lets call this file multi-sample-outfiles-w-m8.txt.
Open a Python console from the command line by typing in python and pressing enter. Then paste in the following code, ensuring that you change the folder names as necessary, and make sure that the file extensions for e.g. the raw kraken results match yours:
import pandas as pd
md = pd.read_csv('metadata.csv', index_col=0, header=0)
kreport_dir = 'kraken2_outraw/'
mmseqs_dir = 'mmseqs_U90_out_tophit/'
mmseqs_m8_dir = 'mmseqs_m8_files_U90/'
new_file = []
for sample in md.index.values:
this_sample = sample+'\t'
this_sample += kreport_dir+sample+'.kraken\t'
this_sample += 'kraken2\t'
this_sample += mmseqs_dir+'mmseqs-'+sample+'-s1.m8-parsed.txt\t'
this_sample += 'uniref\t'
this_sample += mmseqs_m8_dir+'mmseqs-'+sample+'-s1.m8\n'
new_file.append(this_sample)
new_file[-1] = new_file[-1].replace('\n', '')
with open('multi-sample-outfiles-w-m8-Feb2026.txt', 'w') as f:
for row in new_file:
quiet = f.write(row)
Note that this assumes that your metadata file has your sample names as the first column. Please check the resulting file before proceeding to ensure that the file names match what is in your folders.
Type quit() and press enter to exit from the Python console.
With this above file as input, we can now generate all the different versions of the final normalized tables, depending on the level of detail required.
There are a couple of options here: you can either generate only the files of interest to you using the parse_TaxonomyFunction.py script, which expects .pbz2 files containing information on gene lengths and mapping of UniRef90 ID's to EC numbers to be in your current directory, or you can use parse_TaxonomyFunction_single.py, which will generate all four output files that you see here (stratified/unstratified, with/without EC number mapping) and requires these .pbz2 files to be in the same directory as your MMSeqs2 database, and you will tell it where that is. You can see some details on making these files on our page that describes setting up environments for analysis, where we have described downloading the UniRef90 MMSeqs2 database and making these files.
To generate the unstratified RPKM table (ie: no taxonomy):
python /home/shared/MH2/MMSeqs2_functional_annotation/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT-unstrat-matrix-RPKM.txt --unstratified Y
To generate the fully stratified RPKM table (ie: functions with breakdown of each taxa within them):
python /home/shared/MH2/MMSeqs2_functional_annotation/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT-strat-matrix-RPKM.txt --stratified Y
To generate the unstratified RPKM table with functions mapped to EC numbers:
python /home/shared/MH2/MMSeqs2_functional_annotation/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT-unstrat-matrix-RPKM-withEC.txt --unstratified Y --map2EC Y
To generate the stratified RPKM table with functions mapped to EC numbers:
python /home/shared/MH2/MMSeqs2_functional_annotation/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT-strat-matrix-RPKM-withEC.txt --stratified Y --map2EC Y
The main advantage here is that this is quicker than running the script four separate times, as above, as it only needs to get the information once and then just generate the output files separately.
python parse_TaxonomyFunction_single.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT --database /bigpool/shared/mmseqs2_db/UniRef90_2026-01/
We can use one of the existing PICRUSt2.0 scripts to append the EC descriptions of the functions to the tables:
add_descriptions.py -i PROJECT-unstrat-matrix-RPKM-withEC.txt -m EC -o PROJECT-unstrat-matrix-RPKM-withEC-withlabels.txt
In order for this script to work on the stratified table, you need to first split the first column of the stratified matrix (e.g. "EC:6.3.2.6|Terriglobus roseus DSM 18391 (taxid 926566)") into two separate function+taxonomy columns (for example in Excel using the | delimiter). Rename the resulting new split table PROJECT-strat-matrix-RPKM-withEC-split-for-picrust2.txt, then run:
add_descriptions.py -i PROJECT-strat-matrix-RPKM-withEC-split-for-picrust2.txt -m EC -o PROJECT-strat-matrix-RPKM-withEC-withlabels.txt
We can use a second PICRUSt2.0 script to calculate the MetaCyc/BioCyc pathway abundances and coverages from the unstratified table containing the EC numbers:
pathway_pipeline.py -i PROJECT-unstrat-matrix-RPKM-withEC.txt -o pathways_out --intermediate minpath_working -p 4
Under development! We are currently working on various analysis/visualization workflows based upon BURRITO, Metacoder and our own new visualization tool currently under development (due to the lack of a comprehensive tool able to handle stratified output tables).
We are currently developing an R shiny app for visualizing the stratified output. The beta version of the app is available as a github gist at https://gist.github.com/dhwanidesai/943ff5fdbd94815cc27f302d9f56ff0b
The following section will describe the steps required to process the stratified RPKM output generated in the above steps to convert it into input for the Visualizer app:
Following is a sample of the 4-column input file required:
Sample Genus Gene Contribution
31D p__Chordata;c__Mammalia;o__NA;f__NA;g__NA;s__NA EC:5.6.1.7 0.8095988800249774
31D p__Chordata;c__Mammalia;o__NA;f__NA;g__NA;s__NA EC:2.4.1.131 1.2837072456944525
31D p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__NA EC:4.1.1.31 16.230874609024266
31D p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus cristatus EC:4.1.1.31 5.8013837528095165
31D p__Actinobacteria;c__Actinobacteria;o__Corynebacteriales;f__Corynebacteriaceae;g__Corynebacterium;s__Corynebacterium matruchotii EC:4.1.1.31 7.240397583735735
31D p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus sp. 2_1_36FAA EC:4.1.1.31 6.115064692694273
31D p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus sanguinis EC:4.1.1.31 9.928303115746663
31D p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus gordonii EC:4.1.1.31 0.3053339527728727
31D p__Actinobacteria;c__Actinobacteria;o__Corynebacteriales;f__Corynebacteriaceae;g__Corynebacterium;s__Corynebacterium durum EC:4.1.1.31 0.4697140517147269
Following is a sample of the metadata file format required as input by the Visualizer.
NOTE:- you can have multiple columns or categories (tab-delimited) of the samples in the metadata file; However, the first column needs to be labeled "sample_id" (REQUIRED!)
sample_id type
31D A
32Da A
34D A
35Da A
36Da A
C16 B
C18 B
C19 B
C22 B
- Convert stratified output to 4 column format For this, we first need to convert the Kraken2 output files for each of the metagenomes into an abundance (counts) matrix. This file also be useful for the Metacoder visalization (https://rpubs.com/dhwanikdesai/736732)
First we need to get a list of the sample tags and the corresponding kraken2 output files (such as in the example below)
31D kraken2_outraw_0.05/31D.kraken.txt
32Da kraken2_outraw_0.05/32Da.kraken.txt
34D kraken2_outraw_0.05/34D.kraken.txt
35Da kraken2_outraw_0.05/35Da.kraken.txt
36Da kraken2_outraw_0.05/36Da.kraken.txt
C16 kraken2_outraw_0.05/C16.kraken.txt
C18 kraken2_outraw_0.05/C18.kraken.txt
C19 kraken2_outraw_0.05/C19.kraken.txt
C22 kraken2_outraw_0.05/C22.kraken.txt
C23 kraken2_outraw_0.05/C23.kraken.txt
This can be done by cutting the sample_id column from the metadata file.
Next, make a list of the Kraken2 output files for each sample (generated in step 3.1), and then paste the two together
cut -f 1 <project>_metadata.txt > sample_tags.txt
ls -1 kraken2_outraw/* > kraken2_results.txt
paste sample_tags.txt kraken2_results.txt > input_for_taxonomy_abundance_script.txt
Once this input file is ready run the add_6levelsTaxonomy_to_Kraken2Results.py python script (You would need the ete3 package installed):
python /home/shared/MH2/MMSeqs2_functional_annotation/add_6levelsTaxonomy_to_Kraken2Results.py --taxafilelist input_for_taxonomy_abundance_script.txt --outfile PROJECT-kraken2_abundance_matrix-6level-tax.txt
Finally, run the following script to convert the stratified RPKM output file generated in step 5.2 to the 4-column format required by the Visualizer app
python /home/shared/MH2/MMSeqs2_functional_annotation/convert_stratifiedRpkm_to_SankeyFormat.py --StratFileName PROJECT-strat-matrix-RPKM.txt --taxaAbundFile PROJECT-kraken2_abundance_matrix-6level-tax.txt --outfile PROJECT-rpkm-stratified-SankeyFormat.txt
- Run the visualizer app from Github Gist The visualizer app Jarrvis (Just Another stRatified Rpkm VISualizer) is available to run locally from the github gist at https://gist.github.com/dhwanidesai/943ff5fdbd94815cc27f302d9f56ff0b.
To run this app you would require a local install of Rstudio (R version 4) with the shiny package (required). Once Rstudio is installed, open it and run the following commands
library(shiny)
runGist("943ff5fdbd94815cc27f302d9f56ff0b")
This should bring up the app interface. Here you can upload the stratified Rpkm file (in the 4-column format that we generated above) and the metadata file.
The app has options for collapsing the RPKM table at the desired level of taxonomy and the desired RPKM threshold for visualization.
Clicking on the Sample group nodes will highlight the linkages to the taxonomy nodes and the corresponding function nodes. Hovering over nodes or links will display the underlying RPKM value.
You can also select the format to download the resulting plot (png, pdf or jpeg).