Microbiome Helper 2 Annotation of reads with MMSeqs2 - LangilleLab/microbiome

Authors: Robyn Wright,
Modifications by: NA
Based on initial versions by: André Comeau, Dhwani Desai and Gina Filloramo (Metagenomics Standard Operating Procedure v3)

Please note: We are still testing/developing this. If you are concerned by this then please use section 4 from our previous SOP

We will first use the MMseqs2 program to map reads to the UniRef90 protein database (other options are/will eventually be the RefSeqComplete or COG). Note that you can see how we installed this and downloaded/constructed the UniRef90 database here. The MMSeqs2 user guide has details for UniRef90 as well as others like RefSeqComplete, COG, and GTDB.

This is accomplished in multiple steps below where jobfiles are generated to run all samples together and top hits are selected to later combine taxa+function.

The scripts that we use in this workflow can be found here.

1. Generate jobfiles (executables) containing parameters to run MMseqs2

Running this script generates a jobfile correspnding to each sample included in the input folder with certain parameters for running MMseqs2. It assumes the default processing pipeline above in that the forward and reverse reads were concatenated together. If you instead assembled your reads above, you need to change cat_reads/*.fastq to kneaddata_out/*.fastq. Note that the below option of -p 4 is setting the number of processors that will be used when running each job/sample, so be conscious of your total available resources and how many jobfiles you want to run at the same time (see discussion/options below).

/usr/bin/perl /home/shared/MH2/MMSeqs2_functional_annotation/run_TaxonomyFunctionSearchMegan.pl \
    --mapmethod mmseqs --db /bigpool/shared/mmseqs2_db/UniRef90_2026-01/UniRef90 -p 4 \
    -o mmseqs_U90_out cat_reads/*.fastq

2. Execute the jobfiles to generate the functional predictions

Here you have two options, depending on the amount of computing resources available to you (processors+RAM) and the total number of samples to be run.

Option 1: Run each jobfile individually, so that one sample is processed at a time using a high number of CPUs for each sample. If this is the case, you'd want to increase the CPU flag in the above step of generating the jobfiles to something approaching your resource limit (such as -p 40 on a 48 core machine). In this way, a sample will be classified the fastest possible and is the option I prefer - it allows for quick testing to make sure everything is working and will give you a good idea of how long your entire project processing is going to take given the size of functional database you are using. If using this option, you would simply run a jobfile as follows (change the SampleID and data information for each file):

./mmseqs-SAMPLEID-YEAR-MM-DD-HH-MM-SS-jobfile.sh

Option 2: Run all jobfiles simultaneously, so that all samples are processed at the same time using a low number of CPUs for each sample. If this is the case, you'd keep the CPU flag in the above step of generating the jobfiles to something quite low so that when it is multiplied by the total number of samples that you remain within your resource limits (such as -p 4 with 10 samples on a 48 core machine). If using this option, you first make a text file containing the list of all the jobfiles generated in the above step and then execute the file-of-files using a another script:

ls -1 *YEAR-MM-DD*jobfile.sh > jobfiles.list
bash /home/shared/MH2/MMSeqs2_functional_annotation/run-mmseqs-jobfiles-mod.sh /bigpool/shared/mmseqs2_db/UniRef90_2026-01/UniRef90

Note that if you are using a different database than this one, you should change that in this command.

Using either option, I prefer to then clean up the root directory by moving the now used jobfiles to a separate folder:

mkdir mmseqs_jobfiles
mv *jobfile.sh mmseqs_jobfiles/

Finally, clean up all the intermediate files in the mmseqs_U90_out folder to keep only the m8 results files or move all the m8 files to a new folder:

mkdir mmseqs_m8_files
mv mmseqs_U90_out/*.m8 mmseqs_m8_files/

3. Pick the top functional hit for each read

One of the parameters that we set in the jobfiles above sets the max number of hits (--max-seqs) to 25 in order to be comparable to default DIAMOND output. The multiple hits are also useful if you plan to use MEGAN or some other "consensus method" for assigning taxonomy to the reads instead of Kraken2 that we used here. For our purposes, we are keeping only the top functional hit/assignment for each read and do so with the following script:

mkdir mmseqs_U90_out_tophit
python /home/shared/MH2/MMSeqs2_functional_annotation/pick_uniref_top_hit.py --unirefm8Dir mmseqs_m8_files/ --output_path mmseqs_U90_out_tophit

4. Create Final Normalized Taxa+Function Tables

Now that we have the taxonomic AND functional assignments for the reads, we can combine them and convert the counts into final normalized RPKM (Reads Per Kilobase per Million) tables using the multiple ways of running the parse_TaxonomyFunction.py script, after we have prepared a master file-of-files.

4.1 Prepare a master input file

This section assumes that you have already run Kraken 2 to generate taxonomic profiles for all of your reads (and would probably also work fine if you used a different program that also gives a read-by-read output, but we haven't tested it with anything else.

Prepare a tab-delimited input file specifying the corresponding taxonomy assignment file, functional assignment file, etc. for each sample ID. This file should have the following format:

Sample1  kraken2_outraw/Sample1.kraken.txt  kraken2  mmseqs_U90_out_tophit/mmseqs-Sample1-s1.m8-parsed.txt  uniref  mmseqs_m8_files/mmseqs-Sample2-s1.m8
Sample2  kraken2_outraw/Sample2.kraken.txt  kraken2  mmseqs_U90_out_tophit/mmseqs-Sample2-s1.m8-parsed.txt  uniref  mmseqs_m8_files/mmseqs-Sample2-s1.m8

Here, column1 is the SampleID, column2 is the path to the Kraken2 classification files, column3 is the type of taxonomic input run (kraken2 or MEGAN), column4 is the path to the functional annotation file, column5 is the type of functional database used (refseq, uniref or COG), and the last column is the raw output of the MMseqs2 mapping (m8 BLAST format file). Lets call this file multi-sample-outfiles-w-m8.txt.

Short script to do this with a metadata file

Open a Python console from the command line by typing in python and pressing enter. Then paste in the following code, ensuring that you change the folder names as necessary, and make sure that the file extensions for e.g. the raw kraken results match yours:

import pandas as pd

md = pd.read_csv('metadata.csv', index_col=0, header=0)
kreport_dir = 'kraken2_outraw/'
mmseqs_dir = 'mmseqs_U90_out_tophit/'
mmseqs_m8_dir = 'mmseqs_m8_files_U90/'

new_file = []
for sample in md.index.values:
  this_sample = sample+'\t'
  this_sample += kreport_dir+sample+'.kraken\t'
  this_sample += 'kraken2\t'
  this_sample += mmseqs_dir+'mmseqs-'+sample+'-s1.m8-parsed.txt\t'
  this_sample += 'uniref\t'
  this_sample += mmseqs_m8_dir+'mmseqs-'+sample+'-s1.m8\n'
  new_file.append(this_sample)

new_file[-1] = new_file[-1].replace('\n', '')

with open('multi-sample-outfiles-w-m8-Feb2026.txt', 'w') as f:
  for row in new_file:
    quiet = f.write(row)

Note that this assumes that your metadata file has your sample names as the first column. Please check the resulting file before proceeding to ensure that the file names match what is in your folders.

Type quit() and press enter to exit from the Python console.

4.2 Generate the final normalized tables (multiple options)

With this above file as input, we can now generate all the different versions of the final normalized tables, depending on the level of detail required.

There are a couple of options here: you can either generate only the files of interest to you using the parse_TaxonomyFunction.py script, which expects .pbz2 files containing information on gene lengths and mapping of UniRef90 ID's to EC numbers to be in your current directory, or you can use parse_TaxonomyFunction_single.py, which will generate all four output files that you see here (stratified/unstratified, with/without EC number mapping) and requires these .pbz2 files to be in the same directory as your MMSeqs2 database, and you will tell it where that is. You can see some details on making these files on our page that describes setting up environments for analysis, where we have described downloading the UniRef90 MMSeqs2 database and making these files.

Generate the files separately

To generate the unstratified RPKM table (ie: no taxonomy):

python /home/shared/MH2/MMSeqs2_functional_annotation/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT-unstrat-matrix-RPKM.txt --unstratified Y

To generate the fully stratified RPKM table (ie: functions with breakdown of each taxa within them):

python /home/shared/MH2/MMSeqs2_functional_annotation/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT-strat-matrix-RPKM.txt --stratified Y

To generate the unstratified RPKM table with functions mapped to EC numbers:

python /home/shared/MH2/MMSeqs2_functional_annotation/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT-unstrat-matrix-RPKM-withEC.txt --unstratified Y --map2EC Y

To generate the stratified RPKM table with functions mapped to EC numbers:

python /home/shared/MH2/MMSeqs2_functional_annotation/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT-strat-matrix-RPKM-withEC.txt --stratified Y --map2EC Y

Generate the files all at once

The main advantage here is that this is quicker than running the script four separate times, as above, as it only needs to get the information once and then just generate the output files separately.

python parse_TaxonomyFunction_single.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT --database /bigpool/shared/mmseqs2_db/UniRef90_2026-01/

4.3 Append functional descriptions

We can use one of the existing PICRUSt2.0 scripts to append the EC descriptions of the functions to the tables:

add_descriptions.py -i PROJECT-unstrat-matrix-RPKM-withEC.txt -m EC -o PROJECT-unstrat-matrix-RPKM-withEC-withlabels.txt

In order for this script to work on the stratified table, you need to first split the first column of the stratified matrix (e.g. "EC:6.3.2.6|Terriglobus roseus DSM 18391 (taxid 926566)") into two separate function+taxonomy columns (for example in Excel using the | delimiter). Rename the resulting new split table PROJECT-strat-matrix-RPKM-withEC-split-for-picrust2.txt, then run:

add_descriptions.py -i PROJECT-strat-matrix-RPKM-withEC-split-for-picrust2.txt -m EC -o PROJECT-strat-matrix-RPKM-withEC-withlabels.txt

4.4 Generate pathway information

We can use a second PICRUSt2.0 script to calculate the MetaCyc/BioCyc pathway abundances and coverages from the unstratified table containing the EC numbers:

pathway_pipeline.py -i PROJECT-unstrat-matrix-RPKM-withEC.txt -o pathways_out --intermediate minpath_working -p 4

4.5 Format taxa and/or functional files for use in downstream visualization

Under development! We are currently working on various analysis/visualization workflows based upon BURRITO, Metacoder and our own new visualization tool currently under development (due to the lack of a comprehensive tool able to handle stratified output tables).

We are currently developing an R shiny app for visualizing the stratified output. The beta version of the app is available as a github gist at https://gist.github.com/dhwanidesai/943ff5fdbd94815cc27f302d9f56ff0b

The following section will describe the steps required to process the stratified RPKM output generated in the above steps to convert it into input for the Visualizer app:

Example: Stratified EC number rpkm visualisation

Following is a sample of the 4-column input file required:

Sample	Genus	Gene	Contribution
31D	p__Chordata;c__Mammalia;o__NA;f__NA;g__NA;s__NA	EC:5.6.1.7	0.8095988800249774
31D	p__Chordata;c__Mammalia;o__NA;f__NA;g__NA;s__NA	EC:2.4.1.131	1.2837072456944525
31D	p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__NA	EC:4.1.1.31	16.230874609024266
31D	p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus cristatus	EC:4.1.1.31	5.8013837528095165
31D	p__Actinobacteria;c__Actinobacteria;o__Corynebacteriales;f__Corynebacteriaceae;g__Corynebacterium;s__Corynebacterium matruchotii	EC:4.1.1.31	7.240397583735735
31D	p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus sp. 2_1_36FAA	EC:4.1.1.31	6.115064692694273
31D	p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus sanguinis	EC:4.1.1.31	9.928303115746663
31D	p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus gordonii	EC:4.1.1.31	0.3053339527728727
31D	p__Actinobacteria;c__Actinobacteria;o__Corynebacteriales;f__Corynebacteriaceae;g__Corynebacterium;s__Corynebacterium durum	EC:4.1.1.31	0.4697140517147269

Following is a sample of the metadata file format required as input by the Visualizer.

NOTE:- you can have multiple columns or categories (tab-delimited) of the samples in the metadata file; However, the first column needs to be labeled "sample_id" (REQUIRED!)

sample_id	type
31D	A
32Da	A
34D	A
35Da	A
36Da	A
C16	B
C18	B
C19	B
C22	B

Convert stratified output to 4 column format For this, we first need to convert the Kraken2 output files for each of the metagenomes into an abundance (counts) matrix. This file also be useful for the Metacoder visalization (https://rpubs.com/dhwanikdesai/736732)

First we need to get a list of the sample tags and the corresponding kraken2 output files (such as in the example below)

31D	kraken2_outraw_0.05/31D.kraken.txt
32Da	kraken2_outraw_0.05/32Da.kraken.txt
34D	kraken2_outraw_0.05/34D.kraken.txt
35Da	kraken2_outraw_0.05/35Da.kraken.txt
36Da	kraken2_outraw_0.05/36Da.kraken.txt
C16	kraken2_outraw_0.05/C16.kraken.txt
C18	kraken2_outraw_0.05/C18.kraken.txt
C19	kraken2_outraw_0.05/C19.kraken.txt
C22	kraken2_outraw_0.05/C22.kraken.txt
C23	kraken2_outraw_0.05/C23.kraken.txt

This can be done by cutting the sample_id column from the metadata file.

Next, make a list of the Kraken2 output files for each sample (generated in step 3.1), and then paste the two together

cut -f 1 <project>_metadata.txt > sample_tags.txt

ls -1 kraken2_outraw/* > kraken2_results.txt

paste sample_tags.txt kraken2_results.txt > input_for_taxonomy_abundance_script.txt

Once this input file is ready run the add_6levelsTaxonomy_to_Kraken2Results.py python script (You would need the ete3 package installed):

python /home/shared/MH2/MMSeqs2_functional_annotation/add_6levelsTaxonomy_to_Kraken2Results.py --taxafilelist input_for_taxonomy_abundance_script.txt --outfile PROJECT-kraken2_abundance_matrix-6level-tax.txt

Finally, run the following script to convert the stratified RPKM output file generated in step 5.2 to the 4-column format required by the Visualizer app

python /home/shared/MH2/MMSeqs2_functional_annotation/convert_stratifiedRpkm_to_SankeyFormat.py --StratFileName PROJECT-strat-matrix-RPKM.txt --taxaAbundFile PROJECT-kraken2_abundance_matrix-6level-tax.txt --outfile PROJECT-rpkm-stratified-SankeyFormat.txt

Run the visualizer app from Github Gist The visualizer app Jarrvis (Just Another stRatified Rpkm VISualizer) is available to run locally from the github gist at https://gist.github.com/dhwanidesai/943ff5fdbd94815cc27f302d9f56ff0b.

To run this app you would require a local install of Rstudio (R version 4) with the shiny package (required). Once Rstudio is installed, open it and run the following commands

library(shiny)
runGist("943ff5fdbd94815cc27f302d9f56ff0b")

This should bring up the app interface. Here you can upload the stratified Rpkm file (in the 4-column format that we generated above) and the metadata file.

The app has options for collapsing the RPKM table at the desired level of taxonomy and the desired RPKM threshold for visualization.

Clicking on the Sample group nodes will highlight the linkages to the taxonomy nodes and the corresponding function nodes. Hovering over nodes or links will display the underlying RPKM value.

You can also select the format to download the resulting plot (png, pdf or jpeg).

Microbiome Helper 2 Annotation of reads with MMSeqs2 - LangilleLab/microbiome_helper GitHub Wiki

1. Generate jobfiles (executables) containing parameters to run MMseqs2

2. Execute the jobfiles to generate the functional predictions

3. Pick the top functional hit for each read

4. Create Final Normalized Taxa+Function Tables

4.1 Prepare a master input file

Short script to do this with a metadata file

4.2 Generate the final normalized tables (multiple options)

Generate the files separately

Generate the files all at once

4.3 Append functional descriptions

4.4 Generate pathway information

4.5 Format taxa and/or functional files for use in downstream visualization

Example: Stratified EC number rpkm visualisation

Contact

⚠️ GitHub.com Fallback ⚠️

Microbiome Helper 2 Annotation of reads with MMSeqs2 - LangilleLab/microbiome_helper GitHub Wiki

1. Generate jobfiles (executables) containing parameters to run MMseqs2

2. Execute the jobfiles to generate the functional predictions

3. Pick the top functional hit for each read

4. Create Final Normalized Taxa+Function Tables

4.1 Prepare a master input file

Short script to do this with a metadata file

4.2 Generate the final normalized tables (multiple options)

Generate the files separately

Generate the files all at once

4.3 Append functional descriptions

4.4 Generate pathway information

4.5 Format taxa and/or functional files for use in downstream visualization

Example: Stratified EC number rpkm visualisation

Contact

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️