MetaPhlAn 4.2 - biobakery/MetaPhlAn GitHub Wiki

MetaPhlAn is a computational tool for species-level microbial profiling (bacteria, archaea, eukaryotes, and viruses) from metagenomic shotgun sequencing data. StrainPhlAn (available within MetaPhlAn) allows strain-level microbial population genomics.

MetaPhlAn 4, as of the database vJan25, relies on ~11M unique clade-specific marker genes (the latest marker information file can be found here) identified from >1.6M microbial genomes (~219k isolate genomes and ~1,3M metagenome-assembled genomes) spanning 58,331 species-level genome bins (SGBs, http://segatalab.cibio.unitn.it/data/Pasolli_et_al.html) (the full list of the species included in the latest database can be found here).

MetaPhlAn 4.2 allows:

unambiguous taxonomic assignments;
accurate estimation of organismal relative abundance;
SGB-level resolution for bacteria, archaea and eukaryotes;
quantification of metagenomic reads matching viruses;
metagenomic strain-level population genomics and strain tracking (with StrainPhlAn 4.1);
initial support for profiling long-read sequencing data.

The documentation below provides an overview on MetaPhlAn 4.2 basic usage, more usage examples are available in the MetaPhlAn 4.1 tutorial page, be aware that some parameters have been updated since 4.1 and you may need to update the commands to include these changes. You can find the full list of changes in the release notes

Important changes with respect to MetaPhlAn 4.1.*

Renamed parameters

As of MetaPhlAn 4.2 some parameters have been renamed to be more general in terms of the mapping tool that is used. MetaPhlAn now relies on minimap2 for long reads and on bowtie2 for short reads, therefore:

--bowtie2out has been renamed to --mapout
--input_type can take as input mapout, no longer bowtie2out
--bowtie2db has been renamed to --db_dir

Unclassified estimation as default

Starting from version 4.2, MetaPhlAn estimates the unclassified fraction of the metagenome by default. Therefore the --unclassified_estimation parameter is no longer existing. The unclassified estimation can be turned off with --skip_unclassified_estimation

Recommended installation

The best way to install MetaPhlAn is through conda via the Bioconda channel. If you have not configured your Anaconda installation to fetch packages from Bioconda, please follow these steps to set up the channels.

You can install the MetaPhlAn package by running

$ conda install -c bioconda metaphlan=4.2.2

But we recommend to create a dedicated conda environment with MetaPhlAn into it.

$ conda create --name mpa -c bioconda metaphlan=4.2.2

Other ways to install

If during the installation you encounter an incompatibility error with the glibc package, we suggest you to add the conda-forge channel to conda or run one of the following commands.

$ conda install -c conda-forge -c bioconda metaphlan

$ conda create --name mpa -c conda-forge -c bioconda python=3.10 metaphlan

This allows having the correct version of all the dependencies isolated from the system's python installation.

Before using MetaPhlAn, you should activate the mpa environment:

$ conda activate mpa

MetaPhlAn is also available in PyPi

$ pip install metaphlan

Alternatively, you can manually download from GitHub or clone the repository using the following command

$ git clone https://github.com/biobakery/MetaPhlAn.git

and install MetaPhlAn by running

$ pip install .

If you choose this way, you'll need to install manually some dependencies! MetaPhlAn requires python 3 or newer with numpy, and Biopython libraries installed. Python libraries are automatically installed by pip. MetaPhlAn relies on BowTie2 (version 2.3 or higher) to map reads against marker genes. Check that bowtie2 is present in the system path with execute and read permissions. Minimap2 (version 2.26 or higher) is required for long-read sequencing data.

Database installation

MetaPhlAn needs the clade markers and the database to be downloaded locally. To obtain them:

$ metaphlan --install

Important! The MetaPhlAn 4 database has been substantially increased in comparison with the previous versions. Thus, for running MetaPhlAn 4, a minimum of 30GB of memory is needed.

If you have installed MetaPhlAn using Anaconda, it is advised to install the database in a folder outside the Conda environment. To do this, run

$ metaphlan --install --db_dir <database folder>

If you install the database in a different location, remember to run MetaPhlAn using --db_dir <database folder>!

By default, the latest MetaPhlAn database is downloaded and built. You can download a specific version with the --index parameter

$ metaphlan --install --index mpa_vJan21_CHOCOPhlAnSGB_202103 --db_dir <database folder>

When --index is specified, MetaPhlAn skips the check for the latest database version and run the analysis using the database version provided by --index located in --db_dir.

This option is recommended when MetaPhlAn is run on HPC clusters or containerized

If you have issues in downloading the database, you can get it from:

Segatalab FTP

Just download the .tar, .md5, and the mpa_latest files and place them in the metaphlan_databases folder.

Basic Usage

$ metaphlan metagenome.fastq --input_type fastq -o profiled_metagenome.txt

It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (--mapout), and use multiple CPUs (--nproc) if available:

$ metaphlan metagenome.fastq --mapout metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt

If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you can obtain the results in few seconds by using the previously saved --mapout file and specifying the input (--input_type mapout):

$ metaphlan metagenome.bowtie2.bz2 --nproc 5 --input_type mapout -o profiled_metagenome.txt

mapout files generated with MetaPhlAn versions below 3.0 are not compatible. Starting from MetaPhlAn 3, the BowTie2 output now includes the size of the profiled metagenome.

You can also provide an externally BowTie2-mapped SAM if you specify this format with --input_type. Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn with the obtained SAM:

$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x metaphlan_databases/mpa_vJan25_CHOCOPhlAnSGB_202503  -U metagenome.fastq
$ metaphlan metagenome.sam --input_type sam -o profiled_metagenome.txt

Starting from version 4.2, MetaPhlAn estimates the unclassified fraction of the metagenome by default. If you want to scale the relative abundance profile according only to the percentage of reads mapping to a clade in the database use --skip_unclassified_estimation.

$ metaphlan metagenome.fastq --mapout metagenome.bowtie2.bz2 --nproc 5 --input_type fastq --skip_unclassified_estimation -o profiled_metagenome.txt

MetaPhlAn can also natively handle paired-end metagenomes (but does not use the paired-end information, except for when it performs subsampling on paired reads, see below), and, more generally, metagenomes stored in multiple files (but you need to specify the --mapout parameter):

$ metaphlan metagenome_1.fastq,metagenome_2.fastq --mapout metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt

It is possible to subsample the reads before the MetaPhlAn run by passing the number of reads to use (which must be < than the total number of reads of the sample) to --subsampling. In the following example, subsampling to 10,000 reads:

$ metaphlan metagenome.fastq --input_type fastq --subsampling 10000 -o profiled_metagenome_subsampled_10000.txt

Since MetaPhlAn 4.1.1, it is possible to use paired-end information during subsampling (above, paired-end reads would be treated as single-end, i.e., independent). For that, use --subsampling_paired instead:

metaphlan --subsampling_paired <N_PAIRED_READS> -1 <R1_FASTQ> -2 <R2_FASTQ> --input_type fastq --subsampling_out <SUBSAMPLED_READS_OUTPUT> -o <METAPHLAN_OUTPUT> --mapout <MAPOUT>

Starting from version 4.2 you can use MetaPhlAn on long-reads data. In this case, MetaPhlAn relies on minimap2 for mapping reads against markers. This is the minimal command to profile a long-read sample from a fastq file:

metaphlan metagenome.fastq --long_reads --input_type fastq --mapout <MAPOUT> -o profiled_metagenome.txt

The same rules apply to long-reads data to profile from sam file, however you must make sure you perfomed the mapping with Minimap2.

You can provide the specific database version with --index.

By default MetaPhlAn is run with --index latest: the latest version of the database is used; if it is not available, MetaPhlAn will try to download it.

When --index is specified, MetaPhlAn skips the check for the latest database version and run the analysis using the database version provided by --index located in --db_dir.

For advanced options and other analysis types (such as strain tracking) please refer to the full command-line options metaphlan --help.

Utility Scripts

MetaPhlAn's repository features a few utility scripts to aid in the manipulation of sample output and its visualization. These scripts can be found under the utils folder in the MetaPhlAn directory.

Merging Tables

The script merge_metaphlan_tables.py allows to combine MetaPhlAn output from several samples to be merged into one table Bugs (rows) vs Samples (columns) with the table enlisting the relative normalized abundances per sample per bug.

To merge multiple output files, run the script as below

$ merge_metaphlan_tables.py metaphlan_output1.txt metaphlan_output2.txt > metaphlan_output3.txt output/merged_abundance_table.txt

Wildcards can be used as needed:

$ merge_metaphlan_tables.py metaphlan_output*.txt > output/merged_abundance_table.txt

Output files can be merged only if the profiling was performed with the same version of the MetaPhlAn database.

There is no limit to how many files you can merge.

Converting SGB profiles to the GTDB taxonomy

The script sgb_to_gtdb_profile.py allows to convert a SGB-based MetaPhlAn 4 output into a GTDB-taxonomy-based profile.

To do so, run the script as below

$ sgb_to_gtdb_profile.py -i metaphlan_output.txt -o metaphlan_output_gtdb.txt

Alpha and beta diversity calculation

The script calculate_diversity.R allows to compute alpha and/or beta diversity, with different metrics of choice, starting from a merged MetaPhlAn table. Available alpha-diversity metrics are richness, shannon, simpson, and gini. Available beta-diversity distance functions are bray-curtis, jaccard, weighted-unifrac, unweighted-unifrac, centered log-ratio, and aitchison. For example, to generate a beta diversity distance matrix with bray-curtis, you need to run the script as below:

Rscript calculate_diversity.R -f merged_mpa4_profiles.tsv -d beta -m bray-curtis

To compute UniFrac distances, the SGB tree in the Newick format (available here) must be provided.

For the full list of options, please run:

Rscript calculate_diversity.R