Microbiome Helper 2 Annotation with MetaPhlAn - LangilleLab/microbiome_helper GitHub Wiki
Authors: Robyn Wright, Modifications by: NA
Please note: We are still testing/developing this so use with caution :)
Introduction
Another tool that is commonly used for taxonomic annotation of metagenomic sequences is MetaPhlAn. This tool is different from Kraken2 in that it uses a database of marker genes, instead of a collection of genomes, and it identifies only these marker genes within our reads, rather than trying to classify all reads. It then attempts to estimate the abundance of the taxa it identified within our whole samples, but it's important to remember that this is an estimation, and not the actual number of reads classified. MetaPhlAn 4 is the current version of this tool. It is fairly extensively documented on their own Wiki pages, so we will just give an overview of how we tend to run it here.
1. Get the database
Get the clade markers:
metaphlan --install --bowtie2db metaphlan4_db
Note that while it is not necessary to give the
--bowtie2db
, we recommend that you do as sometimes it seems to struggle to find it if it's in the conda folders somewhere.
2. Run MetaPhlAn
First, make an output folder:
mkdir metaphlan_out
And then run MetaPhlAn:
parallel -j 1 --eta 'metaphlan --bowtie2db metaphlan4_db --input_type fastq -o metaphlan_out/{/.}.mpa {} --nproc 2' ::: kneaddata_out/*.fastq
Similar to Kraken2, MetaPhlAn will output individual files for each sample. We can use a utility script from MetaPhlAn to merge our outputs into one table.
merge_metaphlan_tables.py metaphlan_out/*.mpa > metaphlan_merged_out.txt
If we view this new combined table, we will see three key things:
- First, the output format is different to that of Kraken2, where the full taxonomic lineages are expanded line by line.
- Second, MetaPhlAn only outputs relative abundances for each taxonomic node, whereas Kraken2 (before re-analysis with Bracken) will output absolute numbers of reads assigned to each taxonomic node.
- Third, the number of taxa that MetaPhlAn finds is much smaller than Kraken2. This is partially due to us using a low confidence threshold with Kraken2, but this discrepancy between the two tools tends to hold true at higher confidence thresholds as well. See our paper for more info about how these tools perform compared to each other.