MetaPhlan and HUManN output files - quadram-institute-bioscience/gmh-sops GitHub Wiki
Metaphlan
MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. For further information about installation and run the pipeline click here Metaphlan3.
Output Files
Metaphlan produces following output for shortgun metagenomic reads of a sample:
-
$SAMPLE.bowtie2out.txt:
This an intermediate file containing the mapping results to unique sequence markers. Alignments are listed one per line in tab-separated columns of read and reference marker. -
$SAMPLE_profile.txt or $SAMPLE__metaphlan_bugs_list.tsv:
This is the main output file which we are interested in. This file contains the final computed organism abundances. Organism abundances are listed one clade per line, tab-separated from the clade's percent abundance:
#mpa_v30_CHOCOPhlAn_201901
#/opt/software/humann/bin/metaphlan ~/tmp0i7t6gma -x mpa_v30_CHOCOPhlAn_201901 --bowtie2db ~/mpa/mpa_v30_CHOCOPhlAn_201901/ --o ~/$SAMPLE_metaphlan_bugs_list.tsv --input_type fastq --bowtie2out ~/SMAPLE/$SAMPLE_humann_temp/$SAMPLE_metaphlan_bowtie2.txt --nproc 16
#SampleID Metaphlan_Analysis
#SampleID $SAMPLE
#clade_name NCBI_tax_id relative_abundance additional_species
k__Bacteria 2 99.9472
k__Archaea 2157 0.0528
k__Bacteria|p__Firmicutes 2|1239 76.3641
k__Bacteria|p__Bacteroidetes 2|976 22.03415
k__Bacteria|p__Actinobacteria 2|201174 1.46012
k__Bacteria|p__Proteobacteria 2|1224 0.08884
k__Archaea|p__Euryarchaeota 2157|28890 0.0528
k__Bacteria|p__Firmicutes|c__Clostridia 2|1239|186801 69.53105
k__Bacteria|p__Bacteroidetes|c__Bacteroidia 2|976|200643 22.03415
k__Bacteria|p__Firmicutes|c__Bacilli 2|1239|91061 4.40339
k__Bacteria|p__Firmicutes|c__Negativicutes 2|1239|909932 2.3387
k__Bacteria|p__Actinobacteria|c__Actinobacteria 2|201174|1760 1.02661
k__Bacteria|p__Actinobacteria|c__Coriobacteriia 2|201174|84998 0.43351
clade_name: It ranges from taxonomic kingdoms (Bacteria, Archaea, etc.) through species. The taxonomic level of each clade is prefixed to indicate its level:
Kingdom: k__
, Phylum: p__
, Class: c__
, Order: o__
, Family: f__
, Genus: g__
, Species: s__
.
For eg: k__Bacteria|p__Actinobacteria|c__Coriobacteriia
NCBI_tax_id: Indicates the taxonomic id for each species at different taxonomic level.
For eg: k__Bacteria|p__Actinobacteria|c__Coriobacteriia 2|201174|84998
relative abundance: Shows the relative abundance of the species at lowest taxonomic level.
For eg: k__Bacteria|p__Actinobacteria|c__Coriobacteriia 2|201174|84998 0.43351
. The relative abundance of clade Class: Coribacteria
in the samples is 0.43351.
Note: Each taxonomic level will sum to 100%; that is, the sum of all kingdom-level clades is 100%, the sum of all genus-level clades (including unclassified) is also 100%, and so forth.
For detailed explanation refer here.
Combining tables
The script merge_metaphlan_tables.py allows to combine MetaPhlAn output from several samples to be merged into one table Bugs (rows) vs Samples (columns) with the table enlisting the relative normalised abundances per sample per bug. https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-3.0#merging-tables
How to run
$ merge_metaphlan_tables.py Sample1_profile.txt Sample2_profile.txt Sample3_profile.txt Sample4_profile.txt > metaphlan_output3.txt output/merged_abundance_table.txt
Output:
#mpa_v30_CHOCOPhlAn_201901
clade_name NCBI_tax_id Sample1_profile Sample2_profile Sample3_profile Sample4_profile
k__Archaea 2157 0 0 0.0225 0.0528
k__Archaea|p__Euryarchaeota 2157|28890 0 0 0.0225 0.0528
k__Archaea|p__Euryarchaeota|c__Methanobacteria 2157|28890|183925 0 0 0.0225 0.0528
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales 2157|28890|183925|2158 0 0 0.0225 0.0528
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae 2157|28890|183925|2158|2159 0 0 0.0225 0.0528
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter 2157|28890|183925|2158|2159|2172 0 0 0.0225 0.0528
Humann
HUMAnN is a pipeline for profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data (typically millions of short DNA/RNA reads). This process, referred to as functional profiling, aims to describe the metabolic potential of a microbial community and its members. For further information about the pipeline, please refer to HUMAnN 3.0 User Manual
Does not use the paired-end reads information for computation. That means paired-end reads will be treated as single-end reads of a sample.
Installation, Databases and Usage
- For initial installation of HUMAnN 3.0: https://github.com/biobakery/humann#initial-installation
- For downloading and configuring the databases: https://github.com/biobakery/humann#5-download-the-databases
- For running the pipeline: https://github.com/biobakery/humann#how-to-run
- Input file formats: https://github.com/biobakery/humann#main-workflow
Note: HUManN pipeline works with single read file. Hence, prepare an interleaved fastq file as input from the paired-end reads of a sample.
Output files
Successfully executing the pipeline on a metagenomic read of a sample produces three outputs per sample.
1. Gene Families File:
# Gene Family sample_Abundance-RPKs
UNMAPPED 115152.0000000000
UniRef90_A0A0M0VMD0 2000.0000000000
UniRef90_A0A0M0VMD0|g__Bifidobacterium.s__Bifidobacterium_longum 2000.0000000000
UniRef90_C0EWA3 2000.0000000000
UniRef90_C0EWA3|g__Eubacterium.s__Eubacterium_hallii 2000.0000000000
UniRef90_I3BAJ8 2000.0000000000
UniRef90_I3BAJ8|g__Bifidobacterium.s__Bifidobacterium_longum 2000.0000000000
UniRef90_A7B6Q1 1557.7046827047
UniRef90_A7B6Q1|g__Blautia.s__Ruminococcus_gnavus 1557.7046827047
UniRef90_A7B4S7 1000.0000000000
UniRef90_A7B4S7|g__Blautia.s__Ruminococcus_gnavus 1000.0000000000
UniRef90_D4KNZ2 1000.0000000000
UniRef90_D4KNZ2|g__Eubacterium.s__Eubacterium_hallii 1000.0000000000
UNMAPPED
= a Total number of reads that do not map to any gene families.
UniRef90_A0A0M0VMD0 2000.0000000000
Indicates the relative abundance ofA0A0M0VMD0 (UniRef90 Gene Family)
in thesample
i.e., 2000.
UniRef90_A0A0M0VMD0|g__Bifidobacterium.s__Bifidobacterium_longum 2000.0000000000
Indicates relativeA0A0M0VMD0 (UniRef90 Gene Family)
ing__Bifidobacterium.s__Bifidobacterium_longum
.
- Gene families are groups of evolutionarily-related protein-coding sequences that often perform similar functions.
- Gene Families File consist of the abundance of each gene family in the community reported in RPK (reads per kilobase) units to normalize for gene length.
- RPK values can be further sum-normalized to adjust for differences in sequencing depth across samples by using a utility script humann_renorm_table
- HUMAnN 3.0 uses the MetaPhlAn2 software along with the ChocoPhlAn database and translated search database for this computation.
- The UniRef90_unknown values represent the total abundance of reads which map to ChocoPhlAn nucleotide sequences that do not have a UniRef90 annotation.
2. Path Abundance File:
# Pathway sample_Abundance
UNMAPPED 40796.5803457015
UNINTEGRATED 61137.9031336611
UNINTEGRATED|unclassified 39258.2291570060
UNINTEGRATED|g__Eubacterium.s__Eubacterium_hallii 13718.0060457384
UNINTEGRATED|g__Blautia.s__Ruminococcus_gnavus 8180.8962902680
PWY-7238: sucrose biosynthesis II 119.2533495596
PWY-7238: sucrose biosynthesis II|unclassified 103.4461062902
PWY-7238: sucrose biosynthesis II|g__Eubacterium.s__Eubacterium_hallii 14.1151978910
PWY-5941: glycogen degradation II 115.3815881854
PWY-5941: glycogen degradation II|unclassified 98.2852118438
PWY-5941: glycogen degradation II|g__Eubacterium.s__Eubacterium_hallii 15.4308190184
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose) 109.7337991462
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)|unclassified 89.1165043440
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)|g__Eubacterium.s__Eubacterium_hallii 15.9779973229
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)|g__Blautia.s__Ruminococcus_gnavus 6.8460717748
**UNMAPPED**
represents the total number of reads not mapped to any of the gene families and it is converted into the equivalent pathway abundance (https://github.com/biobakery/humann#2-pathway-abundance-file).
**UNINTEGRATED**
Indicates the gene families that don't belong to any of the known pathways. It is converted into the equivalent pathway abundance (https://github.com/biobakery/humann#2-pathway-abundance-file).
- This file details the abundance of each pathway in the community as a function of the abundances of the pathway's component reactions, with each reaction's abundance computed as the sum over abundances of genes catalyzing the reaction.
- Pathways with zero abundance are not included in the file.
- Pathway abundance is computed once at the community level and again for each species (plus the "unclassified" stratum) using community- and species-level gene abundances along with the structure of the pathway.
Example of community-level pathway abundance:
PWY-7238: sucrose biosynthesis II 119.2533495596
Example of species-level pathway abundance:PWY-7238: sucrose biosynthesis II|g__Eubacterium.s__Eubacterium_hallii 14.1151978910
Example of species-level (plus "unclassified" stratum):PWY-7238: sucrose biosynthesis II|unclassified 103.4461062902
- Pathway abundance is proportional to the number of complete "copies" of the pathway in the community. Thus, for a simple linear pathway RXN1→RXN2→RXN3→RXN4, if RXN1 is 10 times as abundant as RXNs 2-4, the pathway abundance will be driven by the abundances of RXNs 2-4.
- Unlike gene abundance, a pathway's community-level abundance is not necessarily the sum of its stratified abundance values. For example, continuing with the simple linear pathway example introduced above, if the abundances of RXNs 1-4 are [5, 5, 10, 10] in Species_A and [10, 10, 5, 5] in Species_B, HUMAnN 3.0 would report that Species_A and Species_B each contribute 5 complete copies of the pathway. However, at the community level, the reaction totals are [15, 15, 15, 15], and thus HUMAnN 3.0 would report 15 complete copies.
- For further reference for the above explanation please refer here https://forum.biobakery.org/t/humann3-pathway-abundance-table-pathway-sum-and-species-sum-different/1471
- By default, HUMAnN 3.0 uses MetaCyc pathway definitions and MinPath to identify a parsimonious set of pathways which explain observed reactions in the community.
3. Pathway coverage File:
# Pathway sample_Coverage
UNMAPPED 1.0000000000
UNINTEGRATED 1.0000000000
UNINTEGRATED|g__Blautia.s__Ruminococcus_gnavus 1.0000000000
UNINTEGRATED|g__Eubacterium.s__Eubacterium_hallii 1.0000000000
UNINTEGRATED|unclassified 1.0000000000
PWY-7238: sucrose biosynthesis II 0.9999774377
PWY-7238: sucrose biosynthesis II|unclassified 0.9999922305
PWY-7238: sucrose biosynthesis II|g__Eubacterium.s__Eubacterium_hallii 0.3191650122
PWY-5941: glycogen degradation II 0.9996921140
PWY-5941: glycogen degradation II|unclassified 0.9994686473
PWY-5941: glycogen degradation II|g__Eubacterium.s__Eubacterium_hallii 0.3929697139
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose) 0.9993528050
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)|unclassified 0.9984230945
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)|g__Eubacterium.s__Eubacterium_hallii 0.6477012866
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)|g__Blautia.s__Ruminococcus_gnavus 0.4838470893
- Pathway coverage shows the presence (1) and absence (0) of pathways in a community, independent of their quantitative abundance.
A pathway with coverage = 1
means pathway is confidently detected (independent of its abundance), which implies that all of its member reactions were also confidently detected.A pathway with coverage = 0
means pathway is less confidently detected (independent of its abundance), as this implies that some of its member reactions were not confidently detected.- Like pathway abundance, pathway coverage is computed for the community as a whole, as well as for each detected species and the unclassified stratum.
- It is possible for a pathway to be confidently covered at the community level but never confidently detected from any single species.
Merging output files:
The output files per sample can be merged into one to produce a single report file with the utility script humann_join_tables.
Merged Genefamiles
# Gene Family Sample1_Abundance-RPKs Sample2_Abundance-RPKs Sample3_Abundance-RPKs Sample4_Abundance-RPKs
UNMAPPED 11959846.0000000000 5042866.0000000000 1970791.0000000000 7644644.0000000000
UniRef90_A0A014AUM0|unclassified 0 0 0 11.8890771065
UniRef90_A0A015NZ08 0 4.4444444444 0 0
UniRef90_A0A015NZ08|unclassified 0 4.4444444444 0 0
UniRef90_A0A015P063 52.5107244035 19.4737274069 0 0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_caccae 0.8748906387 0 0 0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_cellulosilyticus 8.3606005716 3.4715116306 0 0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_ovatus 4.8542525157 1.3966507690 0 0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_thetaiotaomicron 1.3981127189 0 0 0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_uniformis 0.6993006993 2.0974134182 0 0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_vulgatus 30.0327923616 9.0131134493 0 0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_xylanisolvens 6.2907748978 3.4950381398 0 0
Merged Path abundance
# Pathway Sample1_Abundance Sample2_Abundance Sample3_Abundance Sample4_Abundance
UNMAPPED 3926644.7372371694 1635421.4580625638 961801.8663788423 3750505.8833935312
UNINTEGRATED 6811345.7875916995 2566222.8705798956 7303169.8037830554 39497717.4358818978
UNINTEGRATED|g__Actinomyces.s__Actinomyces_sp_HMSC035G02 348.6439590084 0 0 0
UNINTEGRATED|g__Agathobaculum.s__Agathobaculum_butyriciproducens 9862.5403549987 0 0 0
1CMET2-PWY: folate transformations III (E. coli) 3992.1669990516 1483.5934089998 1709.9344713109 10437.1823240668
1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_onderdonkii 11.7503732515 0 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_putredinis 57.5598246324 16.5831762838 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_caccae 30.6833922570 7.3270018081 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_coprocola 250.0124401648 102.0605714820 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_ovatus 21.8408833188 11.4985955368 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_vulgatus 150.3095933731 60.1847365926 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Blautia.s__Ruminococcus_torques 37.5717566806 10.4296704453 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Escherichia.s__Escherichia_coli 0 0 993.3832735153 3508.5961863869
1CMET2-PWY: folate transformations III (E. coli)|g__Parabacteroides.s__Parabacteroides_merdae 15.0576827454 6.9745217371 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Salmonella.s__Salmonella_enterica 0 0 0 3067.3629302126
1CMET2-PWY: folate transformations III (E. coli)|unclassified 83.9685039546 0 0 0
Merged Pathway coverage
# Pathway Sample1_Coverage Sample2_Coverage Sample3_Coverage Sample4_Coverage
UNMAPPED 1.0000000000 1.0000000000 1.0000000000 1.0000000000
UNINTEGRATED 1.0000000000 1.0000000000 1.0000000000 1.0000000000
1CMET2-PWY: folate transformations III (E. coli) 0.9940997287 0.2459453464 0.0000000000 0.9995904747
1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_onderdonkii 0.1758382328 0 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_putredinis 0.2808858572 0.1571578373 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_caccae 0.2072768548 0.1974704271 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_coprocola 0.1088144066 0.4017180344 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_ovatus 0.0000203717 0.0218396573 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_vulgatus 0.0003710575 0.1800953856 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Blautia.s__Ruminococcus_torques 0.0000000000 0.0000000000 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Escherichia.s__Escherichia_coli 0 0 0.0000000000 0.0000000000
1CMET2-PWY: folate transformations III (E. coli)|g__Parabacteroides.s__Parabacteroides_merdae 0.0345641559 0.3923455430 0 0
1CMET2-PWY: folate transformations III (E. coli)|g__Salmonella.s__Salmonella_enterica 0 0 0 0.0000000000
1CMET2-PWY: folate transformations III (E. coli)|unclassified 0.0000000000 0 0 0
Note: Both Metaphlan and Humann offers different utility scripts about which can be found here
- HUManN: https://github.com/biobakery/humann#guides-to-humann-30-utility-scripts
- MetaPhlAn: https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-3.0#utility-scripts https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn2#utility-scripts https://github.com/biobakery/biobakery/wiki/metaphlan3#metaphlan-30-tutorial