MetaPhlan and HUManN output files - quadram-institute-bioscience/gmh-sops GitHub Wiki

Metaphlan

MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. For further information about installation and run the pipeline click here Metaphlan3.

Output Files

Metaphlan produces following output for shortgun metagenomic reads of a sample:

  1. $SAMPLE.bowtie2out.txt:
    This an intermediate file containing the mapping results to unique sequence markers. Alignments are listed one per line in tab-separated columns of read and reference marker.

  2. $SAMPLE_profile.txt or $SAMPLE__metaphlan_bugs_list.tsv:
    This is the main output file which we are interested in. This file contains the final computed organism abundances. Organism abundances are listed one clade per line, tab-separated from the clade's percent abundance:

#mpa_v30_CHOCOPhlAn_201901
#/opt/software/humann/bin/metaphlan ~/tmp0i7t6gma -x mpa_v30_CHOCOPhlAn_201901 --bowtie2db ~/mpa/mpa_v30_CHOCOPhlAn_201901/ --o ~/$SAMPLE_metaphlan_bugs_list.tsv --input_type fastq --bowtie2out ~/SMAPLE/$SAMPLE_humann_temp/$SAMPLE_metaphlan_bowtie2.txt --nproc 16
#SampleID	Metaphlan_Analysis
#SampleID	$SAMPLE
#clade_name	NCBI_tax_id	relative_abundance	additional_species
k__Bacteria	2	99.9472	
k__Archaea	2157	0.0528	
k__Bacteria|p__Firmicutes	2|1239	76.3641	
k__Bacteria|p__Bacteroidetes	2|976	22.03415	
k__Bacteria|p__Actinobacteria	2|201174	1.46012	
k__Bacteria|p__Proteobacteria	2|1224	0.08884	
k__Archaea|p__Euryarchaeota	2157|28890	0.0528	
k__Bacteria|p__Firmicutes|c__Clostridia	2|1239|186801	69.53105	
k__Bacteria|p__Bacteroidetes|c__Bacteroidia	2|976|200643	22.03415	
k__Bacteria|p__Firmicutes|c__Bacilli	2|1239|91061	4.40339	
k__Bacteria|p__Firmicutes|c__Negativicutes	2|1239|909932	2.3387	
k__Bacteria|p__Actinobacteria|c__Actinobacteria	2|201174|1760	1.02661	
k__Bacteria|p__Actinobacteria|c__Coriobacteriia	2|201174|84998	0.43351	

clade_name: It ranges from taxonomic kingdoms (Bacteria, Archaea, etc.) through species. The taxonomic level of each clade is prefixed to indicate its level:
Kingdom: k__, Phylum: p__, Class: c__, Order: o__, Family: f__, Genus: g__, Species: s__.

For eg: k__Bacteria|p__Actinobacteria|c__Coriobacteriia

NCBI_tax_id: Indicates the taxonomic id for each species at different taxonomic level.
For eg: k__Bacteria|p__Actinobacteria|c__Coriobacteriia 2|201174|84998

relative abundance: Shows the relative abundance of the species at lowest taxonomic level.
For eg: k__Bacteria|p__Actinobacteria|c__Coriobacteriia 2|201174|84998 0.43351. The relative abundance of clade Class: Coribacteria in the samples is 0.43351.

Note: Each taxonomic level will sum to 100%; that is, the sum of all kingdom-level clades is 100%, the sum of all genus-level clades (including unclassified) is also 100%, and so forth.

For detailed explanation refer here.

Combining tables

The script merge_metaphlan_tables.py allows to combine MetaPhlAn output from several samples to be merged into one table Bugs (rows) vs Samples (columns) with the table enlisting the relative normalised abundances per sample per bug. https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-3.0#merging-tables

How to run

$ merge_metaphlan_tables.py Sample1_profile.txt Sample2_profile.txt Sample3_profile.txt Sample4_profile.txt > metaphlan_output3.txt output/merged_abundance_table.txt

Output:

#mpa_v30_CHOCOPhlAn_201901
clade_name	NCBI_tax_id	Sample1_profile	Sample2_profile	Sample3_profile	Sample4_profile
k__Archaea	2157	0	0	0.0225	0.0528
k__Archaea|p__Euryarchaeota	2157|28890	0	0	0.0225	0.0528
k__Archaea|p__Euryarchaeota|c__Methanobacteria	2157|28890|183925	0	0	0.0225	0.0528
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales	2157|28890|183925|2158	0	0	0.0225	0.0528
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae	2157|28890|183925|2158|2159	0	0	0.0225	0.0528
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter	2157|28890|183925|2158|2159|2172	0	0	0.0225	0.0528

Humann

HUMAnN is a pipeline for profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data (typically millions of short DNA/RNA reads). This process, referred to as functional profiling, aims to describe the metabolic potential of a microbial community and its members. For further information about the pipeline, please refer to HUMAnN 3.0 User Manual

Does not use the paired-end reads information for computation. That means paired-end reads will be treated as single-end reads of a sample.

Installation, Databases and Usage

Note: HUManN pipeline works with single read file. Hence, prepare an interleaved fastq file as input from the paired-end reads of a sample.

Output files

Successfully executing the pipeline on a metagenomic read of a sample produces three outputs per sample.

  1. Gene Families File
  2. Path Abundance File
  3. Pathway coverage file

1. Gene Families File:

# Gene Family	sample_Abundance-RPKs
UNMAPPED	115152.0000000000
UniRef90_A0A0M0VMD0	2000.0000000000
UniRef90_A0A0M0VMD0|g__Bifidobacterium.s__Bifidobacterium_longum	2000.0000000000
UniRef90_C0EWA3	2000.0000000000
UniRef90_C0EWA3|g__Eubacterium.s__Eubacterium_hallii	2000.0000000000
UniRef90_I3BAJ8	2000.0000000000
UniRef90_I3BAJ8|g__Bifidobacterium.s__Bifidobacterium_longum	2000.0000000000
UniRef90_A7B6Q1	1557.7046827047
UniRef90_A7B6Q1|g__Blautia.s__Ruminococcus_gnavus	1557.7046827047
UniRef90_A7B4S7	1000.0000000000
UniRef90_A7B4S7|g__Blautia.s__Ruminococcus_gnavus	1000.0000000000
UniRef90_D4KNZ2	1000.0000000000
UniRef90_D4KNZ2|g__Eubacterium.s__Eubacterium_hallii	1000.0000000000

UNMAPPED = a Total number of reads that do not map to any gene families.
UniRef90_A0A0M0VMD0 2000.0000000000 Indicates the relative abundance of A0A0M0VMD0 (UniRef90 Gene Family) in the sample i.e., 2000.
UniRef90_A0A0M0VMD0|g__Bifidobacterium.s__Bifidobacterium_longum 2000.0000000000 Indicates relative A0A0M0VMD0 (UniRef90 Gene Family) in g__Bifidobacterium.s__Bifidobacterium_longum.

  • Gene families are groups of evolutionarily-related protein-coding sequences that often perform similar functions.
  • Gene Families File consist of the abundance of each gene family in the community reported in RPK (reads per kilobase) units to normalize for gene length.
  • RPK values can be further sum-normalized to adjust for differences in sequencing depth across samples by using a utility script humann_renorm_table
  • HUMAnN 3.0 uses the MetaPhlAn2 software along with the ChocoPhlAn database and translated search database for this computation.
  • The UniRef90_unknown values represent the total abundance of reads which map to ChocoPhlAn nucleotide sequences that do not have a UniRef90 annotation.

2. Path Abundance File:

# Pathway	sample_Abundance
UNMAPPED	40796.5803457015
UNINTEGRATED	61137.9031336611
UNINTEGRATED|unclassified	39258.2291570060
UNINTEGRATED|g__Eubacterium.s__Eubacterium_hallii	13718.0060457384
UNINTEGRATED|g__Blautia.s__Ruminococcus_gnavus	8180.8962902680
PWY-7238: sucrose biosynthesis II	119.2533495596
PWY-7238: sucrose biosynthesis II|unclassified	103.4461062902
PWY-7238: sucrose biosynthesis II|g__Eubacterium.s__Eubacterium_hallii	14.1151978910
PWY-5941: glycogen degradation II	115.3815881854
PWY-5941: glycogen degradation II|unclassified	98.2852118438
PWY-5941: glycogen degradation II|g__Eubacterium.s__Eubacterium_hallii	15.4308190184
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)	109.7337991462
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)|unclassified	89.1165043440
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)|g__Eubacterium.s__Eubacterium_hallii	15.9779973229
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)|g__Blautia.s__Ruminococcus_gnavus	6.8460717748

**UNMAPPED** represents the total number of reads not mapped to any of the gene families and it is converted into the equivalent pathway abundance (https://github.com/biobakery/humann#2-pathway-abundance-file).
**UNINTEGRATED** Indicates the gene families that don't belong to any of the known pathways. It is converted into the equivalent pathway abundance (https://github.com/biobakery/humann#2-pathway-abundance-file).

  • This file details the abundance of each pathway in the community as a function of the abundances of the pathway's component reactions, with each reaction's abundance computed as the sum over abundances of genes catalyzing the reaction.
  • Pathways with zero abundance are not included in the file.
  • Pathway abundance is computed once at the community level and again for each species (plus the "unclassified" stratum) using community- and species-level gene abundances along with the structure of the pathway.

Example of community-level pathway abundance: PWY-7238: sucrose biosynthesis II 119.2533495596
Example of species-level pathway abundance: PWY-7238: sucrose biosynthesis II|g__Eubacterium.s__Eubacterium_hallii 14.1151978910
Example of species-level (plus "unclassified" stratum): PWY-7238: sucrose biosynthesis II|unclassified 103.4461062902

  • Pathway abundance is proportional to the number of complete "copies" of the pathway in the community. Thus, for a simple linear pathway RXN1→RXN2→RXN3→RXN4, if RXN1 is 10 times as abundant as RXNs 2-4, the pathway abundance will be driven by the abundances of RXNs 2-4.
  • Unlike gene abundance, a pathway's community-level abundance is not necessarily the sum of its stratified abundance values. For example, continuing with the simple linear pathway example introduced above, if the abundances of RXNs 1-4 are [5, 5, 10, 10] in Species_A and [10, 10, 5, 5] in Species_B, HUMAnN 3.0 would report that Species_A and Species_B each contribute 5 complete copies of the pathway. However, at the community level, the reaction totals are [15, 15, 15, 15], and thus HUMAnN 3.0 would report 15 complete copies.
  • For further reference for the above explanation please refer here https://forum.biobakery.org/t/humann3-pathway-abundance-table-pathway-sum-and-species-sum-different/1471
  • By default, HUMAnN 3.0 uses MetaCyc pathway definitions and MinPath to identify a parsimonious set of pathways which explain observed reactions in the community.

3. Pathway coverage File:

# Pathway	sample_Coverage
UNMAPPED	1.0000000000
UNINTEGRATED	1.0000000000
UNINTEGRATED|g__Blautia.s__Ruminococcus_gnavus	1.0000000000
UNINTEGRATED|g__Eubacterium.s__Eubacterium_hallii	1.0000000000
UNINTEGRATED|unclassified	1.0000000000
PWY-7238: sucrose biosynthesis II	0.9999774377
PWY-7238: sucrose biosynthesis II|unclassified	0.9999922305
PWY-7238: sucrose biosynthesis II|g__Eubacterium.s__Eubacterium_hallii	0.3191650122
PWY-5941: glycogen degradation II	0.9996921140
PWY-5941: glycogen degradation II|unclassified	0.9994686473
PWY-5941: glycogen degradation II|g__Eubacterium.s__Eubacterium_hallii	0.3929697139
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)	0.9993528050
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)|unclassified	0.9984230945
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)|g__Eubacterium.s__Eubacterium_hallii	0.6477012866
GLYCOGENSYNTH-PWY: glycogen biosynthesis I (from ADP-D-Glucose)|g__Blautia.s__Ruminococcus_gnavus	0.4838470893
  • Pathway coverage shows the presence (1) and absence (0) of pathways in a community, independent of their quantitative abundance.
  • A pathway with coverage = 1 means pathway is confidently detected (independent of its abundance), which implies that all of its member reactions were also confidently detected.
  • A pathway with coverage = 0 means pathway is less confidently detected (independent of its abundance), as this implies that some of its member reactions were not confidently detected.
  • Like pathway abundance, pathway coverage is computed for the community as a whole, as well as for each detected species and the unclassified stratum.
  • It is possible for a pathway to be confidently covered at the community level but never confidently detected from any single species.

Merging output files:

The output files per sample can be merged into one to produce a single report file with the utility script humann_join_tables.

Merged Genefamiles

# Gene Family	Sample1_Abundance-RPKs	Sample2_Abundance-RPKs	Sample3_Abundance-RPKs	Sample4_Abundance-RPKs
UNMAPPED	11959846.0000000000	5042866.0000000000	1970791.0000000000	7644644.0000000000
UniRef90_A0A014AUM0|unclassified	0	0	0	11.8890771065
UniRef90_A0A015NZ08	0	4.4444444444	0	0
UniRef90_A0A015NZ08|unclassified	0	4.4444444444	0	0
UniRef90_A0A015P063	52.5107244035	19.4737274069	0	0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_caccae	0.8748906387	0	0	0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_cellulosilyticus	8.3606005716	3.4715116306	0	0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_ovatus	4.8542525157	1.3966507690	0	0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_thetaiotaomicron	1.3981127189	0	0	0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_uniformis	0.6993006993	2.0974134182	0	0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_vulgatus	30.0327923616	9.0131134493	0	0
UniRef90_A0A015P063|g__Bacteroides.s__Bacteroides_xylanisolvens	6.2907748978	3.4950381398	0	0

Merged Path abundance

# Pathway	Sample1_Abundance	Sample2_Abundance	Sample3_Abundance	 Sample4_Abundance
UNMAPPED	3926644.7372371694	1635421.4580625638	961801.8663788423	3750505.8833935312
UNINTEGRATED	6811345.7875916995	2566222.8705798956	7303169.8037830554	39497717.4358818978
UNINTEGRATED|g__Actinomyces.s__Actinomyces_sp_HMSC035G02	348.6439590084	0	0	0
UNINTEGRATED|g__Agathobaculum.s__Agathobaculum_butyriciproducens	9862.5403549987	0	0	0
1CMET2-PWY: folate transformations III (E. coli)	3992.1669990516	1483.5934089998	1709.9344713109	10437.1823240668
1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_onderdonkii	11.7503732515	0	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_putredinis	57.5598246324	16.5831762838	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_caccae	30.6833922570	7.3270018081	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_coprocola	250.0124401648	102.0605714820	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_ovatus	21.8408833188	11.4985955368	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_vulgatus	150.3095933731	60.1847365926	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Blautia.s__Ruminococcus_torques	37.5717566806	10.4296704453	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Escherichia.s__Escherichia_coli	0	0	993.3832735153	3508.5961863869
1CMET2-PWY: folate transformations III (E. coli)|g__Parabacteroides.s__Parabacteroides_merdae	15.0576827454	6.9745217371	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Salmonella.s__Salmonella_enterica	0	0	0	3067.3629302126
1CMET2-PWY: folate transformations III (E. coli)|unclassified	83.9685039546	0	0	0

Merged Pathway coverage

# Pathway	Sample1_Coverage	Sample2_Coverage	Sample3_Coverage	Sample4_Coverage
UNMAPPED	1.0000000000	1.0000000000	1.0000000000	1.0000000000
UNINTEGRATED	1.0000000000	1.0000000000	1.0000000000	1.0000000000
1CMET2-PWY: folate transformations III (E. coli)	0.9940997287	0.2459453464	0.0000000000	0.9995904747
1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_onderdonkii	0.1758382328	0	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_putredinis	0.2808858572	0.1571578373	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_caccae	0.2072768548	0.1974704271	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_coprocola	0.1088144066	0.4017180344	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_ovatus	0.0000203717	0.0218396573	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_vulgatus	0.0003710575	0.1800953856	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Blautia.s__Ruminococcus_torques	0.0000000000	0.0000000000	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Escherichia.s__Escherichia_coli	0	0	0.0000000000	0.0000000000
1CMET2-PWY: folate transformations III (E. coli)|g__Parabacteroides.s__Parabacteroides_merdae	0.0345641559	0.3923455430	0	0
1CMET2-PWY: folate transformations III (E. coli)|g__Salmonella.s__Salmonella_enterica	0	0	0	0.0000000000
1CMET2-PWY: folate transformations III (E. coli)|unclassified	0.0000000000	0	0	0

Note: Both Metaphlan and Humann offers different utility scripts about which can be found here