humann2 - biobakery/biobakery GitHub Wiki
Legacy HUMAnN 1.0 tutorial
Next Release HUMAnN 3.0 tutorial
HUMAnN 2.0 (the HMP Unified Metabolic Analysis Network) is a method for efficiently and accurately determining the presence, absence, and abundance of metabolic pathways in a microbial community from metagenomic or metatranscriptomic sequencing data. It is appropriate for any type of microbial community shotgun sequence profiling (i.e. not just the human microbiome; the name is a historical product of its origin in a Human Microbiome Project working group).
HUMAnN 2.0 can be installed from PyPI, as source code from the HUMAnN 2.0 Github repository, or using our quickstart documentation.
Support is available from the HUMAnN Forum. Feel free to join the group.
If you have already installed HUMAnN 2.0 and would like to upgrade your installed version to the latest version, please see the installation update section of the HUMAnN 2.0 User Manual.
Table of contents
- Overview
- 1. Setup
- 2. Metagenome functional profiling
- 3. Manipulating HUMAnN2 output tables
- 4. Advanced HUMAnN2 topics
- 5. Downstream analyses
- MetaPhlAn2 (version <= 2.6.0)
- Bowtie2 (version >= 2.1)
- Diamond (0.9.0 > version >= 0.8.22)
- MinPath
- Python (version >= 2.7)
NOTE: Bowtie2, Diamond, and MinPath are automatically installed when installing HUMAnN 2.0. If these dependencies do not appear to be installed after installing HUMAnN 2.0 with pip, it might be that your environment is setup to use wheels instead of installing from source. HUMAnN 2.0 must be installed from source for it to also be able to install dependencies. To force pip to install HUMAnN 2.0 from source add one of the following options to your install command, "--no-use-wheel" or "--no-binary :all:".
Please note, if you are using bioBakery (Vagrant VM or Google Cloud) you do not need to install HUMAnN 2.0 because the tool and its dependencies are already installed.
The easiest way to install HUMAnN 2.0 is with pip, Conda, or Docker. For other options, please refer to the HUMAnN 2.0 User Manual.
To install with pip:
$ pip install humann2
To install with conda (which will also install MetaPhlAn2 and utility dependencies):
$ conda install -c biobakery humann2
* Before installing with conda, make sure to configure your channels so biobakery is at the top of the list.
Note that by default, a pip
or other HUMAnN 2.0 install will NOT include
the full nucleotide or amino acid search databases needed for raw
metagenome or metatranscriptome profiling, as these are quite large and
should be installed based on project needs. A typical environment,
however, will require one of each, and we default to ChocoPhlAn for the
former and UniRef90 for the latter (note that for the purposes of this
tutorial alone, you will download demo databases to reduce run time):
$ humann2_databases --download chocophlan DEMO humann2_database_downloads
$ humann2_databases --download uniref DEMO_diamond humann2_database_downloads
The commands above download the databases to a folder named
humann2_database_downloads
. Please note you can download these
databases to any location on your machine. HUMAnN 2.0 will track where you
have placed the databases and update its configuration file.
After installation from pip
or another source, you may optionally test
your local HUMAnN 2.0 environment:
$ humann2_test
Which yields (abbreviated):
test_add_length_annotation (basic_tests_utilities.TestHumann2UtilitiesFunctions) ... ok
test_add_length_annotation_spaces (basic_tests_utilities.TestHumann2UtilitiesFunctions) ... ok
test_break_up_fasta_file (basic_tests_utilities.TestHumann2UtilitiesFunctions) ... ok
test_count_reads_fasta (basic_tests_utilities.TestHumann2UtilitiesFunctions) ... ok
...
HUMAnN 2.0 provides organism-specific community functional profiles, and to do so it first detects the organisms present in a community using MetaPhlAn2. As MetaPhlAn2 is thus a prerequisite for running the HUMAnN 2.0 software, please ensure that you've installed MetaPhlAn2 before running HUMAnN 2.0
A note on viewing files from the command line: During this tutorial
you'll inspect many tabular output files. For simplicity the tutorial
uses less -S file.tsv
for this task. To match the more Excel-like
columns seen in the tutorial document, you would first use the column
command, as in:
column -t -s $'\t' file.tsv | less -S
Each HUMAnN 2.0 run can process either a single community sample's raw reads, or (in more advanced usage) pre-computed gene data from multiple samples. Input files can thus be of the following types:
- Quality controlled shotgun metagenomes or metatranscriptomes
(
fastq
,fastq.gz
,fasta
, orfasta.gz
) - Mapping results (
sam
,bam
orblastm8
) - Gene abundance tables (
tsv
orbiom
)
This tutorial will use demo inputs distributed with the HUMAnN 2.0
installation under the examples
directory. If you installed from pip
or Conda
and don't have this handy, you can obtain a copy by
right-clicking these links and selecting "save link as":
The most basic HUMAnN 2.0 execution uses nucleotide mapping and translated
search to provide organism-specific gene and pathway abundance profiles
from a single metagenome. By default, gene families are annotated using
UniRef90 identifiers and pathways
using MetaCyc IDs. From the examples
folder (or
the location where you saved demo.fastq
above) execute the following
:
$ humann2 --input demo.fastq --output demo_fastq
Which yields (blank lines removed):
Creating output directory: demo_fastq
Output files will be written to: demo_fastq
Running metaphlan2.py ........
Found g__Bacteroides.s__Bacteroides_dorei : 59.18% of mapped reads
Found g__Bacteroides.s__Bacteroides_vulgatus : 40.82% of mapped reads
Total species selected from prescreen: 2
Selected species explain 100.00% of predicted community composition
Creating custom ChocoPhlAn database ........
Running bowtie2-build ........
Running bowtie2 ........
Total bugs from nucleotide alignment: 2
g__Bacteroides.s__Bacteroides_vulgatus: 7336 hits
g__Bacteroides.s__Bacteroides_dorei: 8820 hits
Total gene families from nucleotide alignment: 3685
Unaligned reads after nucleotide alignment: 23.0666666667 %
Running diamond ........
Aligning to reference database: uniref90_demo_prots.dmnd
Total bugs after translated alignment: 3
g__Bacteroides.s__Bacteroides_vulgatus: 7336 hits
unclassified: 974 hits
g__Bacteroides.s__Bacteroides_dorei: 8820 hits
Total gene families after translated alignment: 3728
Unaligned reads after translated alignment: 18.5571428571 %
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:
demo_fastq/demo_genefamilies.tsv
demo_fastq/demo_pathabundance.tsv
demo_fastq/demo_pathcoverage.tsv
In newer versions of HUMAnN 2.0 the flag --gap-fill
is set to on
by
default. If you are using a version prior to v0.9.7, you will want to
add --gap-fill on
to the commands. Gap-filling provides robustness to
missing (undersampled) reactions when quantifying metabolic pathways,
which occurs frequently in the demo datasets due to their low sequencing
depth.
Please note, the demo outputs in this tutorial were generated using HUMAnN 2.0 v0.11.1 with the demo databases. If you are running a different version of HUMAnN 2.0 or are using a different set of databases, the outputs might differ.
- Gap filling allows us to quantify a pathway even when a small number of its reactions are conspicuously absent. What might be some pros/cons of this approach in real data?
When HUMAnN 2.0 is run from any input type, three main output files will be created:
$OUTPUT_DIR/$SAMPLE_genefamilies.tsv
$OUTPUT_DIR/$SAMPLE_pathabundance.tsv
$OUTPUT_DIR/$SAMPLE_pathcoverage.tsv
For the demo.fastq
file analyzed above, these look like:
demo_fastq/demo_genefamilies.tsv
demo_fastq/demo_pathabundance.tsv
demo_fastq/demo_pathcoverage.tsv
Let's dig into the contents of the individual files:
This file lists the abundances of each gene family in the community
in RPK (Reads Per Kilobase) units. Each gene family's total abundance is
broken down into the contributions from individual organisms when
possible (delineated by the pipe character, e.g.
UniRef90_R6HHA8|g__Bacteroides.s__Bacteroides_dorei
), or
unclassified
when mappings are only found using translated search
(which is not organism-specific). We refer to this format as "stratified
output" and the individual per-organism and unclassified contributions
are referred to as "strata" (singular "stratum"). By default, gene families
are identified by their unique UniRef90 identifiers (e.g. UniRef90_R6HHA8 in the example above). While these are convenient for computing, they are not very easy to interpret by eye. We'll explore in a subsequent section how to attach human-readable names to UniRef90 (and other) identifiers.
The RPKs of reads that map to gene families that lack UniRef90 IDs are captured as
UniRef90_unknown
, and unmapped read abundances as UNMAPPED
. The
total RPK of each community is thus captured by summing either 1)
UNMAPPED
, _unknown
, and all UniRef IDs prior to organism-specific
decomposition, or 2) UNMAPPED
, _unknown|<organism>
, and all
UniRef|<organism>
IDs. The sum of either set of values can then be
normalized to CPMs (Copies Per Million) or relative
abundances.
To inspect the demo_genefamilies.tsv
file, use:
$ less -S demo_fastq/demo_genefamilies.tsv
Which yields:
# Gene Family demo_Abundance-RPKs
UNMAPPED 3897.0000000000
UniRef90_unknown 1633.3754216411
UniRef90_unknown|g__Bacteroides.s__Bacteroides_dorei 894.7763053061
UniRef90_unknown|g__Bacteroides.s__Bacteroides_vulgatus 738.5991163350
UniRef90_R6HHA8 333.3333333333
UniRef90_R6HHA8|g__Bacteroides.s__Bacteroides_dorei 333.3333333333
UniRef90_R7NYS9 166.6666666667
UniRef90_R7NYS9|g__Bacteroides.s__Bacteroides_vulgatus 166.6666666667
UniRef90_D1K9F5 66.6666666667
Let's find some of the "unclassified" stratifications in this file using
the grep
command:
$ grep 'unclassified' demo_fastq/demo_genefamilies.tsv | less -S
Which yields:
UniRef90_X6L320|unclassified 28.0905985144
UniRef90_U5FT06|unclassified 25.9926484436
UniRef90_W8YTG4|unclassified 25.2752668904
UniRef90_Q9ZUH4|unclassified 23.5421011503
UniRef90_Z5XVM9|unclassified 23.4598539432
UniRef90_Z9JRB3|unclassified 22.3949839456
UniRef90_Q8GT21|unclassified 21.6993097732
UniRef90_W7DVV5|unclassified 21.2467233744
UniRef90_W7T1N9|unclassified 20.6229542533
UniRef90_W1IYL7|unclassified 20.2552497738
- Out of the box, HUMAnN 2.0 can quantify protein families at 90 or 50% identity (via UniRef90 and UniRef50, respectively). What are the advantages of quantifying UniRef90 gene families? UniRef50 families?
This file lists the abundances of each pathway in the community, also in RPK units as described for gene families. For details on the calculation of pathway abundances from constituent gene family abundances, see the HUMAnN 2.0 User Manual. Like gene families, pathways are broken down into per-organism contributions when possible (again delineated by the pipe character), with the total read abundance consisting of pathways assigned to organisms, UNMAPPED
reads, and UNINTEGRATED
reads that are attributable to specific gene families (typically also within specific organisms), but for gene families that are not annotated to any pathway in the corresponding database (e.g. MetaCyc). As with gene families, RPKs can be normalized to copies per million or relative abundances by summing either 1) UNMAPPED
reads, UNINTEGRATED
organism-specific features, and organism-specific (i.e. pipe-containing) pathways, or 2) UNMAPPED
reads, total UNINTEGRATED
features, and pathways that are not broken down per organism.
To inspect the demo_pathabundance.tsv
file, use:
$ less -S demo_fastq/demo_pathabundance.tsv
Which yields:
# Pathway demo_Abundance
UNMAPPED 1225.9730368639
UNINTEGRATED 5908.3755167593
UNINTEGRATED|g__Bacteroides.s__Bacteroides_dorei 3123.1902040899
UNINTEGRATED|g__Bacteroides.s__Bacteroides_vulgatus 2753.6043067990
UNINTEGRATED|unclassified 72.8101423152
PWY-6305: putrescine biosynthesis IV 30.3913024756
PWY-6305: putrescine biosynthesis IV|unclassified 30.3913024756
PWY-4203: volatile benzenoid biosynthesis I (ester formation) 22.5319052245
PWY-4203: volatile benzenoid biosynthesis I (ester formation)|unclassified 22.5319052245
PWY490-3: nitrate reduction VI (assimilatory) 21.3761301200
PWY490-3: nitrate reduction VI (assimilatory)|unclassified 21.3761301200
PWY-1269: CMP-3-deoxy-D-manno-octulosonate biosynthesis I 19.6850995143
PWY-1269: CMP-3-deoxy-D-manno-octulosonate biosynthesis I|unclassified 18.6035804538
PWY-5173: superpathway of acetyl-CoA biosynthesis 17.0384635071
PWY-5173: superpathway of acetyl-CoA biosynthesis|unclassified 15.1040744758
PWY-3781: aerobic respiration I (cytochrome c) 15.9398653899
PWY-3781: aerobic respiration I (cytochrome c)|unclassified 15.9398653899
PWY-7279: aerobic respiration II (cytochrome c) (yeast) 15.9398653899
PWY-7279: aerobic respiration II (cytochrome c) (yeast)|unclassified 15.9398653899
- Unlike gene family abundance, stratified pathway abundance does not necessarily sum to yield the community total pathway abundance. Why not? (Hint: HUMAnN 2.0 estimates the number of complete pathways in the community or a given stratification.)
This file lists the coverage of each pathway in the community,
that is, the probability that it is provided as a unit to the metagenome
or metatranscriptome either 1) within one or more particular organisms
(again pipe-delimited) or 2) across one or more organisms in aggregate.
For details on the calculation of pathway coverage from constituent gene
family abundances, see the HUMAnN 2.0 User
Manual. Pathways' coverages
should be considered probabilistic presence/absence calls for each
pathway and NOT summed, and only cataloged pathways are individually
analyzed this way. To inspect the demo_pathcoverage.tsv
file, use:
$ less -S demo_fastq/demo_pathcoverage.tsv
Which yields:
# Pathway demo_Coverage
UNMAPPED 1.0000000000
UNINTEGRATED 1.0000000000
UNINTEGRATED|g__Bacteroides.s__Bacteroides_dorei 1.0000000000
UNINTEGRATED|g__Bacteroides.s__Bacteroides_vulgatus 1.0000000000
UNINTEGRATED|unclassified 1.0000000000
PWY-6305: putrescine biosynthesis IV 0.9999818077
PWY-6305: putrescine biosynthesis IV|unclassified 0.9401999667
PWY-4203: volatile benzenoid biosynthesis I (ester formation) 0.9992342795
PWY-4203: volatile benzenoid biosynthesis I (ester formation)|unclassified 0.6687522557
PWY490-3: nitrate reduction VI (assimilatory) 0.9990091413
PWY490-3: nitrate reduction VI (assimilatory)|unclassified 0.6313428688
PWY-1269: CMP-3-deoxy-D-manno-octulosonate biosynthesis I 0.9971350238
PWY-1269: CMP-3-deoxy-D-manno-octulosonate biosynthesis I|unclassified 0.4525574805
PWY-5173: superpathway of acetyl-CoA biosynthesis 0.9953786390
PWY-5173: superpathway of acetyl-CoA biosynthesis|unclassified 0.2231965843
PWY-3781: aerobic respiration I (cytochrome c) 0.9883574322
PWY-3781: aerobic respiration I (cytochrome c)|unclassified 0.2386440470
PWY-7279: aerobic respiration II (cytochrome c) (yeast) 0.9883574322
PWY-7279: aerobic respiration II (cytochrome c) (yeast)|unclassified 0.2386440470
The raw results of mapping metagenome/metatranscriptome reads against a
gene family nucleotide database can be saved as a SAM/BAM file and used
to run HUMAnN 2.0 after the mapping (e.g. bowtie2
) step. The only
requirement is that gene family names provided in such a file be
annotated using either 1) identifiers compatible with HUMAnN 2.0's default
mapping to UniRef, or 2) UniRef identifiers directly. Unmapped reads in
such an input file are searched against a non-organism-specific protein
database as are unmapped reads in a "full" HUMAnN 2.0 run. :
$ less -S demo.sam
which yields:
s__Bacteroides_dorei_read000001 16 gi|423242213|ref|NZ_JH724160.1|:185833-186759|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_R7NX76|UniRef50_R5GDB8|927 423 11 100M * 0 0 TTTCGTATTAGTGGAGTATGGCATTGAAAGTACGGATGATGACACGTTGCGCAGGATAAACCGGGGACATACTTTTGCCGTTTCTGCAGAGGCTGTTCGG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XS:i:-30 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:3T38A57 YT:Z:UU
s__Bacteroides_dorei_read000002 0 gi|423245752|ref|NZ_JH724135.1|:381614-382081|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5K0|UniRef50_A6L5K0|468 88 25 100M * 0 0 TTTACTGTAGTCGCCCTATTGGGATTACAGGATTCAGACTCTAACATAACCATCGGAAACAACTTCGAGAGCAAAAAGCTAGGTAAAAAAGGAATTTTCA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 XS:i:-42 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:96A3 YT:Z:UU
s__Bacteroides_dorei_read000003 16 gi|423238261|ref|NZ_JH724153.1|:187959-189296|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5D4|UniRef50_C6X591|1338 1159 0 100M * 0 0 AATACACGACAGGAAGATTATTCCGGCAAGAATTTTTCGGAGTTTGAACTGAGCGCTATGGAAAAACGTCATATTGCTAAAGTTCTGCAACATACTAAAG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-24 XS:i:-36 XN:i:0 XM:i:4 XO:i:0 XG:i:0 NM:i:4 MD:Z:5C28A16G0C47 YT:Z:UU
s__Bacteroides_dorei_read000004 16 gi|294777006|ref|NZ_ADKO01000074.1|:c27580-26633|821|g__Bacteroides.s__Bacteroides_vulgatus|UniRef90_A6L7A5|UniRef50_R7NQK6|948 802 1 100M * 0 0 CAACACTCACGGGAAATCCCCTTCAATGAATTATGGGACAGAATGCTAAGAATCAGAAAAAACAATTATATACATCTTCCACTTTACCTAAAAATAAAGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:85C0T13 YT:Z:UU
s__Bacteroides_dorei_read000005 0 gi|423245752|ref|NZ_JH724135.1|:297336-298682|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5C5|UniRef50_R6HHA5|1347 637 23 100M * 0 0 ATAACATCCTTACATTCTGATGAAATTTGTTTCCGAGGTATGGAACATACTGGACCTCTTACATACGGTGAGGCCGAGAATTTTTTTGATAGCGGCATTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-24 XN:i:0 XM:i:4 XO:i:0 XG:i:0 NM:i:4 MD:Z:51C2A16A12A15 YT:Z:UU
s__Bacteroides_dorei_read000006 16 gi|423231083|ref|NZ_JH724113.1|:c774054-771490|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_unknown|UniRef50_D5EYI2|2565 184 7 100M * 0 0 AATCCCGTTGCTTTATTGGAACGTTTGAGCTATGAAAAGCAGGAGGCATTAACCCAGGATAAGGTTATTGTAAAGCGCCTGAACGATGTATATGCTAAAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-18 XS:i:-30 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:40T24G12A21 YT:Z:UU
s__Bacteroides_dorei_read000007 16 gi|423231083|ref|NZ_JH724113.1|:c74511-72820|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_R7P3C8|UniRef50_X5DCD9|1692 629 1 100M * 0 0 TGGCAGGAGAAGCCTATCTTCGGATGGGTATGCGTGATGCTTCCTAATTCAAGAAGGCGGAAGATGCGGTGACGCCTATTATACCAGGTGGCAAATATGA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-18 XS:i:-18 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:46T36A7T8 YT:Z:UU
s__Bacteroides_dorei_read000009 0 gi|423227774|ref|NZ_JH724110.1|:782275-785145|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_unknown|UniRef50_R7DSH7|2871 804 2 100M * 0 0 AGGTACAGCCGGCTTTGCCTCCACCGAGAAAAAACCAGAAAAACCCTTTGCACAATGGATTCCCCGTATTCCCGAAACAGGAAAATATGCCGTCGATGTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-24 XS:i:-30 XN:i:0 XM:i:4 XO:i:0 XG:i:0 NM:i:4 MD:Z:10C25G6G50T5 YT:Z:UU
s__Bacteroides_dorei_read000012 16 gi|423227774|ref|NZ_JH724110.1|:c1276847-1276389|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_R6HWU5|UniRef50_R5JSB1|459 11 40 100M * 0 0 TATCTTATAAAATGTCCTACTATGCTTTATATGTCTGCTTTGCCGTTATTTTGGTAGTATTGGGCATGTTCTTCTTGGTAGGTTATAATAATCCGGTGGG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:4A77A17 YT:Z:UU
s__Bacteroides_dorei_read000013 16 gi|423227774|ref|NZ_JH724110.1|:c157707-156019|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_R6AN66|UniRef50_P9WQK2|1689 1394 12 100M * 0 0 TGGCACTGAAGGAAGAAGGCAACGTGCTGCTGCTGGACGAGCCTACCAATGATATTGACGTGAACTCGCTGCGCGCGCTGGAGGAAGGTCTGGACGATTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 XS:i:-18 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:65A34 YT:Z:UU
Here, we can also demonstrate speeding up a HUMAnN 2.0 run by taking
advantage of multiple cores using the --threads
argument:
$ humann2 --input demo.sam --output demo_sam --threads 4
which yields (blank lines removed):
Creating output directory: demo_sam
Output files will be written to: demo_sam
Process the sam mapping results ...
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:
demo_sam/demo_genefamilies.tsv
demo_sam/demo_pathabundance.tsv
demo_sam/demo_pathcoverage.tsv
Note the creation of the three expected output files.
HUMAnN 2.0 typically searches metagenome/metatranscriptome reads against two databases: first a nucleotide database using rapid mapping, then a broader protein database using more computationally expensive translated search. One can execute HUMAnN 2.0 using either search alone, however; using nucleotides only is more computationally efficient but less sensitive, and using translated search alone can emulate HUMAnN 1.0 output or be carried out even for very novel or unusual microbial communities (i.e. those with few reference genomes available). If BLASTX tab-delimited M8 format search results have already been generated for HUMAnN 2.0-compatible protein IDs (UniRef90 by default), they can be run by providing the appropriate input format. Let's examine such a file:
$ less -S demo.m8
Which yields:
s__Bacteroides_vulgatus_read000635|100 UniRef90_X8JRH1|978 57.9 19 8 0 95 39 189 207 2.0e-01 21.6
unclassified_read000001|100 UniRef90_W8YTG4|1614 90.6 32 3 0 98 3 382 413 9.0e-15 65.9
unclassified_read000002|100 UniRef90_U5FT06|1224 90.9 33 3 0 100 2 127 159 2.1e-14 64.7
unclassified_read000003|100 UniRef90_V5V2L3|2169 90.6 32 3 0 3 98 388 419 4.5e-14 63.5
unclassified_read000004|100 UniRef90_U5FT06|1224 90.6 32 3 0 99 4 148 179 4.6e-14 63.5
unclassified_read000005|100 UniRef90_X8FHU5|1080 90.6 32 3 0 98 3 240 271 3.1e-15 67.4
unclassified_read000006|100 UniRef90_Z9JXD8|1428 90.6 32 3 0 98 3 22 53 2.6e-14 64.3
unclassified_read000007|100 UniRef90_Q6XMI3|1137 90.6 32 3 0 3 98 155 186 2.0e-14 64.7
unclassified_read000007|100 UniRef90_Q9SPV4|1077 65.6 32 11 0 3 98 129 160 1.5e-09 48.5
unclassified_read000008|100 UniRef90_Z9JUT2|1308 90.6 32 3 0 98 3 222 253 2.4e-15 67.8
To run HUMAnN 2.0 using this file as input, use:
$ humann2 --input demo.m8 --output demo_m8
This again produces the three expected outputs (blank lines removed):
Creating output directory: /demo_m8
Output files will be written to: /demo_m8
Process the blastm8 mapping results ...
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:
/demo_m8/demo_genefamilies.tsv
/demo_m8/demo_pathabundance.tsv
/demo_m8/demo_pathcoverage.tsv
Conversely, to run HUMAnN 2.0 without any translated protein search (i.e.
using only rapid nucleotide mapping alone), one can provide the
--bypass-translated-search
flag. This applies to FASTA
, FASTQ
,
SAM/BAM
, or other raw nucleotide inputs:
$ humann2 --input demo.fastq --output demo_fastq_noaa --threads 4 --bypass-translated-search
Note the lack of DIAMOND
execution in this example as compared to the
FASTQ
analysis above:
Creating output directory: demo_fastq_noaa
Output files will be written to: demo_fastq_noaa
Running metaphlan2.py ........
Found g__Bacteroides.s__Bacteroides_dorei : 59.18% of mapped reads
Found g__Bacteroides.s__Bacteroides_vulgatus : 40.82% of mapped reads
Total species selected from prescreen: 2
Selected species explain 100.00% of predicted community composition
Creating custom ChocoPhlAn database ........
Running bowtie2-build ........
Running bowtie2 ........
Total bugs from nucleotide alignment: 2
g__Bacteroides.s__Bacteroides_vulgatus: 7436 hits
g__Bacteroides.s__Bacteroides_dorei: 8820 hits
Total gene families from nucleotide alignment: 3685
Unaligned reads after nucleotide alignment: 23.0666666667 %
Bypass translated search
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:
demo_fastq_noaa/demo_genefamilies.tsv
demo_fastq_noaa/demo_pathabundance.tsv
demo_fastq_noaa/demo_pathcoverage.tsv
It is informative to compare the results of this demo file with and without translated search; even in a small, subsampled input, including amino acid search is more computationally costly but recovers additional gene families. To count unique gene families in the original file, execute:
$ cut -f1 demo_fastq/demo_genefamilies.tsv | tail -n +3 | grep -v "|" | sort -u | wc -l
(cut -f1
extracts the first column of the file, the gene headers;
tail -n +3
removes the first two lines of the file, the header and
UNMAPPED
; grep -v "|"
removes the stratified headers; sort -u
isolates the unique headers; and wc -l
counts those headers.) The
resulting answering is 3730
(unique gene families). The same command
executed on the file demo_fastq_noaa/demo_genefamilies.tsv
yields
3687
unique gene families. We infer that 3728 - 3685 = 43
additional
gene families were uncovered using translated search.
In general, any HUMAnN 2.0 intermediate file can be saved and used to resume processing from an intermediate point in the analysis pipeline, and several steps including taxonomic pre-screening, nucleotide search, or amino acid search can be independently skipped. For more information, see the relevant sections of the HUMAnN 2.0 User Manual.
- Why is nucleotide-level search generally faster than translated search? What makes nucleotide-level search extra fast in HUMAnN 2.0?
- Under what circumstances would skipping translated search be reasonable? Under what circumstances would skipping translated search be a terrible idea?
To reduce file size in HUMAnN 2.0 tables containing large numbers of
features, feature names are not included by default. To attach feature
names to a table containing only feature identifiers, use the
humann2_rename_table
script:
$ humann2_rename_table --input demo_fastq/demo_genefamilies.tsv --output demo_fastq/demo_genefamilies-names.tsv --names uniref90
Features without names will be named "NO_NAME". Inspect the named
features in the output file demo_fastq/demo_genefamilies-names.tsv
using less
:
# Gene Family demo_Abundance-RPKs
UniRef90_C3R7K2: Polysaccharide deacetylase 45.7650273224
UniRef90_C3R7K2: Polysaccharide deacetylase|g__Bacteroides.s__Bacteroides_dorei 45.7650273224
UniRef90_A6KZR7: Fumarate reductase iron-sulfur cluster protein subunit 16.7427701674
UniRef90_A6KZR7: Fumarate reductase iron-sulfur cluster protein subunit|g__Bacteroides.s__Bacteroides_dorei 9.1324200913
UniRef90_A6KZR7: Fumarate reductase iron-sulfur cluster protein subunit|g__Bacteroides.s__Bacteroides_vulgatus 7.6103500761
UniRef90_A6L5B2: Ferredoxin domain containing protein 16.2037037037
UniRef90_A6L5B2: Ferredoxin domain containing protein|g__Bacteroides.s__Bacteroides_vulgatus 9.2592592593
UniRef90_A6L5B2: Ferredoxin domain containing protein|g__Bacteroides.s__Bacteroides_dorei 6.9444444444
UniRef90_R6HM78: Hydrogenase maturation GTPase HydF 14.6520146520
To facilitate comparisons between samples with different sequencing
depths, it is beneficial to normalize gene family and pathway abundance
RPK values to compositional units (copies per million or relative
abundance). Pathway coverage does not need to be normalized in this
manner. We can normalize a HUMAnN 2.0 abundance file using the utility
script humann2_renorm_table
:
$ humann2_renorm_table --input demo_fastq/demo_genefamilies.tsv --output demo_fastq/demo_genefamilies-cpm.tsv --units cpm --update-snames
Inspect the output file demo_fastq/demo_genefamilies-cpm.tsv
using the
less
command as we did above:
# Gene Family demo_Abundance-CPM
UNMAPPED 167194
UniRef90_unknown 70077
UniRef90_unknown|g__Bacteroides.s__Bacteroides_dorei 38388.7
UniRef90_unknown|g__Bacteroides.s__Bacteroides_vulgatus 31688.2
UniRef90_R6HHA8 14301.1
UniRef90_R6HHA8|g__Bacteroides.s__Bacteroides_dorei 14301.1
UniRef90_R7NYS9 7150.53
UniRef90_R7NYS9|g__Bacteroides.s__Bacteroides_vulgatus 7150.53
UniRef90_D1K9F5 2860.21
Note the difference in scale: units tend to be larger than in the
original file as a result of being scaled to sum to 1 million. Also note
that the sample header has been updated from demo_Abundance-RPK to
demo_Abundance-CPM
courtesy of the --update-snames
flag. Let's
verify that the unstratified abundance values sum up to 1 million:
$ grep -v "#" demo_fastq/demo_genefamilies-cpm.tsv | grep -v "|" | cut -f2 | python -c "import sys; print sum(float(l) for l in sys.stdin)"
The command above uses grep
to exclude the header line (marked by #
)
and the stratified lines (containing |
). The cut
command then pulls
out the second column (CPM values). The python -c
command runs a
one-line python program to sum up the CPM values. This returns
1000000.46371
(i.e. ~1 million modulo a small amount of rounding
error).
- HUMAnN 2.0 natively normalizes gene family abundance by gene length. Why is this step important? (In other words, why are RPK units more useful than raw read counts?)
- Under what circumstances would it be useful to consider gene/pathway abundance before normalizing by sample read depth?
HUMAnN 2.0's default "units" of microbial function are gene families and
metabolic pathways. However, starting from gene family abundance, it is
possible to reconstruct the abundance of other functional categories in
a microbiome using the humann2_regroup_table
script. Inspect the
regrouping options using the humann2_regroup_table -h command
.
Let's regroup our CPM-normalized gene family abundance values to enzyme commission (EC) categories:
$ humann2_regroup_table --input demo_fastq/demo_genefamilies-cpm.tsv --output demo_fastq/level4ec-cpm.tsv --groups uniref90_level4ec
We are given the output message:
Original Feature Count: 3729; Grouped 1+ times: 283 (%7.6); Grouped 2+ times: 4 (%0.1)
Indicating that the original file contained 3,729 features (UniRef90
gene families). Of these, 283 were annotated to a level-4 EC, and 4 were
annotated to more than one level-4 EC. Inspect the output file
demo_fastq/level4ec-cpm.tsv
using the less
command:
# Gene Family demo_Abundance-CPM
UNMAPPED 167194.0
UNGROUPED 761276.764
UNGROUPED|g__Bacteroides.s__Bacteroides_dorei 433049.25
UNGROUPED|g__Bacteroides.s__Bacteroides_vulgatus 323309.361
UNGROUPED|unclassified 4918.036
1.1.1.169 52.771
1.1.1.169|g__Bacteroides.s__Bacteroides_dorei 52.771
1.1.1.205 124.628
1.1.1.205|g__Bacteroides.s__Bacteroides_vulgatus 124.628
We see that UniRef90 features have been regrouped as level-4 EC
features, such as 1.1.1.169. A new stratified feature also appears in
the output: UNGROUPED
. This feature reports the total abundance of
features that were not otherwise included in a group (similar to the
UNINTEGRATED
feature of the pathways output). If each feature was
regrouped into at most one new feature, the total community abundance of
features in the regrouped file (including UNGROUPED
and UNMAPPED
)
would still be 1 million. The actual sum (1,003,757.446) is slightly
higher than 1 million due to the three UniRef90 features whose abundance
was counted in two different level-4 EC features. Using the
humann2_rename_table
script, attach names to the unnamed level-4 EC
features using the option --names ec
.
1.1.1.169: 2-dehydropantoate 2-reductase 52.771
1.1.1.169: 2-dehydropantoate 2-reductase|g__Bacteroides.s__Bacteroides_dorei 52.771
1.1.1.205: IMP dehydrogenase 124.628
1.1.1.205: IMP dehydrogenase|g__Bacteroides.s__Bacteroides_vulgatus 124.628
- From a statistical testing perspective, what is one major advantage of regrouping raw gene abundance to a smaller list of gene groups?
- When regrouping genes to ECs (or KOs or MetaCyc reactions), we sum over the gene sequences belonging to a particular group. Under what circumstance would it make sense to average over the members of a group rather than sum? (Hint: the answer is related to the difference between the "is_a" and "part_of" relationships used in the Gene Ontology.)
It is in some cases useful to understand the collection of intermediate files generated by the individual steps of a typical HUMAnN 2.0 run:
$ ls -lh demo_fastq/demo_humann2_temp/
Which yields:
-rw-rw-r-- 1 11M Aug 2 11:22 demo_bowtie2_aligned.sam
-rw-rw-r-- 1 2.9M Aug 2 11:22 demo_bowtie2_aligned.tsv
-rw-rw-r-- 1 11M Aug 2 11:22 demo_bowtie2_index.1.bt2
-rw-rw-r-- 1 3.6M Aug 2 11:22 demo_bowtie2_index.2.bt2
-rw-rw-r-- 1 136K Aug 2 11:22 demo_bowtie2_index.3.bt2
-rw-rw-r-- 1 3.6M Aug 2 11:22 demo_bowtie2_index.4.bt2
-rw-rw-r-- 1 11M Aug 2 11:22 demo_bowtie2_index.rev.1.bt2
-rw-rw-r-- 1 3.6M Aug 2 11:22 demo_bowtie2_index.rev.2.bt2
-rw-rw-r-- 1 628K Aug 2 11:22 demo_bowtie2_unaligned.fa
-rw-rw-r-- 1 17M Aug 2 11:22 demo_custom_chocophlan_database.ffn
-rw-rw-r-- 1 86K Aug 2 11:22 demo_diamond_aligned.tsv
-rw-rw-r-- 1 521K Aug 2 11:22 demo_diamond_unaligned.fa
-rw-rw-r-- 1 18K Aug 2 11:23 demo.log
-rw-rw-r-- 1 55K Aug 2 11:21 demo_metaphlan_bowtie2.txt
-rw-rw-r-- 1 958 Aug 2 11:22 demo_metaphlan_bugs_list.tsv
These include SAM
files containing the alignment of input reads
against the default nucleotide database, which are first reduced by
counting up and tabulating the hits to each annotated gene family. It
can be useful to reprocess either the reads unaligned after nucleotide
mapping (bowtie2_unaligned.fa
) or, more likely, the reads still
unaligned after amino acid search. Finally, the MetaPhlAn2 bowtie2.txt
intermediate output is itself saved. For more
information, see the HUMAnN 2.0 User Manual's output file
section.
It is typical to execute HUMAnN 2.0 on multiple individual samples, then merge the resulting gene families or pathways into a single tab-delimited multi-sample table. When combining or comparing any HUMAnN 2.0 output, it is critical to normalize the default RPK abundance units (which will be sensitive to the sequencing depth of each sample) into RPKMs (which will not).
We can demonstrate this by downloading a collection of six subsampled input files from the Human Microbiome Project. The original files, and many others, can be downloaded from the HMP DACC.
- 763577454-SRS014459-Stool.fasta
- 763577454-SRS014464-Anterior_nares.fasta
- 763577454-SRS014470-Tongue_dorsum.fasta
- 763577454-SRS014472-Buccal_mucosa.fasta
- 763577454-SRS014476-Supragingival_plaque.fasta
- 763577454-SRS014494-Posterior_fornix.fasta
Click on each individual subsampled file to download. Open a terminal and change directories into the folder where you have downloaded the files.
You will see something like the following contents if you run
$ ls -lh
:
-rw-r--r-- 1 2.7M Jun 4 11:33 763577454-SRS014459-Stool.fasta
-rw-r--r-- 1 2.6M Jun 4 11:33 763577454-SRS014464-Anterior_nares.fasta
-rw-r--r-- 1 2.7M Jun 4 11:33 763577454-SRS014470-Tongue_dorsum.fasta
-rw-r--r-- 1 2.7M Jun 4 11:34 763577454-SRS014472-Buccal_mucosa.fasta
-rw-r--r-- 1 2.7M Jun 4 11:34 763577454-SRS014476-Supragingival_plaque.fasta
-rw-r--r-- 1 2.7M Jun 4 11:34 763577454-SRS014494-Posterior_fornix.fasta
Process each input file into a single output directory:
$ humann2 -i 763577454-SRS014459-Stool.fasta -o 763577454
$ humann2 -i 763577454-SRS014464-Anterior_nares.fasta -o 763577454
$ humann2 -i 763577454-SRS014470-Tongue_dorsum.fasta -o 763577454
$ humann2 -i 763577454-SRS014472-Buccal_mucosa.fasta -o 763577454
$ humann2 -i 763577454-SRS014476-Supragingival_plaque.fasta -o 763577454
$ humann2 -i 763577454-SRS014494-Posterior_fornix.fasta -o 763577454
Alternatively, if you're comfortable with shell scripting, you can use the following command to process all six at once:
$ for f in *.fasta; do humann2 -i $f -o 763577454; done
If you're feeling particularly impatient, add a & at the end (i.e.
763577454 &
), but use that form with caution.
Regardless of how you run HUMAnN 2.0, this will result in six output
directories each containing the three main analysis types (gene
families, pathway abundances, and pathway coverages) along with
intermediate files. Let's merge the 6 per-sample gene family abundance
outputs into a single table for this dataset using the script
humann2_join_tables
:
$ humann2_join_tables -i 763577454 -o 763577454_genefamilies.tsv --file_name genefamilies
Inspect this file using the less
command:
# Gene Family 763577454-SRS014459-Stool_Abundance-RPKs 763577454-SRS014464-Anterior_nares_Abundance-RPKs 763577454-SRS014470-Tongue_dorsum_Abundance-RPKs 763577454-SRS014472-Buccal_mucosa_Abundance-RPKs 763577454-SRS014476-Supragingival_plaque_Abundance-RPKs 763577454-SRS014494-Posterior_fornix_Abundance-RPKs
UNMAPPED 14210.0000000000 17696.0000000000 16623.00000000009766.0000000000 16605.0000000000 13612.0000000000
UniRef90_A0A015N407 2.3310023310 0 0 0 0 0
UniRef90_A0A015N407|g__Bacteroides.s__Bacteroides_cellulosilyticus 2.3310023310 0 0 0 0 0
UniRef90_A0A015N416 5.2910052910 0 0 0 0 0
UniRef90_A0A015N416|g__Bacteroides.s__Bacteroides_cellulosilyticus 5.2910052910 0 0 0 0 0
UniRef90_A0A015P3I5 3.7453183521 0 0 0 0 0
UniRef90_A0A015P3I5|g__Parabacteroides.s__Parabacteroides_distasonis 3.7453183521 0 0 0 0 0
UniRef90_A0A015QBU0 0.6485084306 0 0 0 0 0
UniRef90_A0A015QBU0|g__Bacteroides.s__Bacteroides_ovatus 0.6485084306 00 0 0 0
0
It is especially important to renormalize this merged file as we're now comparing samples with different sequencing depths:
$ humann2_renorm_table -i 763577454_genefamilies.tsv -o 763577454_genefamilies_cpm.tsv --units cpm
For more information, see the HUMAnN 2.0 User Manual's description of a standard analysis workflow.
There are many approaches for associating microbial measurements with sample metadata, including several options in the bioBakery:
- LEfSe (for simple study designs)
- MaAsLin (for complex study designs)
- HAllA (for high-dimensional metadata)
When associating metadata with HUMAnN 2.0 features, it is often beneficial
to associate with community totals and avoid testing each individual
feature stratification (to improve statistical power). This is the
approach used by the HUMAnN 2.0 utility script humann2_associate
, which
can compare feature totals across samples with 1) a single continuous
metadatum (via the Spearman Correlation) or 2) a single categorical
metadatum (via the Kruskal-Wallis H-test). Notably, this is a naive
approach to association, but it is useful for tutorial purposes.
For this and the next section, we'll operate on a table of pathway abundances from ~400 samples from the Human Microbiome Project:
Download this file to your working directory and examine it using
less
:
$ less -S hmp_pathabund.pcl
Which yields:
FEATURE \ SAMPLE SRS011084 SRS011086 SRS011090
STSite Stool Tongue_dorsum Buccal_mucosa
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis 0.000498359 0.000628096 0.000304951
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Acidovorax.s__Acidovorax_ebreus 0 0 0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Bacteroides.s__Bacteroides_finegoldii 0 0 0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Haemophilus.s__Haemophilus_influenzae 0 0 0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Klebsiella.s__Klebsiella_pneumoniae 0 0 0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Staphylococcus.s__Staphylococcus_epidermidis 0 0 0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Veillonella.s__Veillonella_sp_oral_taxon_158 0 0 0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|unclassified 7.89502e-06 1.60101e-05 0
We can see three samples from three different body sites (via the
STSite
metadatum). A single pathway, 1CMET2-PWY:
N10-formyl-tetrahydrofolate biosynthesis, and its stratifications are
visible. In these three samples, the pathway can only be assigned to the
unclassified stratum, and with very low abundance.
Let's associate pathway totals in this file with sample body site:
$ humann2_associate --input hmp_pathabund.pcl --last-metadatum STSite --focal-metadatum STSite --focal-type categorical --output stats.txt
This call will perform a Kruskal-Wallis H-test between each feature's community total and sample body site, summarizing by per-site means and ranking by statistical significance. Let's examine the first few lines of the file:
$ less -S stats.txt
Which yields:
# kruskal-wallis analysis of metadatum: STSite
# feature level means p-value q-value
METSYN-PWY: L-homoserine and L-methionine biosynthesis Supragingival_plaque:0.0004077|Stool:9.059e-05|Posterior_fornix:1.275e-05|Tongue_dorsum:0.0005423|Anterior_nares:2.585e-05|Buccal_mucosa:0.00112 6.477e-70 1.224e-67
PWY-5347: superpathway of L-methionine biosynthesis (transsulfuration) Supragingival_plaque:0.0003854|Stool:9.416e-05|Posterior_fornix:9.412e-06|Tongue_dorsum:0.0005187|Anterior_nares:1.71e-05|Buccal_mucosa:0.0007963 7.074e-70 1.224e-67
PWY-6386: UDP-N-acetylmuramoyl-pentapeptide biosynthesis II (lysine-containing) Supragingival_plaque:0.0007095|Stool:5.553e-06|Posterior_fornix:1.768e-06|Tongue_dorsum:0.0001989|Anterior_nares:0.0002186|Buccal_mucosa:0.001337 1.115e-69 1.286e-67
HOMOSER-METSYN-PWY: L-methionine biosynthesis I Supragingival_plaque:0.0002616|Stool:5.078e-05|Posterior_fornix:7.601e-06|Tongue_dorsum:0.0003655|Anterior_nares:1.856e-05|Buccal_mucosa:0.0009729 3.124e-69 2.702e-67
NAD-BIOSYNTHESIS-II: NAD salvage pathway II Supragingival_plaque:0.0001445|Stool:3.478e-06|Posterior_fornix:2.321e-07|Tongue_dorsum:0.0001515|Anterior_nares:6.176e-06|Buccal_mucosa:0.0005249 7.574e-68 5.241e-66
MET-SAM-PWY: superpathway of S-adenosyl-L-methionine biosynthesis Supragingival_plaque:0.000311|Stool:7.31e-05|Posterior_fornix:8.425e-06|Tongue_dorsum:0.0004838|Anterior_nares:5.945e-06|Buccal_mucosa:0.0005264 6.56e-67 3.783e-65
PWY-5188: tetrapyrrole biosynthesis I (from glutamate) Supragingival_plaque:0.0007416|Stool:1.759e-05|Posterior_fornix:3.483e-06|Tongue_dorsum:0.0006484|Anterior_nares:0.0002229|Buccal_mucosa:0.0003094 1.098e-66 5.426e-65
PWY-3001: superpathway of L-isoleucine biosynthesis I Supragingival_plaque:0.0007748|Stool:0.0002933|Posterior_fornix:1.806e-05|Tongue_dorsum:0.000893|Anterior_nares:5.512e-05|Buccal_mucosa:0.001016 2.768e-66 1.197e-64
We observe, for example, that the first pathway in the table
(METSYN-PWY: L-homoserine and L-methionine biosynthesis
) tends to be
more abundant on average at the three oral sites (buccal mucosa, tongue,
and plaque) than at the three non-oral sites (stool, posterior fornix,
and nares).
Now that we have identified this enrichment, use the grep
command to
explore the stratifications of this pathway from the hmp_pathabund.pcl
file.
$ grep 'METSYN-PWY' hmp_pathabund.pcl | less -S
- Do the major contributing species seem reasonable for an orally-enriched pathway?
HUMAnN 2.0 includes a utility humann2_barplot
to assist with visualizing
stratified outputs. Let's use this tool to make a plot of the
significantly differentially abundant pathway we identified above. (Use
see plot1.png
to open the resulting figure if working in a graphical
environment.) :
$ humann2_barplot --input hmp_pathabund.pcl --focal-feature METSYN-PWY --focal-metadatum STSite --last-metadatum STSite --output plot1.png
Not terribly informative... Let's try adding a "sort" on the summed stratified abundance:
$ humann2_barplot --sort sum --input hmp_pathabund.pcl --focal-feature METSYN-PWY --focal-metadatum STSite --last-metadatum STSite --output plot2.png
A pattern has started to emerge: we can clearly see that the oral body sites are enriched on the left (high) end of the plot, consistent with the summary statistics computed above. We can continue this line of analysis with an additional grouping by body site:
$ humann2_barplot --sort sum metadata --input hmp_pathabund.pcl --focal-feature METSYN-PWY --focal-metadatum STSite --last-metadatum STSite --output plot3.png
We see now that the pathway is most strongly enriched in the buccal mucosa, where it is contributed almost exclusively by Streptococcus species. The pathway's relative abundance in the other two body sites is generally lower, and is more often "unclassified" at the tongue body site.
- How might you find other pathways with a similar abundance distribution in this functional profile?
- What might be causing the "unclassified" contributors of homoserine/methionine synthesis in tongue dorsum communities?
- Is it surprising that three apparently closely related communities (tongue, cheek, and tooth surface) differ so strikingly in metagenomic functional potential?
Let's examine one additional pathway, COA-PWY: coenzyme A biosynthesis, which is more broadly conserved across body sites:
$ humann2_barplot --sort sum --input hmp_pathabund.pcl --focal-feature COA-PWY --focal-metadatum STSite --last-metadatum STSite --output plot4.png
Now sorting by total abundance is less informative. Let's try some additional options: sorting by ecological similarity, normalizing pathway contributions within-sample, and expanding the list of species highlighted:
$ humann2_barplot --sort similarity --top-strata 12 --scaling normalize --input hmp_pathabund.pcl --focal-feature COA-PWY --focal-metadatum STSite --last-metadatum STSite --output plot5.png
Notably, sorting by the similarity of pathway contributions has tended to group samples from the same body site even without the explicit metadata sort. This is because, for broadly distributed pathways, organismal contributions tend to follow overall sample ecology (which tends to be quite different from one body site to the next).
- Is it surprising to see coenzyme A biosynthesis present in diverse human body site microbiomes?