Humann pipeline - mucosal-immunology-lab/microbiome-analysis GitHub Wiki

Microbial profiling with HUMAnN

Table of contents

Overview

The primary purpose of this document is to outline the process for performing microbial functional profiling using HUMAnN on the M3 MASSIVE cluster system using a Snakemake pipeline.

To make the process as user-friendly as possible for running in a cluster environment, we will utilise the tool in a Snakemake context. The original Snakemake code was provided as a tool called metannotate by the Sycuro Lab here, and has been adapted to correct some issues that arose likely due to changes in package versions etc.

HUMAnN is a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data (typically millions of short DNA/RNA reads). This process, referred to as functional profiling, aims to describe the metabolic potential of a microbial community and its members. More generally, functional profiling answers the question "What are the microbes in my community-of-interest doing (or capable of doing)?"

For more details, you can visit the HUMAnN GitHub page or the HUMAnN 3.0 tutorial.

Installation 🔨👷

Environment creation

To run HUMAnN, we will first create a new environment using mamba. We will use mamba over conda for its improved dependency solving speed and parallel package downloading capabilities.

You can either follow the installation instructions provided in the HUMAnN 3.0 tutorial, or you can use the provided yaml file to create your new environment.

mamba env create --file humann.yaml
mamba activate humann

Database installation

Downloading HUMAnN by itself does not install the required databases to perform the anlayses. The package does provide a utility function for this purpose however. You can download these files anywhere, and HUMAnN will track its location. If you move these files, you can update the location where HUMAnN will look for these files.

For our purposes, we will require the ChocoPhlAn and UniRef90 databases, along with the utility mapping database that contains EggNOG and KO mapping information. It is preferable to store these download files in a location separate from your current project so that they can be reused for other data analysis in the future. 💡

mkdir humann_dbs

humann_databases --download chocophlan full humann_dbs
humann_databases --download uniref uniref90_diamond humann_dbs
humann_databases --download utility_mapping full humann_dbs

Testing the installation 🦺

After installation, you may optionally test your local HUMAnN environment (this takes ~1 minute).

humann_test

Metagenome functional profiling 🦠🧬

HUMAnN can use different inputs as starting points for analysis, however given we have already used the Sunbeam pipeline to process our raw sequencing reads, and have the host-decontaminated read files ready to go, we will use these are our inputs.

Start by creating a new folder within your main project directory (N.B. your main project directory will be the one that contains the sunbeam_output folder).

mkdir humann; cd humann

Creating your sample list 📝

HUMAnN requires a list file containing the base names of the samples you want to profile. The example here shows the naming of the files within the sunbeam_output/qc/decontam folder, and the appropriate base name you should add to your sample list.

Forward Reverse Sample for list file
placebo_A_1.fastq.gz placebo_A_2.fastq.gz placebo_A
placebo_B_1.fastq.gz placebo_B_2.fastq.gz placebo_B
placebo_C_1.fastq.gz placebo_C_2.fastq.gz placebo_C
treatment_A_1.fastq.gz treatment_A_2.fastq.gz treatment_A
treatment_B_1.fastq.gz treatment_B_2.fastq.gz treatment_B
treatment_C_1.fastq.gz treatment_C_2.fastq.gz treatment_C

The list_files.txt document is a very simple .txt file with one sample listed per line. An example is provided here.

NOTE: The next step assumes the strandedness of your sequencing read files is denoted simply by _1.fastq.gz and _2.fastq.gz. If this is not the case, ensure you rename your files or alter the 01_merge_files.sh script before continuing.

Creating your Snakemake configuration file 🐍

Snakemake will use a config.yaml file to set additional variables it requires for running of the Snakefile.

Add the config.yaml to the humann folder and edit accordingly. The most important things are to correctly set the file paths to the utility mapping files. Without these, the pipeline will not work.

Merging your sequencing files

The Sunbeam pipeline outputs the processed data in the format it was provided: forward and reverse read files. For HUMAnN however, we just want a single file containing both strands.

Using the 01_merge_files.sh script, we will simultaneously concatenate our files and copy them into a new merged folder within our humann folder.

From within the humann folder, the script retrieves the relevant files from the sunbeam_output/qc/decontam folder and adds the concatenated files to the merged output folder.

mkdir merged
bash 01_merge_files.sh

Run the HUMAnN pipeline 🏃

The Snakefile contains all of the steps necessary to run the HUMAnN pipeline, and will submit each step for each sample as a separate sbatch job to the M3 MASSIVE cluster.

Therefore, we don't require large amounts of memory or CPU capacity for our interactive session.

smux n --time=2-00:00:00 -J humann_main

Once the interactive session has started, we can activate our humann mamba environment inside the session, and start the pipeline running. The 02_humann.sh script is configured for use with the genomics partition, so if you do not have access to this, simply remove the --partition and --qos flags at the end of the cluster command section.

A main log file will be output to a log folder created within the humann folder, called humann-analysis.log.txt. It will be updated as the pipeline runs, and allows quick identification if any errors arise (but let's hope they don't! 🤞).

mamba activate humann
bash 02_humann.sh

Troubleshooting 🤔

Conda YAML and URL bug

There is a known bug (pull request #1708) in Snakemake (at least in the version used here), where if the conda environment provided to conda.py is a URL path, self.is_named will be true, resulting in an error.

To fix, navigate to the conda.py file. First, locate where your conda/miniconda/mamba environments are stored (i.e. the env location you provided in the config.yaml file). From there, it should be located here:

  • humann/lib/python3.7/site-packages/snakemake/deployment/conda.py

At line 254 there is a new function defined, called address_argument, which looks like this:

@property
def address_argument(self):
    if self.is_named:
        return "--name '{}'".format(self.address)
    else:
        return "--prefix '{}'".format(self.address)

To fix the problem, both cases should use the --prefix flag as follows:

@property
def address_argument(self):
    if self.is_named:
        return "--prefix '{}'".format(self.address)
    else:
        return "--prefix '{}'".format(self.address)

Save the file, and attempt to re-run the pipeline. The issue should now be corrected.

⚠️ **GitHub.com Fallback** ⚠️