Metagenomics standard operating procedure v2 - LangilleLab/microbiome_helper GitHub Wiki

Note that this workflow is continually being updated. If you want to use the below commands be sure to keep track of them locally.

Last updated: 9 Oct 2018 (see revisions above for earlier minor versions and Metagenomics SOPs on the SOP for earlier major versions)


Important points to keep in mind:

  • The below options are not necessarily best for your data; it is important to explore different options, especially at the read quality filtering stage.
  • If you are confused about a particular step be sure to check out the pages under Metagenomic Resources on the right side-bar.
  • You should check that only the FASTQs of interest are being specified at each step (e.g. with the ls command). If you are re-running commands multiple times or using optional steps the input filenames and/or folders could differ.
  • The tool parallel comes up several times in this workflow. Be sure to check out our tutorial on this tool here.
  • You can always run parallel with the --dry-run option to see what commands are being run to double-check they are doing what you want!

This workflow starts with raw paired-end MiSeq or NextSeq data in demultiplexed FASTQ format assumed to be located within a folder called raw_data.

1. First Steps

1.1 Join lanes together (all Illumina, except MiSeq)

Concatenate multiple lanes of sequencing together (e.g. for NextSeq data). If you do NOT do this step, remember to change cat_lanes to raw_data in the further commands below that have assumed you are working with the most common type of metagenome data.

concat_lanes.pl raw_data/* -o cat_lanes -p 4

1.2 Inspect read quality

Run fastqc to allow manual inspection of the quality of sequences.

mkdir fastqc_out
fastqc -t 4 cat_lanes/* -o fastqc_out/

1.3 (Optional) Stitch F+R reads

Stitch paired-end reads together (summary of stitching results are written to "pear_summary_log.txt"). Note: it is important to check the % of reads assembled. It may be better to concatenate the forward and reverse reads together if the assembly % is too low (see step 2.2).

run_pear.pl -p 4 -o stitched_reads cat_lanes/*

If you don't stitch your reads together at this step you will need to unzip your FASTQ files before continuing with the below commands.

2. Read Quality-Control and Contaminant Screens

2.1 Running KneadData

Use kneaddata to run pre-processing tools. First Trimmomatic is run to remove low quality sequences. Then Bowtie2 is run to screen out contaminant sequences. Below we are screening out reads that map to the human or PhiX genomes. Note KneadData is being run below on all unstitched FASTQ pairs with parallel, you can see our quick tutorial on this tool here. For a detailed breakdown of the options in the below command see this page. The forward and reverse reads will be specified by "_1" and "_2" in the output files, ignore the "R1" in each filename. Note that the \ characters at the end of each line are just to split the command over multiple lines to make it easier to read.

parallel -j 1 --link 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
 ::: cat_lanes/*_R1.fastq ::: cat_lanes/*_R2.fastq

Clean up the output directory (helps downstream commands) by moving the discarded sequences to a subfolder:

mkdir kneaddata_out/contam_seq

mv kneaddata_out/*_contam*.fastq kneaddata_out/contam_seq

You can produce a logfile summarizing the kneadData output with this command:

kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt

2.2 Concatenate unstitched output (Omit if data stitched above)

If you did not stitch your paired-end reads together with pear, then you can concatenate the forward and reverse FASTQs together now. Note this is done after quality filtering so that both reads in a pair are either discarded or retained. It's important to only specify the "paired" FASTQs outputted by kneaddata in the below command.

concat_paired_end.pl -p 4 --no_R_match -o cat_reads kneaddata_out/*_paired_*.fastq 

You should check over the commands that are printed to screen to make sure the correct FASTQs are being concatenated together.

3. Determine Functions with HUMAnN2

3.1 Run HUMAnN2

Run humann2 with parallel to calculate abundance of UniRef90 gene families and MetaCyc pathways. If you are processing environment data (e.g. soil samples) the vast majority of the reads may not map using this approach. Instead, you can try mapping against the UniRef50 database (which you can point to with the --protein-database option).

mkdir humann2_out

parallel -j 4 'humann2 --threads 1 --input {} --output humann2_out/{/.}' ::: cat_reads/*fastq

3.2 Merge individual sample data together

Join HUMAnN2 output per sample into one table.

mkdir humann2_final_out

humann2_join_tables -s --input humann2_out/ --file_name pathabundance --output humann2_final_out/humann2_pathabundance.tsv
humann2_join_tables -s --input humann2_out/ --file_name pathcoverage --output humann2_final_out/humann2_pathcoverage.tsv
humann2_join_tables -s --input humann2_out/ --file_name genefamilies --output humann2_final_out/humann2_genefamilies.tsv

3.3 Table output normalization

Re-normalize gene family and pathway abundances (so that all samples are in units of copies per million).

humann2_renorm_table --input humann2_final_out/humann2_pathabundance.tsv --units cpm --output humann2_final_out/humann2_pathabundance_cpm.tsv
humann2_renorm_table --input humann2_final_out/humann2_genefamilies.tsv --units cpm --output humann2_final_out/humann2_genefamilies_cpm.tsv

3.4 Separate out taxonomic contributions

Split HUMAnN2 output abundance tables in stratified and unstratified tables (stratified tables include the taxa associated with a functional profile).

humann2_split_stratified_table --input humann2_final_out/humann2_pathabundance_cpm.tsv --output humann2_final_out
humann2_split_stratified_table --input humann2_final_out/humann2_genefamilies_cpm.tsv --output humann2_final_out
humann2_split_stratified_table --input humann2_final_out/humann2_pathcoverage.tsv --output humann2_final_out

3.5 Format STAMP function file

Convert unstratified HUMAnN2 abundance tables to STAMP format by changing header-line. These commands remove the comment character and the spaces in the name of the first column. Trailing descriptions of the abundance datatype are also removed from each sample's column name.

sed 's/_Abundance-RPKs//g' humann2_final_out/humann2_genefamilies_cpm_unstratified.tsv | sed 's/# Gene Family/GeneFamily/' > humann2_final_out/humann2_genefamilies_cpm_unstratified.spf
sed 's/_Abundance//g' humann2_final_out/humann2_pathabundance_cpm_unstratified.tsv | sed 's/# Pathway/Pathway/' > humann2_final_out/humann2_pathabundance_cpm_unstratified.spf
sed 's/_Coverage//g' humann2_final_out/humann2_pathcoverage_unstratified.tsv | sed 's/# Pathway/Pathway/' > humann2_final_out/humann2_pathcoverage_unstratified.spf

3.6 Extract MetaPhlAn2 taxonomic compositions

Since HUMAnN2 also runs MetaPhlAn2 as an initial step, we can use the output tables already created to get the taxonomic composition of our samples. First we need to gather all the output MetaPhlAn2 results per sample into a single directory and then merge them into a single table using MetaPhlAn2's merge_metaphlan_tables.py command. After this file is created we can fix the header so that each column corresponds to a sample name without the trailing "_metaphlan_bugs_list" description. Note that MetaPhlAn2 works best for well-characterized environments, like the human gut, and has low sensitivity in other environments.

mkdir metaphlan2_out
cp humann2_out/*/*/*metaphlan_bugs_list.tsv metaphlan2_out/
/usr/local/metaphlan2/utils/merge_metaphlan_tables.py metaphlan2_out/*metaphlan_bugs_list.tsv > metaphlan2_merged.txt
sed -i 's/_metaphlan_bugs_list//g' metaphlan2_merged.txt

3.7 Format STAMP taxonomy file

Lastly we can convert this MetaPhlAn2 abundance table to STAMP format

metaphlan_to_stamp.pl metaphlan2_merged.txt > metaphlan2_merged.spf