Metagenomics Standard Operating Procedure v3 - LangilleLab/microbiome_helper GitHub Wiki

(Current draft for Langille Lab Vulcan server!)

Note that this workflow is continually being updated. If you want to use the below commands be sure to keep track of them locally.

Last updated: 7 Jun 2021 (see revisions above for earlier minor versions and Metagenomics SOPs on the SOP for earlier major versions)


Important points to keep in mind:

  • Make sure you have all the required tools, scripts, databases and accessory files required for this pipeline installed - the list is here.
  • The below options are not necessarily best for your data; it is important to explore different options, especially at the read QC stage and those surrounding Kraken2/Bracken + MMseqs2 database choices.
  • If you are confused about a particular step be sure to check out the pages under Metagenomic Resources on the right side-bar.
  • You should check that only the FASTQs of interest are being specified at each step (e.g. with the ls command). If you are re-running commands multiple times or using optional steps the input filenames and/or folders could differ.
  • The tool parallel comes up several times in this workflow. Be sure to check out our tutorial on this tool here.
  • You can always run parallel with the --dry-run option to see what commands are being run to double-check they are doing what you want!

This workflow starts with raw paired-end MiSeq or NextSeq data in demultiplexed FASTQ format assumed to be located within a folder called raw_data.

1. First Steps

1.1 Join lanes together (only some Illumina)

Concatenate multiple lanes of sequencing together (e.g. for NextSeq550 data; NS2000 does not need this). If you do NOT do this step, remember to change cat_lanes to raw_data in the further commands below that have assumed you are working with the most common type of metagenome data.

concat_lanes.pl raw_data/* -o cat_lanes -p 4

1.2 Inspect read quality

Run fastqc to allow manual inspection of the quality of sequences.

mkdir fastqc_out
fastqc -t 4 cat_lanes/* -o fastqc_out/

1.3 (Optional) Stitch F+R reads

Stitch paired-end reads together with PEAR (summary of stitching results are written to "pear_summary_log.txt"). Note: it is important to check the % of reads assembled. It may be better to concatenate the forward and reverse reads together if the assembly % is too low (see step 2.2).

run_pear.pl -p 4 -o stitched_reads cat_lanes/*

If you don't stitch your reads together at this step you will need to unzip your FASTQ files before continuing with the below commands.

2. Read Quality-Control and Contaminant Screens

2.1 Running KneadData

Use kneaddata to run pre-processing tools. First Trimmomatic is run to remove low quality sequences. Then Bowtie2 is run to screen out contaminant sequences. Below we are screening out reads that map to the human or PhiX genomes. Note KneadData is being run below on all unstitched FASTQ pairs with parallel, you can see our quick tutorial on this tool here. For a detailed breakdown of the options in the below command see this page. The forward and reverse reads will be specified by "_1" and "_2" in the output files, ignore the "R1" in each filename. Note that the \ characters at the end of each line are just to split the command over multiple lines to make it easier to read.

parallel -j 1 --link 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
    -db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
    -t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
    --bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
    ::: cat_lanes/*_R1.fastq ::: cat_lanes/*_R2.fastq

Optional: If you stitched the F+R reads together above, then use the following version.

parallel -j 1 --link 'kneaddata -i {} -o kneaddata_out/ \
    -db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
    -t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
    --bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
    ::: stitched_reads/*.assembled.fastq

Clean up the output directory (helps downstream commands) by moving the discarded sequences to a subfolder:

mkdir kneaddata_out/contam_seq
mkdir kneaddata_out/unmatched_seq
mv kneaddata_out/*_contam*.fastq kneaddata_out/contam_seq
mv kneaddata_out/*_unmatched*.fastq kneaddata_out/unmatched_seq

You can produce a logfile summarizing the kneadData output with this command:

kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt

2.2 Concatenate unstitched output (omit if data stitched above)

If you did not stitch your paired-end reads together with PEAR, then you can concatenate the forward and reverse FASTQs together now. Note this is done after quality filtering so that both reads in a pair are either discarded or retained. It's important to only specify the "paired" FASTQs outputted by kneaddata in the below command.

concat_paired_end.pl -p 4 --no_R_match -o cat_reads kneaddata_out/*_paired_*.fastq 

You should check over the commands that are printed to screen to make sure the correct FASTQs are being concatenated together.

3. Determine Taxonomy with Kraken2+Bracken

3.1 Use Kraken2 to obtain "raw" taxonomy profiles

Use Kraken2 with memory-/time-saving parameters to create the taxonomic profiles using the NCBI RefSeq Complete database for identification. Kraken2 assigns taxonomy to all reads (vs. marker-gene approaches such as MetaPhlAn) and so we use that information here and for later constructing the stratified functional output. Note that here we are using a very large (~1.2 TB) reference database that will be prohibitive for most users (since it requires >1.2 TB of RAM). This can be substituted for one of the smaller pre-compiled Kraken2 databases, or one of the other databases that we have provided on Dropbox here or you can follow our instructions for creating your own custom database (link coming soon). You can read some further discussion on choosing the best database that you are able to given any computational restraints in our preprint.

First, copy the Kraken2 database to the mounted ramdisk in your scratch location (on SSD drive) to put the database directly into RAM, which speeds up processing enormously as the database normally has to be read for each sample:

cp -r /home/shared/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/ /scratch/ramdisk/

Second, create the output directories:

mkdir kraken2_outraw
mkdir kraken2_kreport

Then run Kraken2 with the --memory-mapping mode which looks for the database in memory instead of loading it from the disk. Note that, aside from the underlying database used, the most important parameter which determines the quality of your results is the confidence threshold and we are using a value of 0.5 - there are several considerations when choosing a confidence threshold to use, such as whether you are more concerned with ensuring that you don't miss any taxa (no false negatives) or that you don't identify taxa that weren't actually there (no false positives), and the best confidence threshold to use also depends on which database you are using. We provide guidance on choosing an appropriate threshold in our preprint. We recommend initially choosing a representative subset of your samples and running them with a few different confidence thresholds - this will allow you to see at what point the proportion of reads classified really starts to drop off, as well as whether there is a point at which you no longer see taxa that you know to be in your samples.

parallel -j 1 'kraken2 --db /scratch/ramdisk/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/ \
    --threads 4 --output kraken2_outraw/{/.}.kraken.txt --report kraken2_kreport/{/.}.kreport \
    --use-names --confidence 0.5 --memory-mapping {}' \
    ::: cat_reads/*.fastq

3.2 Correct "raw" Kraken2 taxonomy output with Bracken

The raw numbers output by Kraken2 are always supposed to be corrected afterwards with Bracken. Below are the examples for running the genus- and phylum-level corrections, but you can go through all levels by changing the -l flag (remember to change the output file extension such as ".genus.bracken" each time you change levels or it will overwrite previous files). Important: Bracken annoyingly creates a temp file of the KReport in the same origin directory and appends a "Bracken" label to it (ie: makes a *_bracken.kreport file for each sample) - you have to delete those files, between each level that you make new outputs for, or the below commands will make extra/incorrect collated Bracken files (since works off of all .kreport files in the folder). Also note that we are also requiring at least 10 reads to have mapped to any specific taxon (-t 10) in order to filter out some of the false-positive noise (Bracken-recommended default).

mkdir bracken_out

parallel -j 4 'bracken -d /scratch/ramdisk/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/ \
    -i {} -o bracken_out/{/.}.genus.bracken \
    -r 150 -l G -t 10' \
    ::: kraken2_kreport/*.kreport
rm kraken2_kreport/*_bracken.kreport

parallel -j 4 'bracken -d /scratch/ramdisk/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/ \
    -i {} -o bracken_out/{/.}.phylum.bracken \
    -r 150 -l P -t 10' \
    ::: kraken2_kreport/*.kreport

3.3 Collate the Bracken results

We can now use an additional Bracken script (in the analysis_scripts folder of your Bracken install) to collate the result files (one per sample) into final summary files (one per level). Note below are again the examples for just the genus- and phylum-level files.

mkdir bracken_out_merged

combine_bracken_outputs.py --files bracken_out/*.genus.bracken -o bracken_out_merged/merged_output.genus.bracken
combine_bracken_outputs.py --files bracken_out/*.phylum.bracken -o bracken_out_merged/merged_output.phylum.bracken

Note that if you are using Python 3.x.x then you may get an error here of something like TypeError: 'dict_keys' object is not subscriptable. If you do, then you can simply edit some of the lines in the combine_bracken_outputs.py as such: Line 113: sys.exit("Taxonomy IDs not matching for species %s: (%s\t%s)" % (name, taxid, sample_counts[name].keys()[0])) -> sys.exit("Taxonomy IDs not matching for species %s: (%s\t%s)" % (name, taxid, list(sample_counts[name].keys())[0])) Line 133: taxid = sample_counts[name].keys()[0] -> taxid = list(sample_counts[name].keys())[0]

4. Determine Function with MMseqs2

We will now use the MMseqs2 program to map reads to the UniRef90 protein database (other options are/will eventually be the RefSeqComplete or COG). This is accomplished in multiple steps below where jobfiles are generated to run all samples together and top hits are selected to later combine taxa+function.

4.1 Generate jobfiles (executables) containing parameters to run MMseqs2

Running this script generates a jobfile correspnding to each sample included in the input folder with certain parameters for running MMseqs2. It assumes the default processing pipeline above in that the forward and reverse reads were concatenated together. If you instead assembled your reads above, you need to change cat_reads/*.fastq to kneaddata_out/*.fastq. Note that the below option of -p 4 is setting the number of processors that will be used when running each job/sample, so be conscious of your total available resources and how many jobfiles you want to run at the same time (see discussion/options below).

/usr/bin/perl /home/dhwani/MyGit/MH2_test/run_TaxonomyFunctionSearchMegan.pl \
    --mapmethod mmseqs --db /home/dhwani/databases/mmseqsUniref90DB -p 4 \
    -o mmseqs_U90_out cat_reads/*.fastq

4.2 Execute the jobfiles to generate the functional predictions

Here you have two options, depending on the amount of computing resources available to you (processors+RAM) and the total number of samples to be run.

Option 1: Run each jobfile individually, so that one sample is processed at a time using a high number of CPUs for each sample. If this is the case, you'd want to increase the CPU flag in the above step of generating the jobfiles to something approaching your resource limit (such as -p 40 on a 48 core machine). In this way, a sample will be classified the fastest possible and is the option I prefer - it allows for quick testing to make sure everything is working and will give you a good idea of how long your entire project processing is going to take given the size of functional database you are using. If using this option, you would simply run a jobfile as follows (change the SampleID and data information for each file):

./mmseqs-SAMPLEID-YEAR-MM-DD-HH-MM-SS-jobfile.sh

Option 2: Run all jobfiles simultaneously, so that all samples are processed at the same time using a low number of CPUs for each sample. If this is the case, you'd keep the CPU flag in the above step of generating the jobfiles to something quite low so that when it is multiplied by the total number of samples that you remain within your resource limits (such as -p 4 with 10 samples on a 48 core machine). If using this option, you first make a text file containing the list of all the jobfiles generated in the above step and then execute the file-of-files using a another script:

ls -1 *YEAR-MM-DD*jobfile.sh > jobfiles.list
bash /home/dhwani/MyGit/MH2_test/run-mmseqs-jobfiles.sh

Using either option, I prefer to then clean up the root directory by moving the now used jobfiles to a separate folder:

mkdir mmseqs_jobfiles
mv *jobfile.sh mmseqs_jobfiles/

Finally, clean up all the intermediate files in the mmseqs_U90_out folder to keep only the m8 results files or move all the m8 files to a new folder:

mkdir mmseqs_m8_files
mv mmseqs_U90_out/*.m8 mmseqs_m8_files/

4.3 Pick the top functional hit for each read

One of the parameters that we set in the jobfiles above sets the max number of hits (--max-seqs) to 25 in order to be comparable to default DIAMOND output. The multiple hits are also useful if you plan to use MEGAN or some other "consensus method" for assigning taxonomy to the reads instead of Kraken2 that we used here. For our purposes, we are keeping only the top functional hit/assignment for each read and do so with the following script:

mkdir mmseqs_U90_out_tophit
python /home/dhwani/MyGit/MH2_test/pick_uniref_top_hit.py --unirefm8Dir mmseqs_m8_files/ --output_path mmseqs_U90_out_tophit

5. Create Final Normalized Taxa+Function Tables

Now that we have the taxonomic AND functional assignments for the reads, we can combine them and convert the counts into final normalized RPKM (Reads Per Kilobase per Million) tables using the multiple ways of running the parse_TaxonomyFunction.py script, after we have prepared a master file-of-files.

5.1 Prepare a master input file

Prepare a tab-delimited input file specifying the corresponding taxonomy assignment file, functional assignment file, etc. for each sample ID. This file should have the following format:

Sample1  kraken2_outraw/Sample1.kraken.txt  kraken2  mmseqs_U90_out_tophit/mmseqs-Sample1-s1.m8-parsed.txt  uniref  mmseqs_m8_files/mmseqs-Sample2-s1.m8
Sample2  kraken2_outraw/Sample2.kraken.txt  kraken2  mmseqs_U90_out_tophit/mmseqs-Sample2-s1.m8-parsed.txt  uniref  mmseqs_m8_files/mmseqs-Sample2-s1.m8

Here, column1 is the SampleID, column2 is the path to the Kraken2 classification files, column3 is the type of taxonomic input run (kraken2 or MEGAN), column4 is the path to the functional annotation file, column5 is the type of functional database used (refseq, uniref or COG), and the last column is the raw output of the MMseqs2 mapping (m8 BLAST format file). Lets call this file multi-sample-outfiles-w-m8.txt.

5.2 Generate the final normalized tables (multiple options)

With this above file as input, we can now generate all the different versions of the final normalized tables, depending on the level of detail required.

To generate the unstratified RPKM table (ie: no taxonomy):

python /home/dhwani/MyGit/MH2_test/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT-unstrat-matrix-RPKM.txt --unstratified Y

To generate the fully stratified RPKM table (ie: functions with breakdown of each taxa within them):

python /home/dhwani/MyGit/MH2_test/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT-strat-matrix-RPKM.txt --stratified Y

To generate the unstratified RPKM table with functions mapped to EC numbers:

python /home/dhwani/MyGit/MH2_test/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT-unstrat-matrix-RPKM-withEC.txt --unstratified Y --map2EC Y

To generate the stratified RPKM table with functions mapped to EC numbers:

python /home/dhwani/MyGit/MH2_test/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf PROJECT-strat-matrix-RPKM-withEC.txt --stratified Y --map2EC Y

5.3 Append functional descriptions

We can use one of the existing PICRUSt2.0 scripts to append the EC descriptions of the functions to the tables:

add_descriptions.py -i PROJECT-unstrat-matrix-RPKM-withEC.txt -m EC -o PROJECT-unstrat-matrix-RPKM-withEC-withlabels.txt

In order for this script to work on the stratified table, you need to first split the first column of the stratified matrix (e.g. "EC:6.3.2.6|Terriglobus roseus DSM 18391 (taxid 926566)") into two separate function+taxonomy columns (for example in Excel using the | delimiter). Rename the resulting new split table PROJECT-strat-matrix-RPKM-withEC-split-for-picrust2.txt, then run:

add_descriptions.py -i PROJECT-strat-matrix-RPKM-withEC-split-for-picrust2.txt -m EC -o PROJECT-strat-matrix-RPKM-withEC-withlabels.txt

5.4 Generate pathway information

We can use a second PICRUSt2.0 script to calculate the MetaCyc/BioCyc pathway abundances and coverages from the unstratified table containing the EC numbers:

pathway_pipeline.py -i PROJECT-unstrat-matrix-RPKM-withEC.txt -o pathways_out --intermediate minpath_working -p 4

5.5 Format taxa and/or functional files for use in downstream visualization

Under development! We are currently working on various analysis/visualization workflows based upon BURRITO, Metacoder and our own new visualization tool currently under development (due to the lack of a comprehensive tool able to handle stratified output tables).

We are currently developing an R shiny app for visualizing the stratified output. The beta version of the app is available as a github gist at https://gist.github.com/dhwanidesai/943ff5fdbd94815cc27f302d9f56ff0b

The following section will describe the steps required to process the stratified RPKM output generated in the above steps to convert it into input for the Visualizer app:

Example: Stratified EC number rpkm visualisation


Following is a sample of the 4-column input file required:


Sample	Genus	Gene	Contribution
31D	p__Chordata;c__Mammalia;o__NA;f__NA;g__NA;s__NA	EC:5.6.1.7	0.8095988800249774
31D	p__Chordata;c__Mammalia;o__NA;f__NA;g__NA;s__NA	EC:2.4.1.131	1.2837072456944525
31D	p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__NA	EC:4.1.1.31	16.230874609024266
31D	p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus cristatus	EC:4.1.1.31	5.8013837528095165
31D	p__Actinobacteria;c__Actinobacteria;o__Corynebacteriales;f__Corynebacteriaceae;g__Corynebacterium;s__Corynebacterium matruchotii	EC:4.1.1.31	7.240397583735735
31D	p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus sp. 2_1_36FAA	EC:4.1.1.31	6.115064692694273
31D	p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus sanguinis	EC:4.1.1.31	9.928303115746663
31D	p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus gordonii	EC:4.1.1.31	0.3053339527728727
31D	p__Actinobacteria;c__Actinobacteria;o__Corynebacteriales;f__Corynebacteriaceae;g__Corynebacterium;s__Corynebacterium durum	EC:4.1.1.31	0.4697140517147269

Following is a sample of the metadata file format required as input by the Visualizer.

NOTE:- you can have multiple columns or categories (tab-delimited) of the samples in the metadata file; However, the first column needs to be labeled "sample_id" (REQUIRED!)


sample_id	type
31D	A
32Da	A
34D	A
35Da	A
36Da	A
C16	B
C18	B
C19	B
C22	B

  • Convert stratified output to 4 column format For this, we first need to convert the Kraken2 output files for each of the metagenomes into an abundance (counts) matrix. This file also be useful for the Metacoder visalization (https://rpubs.com/dhwanikdesai/736732)

First we need to get a list of the sample tags and the corresponding kraken2 output files (such as in the example below)

31D	kraken2_outraw_0.05/31D.kraken.txt
32Da	kraken2_outraw_0.05/32Da.kraken.txt
34D	kraken2_outraw_0.05/34D.kraken.txt
35Da	kraken2_outraw_0.05/35Da.kraken.txt
36Da	kraken2_outraw_0.05/36Da.kraken.txt
C16	kraken2_outraw_0.05/C16.kraken.txt
C18	kraken2_outraw_0.05/C18.kraken.txt
C19	kraken2_outraw_0.05/C19.kraken.txt
C22	kraken2_outraw_0.05/C22.kraken.txt
C23	kraken2_outraw_0.05/C23.kraken.txt

This can be done by cutting the sample_id column from the metadata file.

Next, make a list of the Kraken2 output files for each sample (generated in step 3.1), and then paste the two together

cut -f 1 <project>_metadata.txt > sample_tags.txt

ls -1 kraken2_outraw/* > kraken2_results.txt

paste sample_tags.txt kraken2_results.txt > input_for_taxonomy_abundance_script.txt

Once this input file is ready run the add_6levelsTaxonomy_to_Kraken2Results.py python script (You would need the ete3 package installed):

python /home/dhwani/MyGit/MH2_test/add_6levelsTaxonomy_to_Kraken2Results.py --taxafilelist input_for_taxonomy_abundance_script.txt --outfile PROJECT-kraken2_abundance_matrix-6level-tax.txt

Finally, run the following script to convert the stratified RPKM output file generated in step 5.2 to the 4-column format required by the Visualizer app

python /home/dhwani/MyGit/MH2_test/convert_stratifiedRpkm_to_SankeyFormat.py --StratFileName PROJECT-strat-matrix-RPKM.txt --taxaAbundFile PROJECT-kraken2_abundance_matrix-6level-tax.txt --outfile PROJECT-rpkm-stratified-SankeyFormat.txt

To run this app you would require a local install of Rstudio (R version 4) with the shiny package (required). Once Rstudio is installed, open it and run the following commands

library(shiny)
runGist("943ff5fdbd94815cc27f302d9f56ff0b")

This should bring up the app interface. Here you can upload the stratified Rpkm file (in the 4-column format that we generated above) and the metadata file.

The app has options for collapsing the RPKM table at the desired level of taxonomy and the desired RPKM threshold for visualization.

Clicking on the Sample group nodes will highlight the linkages to the taxonomy nodes and the corresponding function nodes. Hovering over nodes or links will display the underlying RPKM value.

You can also select the format to download the resulting plot (png, pdf or jpeg).

⚠️ **GitHub.com Fallback** ⚠️