Galaxy Read profiling - quadram-institute-bioscience/gmh-sops GitHub Wiki

Galaxy Read profiling module 2.0

Details

Decription: Read_profiling_wf_2.0 - Galaxy workflow to perform taxonomic profiling using MetaPhlan2 and Kraken2 and functiona,m profiling using HUMAnN2. The galaxy workflow performs read profiling on paired-read collections in Galaxy. For this, the user needs to manually import reads into Galaxy, which can be easily done via direct import from IRIDA.

Contact person: Rebecca Ansorge ([email protected])

Source: Read_profiling_wf_2.0

Input

Paired collection of shotgun fastq metagenome reads

Output

Downloadable zip-folder containing:

kraken2-bracken-sampleX.tsv - re-estimated abundances per sample from kraken2 output
kraken2-report-sampleX.tsv - kraken2-report taxonomic per sample X
kraken2-krona-sampleX.html - krona vizualization of kraken2 taxonomic output per sample X
metaphlan2-merged.tsv - metaphlan2 relative abundances of taxa containing one column per sample
genefamily_abundance_rpk_sample_X - humann2 output of reads-per-kilobase (rpk) of gene families per sample
pathway_abundance_rpk_sample_X - humann2 output of reads-per-kilobase (rpk) of pathways per sample
pathway_coverage_sample_X - humann2 output of coverage of pathway per sample
genefamily_abundance_cpm - humann2 output of counts-per-million (compositional) of gene families of all samples (one sample per column)
pathway_abundance_cpm - humann2 output of counts-per-million (compositional) of pathways of all samples (one sample per column)
multiQC-report.html
1. General Statistics
2. Species composition: MetaPhlan2 and kraken2-bracken species abundance profiles (bar charts)
3. Family composition: MetaPhlan2 (bar charts)
4. Kingdom composition: MetaPhlan2 (bar charts)
5. Host removal: number of reads matching to host reference
6. Top 20 most abundant pathways
7. Fastp results (read qualities)

Workflow

Workflow description:

Pipeline parameters can be modified by user if needed. By default the host reference is 'human' - please adapt this to the correct host reference if needed (see below). Generally, the workflow Read_profiling_wf_2.0 performs the following steps.
Read quality trimming using fastp: phred Q20, unqualified percent limit 40%, N base limit 5, minimum read length 15, complexity threshold 30%
Host read removal with Kraken2
1. default database is: human_20200311 (human) - adapt as needed (see below for instructions)
2. note: for all profiling the host reads were excluded
Taxonomic profiling of filtered reads with Kraken2: confidence threshold 0.5
1. database: kraken2_db_20190111 (other databases available)
2. Bayesian Re-estimation of Abundance with Kraken with Bracken
3. Visualization or per-sample profiles as a Krona diagram (can be explored interactively)
Taxonomic profiling of filtered reads with MetaPhlan2
1. database: Chocophlan
2. Per-sample metaphlan2 outputs are subsequently merged into a single overview table containing all samples which facilitates easier downstream analyses
Functional profiling of filtered reads with HUMAnN2
1. nucleotide database: Chocophlan
2. protein database: Ec-filtered Uniref90
3. pathway database: MetaCyc
4. translated alignment with: diamond
5. Abundance profiles of gene families and pathways in rpk (reads per kilobase) and cpm (counts per million)
Creating multiQC report to summarize
1. General Statistics
2. Species composition: MetaPhlan2 and kraken2-bracken species abundance profiles (bar charts)
3. Family composition: MetaPhlan2 (bar charts)
4. Kingdom composition: MetaPhlan2 (bar charts)
5. Host removal: number of reads matching to host reference
6. Top 20 most abundant pathways
7. fastp results (read qualities)

Merge all outputs into downloadable zip file

Usage:

Get workflow
- Download workflow
- Import workflow into your own Galaxy environment

Import data
- from IRIDA the reads are already imported in the correct format (paired collection)
- from other source of choice
Modify the settings if needed
Modify host reference if needed

Run workflow on imported read collection
Explore results within Galaxy or download zip-folder containing all relevant output files
Example of history when running workflow on a collection of 2 samples

Example of multiQC report

Tool citations

Galaxy: https://galaxyproject.org/citing-galaxy/

Abubucker, Sahar and Segata, Nicola and Goll, Johannes and Schubert, Alyxandria M. and Izard, Jacques and Cantarel, Brandi L. and Rodriguez-Mueller, Beltran and Zucker, Jeremy and Thiagarajan, Mathangi and Henrissat, Bernard and et al. (2012). Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome. In PLoS Computational Biology, 8 (6), pp. e1002358. [doi:10.1371/journal.pcbi.1002358][Link]

Ondov, Brian D and Bergman, Nicholas H and Phillippy, Adam M (2011). Interactive metagenomic visualization in a Web browser. In BMC Bioinformatics, 12 (1). [doi:10.1186/1471-2105-12-385][Link]

Chen, Shifu and Zhou, Yanqing and Chen, Yaru and Gu, Jia (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. [doi:10.1101/274100][Link]

Ewels, Philip and Magnusson, MÃ¥ns and Lundin, Sverker and KÃ¤ller, Max (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. In Bioinformatics, 32 (19), pp. 3047â3048. [doi:10.1093/bioinformatics/btw354][Link]

Truong, Duy Tin and Franzosa, Eric A and Tickle, Timothy L and Scholz, Matthias and Weingart, George and Pasolli, Edoardo and Tett, Adrian and Huttenhower, Curtis and Segata, Nicola (2015). MetaPhlAn2 for enhanced metagenomic taxonomic profiling. In Nature Methods, 12 (10), pp. 902â903. [doi:10.1038/nmeth.3589][Link]

Lu, Jennifer and Breitwieser, Florian P. and Thielen, Peter and Salzberg, Steven L. (2017). Bracken: estimating species abundance in metagenomics data. In PeerJ Computer Science, 3, pp. e104. [doi:10.7717/peerj-cs.104][Link]

Wood, Derrick E and Salzberg, Steven L (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. In Genome Biology, 15 (3), pp. R46. [doi:10.1186/gb-2014-15-3-r46][Link]

Blankenberg, D. and Gordon, A. and Von Kuster, G. and Coraor, N. and Taylor, J. and Nekrutenko, A. (2010). Manipulation of FASTQ data with Galaxy. In Bioinformatics, 26 (14), pp. 1783â1785. [doi:10.1093/bioinformatics/btq281][Link]

Tools and their versions

tool; galaxy tool version; tool version

fastp; 0.19.5+galaxy1; 0.19.5
Kraken2; 2.1.1+galaxy0; 2.1.1
MultiQC; 1.7.1; 1.7
MetaPhlAn2; 2.6.0.0; 2.6.0
Merge (MetaPhlAn2); 2.6.0.0; 2.6.0
HUMAnN2; 0.11.1.0; 0.11.2
Renormalize (HUMAnN2); 0.11.1.0; 0.11.1
Join (HUMAnN2); 0.11.1.0; 0.11.1
Bracken; 2.2; 2.2
krona; 2.7.1; 2.7.1
FASTQ interlacer; 1.2.0.1; galaxy_internal
Extract element identifiers; 0.0.2; galaxy_internal
Build List; 1.0.0; galaxy_internal
Flatten Collection; 1.0.0; galaxy_internal
Replace Text; 1.1.2; galaxy_internal
Relabel List Identifiers; 1.0.0; galaxy_internal
Merge Collections; 1.0.0; galaxy_internal
Bundle Collection; 1.0.2; galaxy_internal