Galaxy Read profiling - quadram-institute-bioscience/gmh-sops GitHub Wiki

Galaxy Read profiling module 2.0

Details

Decription: Read_profiling_wf_2.0 - Galaxy workflow to perform taxonomic profiling using MetaPhlan2 and Kraken2 and functiona,m profiling using HUMAnN2. The galaxy workflow performs read profiling on paired-read collections in Galaxy. For this, the user needs to manually import reads into Galaxy, which can be easily done via direct import from IRIDA.

Contact person: Rebecca Ansorge ([email protected])

Source: Read_profiling_wf_2.0

Input

Paired collection of shotgun fastq metagenome reads

Output

Downloadable zip-folder containing:

  • kraken2-bracken-sampleX.tsv - re-estimated abundances per sample from kraken2 output
  • kraken2-report-sampleX.tsv - kraken2-report taxonomic per sample X
  • kraken2-krona-sampleX.html - krona vizualization of kraken2 taxonomic output per sample X
  • metaphlan2-merged.tsv - metaphlan2 relative abundances of taxa containing one column per sample
  • genefamily_abundance_rpk_sample_X - humann2 output of reads-per-kilobase (rpk) of gene families per sample
  • pathway_abundance_rpk_sample_X - humann2 output of reads-per-kilobase (rpk) of pathways per sample
  • pathway_coverage_sample_X - humann2 output of coverage of pathway per sample
  • genefamily_abundance_cpm - humann2 output of counts-per-million (compositional) of gene families of all samples (one sample per column)
  • pathway_abundance_cpm - humann2 output of counts-per-million (compositional) of pathways of all samples (one sample per column)
  • multiQC-report.html
    1. General Statistics
    2. Species composition: MetaPhlan2 and kraken2-bracken species abundance profiles (bar charts)
    3. Family composition: MetaPhlan2 (bar charts)
    4. Kingdom composition: MetaPhlan2 (bar charts)
    5. Host removal: number of reads matching to host reference
    6. Top 20 most abundant pathways
    7. Fastp results (read qualities)

Workflow

Workflow description:

  1. Pipeline parameters can be modified by user if needed. By default the host reference is 'human' - please adapt this to the correct host reference if needed (see below). Generally, the workflow Read_profiling_wf_2.0 performs the following steps.

  2. Read quality trimming using fastp: phred Q20, unqualified percent limit 40%, N base limit 5, minimum read length 15, complexity threshold 30%

  3. Host read removal with Kraken2

    1. default database is: human_20200311 (human) - adapt as needed (see below for instructions)
    2. note: for all profiling the host reads were excluded
  4. Taxonomic profiling of filtered reads with Kraken2: confidence threshold 0.5

    1. database: kraken2_db_20190111 (other databases available)
    2. Bayesian Re-estimation of Abundance with Kraken with Bracken
    3. Visualization or per-sample profiles as a Krona diagram (can be explored interactively)
  5. Taxonomic profiling of filtered reads with MetaPhlan2

    1. database: Chocophlan
    2. Per-sample metaphlan2 outputs are subsequently merged into a single overview table containing all samples which facilitates easier downstream analyses
  6. Functional profiling of filtered reads with HUMAnN2

    1. nucleotide database: Chocophlan
    2. protein database: Ec-filtered Uniref90
    3. pathway database: MetaCyc
    4. translated alignment with: diamond
    5. Abundance profiles of gene families and pathways in rpk (reads per kilobase) and cpm (counts per million)
  7. Creating multiQC report to summarize

    1. General Statistics
    2. Species composition: MetaPhlan2 and kraken2-bracken species abundance profiles (bar charts)
    3. Family composition: MetaPhlan2 (bar charts)
    4. Kingdom composition: MetaPhlan2 (bar charts)
    5. Host removal: number of reads matching to host reference
    6. Top 20 most abundant pathways
    7. fastp results (read qualities)

Merge all outputs into downloadable zip file

Usage:

  1. Get workflow
    • Download workflow
    • Import workflow into your own Galaxy environment

  1. Import data

    • from IRIDA the reads are already imported in the correct format (paired collection)
    • from other source of choice
  2. Modify the settings if needed

  3. Modify host reference if needed

  1. Run workflow on imported read collection

  2. Explore results within Galaxy or download zip-folder containing all relevant output files

  3. Example of history when running workflow on a collection of 2 samples

  1. Example of multiQC report

Tool citations

Galaxy: https://galaxyproject.org/citing-galaxy/

Abubucker, Sahar and Segata, Nicola and Goll, Johannes and Schubert, Alyxandria M. and Izard, Jacques and Cantarel, Brandi L. and Rodriguez-Mueller, Beltran and Zucker, Jeremy and Thiagarajan, Mathangi and Henrissat, Bernard and et al. (2012). Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome. In PLoS Computational Biology, 8 (6), pp. e1002358. [doi:10.1371/journal.pcbi.1002358][Link]

Ondov, Brian D and Bergman, Nicholas H and Phillippy, Adam M (2011). Interactive metagenomic visualization in a Web browser. In BMC Bioinformatics, 12 (1). [doi:10.1186/1471-2105-12-385][Link]

Chen, Shifu and Zhou, Yanqing and Chen, Yaru and Gu, Jia (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. [doi:10.1101/274100][Link]

Ewels, Philip and Magnusson, Måns and Lundin, Sverker and Käller, Max (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. In Bioinformatics, 32 (19), pp. 3047–3048. [doi:10.1093/bioinformatics/btw354][Link]

Truong, Duy Tin and Franzosa, Eric A and Tickle, Timothy L and Scholz, Matthias and Weingart, George and Pasolli, Edoardo and Tett, Adrian and Huttenhower, Curtis and Segata, Nicola (2015). MetaPhlAn2 for enhanced metagenomic taxonomic profiling. In Nature Methods, 12 (10), pp. 902–903. [doi:10.1038/nmeth.3589][Link]

Lu, Jennifer and Breitwieser, Florian P. and Thielen, Peter and Salzberg, Steven L. (2017). Bracken: estimating species abundance in metagenomics data. In PeerJ Computer Science, 3, pp. e104. [doi:10.7717/peerj-cs.104][Link]

Wood, Derrick E and Salzberg, Steven L (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. In Genome Biology, 15 (3), pp. R46. [doi:10.1186/gb-2014-15-3-r46][Link]

Blankenberg, D. and Gordon, A. and Von Kuster, G. and Coraor, N. and Taylor, J. and Nekrutenko, A. (2010). Manipulation of FASTQ data with Galaxy. In Bioinformatics, 26 (14), pp. 1783–1785. [doi:10.1093/bioinformatics/btq281][Link]

Tools and their versions

tool; galaxy tool version; tool version

  • fastp; 0.19.5+galaxy1; 0.19.5
  • Kraken2; 2.1.1+galaxy0; 2.1.1
  • MultiQC; 1.7.1; 1.7
  • MetaPhlAn2; 2.6.0.0; 2.6.0
  • Merge (MetaPhlAn2); 2.6.0.0; 2.6.0
  • HUMAnN2; 0.11.1.0; 0.11.2
  • Renormalize (HUMAnN2); 0.11.1.0; 0.11.1
  • Join (HUMAnN2); 0.11.1.0; 0.11.1
  • Bracken; 2.2; 2.2
  • krona; 2.7.1; 2.7.1
  • FASTQ interlacer; 1.2.0.1; galaxy_internal
  • Extract element identifiers; 0.0.2; galaxy_internal
  • Build List; 1.0.0; galaxy_internal
  • Flatten Collection; 1.0.0; galaxy_internal
  • Replace Text; 1.1.2; galaxy_internal
  • Relabel List Identifiers; 1.0.0; galaxy_internal
  • Merge Collections; 1.0.0; galaxy_internal
  • Bundle Collection; 1.0.2; galaxy_internal