Galaxy QC - quadram-institute-bioscience/gmh-sops GitHub Wiki

Galaxy QC module 2.0

Details

Decription: QC_module_wf_2.0 - Galaxy workflow to perform QC on shotgun metagenomics data. The galaxy workflow performs host read removal, quality control with fastp, calculates read numbers that pass the QC and read numbers that can be assigned to Bacteria / Archaea / Viruses. The pipeline also creates a first taxonomic profile using kraken2 and bracken2. A multiQC HTML output gives an overview of all the data and additionally the user obtains browsable krona diagrams to explore the taxonomic composition of the datasets. For this, the user needs to import reads into Galaxy, which can be easily done via direct import from IRIDA.

Contact person: Rebecca Ansorge ([email protected])

Source: QC_module_wf_2.0

Input

Paired collection of shotgun fastq metagenome reads

Output

Downloadable zip-folder containing:

  • kraken2-bracken-sampleX.tsv - re-estimated abundances per sample from kraken2 output
  • kraken2-report-sampleX.tsv - kraken2-report taxonomic per sample X
  • kraken2-krona-sampleX.html - krona vizualization of kraken2 taxonomic output per sample X
  • multiQC-report.html - QC report of fastq reads and taxa abundances
    1. General Statistics
    2. Species composition: kraken2-bracken species abundance profiles (bar charts)
    3. Domain composition: read numbers of domains and unclassified fraction
    4. Host removal: number of reads matching to host reference
    5. Fastp results (read qualities)

The workflow also removes host reads and decontaminated reads can be downloaded from galaxy object decontaminated-reads-interleaved

Workflow

Workflow description:

Pipeline parameters can be modified by user if needed. By default the host reference is 'human' - please adapt this to the correct host reference if needed (see below).

The workflow QC_module_wf_2.0 performs the following steps:

  1. Read quality assessment and trimming using fastp: phred Q20, unqualified percent limit 40%, N base limit 5, minimum read length 15, complexity threshold 30%
  2. Host read removal with Kraken2
    1. default database is: human_20200311 (human) - adapt as needed (see below for instructions)
    2. output: decontaminated-reads-interleaved (reads where host was removed)
  3. Taxonomic profiling of filtered reads with Kraken2: confidence threshold 0.5
    1. database: kraken2_db_20190111 - adapt as needed
    2. Bayesian Re-estimation of Abundance with Kraken with Bracken
    3. Visualization or per-sample profiles as a Krona diagram (can be explored interactively)
  4. Creating multiQC report to summarize
    1. General Statistics
    2. Species composition: kraken2-bracken species abundance profiles (bar charts)
    3. Domain composition: read numbers of domains and unclassified fraction
    4. Host removal: number of reads matching to host reference
    5. Fastp results (read qualities)
  5. Merge all outputs into downloadable zip file (decontaminated reads not included - these need to be downloaded separately)

The results can be either explored within Galaxy or downloaded.

Usage:

  1. Get workflow
    • Download workflow
    • Import workflow into your own Galaxy environment

  1. Import data

    • from IRIDA the reads are already imported in the correct format (paired collection)
    • from other source of choice (check instructions here Uploading data into Galaxy)
  2. Modify the settings if needed

  3. Modify host reference if needed

  1. Run workflow on imported read collection
  2. Explore results within Galaxy or download zip-folder containing all relevant output files
  3. Example of history when running workflow on a collection of 2 samples:

  1. Example of krona diagram

  1. Example of multiQC report

Tool citations

Galaxy: https://galaxyproject.org/citing-galaxy/

Abubucker, Sahar and Segata, Nicola and Goll, Johannes and Schubert, Alyxandria M. and Izard, Jacques and Cantarel, Brandi L. and Rodriguez-Mueller, Beltran and Zucker, Jeremy and Thiagarajan, Mathangi and Henrissat, Bernard and et al. (2012). Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome. In PLoS Computational Biology, 8 (6), pp. e1002358. [doi:10.1371/journal.pcbi.1002358][Link]

Ondov, Brian D and Bergman, Nicholas H and Phillippy, Adam M (2011). Interactive metagenomic visualization in a Web browser. In BMC Bioinformatics, 12 (1). [doi:10.1186/1471-2105-12-385][Link]

Chen, Shifu and Zhou, Yanqing and Chen, Yaru and Gu, Jia (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. [doi:10.1101/274100][Link]

Ewels, Philip and Magnusson, Måns and Lundin, Sverker and Käller, Max (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. In Bioinformatics, 32 (19), pp. 3047–3048. [doi:10.1093/bioinformatics/btw354][Link]

Wood, Derrick E and Salzberg, Steven L (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. In Genome Biology, 15 (3), pp. R46. [doi:10.1186/gb-2014-15-3-r46][Link]

Blankenberg, D. and Gordon, A. and Von Kuster, G. and Coraor, N. and Taylor, J. and Nekrutenko, A. (2010). Manipulation of FASTQ data with Galaxy. In Bioinformatics, 26 (14), pp. 1783–1785. [doi:10.1093/bioinformatics/btq281][Link]

Tools and their versions

tool; galaxy tool version; tool version

  • fastp; 0.19.5+galaxy1; 0.19.5
  • Kraken2; 2.1.1+galaxy0; 2.1.1
  • MultiQC; 1.7.1; 1.7
  • Bracken; 2.2; 2.2
  • krona; 2.7.1; 2.7.1
  • Extract element identifiers; 0.0.2; galaxy_internal
  • Build List; 1.0.0; galaxy_internal
  • Flatten Collection; 1.0.0; galaxy_internal
  • Replace Text; 1.1.2; galaxy_internal
  • Relabel List Identifiers; 1.0.0; galaxy_internal
  • Merge Collections; 1.0.0; galaxy_internal
  • Bundle Collection; 1.0.2; galaxy_internal