Galaxy Assembly based workflow - quadram-institute-bioscience/gmh-sops GitHub Wiki

Assembly-based analyses of shotgun metagenomics

Details

Description: Assembly_wf_2.0 - Galaxy workflow to perform assembly-based metagenomic analysis of shotgun sequencing data. The galaxy workflow performs quality trimming, host read removal, assembly, binning, and annotation on paired-read collections in Galaxy. For this, the user needs to manually import reads into Galaxy, which can be easily done via direct import from IRIDA.

Contact person: Rebecca Ansorge ([email protected])

Source: https://galaxy.quadram.ac.uk/galaxy/u/ansorge/w/assemblywf20

Input

Paired collection of shotgun fastq metagenome reads

Output

merged-results_zip (downloadable zip folder) contains:

  • metabat2-checkm-results - checkM completeness and contamination assessment of MetaBat2 bins

  • multiQC-report.html - QC report of reads and assembly

  • bin-GTDBclassification - classification of bins that passed quality threshold of >70% completeness

  • For each sample X:

    • metaspades-assembly-graph-sample_X - fastg assembly graph output from metaSpades for sample X
    • prokka-sampleX-annotations-tsv.tsv - prokka annotation tsv file for sample X
    • prokka-sampleX-features-tbl.tsv - prokka features tbl file for sample X
    • prokka-sampleX-gbk.genbank - prokka annotations genbank format for sample X
    • prokka-sampleX-gff.gff - prokka annotations gff3 format for sample X
    • prokka-sampleX-genes-ffn.fasta - prokka nucleotide gene sequences in fasta format for sample X
    • prokka-sampleX-proteins-faa.fasta - prokka amino acid protein sequences in fasta format for sample X

merged-assembly-bins_zip (downloadable zip folder) contains:

  • For each sample X:
    • metaspades-assembly-scaffolds-sample_X.fasta - metaSpades assembly for sample X in fasta format
    • metabat2-sampleX_bin_1.fasta, metabat2-sampleX_bin_2,.fasta,... - Metabat2 bins 1, 2, 3, 4,... for sample X

Workflow

Pipeline parameters can be modified by user if needed. Generally, the workflow Assembly_wf_2.0 performs the following steps.

  1. Read quality trimming using fastp: phred Q20, unqualified percent limit 40%, N base limit 5, minimum read length 15, complexity threshold 30%
  2. Host read removal with Kraken2
    1. default database is: human_20200311 (human) - adapt as needed (see below for instructions)
    2. NOTE: for all profiling the host reads were excluded
  3. Summary of read, assembly, bin stats using multiQC
  4. Metagenomic binning using MetaBat2
  5. Assessment of bin completeness and contamination using CheckM
  6. Classify all bins with more than 70% checkM completeness using GTDB-tk
  7. Annotation of assembly using Prokka (in fast mode and with --metagenome option for improvement of gene predictions for highly fragmented genomes)
  8. Assembly using metaSPades and assembly QC using Quast
  9. Merge all outputs into two downloadable zip file (one containing fasta sequences and one all other output)

Usage:

  1. Get workflow

    1. Download workflow from: -TBD-
    2. Import workflow into your own Galaxy environment

  1. Import data

    1. from IRIDA the reads are already imported in the correct format (paired collection)
    2. from other source of choice
  2. Modify the settings if needed

  3. Modify host reference if needed

  1. Run workflow on imported read collection

  2. Explore results within Galaxy or download zip-folder(s) containing all relevant output files

  3. Example of history when workflow when is run with 2 samples

  1. Example of (partial) multiQC report

Tool citations

Galaxy: https://galaxyproject.org/citing-galaxy/

Wood, Derrick E and Salzberg, Steven L (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. In Genome Biology, 15 (3), pp. R46. [doi:10.1186/gb-2014-15-3-r46][Link]

Wu, Yu-Wei and Simmons, Blake A. and Singer, Steven W. (2015). MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. In Bioinformatics, 32 (4), pp. 605–607. [doi:10.1093/bioinformatics/btv638][Link]

Cuccuru, Gianmauro and Orsini, Massimiliano and Pinna, Andrea and Sbardellati, Andrea and Soranzo, Nicola and Travaglione, Antonella and Uva, Paolo and Zanetti, Gianluigi and Fotia, Giorgio (2014). Orione, a web-based framework for NGS analysis in microbiology. In Bioinformatics, 30 (13), pp. 1928–1929. [doi:10.1093/bioinformatics/btu135][Link]

Bankevich, Anton and Nurk, Sergey and Antipov, Dmitry and Gurevich, Alexey A. and Dvorkin, Mikhail and Kulikov, Alexander S. and Lesin, Valery M. and Nikolenko, Sergey I. and Pham, Son and Prjibelski, Andrey D. and et al. (2012). SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. In Journal of Computational Biology, 19 (5), pp. 455–477. [doi:10.1089/cmb.2012.0021][Link]

Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. In Bioinformatics, 30 (14), pp. 2068–2069. [doi:10.1093/bioinformatics/btu153][Link]

Langmead, Ben and Trapnell, Cole and Pop, Mihai and Salzberg, Steven L (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. In Genome Biology, 10 (3), pp. R25. [doi:10.1186/gb-2009-10-3-r25][Link]

Langmead, Ben and Salzberg, Steven L (2012). Fast gapped-read alignment with Bowtie 2. In Nature Methods, 9 (4), pp. 357–359. [doi:10.1038/nmeth.1923][Link]

Ewels, Philip and Magnusson, Måns and Lundin, Sverker and Käller, Max (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. In Bioinformatics, 32 (19), pp. 3047–3048. [doi:10.1093/bioinformatics/btw354][Link]

Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, Glenn Tesler. QUAST: quality assessment tool for genomce assemblies, Bioinformatics (2013) 29 (8): 1072-1075.

Quast v4.1. http://bioinf.spbau.ru/quast. Released May 2016.

Chen, Shifu and Zhou, Yanqing and Chen, Yaru and Gu, Jia (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. [doi:10.1101/274100][Link]

Parks, Donovan H. and Imelfort, Michael and Skennerton, Connor T. and Hugenholtz, Philip and Tyson, Gene W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. In Genome Research, 25 (7), pp. 1043–1055. [doi:10.1101/gr.186072.114][Link]

Kang, Dongwan D. and Li, Feng and Kirton, Edward and Thomas, Ashleigh and Egan, Rob and An, Hong and Wang, Zhong (2019). MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. In PeerJ, 7, pp. e7359. [doi:10.7717/peerj.7359][Link]

Tool versions

tool; galaxy tool version; tool version

  • fastp; 0.19.5; 0.19.5
  • metaSpades; 3.9.0; 3.9.0
  • Kraken2; 2.1.1+galaxy0; 2.1.1
  • Quast; 5.0.2; 5.0.2
  • Bowtie2; 2.3.4.3+galaxy0; 2.3.4.1
  • Prokka; 1.13; 1.13.3
  • checkm lineage_wf; 1.0.11; 1.0.11
  • GTDB-tk ; 0.3.2; 0.3.2
  • coverm; 0.3.2; 0.3.2
  • metabat2; 2.14; 2.14
  • MultiQC; 1.7.1; 1.7
  • sort (samtools); 1.0.2; 1.9
  • Flatten Collection; 1.0.0; galaxy_internal
  • Build List; 1.0.0; galaxy_internal
  • Extract element identifiers; 0.0.2; galaxy_internal
  • Replace Text; 1.1.2; galaxy_internal
  • Relabel List Identifiers; 1.0.0; galaxy_internal
  • Merge Collections; 1.0.0; galaxy_internal
  • Bundle Collection; 1.0.2; galaxy_internal