Serratus Assembly - ababaian/serratus GitHub Wiki

Introduction

coronaSPAdes assemblies

Latest assemblies:

Category A: single-contig assemblies of length > 25 Kbp

https://serratus-public.s3.amazonaws.com/assemblies/analysis/catA-v3.txt : list of assemblies

https://serratus-public.s3.amazonaws.com/assemblies/analysis/catA-v3.fa : multiFASTA file (1 FASTA entry = 1 assembly)

Category B: multi-contig assemblies of total length > 25 Kbp

https://serratus-public.s3.amazonaws.com/assemblies/analysis/catB-v3.txt : list of assemblies

https://serratus-public.s3.amazonaws.com/assemblies/analysis/catB-v3.fa : multiFASTA file (1 FASTA entry = 1 contig, each assembly therefore is in multiple entries)

multiFASTA headers format:

>[accession name].coronaspades.[contig identifier given by coronaSPAdes]

Components

Read QC

Fastp was used.

BBduk was also considered and implemented but ended up not being used.

Pipelines

Ours: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly

Broad's Viral NGS

V-pipe

List of Assemblers to Consider

Review article Data Transformation

  • This table lists 13 virus assemblers with links to code & papers.

Please update this list if you have ideas, corrections, comments. If you don't have commit rights to this repository, add a comment to issue #71. For each assembler, provide:

  1. Name
  2. Type (e.g. reference or de-novo)
  3. Link to code
  4. Link to paper
  5. Comments on pros or cons for the serratus project.

Use "??" as a placeholder if not known.

  1. Kollector

  2. ABySS

  3. Trans-ABySS

  4. RNA-Bloom

  5. SPAdes

    • Type: De-novo genomic / transcriptomic / metagenomic (different varieties exist - rnaSPAdes, SPAdes meta etc.)
    • Code: https://github.com/ablab/spades
    • Paper: https://doi.org/10.1089/cmb.2012.0021
    • Comments: Well-supported and generally robust assembler. SPAdes meta was highlighted in the review article at the top of the document ("Choice of assembly software has a critical impact on virome characterisation") as performing "consistently well".
  6. Megahit

  7. IDBA

  8. metaviralSPAdes

  9. SOAPdenovo-Trans

  10. SKESA

Output Format

It is important that we try to harmonize the output format from various assembly pipelines, so that we can better compare their outputs, and make it easier to develop downstream components. Below please find a proposed set of requirements:

  • FASTA formatted assembly
  • BAM file of reads that assembled, against the assembly itself
  • CSV file with the following columns: contig ID, coverage, quality score (TBD)

Validation

Candidate Samples for Discovery and Pipeline development

In this section of the Serratus Assembly wiki, please list samples that have been identified as likely containing coronavirus-related sequence, or samples that might serve as verified non-coronavirus sequences. Briefly mention what it is and why it would be useful to have it assembled ASAP:

  • SRR1234, Brief description, [High/low] priority
  • SRR1235, Brief description 2, [High/low] priority ...