Serratus Assembly - ababaian/serratus GitHub Wiki
Introduction
coronaSPAdes assemblies
Latest assemblies:
Category A: single-contig assemblies of length > 25 Kbp
https://serratus-public.s3.amazonaws.com/assemblies/analysis/catA-v3.txt : list of assemblies
https://serratus-public.s3.amazonaws.com/assemblies/analysis/catA-v3.fa : multiFASTA file (1 FASTA entry = 1 assembly)
Category B: multi-contig assemblies of total length > 25 Kbp
https://serratus-public.s3.amazonaws.com/assemblies/analysis/catB-v3.txt : list of assemblies
https://serratus-public.s3.amazonaws.com/assemblies/analysis/catB-v3.fa : multiFASTA file (1 FASTA entry = 1 contig, each assembly therefore is in multiple entries)
multiFASTA headers format:
>[accession name].coronaspades.[contig identifier given by coronaSPAdes]
Components
Read QC
Fastp was used.
BBduk was also considered and implemented but ended up not being used.
Pipelines
Ours: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly
List of Assemblers to Consider
Review article Data Transformation
- This table lists 13 virus assemblers with links to code & papers.
Please update this list if you have ideas, corrections, comments. If you don't have commit rights to this repository, add a comment to issue #71. For each assembler, provide:
- Name
- Type (e.g. reference or de-novo)
- Link to code
- Link to paper
- Comments on pros or cons for the serratus project.
Use "??" as a placeholder if not known.
-
Kollector
- Type: Targeted De-novo
- Code: https://github.com/bcgsc/kollector
- Paper: https://doi.org/10.1093/bioinformatics/btx078
- Comments: Orignally tested for genomic DNA assembly, may need refinement to work with transcriptome data.
-
ABySS
- Type: De-novo Genomic
- Code: https://github.com/bcgsc/abyss
- Paper: https://doi.org/10.1101/gr.214346.116
- Comments: ??
-
Trans-ABySS
- Type: De-novo Genomic for RNAseq
- Code: https://github.com/bcgsc/transabyss/releases/
- Paper: https://www.nature.com/articles/nmeth.1517?page=12
- Comments: ??
-
RNA-Bloom
- Type: De-novo Transcriptomic
- Code: https://github.com/bcgsc/RNA-Bloom
- Paper: https://doi.org/10.1101/701607
- Comments: Isoform assembly maybe an unnecessary feature, but our datasets are expected to be transciptomic.
-
SPAdes
- Type: De-novo genomic / transcriptomic / metagenomic (different varieties exist - rnaSPAdes, SPAdes meta etc.)
- Code: https://github.com/ablab/spades
- Paper: https://doi.org/10.1089/cmb.2012.0021
- Comments: Well-supported and generally robust assembler. SPAdes meta was highlighted in the review article at the top of the document ("Choice of assembly software has a critical impact on virome characterisation") as performing "consistently well".
-
Megahit
- Type: De-novo genomic / metagenomic
- Code: https://github.com/voutcn/megahit
- Paper: https://doi.org/10.1093/bioinformatics/btv033
- Comments: Very memory-efficient.
-
IDBA
- Type: De-novo metagenomic
- Code: https://github.com/loneknightpy/idba
- Paper: https://doi.org/10.1093/bioinformatics/bts174
- Comments: Anecdotally (i.e. in my own experience) works well for viral genome assembly. Also positively reviewed in the review paper above.
-
metaviralSPAdes
- Type: De-novo metagenomic
- Code:
- Paper: https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa490/5837667
- Comments: Recently published (May 15, 2020.) Has some tools for viral contig classification/validation.
-
SOAPdenovo-Trans
- Type: RNA-Seq De novo Assembly
- Code: https://github.com/aquaskyline/SOAPdenovo-Trans
- Paper: https://academic.oup.com/bioinformatics/article/30/12/1660/380938
- Comments: This may be a little older but was used by the JGI for awhile
-
SKESA
- Type: De novo Assembly
- Code: https://github.com/ncbi/SKESA/releases
- Paper: https://link.springer.com/article/10.1186/s13059-018-1540-z
- Comments: ??
Output Format
It is important that we try to harmonize the output format from various assembly pipelines, so that we can better compare their outputs, and make it easier to develop downstream components. Below please find a proposed set of requirements:
- FASTA formatted assembly
- BAM file of reads that assembled, against the assembly itself
- CSV file with the following columns: contig ID, coverage, quality score (TBD)
Validation
Candidate Samples for Discovery and Pipeline development
In this section of the Serratus Assembly wiki, please list samples that have been identified as likely containing coronavirus-related sequence, or samples that might serve as verified non-coronavirus sequences. Briefly mention what it is and why it would be useful to have it assembled ASAP:
- SRR1234, Brief description, [High/low] priority
- SRR1235, Brief description 2, [High/low] priority ...