Assembly benchmark results for 8 coronavirus candidates datasets - ababaian/serratus GitHub Wiki

See https://github.com/ababaian/serratus/issues/130 for the motivation and description of the datasets.

Viral contigs are available in s3://serratus-public/notebook/200526_assembly/RC/contigs.

A collection of scripts that were used to produce this benchmark is in s3://serratus-public/notebook/200526_assembly/RC/scripts/.

Benchmark setup

Reads were given as-is to each assembler (no quality/adapter trim). I ran the assemblers with default parameters and gave the contigs to CheckV. I then reported any contig hit for the genomes in checkv_genbank.tsv that match the regexp [Cc]orona.

MetaviralSpades did not detect anything in its regular output so I used K*/before_chromosome_removal.fasta (the unfiltered final assembly), and I couldn't run it on single-end reads (SRR1168901.fastq.gz, SRR10951660.fastq.gz, SRR10951656.fastq.gz).

For coronaSPAdes, I ran CheckV on the gene_clusters.fasta file.

Results

SRR10829953 : ~180K reads to KT323979.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_900205315.1 28530 101.4 97.55
Megahit GCA_000913415.1 583 2.0 62.7
Minia-pipeline GCA_900205315.1 24266 86.3 97.72
Minia-pipeline GCA_900205315.1 2650 9.5 96.26
Minia k71 GCA_900205315.1 21295 75.7 97.88
Minia k71 GCA_900205315.1 2553 9.1 98.2
Minia k31 GCA_900205315.1 24280 86.4 97.72
Minia k31 GCA_900205315.1 1040 3.7 98.4
Minia k31 GCA_900205315.1 1003 3.6 93.0
MetaViralSPAdes GCA_900205315.1 18984 67.5 97.51
MetaViralSPAdes GCA_900205315.1 8808 31.4 96.9
coronaSPAdes GCA_900205315.1 27973 99.5 97.5

SRR10829957 : ~195K reads to KP728470.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_900205315.1 1548 5.5 95.4
Megahit GCA_900205315.1 26185 93.1 97.65
Minia-pipeline GCA_900205315.1 24272 86.3 97.72
Minia-pipeline GCA_900205315.1 3650 13.0 96.42
Minia k71 GCA_900205315.1 24243 86.2 97.72
Minia k71 GCA_900205315.1 2363 8.4 98.2
Minia k71 GCA_900205315.1 805 2.9 98.5
Minia k31 GCA_900205315.1 17109 60.9 98.19
Minia k31 GCA_900205315.1 5041 17.9 95.0
Minia k31 GCA_900205315.1 1451 5.2 98.1
MetaViralSPades GCA_900205315.1 28060 99.8 97.55
coronaSPAdes GCA_900205315.1 27994 99.6 97.55

SRR10951656 : ~4x coverage to MH878976.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000880055.1 460 1.7 93.8
Megahit GCA_000880055.1 569 2.1 95.2
Megahit GCA_000880055.1 790 2.9 87.4
Minia-pipeline GCA_000880055.1 206 0.7 84.4
Minia k31 GCA_000880055.1 203 0.7 92.2
Minia k31 GCA_000862965.1 343 1.2 96.5
Minia k31 GCA_000880055.1 255 0.9 85.0

SRR10951660 : ~1x coverage to MH878976.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000862965.1 346 1.3 96.5
Megahit GCA_000880055.1 634 2.3 95.7

SRR1194066 : ~16K read coverage to KF600647.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000901155.1 7601 25.2 99.18
Minia-pipeline GCA_000901155.1 6973 23.2 99.1
Minia k71 GCA_000901155.1 2502 8.3 100.0
Minia k71 GCA_000901155.1 760 2.5 100.0
Minia k71 GCA_000901155.1 285 1.0 100.0
Minia k31 GCA_000901155.1 5691 19.0 100.0
MetaViralSPAdes GCA_000901155.1 3180 10.6 100.0
MetaViralSPAdes GCA_000901155.1 2952 9.8 100.0
coronaSPAdes GCA_000901155.1 7601 25.3 99.18

ERR2756788 : Frank, ~8K mapped coverage, closest hit is fragment EU769558.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000872845.1 1998 6.3 32.2
Megahit GCA_003972065.1 29219 105.2 55.71
Minia-pipeline GCA_001503155.1 26412 95.1 56.18
Minia-pipeline GCA_003972065.1 2778 9.7 47.6
Minia-pipeline GCA_000872845.1 1801 5.6 32.2
Minia k71 GCA_000899495.1 7104 25.3 48.61
Minia k71 GCA_000899495.1 5044 17.7 80.3
Minia k31 GCA_003972065.1 28908 104.1 55.71
Minia k31 GCA_000872845.1 798 2.5 32.2
MetaViralSPAdes GCA_000899495.1 26406 95.1 55.75
MetaViralSPAdes GCA_003972065.1 1337 4.7 46.9
MetaViralSPAdes GCA_000872845.1 1185 3.7 32.2
coronaSPAdes GCA_003972065.1 29264 102.1 54.84

SRR7287110 : Ginger, ~46k mapped coverage to various Feline Cov, cloest hit is MN165107.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000856025.1 26905 95.1 85.3
Megahit GCA_000856025.1 1995 6.9 90.89
Megahit GCA_000856025.1 2598 9.0 91.14
Megahit GCA_000870985.1 3972 14.4 37.0
Minia-pipeline GCA_000856025.1 11269 39.5 79.61
Minia-pipeline GCA_000856025.1 6024 20.9 91.23
Minia-pipeline GCA_000856025.1 1058 3.7 68.1
Minia-pipeline GCA_000856025.1 2288 7.9 90.61
Minia-pipeline GCA_000856025.1 2291 7.9 91.14
Minia k71 GCA_000856025.1 10075 35.0 78.89
Minia k71 GCA_000856025.1 1000 3.4 89.4
Minia k71 GCA_000856025.1 843 3.0 90.8
Minia k71 GCA_000856025.1 846 3.0 90.8
Minia k71 GCA_000856025.1 666 2.3 97.8
Minia k71 GCA_000856025.1 666 2.3 94.6
Minia k31 GCA_000856025.1 4902 17.2 96.1
Minia k31 GCA_000856025.1 1891 6.5 92.4
Minia k31 GCA_000856025.1 1368 4.8 87.08
Minia k31 GCA_000856025.1 832 3.0 95.5
MetaViralSPAdes GCA_000856025.1 307 1.1 68.1
MetaViralSPAdes GCA_000856025.1 307 1.1 72.1
MetaViralSPAdes GCA_001504755.1 2879 10.4 36.0
MetaViralSPAdes GCA_000856025.1 1064 3.6 91.4
MetaViralSPAdes GCA_000856025.1 2201 7.6 91.8
MetaViralSPAdes GCA_000856025.1 1217 4.2 92.1
MetaViralSPAdes GCA_000856025.1 899 3.1 88.4
MetaViralSPAdes GCA_000856025.1 8785 30.5 95.9
MetaViralSPAdes GCA_000856025.1 858 3.0 95.5
MetaViralSPAdes GCA_000856025.1 9000 31.4 91.3
coronaSPAdes GCA_000856025.1 29277 103.6 85.58
coronaSPAdes GCA_000870985.1 3523 12.5 35.2

SRR1168901

Nothing was found, apparently.

Performance

On 4 threads, dataset SRR10829957 (4 GB compressed)

Method Time Memory (GB)
Megahit 2h59m 6.8
Minia-pipeline 2h 3.8
Minia k61 24m 1.7
Minia k31 34m 1.7
MetaviralSpades 18h21m 21
coronaSPAdes 6h03m 23

Some thoughts

There are 3 types of datasets:

  1. (easy-medium instances) those on which the virus assembles into 1-3 contigs with any assembler, typically one large contig covering >90% of the genome
  2. (hard instances) and those where the coverage is too low and reference-based assembly will be needed.
  3. (assembler-critical instances) on those, only some assemblers do well with default parameters

Clearly types 1 and 2 are "easy" in the sense that we can run any assembler and get pretty much the same result quality. I was wondering how many datasets were of type 3, which makes the choice of method harder. It seems only SRR7287110 is. Of course I am biased, but if all datasets were of type 1 and 2, then Minia k31 is an energy-saving way to triage datasets between those of type 1 and those of type 2.