Assembly benchmark results for 8 coronavirus candidates datasets - ababaian/serratus GitHub Wiki
See https://github.com/ababaian/serratus/issues/130 for the motivation and description of the datasets.
Viral contigs are available in s3://serratus-public/notebook/200526_assembly/RC/contigs
.
A collection of scripts that were used to produce this benchmark is in s3://serratus-public/notebook/200526_assembly/RC/scripts/
.
Benchmark setup
Reads were given as-is to each assembler (no quality/adapter trim). I ran the assemblers with default parameters and gave the contigs to CheckV. I then reported any contig hit for the genomes in checkv_genbank.tsv
that match the regexp [Cc]orona
.
MetaviralSpades did not detect anything in its regular output so I used K*/before_chromosome_removal.fasta
(the unfiltered final assembly), and I couldn't run it on single-end reads (SRR1168901.fastq.gz, SRR10951660.fastq.gz, SRR10951656.fastq.gz
).
For coronaSPAdes, I ran CheckV on the gene_clusters.fasta
file.
Results
SRR10829953
: ~180K reads to KT323979.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_900205315.1 | 28530 | 101.4 | 97.55 |
Megahit | GCA_000913415.1 | 583 | 2.0 | 62.7 |
Minia-pipeline | GCA_900205315.1 | 24266 | 86.3 | 97.72 |
Minia-pipeline | GCA_900205315.1 | 2650 | 9.5 | 96.26 |
Minia k71 | GCA_900205315.1 | 21295 | 75.7 | 97.88 |
Minia k71 | GCA_900205315.1 | 2553 | 9.1 | 98.2 |
Minia k31 | GCA_900205315.1 | 24280 | 86.4 | 97.72 |
Minia k31 | GCA_900205315.1 | 1040 | 3.7 | 98.4 |
Minia k31 | GCA_900205315.1 | 1003 | 3.6 | 93.0 |
MetaViralSPAdes | GCA_900205315.1 | 18984 | 67.5 | 97.51 |
MetaViralSPAdes | GCA_900205315.1 | 8808 | 31.4 | 96.9 |
coronaSPAdes | GCA_900205315.1 | 27973 | 99.5 | 97.5 |
SRR10829957
: ~195K reads to KP728470.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_900205315.1 | 1548 | 5.5 | 95.4 |
Megahit | GCA_900205315.1 | 26185 | 93.1 | 97.65 |
Minia-pipeline | GCA_900205315.1 | 24272 | 86.3 | 97.72 |
Minia-pipeline | GCA_900205315.1 | 3650 | 13.0 | 96.42 |
Minia k71 | GCA_900205315.1 | 24243 | 86.2 | 97.72 |
Minia k71 | GCA_900205315.1 | 2363 | 8.4 | 98.2 |
Minia k71 | GCA_900205315.1 | 805 | 2.9 | 98.5 |
Minia k31 | GCA_900205315.1 | 17109 | 60.9 | 98.19 |
Minia k31 | GCA_900205315.1 | 5041 | 17.9 | 95.0 |
Minia k31 | GCA_900205315.1 | 1451 | 5.2 | 98.1 |
MetaViralSPades | GCA_900205315.1 | 28060 | 99.8 | 97.55 |
coronaSPAdes | GCA_900205315.1 | 27994 | 99.6 | 97.55 |
SRR10951656
: ~4x coverage to MH878976.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_000880055.1 | 460 | 1.7 | 93.8 |
Megahit | GCA_000880055.1 | 569 | 2.1 | 95.2 |
Megahit | GCA_000880055.1 | 790 | 2.9 | 87.4 |
Minia-pipeline | GCA_000880055.1 | 206 | 0.7 | 84.4 |
Minia k31 | GCA_000880055.1 | 203 | 0.7 | 92.2 |
Minia k31 | GCA_000862965.1 | 343 | 1.2 | 96.5 |
Minia k31 | GCA_000880055.1 | 255 | 0.9 | 85.0 |
SRR10951660
: ~1x coverage to MH878976.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_000862965.1 | 346 | 1.3 | 96.5 |
Megahit | GCA_000880055.1 | 634 | 2.3 | 95.7 |
SRR1194066
: ~16K read coverage to KF600647.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_000901155.1 | 7601 | 25.2 | 99.18 |
Minia-pipeline | GCA_000901155.1 | 6973 | 23.2 | 99.1 |
Minia k71 | GCA_000901155.1 | 2502 | 8.3 | 100.0 |
Minia k71 | GCA_000901155.1 | 760 | 2.5 | 100.0 |
Minia k71 | GCA_000901155.1 | 285 | 1.0 | 100.0 |
Minia k31 | GCA_000901155.1 | 5691 | 19.0 | 100.0 |
MetaViralSPAdes | GCA_000901155.1 | 3180 | 10.6 | 100.0 |
MetaViralSPAdes | GCA_000901155.1 | 2952 | 9.8 | 100.0 |
coronaSPAdes | GCA_000901155.1 | 7601 | 25.3 | 99.18 |
ERR2756788
: Frank, ~8K mapped coverage, closest hit is fragment EU769558.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_000872845.1 | 1998 | 6.3 | 32.2 |
Megahit | GCA_003972065.1 | 29219 | 105.2 | 55.71 |
Minia-pipeline | GCA_001503155.1 | 26412 | 95.1 | 56.18 |
Minia-pipeline | GCA_003972065.1 | 2778 | 9.7 | 47.6 |
Minia-pipeline | GCA_000872845.1 | 1801 | 5.6 | 32.2 |
Minia k71 | GCA_000899495.1 | 7104 | 25.3 | 48.61 |
Minia k71 | GCA_000899495.1 | 5044 | 17.7 | 80.3 |
Minia k31 | GCA_003972065.1 | 28908 | 104.1 | 55.71 |
Minia k31 | GCA_000872845.1 | 798 | 2.5 | 32.2 |
MetaViralSPAdes | GCA_000899495.1 | 26406 | 95.1 | 55.75 |
MetaViralSPAdes | GCA_003972065.1 | 1337 | 4.7 | 46.9 |
MetaViralSPAdes | GCA_000872845.1 | 1185 | 3.7 | 32.2 |
coronaSPAdes | GCA_003972065.1 | 29264 | 102.1 | 54.84 |
SRR7287110
: Ginger, ~46k mapped coverage to various Feline Cov, cloest hit is MN165107.1
Method | Detected virus | Contig length | CheckV estim. completeness | CheckV AA avg ID% |
---|---|---|---|---|
Megahit | GCA_000856025.1 | 26905 | 95.1 | 85.3 |
Megahit | GCA_000856025.1 | 1995 | 6.9 | 90.89 |
Megahit | GCA_000856025.1 | 2598 | 9.0 | 91.14 |
Megahit | GCA_000870985.1 | 3972 | 14.4 | 37.0 |
Minia-pipeline | GCA_000856025.1 | 11269 | 39.5 | 79.61 |
Minia-pipeline | GCA_000856025.1 | 6024 | 20.9 | 91.23 |
Minia-pipeline | GCA_000856025.1 | 1058 | 3.7 | 68.1 |
Minia-pipeline | GCA_000856025.1 | 2288 | 7.9 | 90.61 |
Minia-pipeline | GCA_000856025.1 | 2291 | 7.9 | 91.14 |
Minia k71 | GCA_000856025.1 | 10075 | 35.0 | 78.89 |
Minia k71 | GCA_000856025.1 | 1000 | 3.4 | 89.4 |
Minia k71 | GCA_000856025.1 | 843 | 3.0 | 90.8 |
Minia k71 | GCA_000856025.1 | 846 | 3.0 | 90.8 |
Minia k71 | GCA_000856025.1 | 666 | 2.3 | 97.8 |
Minia k71 | GCA_000856025.1 | 666 | 2.3 | 94.6 |
Minia k31 | GCA_000856025.1 | 4902 | 17.2 | 96.1 |
Minia k31 | GCA_000856025.1 | 1891 | 6.5 | 92.4 |
Minia k31 | GCA_000856025.1 | 1368 | 4.8 | 87.08 |
Minia k31 | GCA_000856025.1 | 832 | 3.0 | 95.5 |
MetaViralSPAdes | GCA_000856025.1 | 307 | 1.1 | 68.1 |
MetaViralSPAdes | GCA_000856025.1 | 307 | 1.1 | 72.1 |
MetaViralSPAdes | GCA_001504755.1 | 2879 | 10.4 | 36.0 |
MetaViralSPAdes | GCA_000856025.1 | 1064 | 3.6 | 91.4 |
MetaViralSPAdes | GCA_000856025.1 | 2201 | 7.6 | 91.8 |
MetaViralSPAdes | GCA_000856025.1 | 1217 | 4.2 | 92.1 |
MetaViralSPAdes | GCA_000856025.1 | 899 | 3.1 | 88.4 |
MetaViralSPAdes | GCA_000856025.1 | 8785 | 30.5 | 95.9 |
MetaViralSPAdes | GCA_000856025.1 | 858 | 3.0 | 95.5 |
MetaViralSPAdes | GCA_000856025.1 | 9000 | 31.4 | 91.3 |
coronaSPAdes | GCA_000856025.1 | 29277 | 103.6 | 85.58 |
coronaSPAdes | GCA_000870985.1 | 3523 | 12.5 | 35.2 |
SRR1168901
Nothing was found, apparently.
Performance
On 4 threads, dataset SRR10829957
(4 GB compressed)
Method | Time | Memory (GB) |
---|---|---|
Megahit | 2h59m | 6.8 |
Minia-pipeline | 2h | 3.8 |
Minia k61 | 24m | 1.7 |
Minia k31 | 34m | 1.7 |
MetaviralSpades | 18h21m | 21 |
coronaSPAdes | 6h03m | 23 |
Some thoughts
There are 3 types of datasets:
- (easy-medium instances) those on which the virus assembles into 1-3 contigs with any assembler, typically one large contig covering >90% of the genome
- (hard instances) and those where the coverage is too low and reference-based assembly will be needed.
- (assembler-critical instances) on those, only some assemblers do well with default parameters
Clearly types 1 and 2 are "easy" in the sense that we can run any assembler and get pretty much the same result quality. I was wondering how many datasets were of type 3, which makes the choice of method harder. It seems only SRR7287110
is. Of course I am biased, but if all datasets were of type 1 and 2, then Minia k31 is an energy-saving way to triage datasets between those of type 1 and those of type 2.