Architecture and Pipeline - ababaian/serratus GitHub Wiki
Architecture

In February 2020, AWS mirrored the NCBI/NIH Sequence Read Archive (SRA) onto their S3 servers as an Open Data-set which allows for an unprecedented rate of access to the petabytes of raw data.
To perform the ultra-high throughput CoV search, AWS cloud computing was employed to create a 22,500 vCPU cluster (1460 x R5.xlarge, 4120 x C5.xlarge, and 90 x C5.large EC2 instances). Using this hyper-parallelized architecture we bypass conventional networking and disk IO limitations that would stall a conventional cluster to accelerate the rate of discovery.
The Serratus architecture is aggressively cost-optimized for big data analysis, achieving a maximum alignment throughput of +1,000,000 sequencing libraries per 24 hours, at a cost of $0.004 to $0.006 per library.
The workhorse for Serratus is currently the C5.large EC2 instance on AWS. Each instance has 2 vCPU and 4 GB of memory running the minimal amazon linux 2 with docker. All workflows are containerized to allows for rapid and cheap scaling of the cluster.
Bioinformatics pipeline

To detect known and diverged viruses, we selected bowtie2 for nucleotide alignment and diamond2 for translated-nucleotide alignment. Each SRA accession is downloaded (prefetch), decompressed (fastq-dump) and split into equal-sized "blocks" of 1 million reads. This allows for a highly uniform and predictable workload, and independent scaling of download instances and alignment instances` to provide optimal resource usage.
Any input collection of nucleotide or protein sequences can be queried with Serratus. For more details about the searches we performed see Sequence-Resources.
Real-time Cluster Monitoring
We implemented a Grafana/Prometheus cluster monitoring system. Performance of the cluster can be monitored at high granularity in real time to ensure smooth performance and to avoid cost over-runs.
