budget - ababaian/serratus GitHub Wiki

Serratus Budget Overview

We're often asked how we manage the budget for Serratus. Formally the project has no research grants and all researchers are volunteers. We have spent approx $370 out of pocket to make this project a reality, including a $250 Amazon gift certificate offered as an incentive for developing a proof of concept for a critical module. AWS/CIC has very kindly allowed us use of their computing systems and we have tracked how much it would have cost, but this comes with the caveat that we certainly would not be as brazen with our R&D if these were real dollar figures taken from a research grant.

March 2020

march

Serratus was initiated as a drawing on a napkin on March 2nd 2020.

This establishes the baseline costs of $2/day for S3 storage as part of previous work. We pay $40 for the (serratus.io) domain. During this time we are developing the core infrastructure behind Serratus, this was done using small test instances t3.micro and t3.medium keeping development costs minimal. Development is done predominately on-premises computers (our laptops).

Total: $120.59

April 2020

april

Work was initially being performed on us-west-1 AWS region (local to us). The SRA buckets are on us-east-1 and moving data between regions dinged us $60 in two days. We migrate all of our work to us-east-1 to prevent cross-region data transfers.

We initiate the first pilot-phases of Serratus, we were analyzing a blistering 800 libraries per day for $85 ($0.106 / library). This is approximately five times cheaper than previous experiences of running alignment on AWS.

Total: $279.39

May 2020

may

We are in active R&D mode at this point. Early in the month we initiate the Zoonotic Reservoir for Coronaviruses search. This was done using an early (and not so great) query sequence of Coronaviruses. We attempt to analyze 70,966 libraries which was completed for $870 over two days ($0.0126 / library).

Mid-month we incur $600 of charges using a c5d.18xlarge instance for assembly of something like 10 libraries. This was drastically overkill for the application but we wanted the results "fast". Realistically this demonstrates the single largest drawback of cloud computing services. You can get "dinged" with an unexpected charge if a user allocates more resources than are absolutely necessary. Lesson learned so you don't have to. A standard instance is called "on demand", you pay full price for these. If you can develop "fault tolerance" into your pipeline you can use SPOT instances (half the cost but they can be shut down if another user pays "on demand" prices for it). Serratus was built around the idea of allowing for fault tolerance to leverage SPOT instances.

By the end of the month the primary biological R&D for Serratus is ready. We have a satisfactory pangenome to apply our search and begin to scale-up to analyze a total of 1 million libraries. Unexpectedly we begin to accumulate a large cost associated with "Cloudwatch", this is essentially the text output from each instance sent to a log file. This allows us to debug individual components of Serratus and see where in the processing each instance is in essentially realtime, but as we scale up the cost associated with this grew quickly. Soon after we "quiet" the Serratus pipeline as much as possible and have it only report error messages to essentially eliminate this cost.

Total: $6,137.94

June 2020

june

The first phase of Serratus is in full swing at this point, we are processing our data at scale to get this analysis done fast! We begin to reach processing rates of 100,000+ sequencing libraries per day, often running into one roadblock or another which requires refactoring code and optimization. Early in this month we switch from SQLite to PostgreSQL for the scheduler backend to be able to keep up with the rapid processing of data. We establish a target goal to analyze 1 million sequencing libraries in a 24 hour period for under 1 cent per library.

Early in the month I open an AWS support ticket, I try to "expedite" the responses by signing up for a service plan (these scale based on your total AWS bill). This cost $1,600 and in the end we got the response at about the same time as our "free plan". If this was real money you can ask for a refund and certainly would receive it, our credits are imaginary anyways so they tell us not to worry about it. Unless you are a proper enterprise and need rapid service responses, just wait the ~12 hours for a response with the free plan.

We test several different combinations of instance types for Serratus in this time, c5.xlarge, r5.xlarge, c5.8xlarge ... For the nucleotide search the balance of efficiency and stability is reached with c5.2xlarge instances for alignment and r5.xlarge for downloading. Every run at scale becomes incrementally more efficient with respect to total CPU efficiency we are reaching and at the processing rate.

Due to ongoing optimizations it's difficult to assign a dollar figure for this entire process and it will include some ongoing R&D costs happening at this time. Very roughly, over this month we complete 3.8 million sequence alignments for $50,000 ($0.0132 / library).

To measure this precisely for the paper, we "tag" the entire Serratus pipeline which allows us to extract costs associated with a single application. We perform a benchmark at the end of the month which is shown in the manuscript: "Serratus shows stable and linear performance to complete 1.29 million SRA accessions in a 24-hour period. Compute costs between modules are an approximate comparison of CPU requirements of each step. The total average cost per completed SRA accession was $0.0062 US dollars".

We complete our first major data release and wrap up the nucleotide search phase of the project. At the end of the month we develop the diamond translated-nucleotide implementation of Serratus and begin performing pilot runs with that.

Total: $50,187.26

July 2020

july

At this point we are storing ~8 terabytes of data in our S3 buckets, daily costs are $70 to maintain this data.

After a brief translated-nucleotide pilot analysis of 360K libraries prior to June 13th ($8,200), we begin the assembly of Coronavirus data. We opt for a hyper-aggressive assembly strategy for Coronaviruses (55K+ libraries) to ensure we capture the globally available sequences and create a benchmark dataset for interpreting the alignment data we generated thus far. We absolutely would not opt for such an aggressive strategy if this was real research dollars, but training data is valuable and it was possible for us to do it. The total cost of these assemblies reaches near $31,000 ($0.56 per assembly). This is likely a high-end estimate since many had to be repeated due to rapid rate we were processing it (over 3 days), if done more slowly a pragmatic value would be $0.40 per assembly.

Due to the high-memory requirements of assembly these cannot be "broken down" into components as easily as alignment can and the algorithms are simply much more computationally expensive. Barring a revolutionary new method for assembly, it's unlikely costs will break below $0.10 per library in the near future.

Near the mid-month our first PostgreSQL database goes live, an oversight in how this was implemented (and down time on part of volunteers) led to this being a trickle of cost we will pay over the next few months (mistake!).

Total: $45,440.30

August 2020

august

The Serratus Preprint goes live!! There is a lull post submission as everyone rests. S3 storage costs and the RDS database now tick away at $120 per day ($3,600 / month). This took longer than I care to admit to realize/piece together.

We run some minor analyses in this time, at the end of the month we begin an updated translated-protein search and start digging into the data (on local machines), making sense of the massive amounts of data we collected.

Total cost: $9,613.57

September 2020

september

Waiting for our revisions to return, we initiate some side-project ideas using Serratus, these didn't really pan out in the way we hoped but there were some minor improvements to the translated-protein search added in this time. The mid-month spike was another 8,000 assemblies we performed for novel satellite viruses, the analysis was too inconsistent to make sense of at the time. Once the assemblies were done, an oversight allowed for a few large instances r4.16xlarge to remain operational for days costing $1,500. It's incredibly important to ensure all resources are shutdown after an analysis completes and not assume the shutdown functions worked as intended.

At this point the RDS server expanded itself to a full 10 TB of block storage.

Total cost: $21,166.21

Oct/Nov 2020

oct and nov

Beginning of the month I routinely check on the billing, catch the "hanging" EC2 instance. For this downtime we can see the 'background' costs associated with cloud computing which tend to accumulate. These are small individually but when added together they add up. The RDS server block storage was $60/day, and overallocation via an r5.xlarge instance (should have been a t2.medium or less) cost $12/day. We didn't clean up our S3 workspace immediately, this was ticking away at $25/day, which with more vigilance could have been reduced down to $10/day if we removed intermediate files. We were uncertain how long it would be until we heard back, so we opted to keep the data there "just another week".

Total cost: $5,418.88

Dec 2020

dec

Revisions return! But more than we expected. Background costs are $75/day. After some back and forth discussion we attempt to find "more viruses" by doing a targeted assembly/expansion of Quenyaviridae and Dicistroviridae. In a few days we identify ~600 novel viruses in these families but it's not a striking result.

At this point we get the "oh shoot" overview of how much the RDS database is costing us, we begin to explore alternative solutions to keep the (www.serratus.io) site online and data "explorable".

December 26th we decide to do a search for all viral RNA dependent RNA polymerases and begin to compile that database.

Total: $8,807.95

Jan 2021

jan

After two weeks of R&D to establish a search query of rdrp1 we are satisfied with, we are ready to pull the trigger. We initiate a wonderful collaboration with B. Buchfink who takes a deep dive into the diamond code to optimize it for a "small search query" such as rdrp1. For reference, applying diamond out of the box in our earlier work with the protref5 panproteome (6.6 Mb) was ~$0.03 per library for a translated-nucleotide search. Running the optimized version of diamond with rdrp1 (7.1 Mb) cost only $0.0042 per library, nearly an order of magnitude drop in price! This inspired us to "go big" as now SRA-scale protein search was within our grasp.

Starting late in the evening on January 11th and running Serratus casually for the next 11 days (non continuous use, at ~80% of it's maximum capacity to favour stability over performance) we complete a search of 5,686,715 sequencing libraries (10.2 petabases). The total cost of a full ground-up re-analysis was $23,980 or $0.0042 per library. This value reflects the current state-of-the-art for Serratus, and to the best of our knowledge any means of ultra-rapid access to petabases of sequencing data.

Over the next few days we fire up some on demand instances to analyze the data we generated and near the end of the month run assemblies to quality control the data generated in the rdrp1 search.

Total: $34,350.21

Lessons learned

Effective search cost: ~$74,000 Final total cost: ~$182,000.00

With great power comes great responsibility. Commercial cloud computing services offer great power to accelerate research to a pace hard to imagine just a year ago, but it is easy to allow small oversights or inefficiencies escalate to a significant cost for limited research grant dollars. We are fortunate we could play fast and loose with these optimizations, allowing us to focus on our primary objective, Serratus and the ultra-fast discovery of novel coronaviruses. We offer some recommendations for best practices.

  • Use full containerized workflows that can be deployed on any system reproducibly. This will allow the vast bulk of development to performed on premises (i.e. your laptop).
  • Establish unit-tests (such as always using the same 1 million metagenome reads for alignment) for debugging and benchmarking optimizations. You can then debug the deployed system using minimal cloud resources.
  • Perform iterative "scale-ups". When deploying the system at scale, do not immediatly jump from a test dataset of 10 libraries to 1,000,000 libraries. There are often unexpected hiccups along the way that will require debugging. We usually would scale up by a factor of 2-5 once we were confident the system was operating nominally, which would give us space to resolve issues before scaling up again.
  • Be vigilant with keeping track of "small" costs. This includes ensuring users are aware of how to allocate resources efficiently for the task at hand, and do not leave instances operational and not in use over a weekend.
  • Establish a "checkpoint" to review costs on a calendar basis and account for where resources are being spent. It may be an unexpected place such as Cloudwatch logs that incur 30% of your total costs. Identify these quickly and trim aggressively.

Have fun. It's exciting to be able to do cool science and there is a lot of small details to learn in using any cloud service. Keep an eye to always improve code performance and track where inefficiencies are sneaking up, these are cumulative effects. The faster/more optimal your pipeline the more you can deploy it.