pangenome_design - ababaian/serratus GitHub Wiki

Pangenome Design

Serratus is designed as a "Database vs. Database" type of search. It is possible to run a single-sequence BLAST-like query, but this does not create an efficient or sensitive output.

We broadly refer to our query databases as "pangenome" or "panproteome".

Search type

Similar to blastn, use nucleotide search if you are interested in sequences similar to your search query (80-100% nucleotide identity).

Similar to blastx, use translated nucleotide search if you are interested in protein sequences diverged from your search query (40-100% amino acid identity).

Input sequences should be masked for simple repeats or low-information sequences. It is advisable to run "pilot" experiments prior to going to scale to ensure there are no sequences prone to false-positive alignments (plasmid sequences, regions of similarity to a common sequence like rRNA).

Clustering

A topic of much debate. Recommended starting place is 90% nucleotide or 90% amino acid identity. Exact thesholds at which to cluster search sequences are choosen based on weighing the following factors.

Final size of the pangenome (and thus search time)
Are you hoping to classify the retrieved sequences based on the alignment? What are the accepted classification values?
Tolerence for the Sources-of-Error in downstream applications

Assessing cost

As a guideline we would recommend budgeting $0.01 per library searched.

The final cost is highly dependent on the search query, empirical measurement would be the only way to assess cost. Some guidelines from our experience.

Applying bowtie2, the cov3ma nucleotide pangenome was 79.8 MB and cost $0.0062 per library.
Applying DIAMOND out of the box, the protref5 panproteome was 6.6 MB and $0.03 per library.
Applying memory-optimised DIAMOND, the rdrp1 collection was 7.1 MB and $0.0042 per library.

The dominating factor to consider in pangenome design is the rate at which sequences in the library will have a "seed match" to the query. Bowtie2 and Diamond are exceptionally efficient at discarding reads with no seed match, so sequences do not undergo computationally expensive "extentions".

If your query sequences are abundant in the samples you are searching (say rRNA or ribosomal proteins) then even a small 100 KB query can take orders of magnitude longer than large collection of vertebrates viruses. In addition, recent improvements to DIAMOND specifically for Serratus has moved a hash-table of seed sequences from RAM to L3 CPU-cache memory, thus decreased Serratus alignment runtime from 305 seconds per million reads to 21 seconds per million reads. To make use of this optimisation it is important to limit protein search database to within the L3 memory capacity of your processor. In practice, for a C5n.xlarge instance on AWS, a 7 MB protein database is near the upper limit of what is tolerated.

Future cost

Will Serratus be cheaper if I wait a year?

A year and a half ago, a naive implementation of cloud-based computing to analyse only 10,708 transcriptomes cost $0.40 per library ref. It is difficult to predict what costs will be in a year as the technology is rapidly improving.

If we are allowed speculate, it is not unreasonable to expect search costs to reach sub-$0.001 by 2022 and thus keep pace with the growing SRA.

If your scientific question is sufficiently interesting, the most important cost to consider is if you are paying a larger "Opportunity Cost" waiting.

If you're on a tight budget (we've all been there), we highly recommend checking out SearchSRA. This is a similar service free for academics, you can align 100,000+ libraries with that platform.