pangenome_design - ababaian/serratus GitHub Wiki
Pangenome Design
Serratus
is designed as a "Database vs. Database" type of search. It is possible to run a single-sequence BLAST
-like query, but this does not create an efficient or sensitive output.
We broadly refer to our query databases as "pangenome" or "panproteome".
Search type
Similar to blastn
, use nucleotide
search if you are interested in sequences similar to your search query (80-100% nucleotide identity).
Similar to blastx
, use translated nucleotide
search if you are interested in protein sequences diverged from your search query (40-100% amino acid identity).
Input sequences should be masked for simple repeats or low-information sequences. It is advisable to run "pilot" experiments prior to going to scale to ensure there are no sequences prone to false-positive alignments (plasmid sequences, regions of similarity to a common sequence like rRNA).
Clustering
A topic of much debate. Recommended starting place is 90% nucleotide or 90% amino acid identity. Exact thesholds at which to cluster search sequences are choosen based on weighing the following factors.
- Final size of the pangenome (and thus search time)
- Are you hoping to classify the retrieved sequences based on the alignment? What are the accepted classification values?
- Tolerence for the Sources-of-Error in downstream applications
Assessing cost
As a guideline we would recommend budgeting $0.01 per library searched.
The final cost is highly dependent on the search query, empirical measurement would be the only way to assess cost. Some guidelines from our experience.
- Applying
bowtie2
, thecov3ma
nucleotide pangenome was 79.8 MB and cost $0.0062 per library. - Applying
DIAMOND
out of the box, theprotref5
panproteome was 6.6 MB and $0.03 per library. - Applying memory-optimised
DIAMOND
, therdrp1
collection was 7.1 MB and $0.0042 per library.
The dominating factor to consider in pangenome design is the rate at which sequences in the library will have a "seed match" to the query. Bowtie2 and Diamond are exceptionally efficient at discarding reads with no seed match, so sequences do not undergo computationally expensive "extentions".
If your query sequences are abundant in the samples you are searching (say rRNA or ribosomal proteins) then even a small 100 KB query can take orders of magnitude longer than large collection of vertebrates viruses. In addition, recent improvements to DIAMOND specifically for Serratus has moved a hash-table of seed sequences from RAM to L3 CPU-cache memory, thus decreased Serratus alignment runtime from 305 seconds per million reads to 21 seconds per million reads. To make use of this optimisation it is important to limit protein search database to within the L3 memory capacity of your processor. In practice, for a C5n.xlarge instance on AWS, a 7 MB protein database is near the upper limit of what is tolerated.
Future cost
Will Serratus
be cheaper if I wait a year?
A year and a half ago, a naive implementation of cloud-based computing to analyse only 10,708 transcriptomes cost $0.40 per library ref. It is difficult to predict what costs will be in a year as the technology is rapidly improving.
If we are allowed speculate, it is not unreasonable to expect search costs to reach sub-$0.001 by 2022 and thus keep pace with the growing SRA.
If your scientific question is sufficiently interesting, the most important cost to consider is if you are paying a larger "Opportunity Cost" waiting.
If you're on a tight budget (we've all been there), we highly recommend checking out SearchSRA
. This is a similar service free for academics, you can align 100,000+ libraries with that platform.