Project plan for big RdRp search - ababaian/serratus GitHub Wiki

Issues to be addressed

(1) Construct biggest possible RdRp query with known sequences ("rdrp-big") for SRA search (i.e., the reference database for diamond). There are several sub-tasks here; primariy collecting as many known RdRps as possible, then QC on the hits to make a high-quality query. The query should NOT be trimmed as described below; it should include as much of the ORF as possible to maximize sensitivity. For example, diamond may pick up a short fragment which may be completed by assembling the full dataset, and trimming may lose reads which align only partially due to the trimming boundary. Maximizing specificity is the job of downstream processing, not the initial search.

(2) Binary classifier for an SRA dataset: input is diamond tsv file, output is true/false prediction that dataset contains an assemblable novel virus with confidence metric (score or probability). This classifier must be validated, with a ROC curve for the paper.

(3) Procedure for assembling RdRp sequence(s) in SRA datasets classified as positives by classifier. It may be possible to assemble the diamond hits without going back to the full dataset; this is TBD.

(4) Binary classifier for assembled RdRp sequences: true/false contain RdRp gene, with confidence metric. This classifier must also be validated with a ROC curve for the paper.

(5) Procedure for trimming RdRp sequences to the same boundaries. This is necessary because RdRp often appears in a polyprotein, so capturing the full-length ORF may include non-RdRp genes. Also, even if we knew the boundaries, the gene structure is quite variable and is not globally alignable across all Riboviruses. This can be addressed by trimming RdRp to boundaries defined by domains which are well-conserved. This procedure must also be validated, otherwise putative novel RdRps may be known viruses where we failed to identify the RdRp and are reporting a fragment of some other gene in the same ORF, or a different region of RdRp.

(6) Optimize diamond for RdRp search on EC2 (nice but optional, since the container works well enough already).

Step (1) must be completed before firing off a big search.

Steps (2) and (3) must be completed before starting large-scale assembly.

Trimming RdRp is tricky because there is no robust method for finding domain boundaries, as opposed to conserved regions within a domain. Developing our own is too hard. We will address this by leveraging the "Wolf18" RdRp dataset from this paper: https://doi.org/10.1128/mbio.02329-18. These guys did a pretty comprehensive RdRp search and trimmed the results using psi-blast. We will use their trimmed sequences as an operational definition of the region we want to analyze in depth. Wolf18 is posted on S3 here: s3://serratus-public/notebook/201226_rdrp0/rev1/rdrp0_r1.fa. (AB -- can we harmonize terminology? this is complicated enough as it is without Wolf18=rdrp0_r1; I like Wolf18 better).

For trimming, we will align to Wolf18 and discard the overhangs. This gets difficult at low identities, so we must validate this step also. Providing we don't pick up a totally different region of the ORF, we can tolerate quite large errors in trimming.

For validation of RdRp identification and trimming, we should do hold-out cross-validation using high-quality sequences, i.e. RefSeqs. To assess how accuracy varies with identity, we will use cross-validation by identity (CVI, https://www.drive5.com/usearch/manual/cvi.html) which constructs test/training sets where the maximum identity (MaxID) of the top test-training hit is fixed to a known value or range of values. Here, I suggest five test/training pairs with MaxID = 90, 75, 50, 30, 15% to get a representative range.

The gold standard reference for the gene identification and trimming benchmark (GITB-gold) will be full-length RdRp-containing ORFs from RefSeqs plus coordinates of the trimmed RdRps. These coordinates will be obtained by aligning the ORFs to Wolf18. These alignments should be good (diamond may be good enough) because the top-hit identities should be at least 90%.

The decoy reference (GITB-decoy) will be all RefSeq ORFs which do NOT contain RdRp. Ideally, these will all be discarded by the protocol. In practice, there will be FPs due to local homology between RdRp and other genes; this surely arises often in viruses due to recombination events. Using the decoy reference will enable us to design better filters and measure the FP rate (ROC curve) of the protocol.

For the alignment step of the trimming protocol, we need a semi-global aligner because we want to enforce that all of the trimming reference sequence is aligned to a local segment of the query sequence. Overhangs in the query should not be penalized, while overhangs in the reference sequence should be penalized; this is the definition of semi-global. So far, we've been using PathRacer, but this off-label use of PR which is not ideally suited to this task and we're looking into alternatives.

Task assignments

AB: Construct GITB-gold and GITB-decoy sequences. Optional: generate coordinates by aligning to Wolf18, RE happy to do this. (?mostly/all done already).

AB: Construct rdrp-big.

RE: Implement (a) GITB, (b) gene identificaion method (probably just diamond E-value), and (c) trimming with a semi-global aligner. This is going to take a few days.

RE: Binary classifier for SRA datasets.

RE: Binary classifier for RdRp gene identification.

AK: Implement and test "micro-assembly", i.e. assembly by converting diamond tsv to FASTQ.

BB: Optimize diamond for RdRp search.