Finding transfer recombination events in the Spike protein - ababaian/serratus GitHub Wiki

Naive approach

The naive approach is to look for discrepancies between trees of two genes, I'll call them Spike and RdRp. Roughly 1/3 of our assemblies have RdRp, so I'm guessing same applies to Spike in which case 1/3 will have Spike and 1/6 will have both. One sixth of 11k assemblies = 2k assembles. Thus, the naive task is to make two trees with 2k sequences and look for discrepancies. I see a few problems with this approach.

The MSA will be large and highly diverged because we have assmblies across all genera in Cov. I doubt we will get a good alignment of Spike for this reason.
There will be many sequence errors in both Spike and RdRp sequences due to sequencing and assembly, especially when coverage is ~1 so sequencing and PCR errors propogate to the predicted gene. Even if the nt sequence is correct, the amino acid sequences may have errors due to incorrect HMM alignments and hence incorrect translations. Sequence errors will disrupt the tree branching order and induce false positive transfer events.
A tree with 2k leaves is quite hard to view and manipulate. Not impossibly large, but inconvenient.
How to define and detect "discrepancies" between a pair of trees which imply transfer events? One complication is that the trees will have many branching order errors even with correct sequences. Is there a good existing tool that can do this? If not, we will have to write some code to compare trees.

Alternative approach

I prefer clustering to trees because sequence errors disrupt at most two clusters (the one they should be in and the one they actually end up in), while they can disrupt many branches in a tree because with N sequences, N-1 of the distances are wrong, and this can cause arbitrarily many branching order errors (wrong topology, in the jargon).

A clustering approach can be implemented as follows. Cluster the RdRp genes at some threshold, say 99%. For each RdRp cluster (call it R), make an MSA of the Spike protein and calculate all-vs-all distances for the Spikes associated with R. In each R, the Spikes will usually be closely related because the RdRdps in R are closely related. This means that the Spike MSA will be much better than an MSA over all spikes, and we can use a nt instead of a.a. alignment which eliminates the problem with HMM alignment errors. I'm not sure how to get the nt sequences; translated blast of contigs against protein database of known genes should work, it's a relatively small compute.

The idea is to identify outliers -- if one spike has is an outlier relative to the rest of the cluster, then it is a candidate for further investigation to check if it is a sequence error or looks like a genuine event.

I would guess that a large majority of candidates will be false positive, but I think this method will be easier and more effective for generating candidates.