Taxonomy case study of misannotated Genbank entry - ababaian/serratus GitHub Wiki
Trees for EU420137, which is mis-classified as Minunacovirus in Genbank.
Minunacovirus is highlighted in green. Reference sequences are labeled GgSsss.dom where Gg is the first two letters of the genus (Al=Alphacoronavirus, Be=Betacoronavirus etc.), Ssss is the first four letters of the sub-genus (e.g. Minu=Minunacoronavirus) and dom is the domain (e.g. pol for polymerase). Notice that all sub-genera are monophyletic in all trees. The pol tree, which is the only tree to include more than one genus, Alpha- and Betacoronavirus are monophyletic. These are strong validations that the procedure is consistent and robust. Minunacovirus does not appear in the Polymerase tree. In the other three trees, EU420137 joins the tree above the reference Minunacovirus sequences. These trees imply that EU420137 is a novel sub-genus in the Alphacoronavirus genus.
This is a good illustration of the approach I'm trying to implement. The alignments and trees are generated automatically, this procedure will work for any full-length genome, fragment or contig. The final step is to automate classification from the trees, working on that now.
It difficult to align proteins across all of Cov, even pol. To mitigate this problem, the method selects a subset with the goal of including the closest sequences plus some outgroup, going no further than necessary to capture good outgroup.
In this case Minu fell out of the pol selection because Minu pols are not particularly closer than the other subgenera to EU420137. They would not fall out if EU420137 really was a Minu.
Pol is the default choice of gene for doing taxonomy, but using multiple genes is better. Pol is good for classifying at genus, less good at lower ranks because it diverges too slowly.
Using multiple genes gives us a check (a) that the method is robust and (b) that we don't have a chimera due to PCR artifacts or horizontal transfer.