Hierarchical OTU classification of RdRPs - ababaian/serratus GitHub Wiki

Clustering not tree

Naively, a tree is the best approach for constructing a sequence-based classification system, and this is what referees will expect, but I think a pure clustering approach is better.

Problems with trees

RdRP is not alignable across all species due to re-arrangements, so there is the simple practical issue that we can't make an MSA for all species. We could attempt to solve that by splitting into globally alignable subgroups, but this may be error-prone because it's not clear if we can reliably detect all known re-arrangements and there may be re-arrangements that are so far undetected. Re-arrangements occur in closely related species (same family), so the tree does not divide cleanly into globally alignable subsets. Any tree will surely be pretty bad, and it is not reproducible except by using the same software with the same command line. Many branches will have low support, which will be reflected by the fact that other equally good software and command line choices will give quite different trees.

Clustering is better

Clustering avoids a requirement that the entire set is globally alignable. Closely related sequences will fall into the same cluster regardless of re-arrangements. A set of clusters generated by UCLUST, complete-linkage or single-linkage clustering can be checked independently by checking pair-wise distances; pair-wise alignments from other programs will mostly confirm linkage clustering, or confirm the UCLUST criteria are satisfied with a few marginal disagreements. There are no analogous methods for confirming branches in a tree or assignments to average-linkage clusters.

Problem with single-linkage

With single-linkage clustering, adding a new sequence can merge two clusters. With UCLUST, this cannot happen: adding a new sequence either falls into an existing OTU or creates a new one. OTUs should have an update procedure to expand the classification as new viruses are discovered. UCLUST enables this easily, while most other clustering methods, including single-linkage in particular, do not. We should therefore use UCLUST for clustering.

Clustering criteria and thresholds

Based on our experience so far, I think we should use UCLUST with 50% (family-like) 75% (genus-like) and 90% (species-like) identity thresholds.

No monkey business

It is natural to consider adjusting the OTUs to improve agreement with taxonomy, but I think this is a bad idea. We should implement a fully automated system because manual intervention is not reproducible and gets into a gray area of trying to call taxonomy from sequence, which would be asking for trouble -- we should make clear this is a purely sequence-based method which is not assigning taxonomy.

Cluster identifiers

Each cluster will be named Fnnn.Gnnn.Snnn where nnn's are integer values for the family-like, Genus-like and Species-like levels. Gnnn and Snnn will be globally unique rather than unique within its parent cluster so that we can talk about species-like cluster S123 without having to give F and G numbers.

Connection to taxonomy is provided by files specifying which named taxa fall into each cluster. Taxon names are not included in the cluster identifiers.

Conclusions

Keep it simple and transparently not an attempt to make a phylogeny. Use UCLUST at 50%, 75% and 90% identity thresholds to define a three-tier hierarchy. OTU identifiers are three integers Fnnn.Gnnn.Snnn.