Background - zwickl/terraphy GitHub Wiki

Table of Contents

Background

Some definitions

In large multi-locus phylogenetic data sets, we can distinguish between phylogenomic studies, which generally have many more loci than taxa and phylodiversity studies, which tend to have many more taxa than loci. A hallmark of both kinds of studies is missing data—sequences not sampled for some taxa—often as much as 50-90% of the data. Let the taxon coverage matrix be a presence-absence matrix of taxa by loci, and its density be the fraction of the matrix entries present. Low density can occur for many reasons. Genes may be missing in a phylogenomic matrix of ESTs because of differential levels of gene expression, but missing in a phylodiversity study mined from GenBank because of sample biases in data collection and submission.

Partial taxon coverage, decisiveness and terraces

Although many factors can interact with patterns of missing data to affect tree reconstruction, Terraphy focuses on issues related to mathematical patterns in taxon coverage, which have several provable and general properties (Steel and Sanderson 2010; Sanderson et al. 2010, 2011; Steel 2016). Consider the figure below, a data set of two loci for 4 taxa and an outgroup. For any parent tree, T, (say, the best tree found in a RAxML search), partial taxon coverage induces a subtree of T for each locus, T1, and T2, with a taxon missing from each. A taxon coverage matrix is decisive if there is only one tree consistent with T1, and T2. Here there are two additional trees consistent with them: the matrix is not decisive. The optimality score of these trees can be identical if the score can be decomposed (as here) into a likelihood associated with each locus, in which case we call the collection a terrace. This is always true for the parsimony score, but is also true for the likelihood score if the models describing evolution at each locus have "unlinked" edge lengths. An unlinked (cf. "partitioned", or "fully partitioned") model has different sets of parameters for each locus. Terraces are reminiscent of tree islands, but every terrace is wholly contained within a tree island (Sanderson et al. 2011). Tree space can be thought of as a collection of terraces, the size and topological position of which are determined entirely by the taxon coverage matrix. Only the heights of the terraces depend on the data.

With respect to their optimality scores, trees on a terrace are like "exchangeable particles", adding considerable ambiguity and computational expense to phylogeny inference. However, we have identified several properties of terraces which make them tractable (Sanderson et al. 2011): trees on a terrace can be enumerated efficiently without further heuristic search; they can be summarized by consensus trees which can be constructed in polynomial time; they all belong to the same tree island; and testing if two trees are on the same terrace can be done efficiently without further heuristic search.

Conditions leading to terraces

Terraces can occur whenever phylogenetic trees are constructed from multiple blocks of characters by optimizing a score associated with each block and a candidate tree, T. A block, B, might be just a single character, or it could be all the characters for an entire locus, etc. Suppose there are N taxa for tree T. Two properties together are sufficient for terraces to occur (Steel 2016):

1. For a given block of characters, suppose data are missing for the entire block for exactly n taxa. Let the tree T* be the tree T with these n taxa removed. Then the score, S(B,T) must be the same as S(B,T*).

2. The total score across all m blocks of characters, B_i, must be the sum of individual scores for each block (i.e., linear), S(B,T)=S(B_1,T)+S(B_2,T)+...+S(B_m,T).

For maximum parsimony, condition 1 and 2 are true even if blocks are just individual characters. For maximum likelihood, much depends on the model of evolution and how it is set up within and between blocks. For example, if the model includes a rate parameter that is the same for all blocks (e.g., all genes evolve at the same rate), the second condition does not hold, because data in any block can affect the estimation of that rate parameter and therefore the total likelihood score cannot be linearly decomposed. On the other hand, if the substitution model is set up so that each locus is edge-unlinked, having different rate parameters across edges and blocks, then condition 2 does hold.

Extensions to gene tree settings

Terraces can also arise in very different species tree reconstruction approaches, such as those that infer a species tree, S from a collection of gene trees, G using the so-called gene tree reconciliation framework (Page 1994). In this setting, a "block" is a gene tree rather than a set of characters, but otherwise the definitions hold. The individual gene trees may have taxa missing relative to the entire set of taxa on S. Terraces can occur for at least two kinds of optimality scores in this setting.

1. Duplication score in gene family trees. Gene families with multiple paralogs due to gene duplication can be used to infer a species tree by computing the duplication score, D(S,G), for a gene tree imbedded in a candidate species tree, and then taking the sum across all gene trees. This score is linear and is unaffected by partial taxon coverage, so both necessary properties for terraces hold. However, scores involving losses of genes appear to be more complex and dependent on the details of the partial sampling.

2. Deep coalescence score in allele trees. Deep coalescences are allele lineages that persist deeper in the species tree than required by the coalescence of the species tree lineages (incomplete lineage sorting). The number of extra allele tree lineages in an edge of the species tree is the deep coalescence (DC) score. It too is linear across multiple allele trees, but how it treats subtree pruning (Property 1 above) is more nuanced than for the duplication score. If two adjacent edges of the species tree each have one extra imbedded allele tree edge, we generally count this as contributing +2 to the DC score, even if the node joining the two edges leads to a leaf taxon with missing data. Under this formulation, Property 1 does not generally hold. However, we might alternatively count this as just a single extra edge, in which case Property 1 does.

Less is known about how and if terraces extend to model-based species tree inference, but we expect that parameter independence is also relevant here. For example, in methods that treat incomplete lineage sorting, though individual gene trees might be reconstructed from different alignments, they might be integrated using a common set of demographic parameters on the species tree, which would lead to violation of the second condition described above.

Polytomies in the parent tree

Our implementations assume that the input parent tree is a binary rooted tree. The algorithm we implemented to count the number of parent trees on a terrace, for example, assumes this is true. Using our code, one could implement two approaches to study terraces for a non-binary parent tree. Suppose we have a non-binary parent tree, T0, then we could (i) infer the terrace for each possible binary resolution of T0, and then combine these terraces---call this set of trees U(T0); or alternatively, we could (ii) generate the subtrees of T0 using our code (which in fact, finds their rooted triplets), and then infer the terrace from that (some of these subtrees would be non-binary and therefore would have a smaller set of rooted triplets than if they were binary)---call this terrace Terr(T0).

We have derived one mathematical property relating these two collections of trees. It would be convenient computationally if they were equal, but they are not. In fact U(T0) is a subset of (or equal to) Terr(T0), and there are examples in which Terr(T0) contains trees that are not in U(T0). One example is the tree T0=(d,((a1,a2),(b1,b2),(c1,c2))), which has a trichotomy. The tree T* = (d,((a1,(b1,c1)),(a2,(b2,c2)))) is in Terr(T0) but it is not in U(T0) (Steel, unpub.). That is, T* is on the terrace that would normally be constructed by running Terraphy, but it is not a binary resolution of the original non-binary tree, T0.

Prevalence of terraces

Terraces are not found in all large-scale data sets, but they are characteristic of phylodiversity studies aimed at maximizing the number of taxa (Dobrin et al. 2018). Terrace size can scale exponentially with the number of taxa and can thus be huge in large studies. The optimal tree reported in one study of 7,000 bird taxa is on a terrace with 10^388 other trees. Generally, the sampling properties of multilocus data sets predict the scale of their terraces: Figure 3 in Dobrin et al. (2018) shows the observed correlation between percentage of missing taxon samples and terrace size, and Figure 4 shows how a sampling adequacy measure (a function of missing data percentage and taxon number) roughly predicts whether the terrace found in a data set will be large or small.

Impacts of terraces

Terraces have at least four known impacts.

1. Ambiguity in tree space

Terraces, like any factor that causes multiple trees to have identical or similar optimality scores, adds ambiguity to phylogenetic inference. Terraphy can characterize this ambiguity, by enumeration of all binary trees on a terrace; and construction of the BUILD and strict consensus trees of the terrace. The consensus trees are attractive because they can be constructed efficiently without enumerating any trees on the terrace at all, but each type conveys only some information about the terrace. The BUILD tree (equal to an Adams consensus of all the trees on a terrace) tends to be nearly binary and to give the false impression of a large number of clades that are actually not found in most trees on the terrace. The strict consensus, on the other hand, is often highly unresolved in such cases, but is a safer summary (perhaps too conservative).
2. Confidence assessment
Terraces affect estimates of the quality of trees. The frequency with which trees on a terrace are sampled by bootstrap or Bayesian MCMC analyses can vary significantly depending on several factors, despite all trees on a terrace having the same likelihood score. Sanderson et al. (2015) studied effects on the bootstrap and the Bayesian posterior probability distribution of trees. Assume very long sequences so that phylogenetic uncertainty is due only to terraces. In very simple cases of partial taxon coverage, an incorrect bipartition comes to dominate the set of bootstrap bipartitions and bias the bootstrap proportion estimate. Equally unexpected results also arise in Bayesian inference, in circumstances that are reminiscent of "long branch attraction". This arises because, for a single locus, a taxon with fully missing data has highest posterior probability of attaching to the longest edge in the tree. The posterior probabilities when there are two loci depend on this attraction occurring differentially for the two loci.
3. Heuristic search and Bayesian MCMC
In theory, terraces can impede heuristic searches for an optimal tree and alter the course of Markov Chain Monte Carlo (MCMC) runs. Almost all heuristic tree search algorithms for parsimony and likelihood-based inference or MCMC use some kind of topological rearrangement step in the search or proposal mechanism. The structure of terraces could affect this in several ways. There may be trees ("rabbit holes") within a terrace from which no rearrangement will produce a tree off the terrace. Any search trying simply to improve the likelihood score will terminate there. Fortunately, rabbit holes on terraces, even for NNIs, which are quite "local", appear to be relatively rare. In small examples with 8 taxa and 2 loci (density 62.5%) we found only 72 such trees (out of 10,395). However, it can still be true that a large fraction of rearrangements from a given tree on a terrace stay on the terrace (about 48% in these toy examples). Depending on details of implementation and data set this might limit the effectiveness of searches. Finally, it is not hard to construct examples in which terraces act as "moats" during tree search. This means that any sequence of NNIs must traverse several trees on a terrace before successfully getting to a higher terrace. Again, depending on termination criteria, a search algorithm might terminate early. These can also impede more powerful SPR rearrangement strategies if the attachment radius is limited, as in GARLI and RAxML.
4. Downstream comparative inferences
Terraces can affect inference of ancestral states or tests concerning species diversification rates, to take two examples, by generating large numbers of equally optimal trees having different implications for evolutionary scenarios. Most recent studies of evolution on trees appreciate the need to consider a confidence set of trees, by repeating analyses on a distribution of trees, using the trees as a prior in downstream analyses, or at a minimum by choosing (somehow) a canonical tree. All of these processes are potentially affected by the size and "shape" of the set of trees on terraces. Simple patterns of partial taxon coverage similar to that described above can lead to distributions of trees that are dominated by tree topologies that favor one ancestral state reconstruction or one outcome of a diversification rate test. In fact the outcome of these analyses can be predicted largely on the basis of seemingly irrelevant factors involving the size of the clades missing or the other locus in a partial coverage matrix like that of Figure 1.

References

Dobrin, B., D. Zwickl, and M. J. Sanderson. 2018. The prevalence of terraced treescapes in phylogenetic data sets. BMC Evol. Biol., 18:46.

Page R.D.M. 1994. Maps between trees and cladistic analysis of historical associations among genes, organisms and areas. Syst. Biol., 43:58-77.

Sanderson M.J., McMahon M.M., Steel M. 2010. Phylogenomics with incomplete taxon coverage: The limits to inference. BMC Evolutionary Biology 10.

Sanderson M.J., McMahon M.M., Steel M. 2011. Terraces in phylogenetic tree space. Science 333:448-450.

Sanderson, M. J., M. M. McMahon, A. Stamatakis, D. Zwickl and M. Steel. 2015. Impacts of terraces on phylogenetic inference. Syst. Biol. 64:709-726.

Semple C. 2003. Reconstructing minimal rooted trees. Discrete Applied Mathematics, 127:489-503.

Steel M. 2016. Phylogeny: discrete and random processes in evolution. SIAM, Philadelphia, USA.

Steel M., Sanderson M.J. 2010. Characterizing phylogenetically decisive taxon coverage. Applied Mathematics Letters 23:82-86.

⚠️ **GitHub.com Fallback** ⚠️