Tree statistics - arklumpus/TreeViewer GitHub Wiki
TreeViewer can be used to compute some statistics about the tree that is currently open. This feature is still somewhat experimental, so any feedback and/or suggestions about other statistics to implement are welcome.
The tree statistics can be displayed by clicking on the Stats
button at the bottom of the interface. When you click on this button, TreeViewer will analyse the tree that is currently open and display a "report" containing information about it. Three kinds of reports can be created:
- Reports about a single tree.
- Reports comparing two trees.
- Reports about multiple trees.
This page will describe the kinds of statistics contained in each report and show how to replicate the plots produced by TreeViewer using R.
The file Samples.trees contains 500 samples from a Bayesian phylogenetic analysis estimating a phylogenetic tree using MrBayes (on a simulated dataset). We will mainly use this file for the examples in this page.
Reports about a single tree
When you download the tree file mentioned above and open it in TreeViewer, the program will compute a consensus tree and display it. If you click on the Stats
button at the bottom of the interface, a report containing information about the consensus tree and a number of plots will be produced.
You can export this report by clicking on the buttons at the top of the interface. You can save the source of the report (including all the plots) in Markdown format (the same language used e.g. to write Readme documents in GitHub), or you can save a copy in PDF format. You can download a copy of this report in PDF format from here: SingleTreeReport.pdf
.
General information
First of all, the report will specify whether the tree is rooted or not, the number of tips it contains, the number of total nodes (including both tips and internal nodes), and whether the tree is fully bifurcating or not. If the tree is not fully bifurcating, the average number of descendants for each node is also displayed (the closer this is to 2, the closer the tree is to being fully bifurcating). Note that the report analyses the Final transformed tree, i.e. the produced by the Transformer module after it has been altered by any Further transformation modules.
Branch length information
If the tree has branch lengths, the total length of the tree (i.e., the sum of all branch lengths) is next displayed. If the tree is clock-like, the age of the root (distance from the root to the tips) is displayed; otherwise, both the maximum and minimum distance from the root to a tip are displayed. Some statistics about the branch lengths are next (range, mean, median, 89% highest-density interval and 95% HDI), followed by a combined histogram and box plot showing the distribution of the branch lengths.
If you move the mouse over the plot, you will notice four buttons appearing in the top-right corner:
- The
PDF
andSVG
buttons on the right can be used to export the plot in PDF or SVG format, respectively. - The
CSV
button can be used to save the data used to create the plot in CSV format. For this plot, the file created by this button consists of a first line readingBranch length
, and then one line for each branch in the tree, containing the branch length. - The button with arrows on the left enlarges the plot, temporarily hiding the rest of the report. In this enlarged view, the plot becomes interactive: if you move the mouse over the various bars of the histogram, you will see the range represented by each bar and the number of values within that range. You can also zoom in/out the plot using the mouse wheel or touchpad. Using the buttons in the
Plot
tab at the top of the interface, you can again export the plot data in CSV format or the plot as SVG, PDF or raster image. When you are done examining the plot, you can click on theClose plot
button to go back to the report.
The buttons described here will appear on all of the plots contained in tree statistics reports, with the same function. Naturally, the kind of data that is exported by the CSV
button will differ between different plots.
You can use the exported CSV file to recreate the histogram/boxplot in R (or in any other statistical analysis software). For example, assuming that the CSV file has been saved as branchLengths.csv
, you could do something like:
data <- read.table("branchLengths.csv", header=TRUE, sep="\t")
hist(data$Branch.length, breaks="FD")
boxplot(data$Branch.length)
graphics.off()
Naturally, you could spend some time customising the appearance of the box plot and the histogram, but that is beyond the scope of this document.
If at least one branch in the tree does not have an associated branch length, this section will be missing from the report.
Tree shape statistics
If the tree is rooted or has at least six tips, the next section of the report will contain some tree shape statistics.
Leaf depths and height
If the tree is rooted, the first statistic shown is leaf depth, i.e., the number of internal nodes between each tip of the tree and the root node. The average value, variance and distribution (as another histogram/boxplot combination) are shown; at least for small trees, the variance of the leaf depths gives a measure of the "balance" of the tree: the closer this is to 0, the more balanced the tree is (Sackin, 1972, Coronado et al., 2020).
If the tree is not clock-like, another plot is shown, displaying leaf heights. This is the path-length between each tip and the root node of the tree. For a clock-like tree (as the one we are looking at), all leaves have the same height, so this plot is not shown.
Like the plot showing branch lengths, you can replicate these plots by exporting the data as a CSV file, and then using the R commands above.
Sackin index
The next item included in the report is the Sackin index, which is the sum of the leaf depths for all the leaves in the tree (Shao & Sokal, 1990). This is another measure of "balance" of the tree - the more balanced the tree is, the lower the value of the Sackin index should be. Note that the Sackin indices of trees with different numbers of taxa are not directly comparable, as the range of values for this index changes depending on the number of taxa.
To address this, the Sackin index can be normalised by considering the expected mean value for the number of taxa in the tree; this will yield a value that is comparable between trees with different numbers of taxa. The problem, however, is that the mean for the Sackin index depends on the assumed null distribution of the topologies. Two null models are currently implemented in TreeViewer: the YHK model (Yule-Harding-Kingman, which is a pure-birth process) and the PDA model (Proportional to Distinguished Arrangements, under which all tree topologies are equally probable). You can find more information about these models in Creating a new tree.
The YHK and PDA normalisations for the Sackin index are, respectively (Blum et al, 2006):
$$ S_{YHK} = \frac{S}{n} - 2 \sum_{i=2}^{n - 1}\frac{1}{i} \qquad \qquad S_{PDA} = \frac{S}{n^{3/2}} $$
Where $S$ is the raw Sackin index and $n$ is the number of taxa in the tree. If the tree is fully bifurcating, the normalised values of the Sackin index are shown after the raw value. A statistical analysis is then performed to determine whether these values differ significantly from the expectation under each null hypothesis. This involves generating 2000 random trees under the YHK or PDA model, with the same number of tips as the current tree; then, the Sackin index for each tree is computed, and this distribution is compared with the value obtained for the actual tree being analysed. The difference is deemed significant if the observed value for the Sackin index is higher or lower than 95% of the values in the distribution.
For the tree we are looking at, we see that the Sackin index is not significantly different than expected under the YHK, and it is significantly more balanced than expected under the PDA model. This is reassuring, since the tree was actually sampled from a pure birth process. The plot shows the distribution of the Sackin indices under each null model; if you enlarge it, you can move your mouse over the curves to see the p values for both models.
If you export the data for this plot as a CSV file, you will get a file with four columns: the first two columns contain the sampled distribution under the null hypothesis for the YHK and PDA models, respectively, while the two final columns contain just the observed value for the Sackin index, normalised according to the YHK or PDA model. To replicate the plot in R, you can use the following commands, assuming the CSV file has been saved as sackin.csv
:
data <- read.table("sackin.csv", header=TRUE, sep="\t", fill=TRUE)
hist.YHK <- hist(data$Samples.from.YHK.model, breaks="FD")
abline(v=data[1, 3])
p.smaller.YHK <- sum(data$Samples.from.YHK.model < data[1, 3]) / length(data$Samples.from.YHK.model)
text(data[1, 3], max(hist.YHK$counts), p.smaller.YHK, pos=2)
text(data[1, 3], max(hist.YHK$counts), 1 - p.smaller.YHK, pos=4)
hist.PDA <- hist(data$Samples.from.PDA.model, breaks="FD")
abline(v=data[1, 4])
p.smaller.PDA <- sum(data$Samples.from.PDA.model < data[1, 4]) / length(data$Samples.from.PDA.model)
text(data[1, 4], max(hist.PDA$counts), p.smaller.PDA, pos=2)
text(data[1, 4], max(hist.PDA$counts), 1 - p.smaller.PDA, pos=4)
graphics.off()
Again, you can fiddle with the plot commands to get a nicer plot, if you want to.
Colless index
Assuming the tree is fully bifurcating, the next item is the Colless index, which is the sum, over all the nodes in the tree, of the absolute difference in the number of descendants on the left and right side of each node (Colless, 1982). Similarly to the Sackin index, both a raw value and normalised values according to the YHK and PDA null models are provided (Blum et al, 2006):
$$ C_{YHK} = \frac{C - \mathbb{E}\left[C_n \right]}{n} \qquad \qquad C_{PDA} = \frac{C}{n^{3/2}} $$
$$ \mathbb{E}\left[C_n \right] = n - 1 - 2t(n) +2(n + 1) \cdot k(n) $$
$$ t(n) = \left { \begin{matrix} \frac{n - 2}{4} & n \mathrm{\ even} \ \frac{\left (n - 1 \right )^2}{4n} & n \mathrm{\ odd} \end{matrix} \right . \qquad \qquad k(n) = \sum_{i=1}^{n-1}\frac{i - 1 - 2 t(i)}{\left(i + 1 \right)\left (i + 2 \right)} $$
Where $n$ is the number of taxa in the tree and $C$ is the raw Colless index value. The observed value of the Colless index is then analysed by comparing it to a sampled distribution, just like the Sackin index; the meaning of the plot and the p-values is identical. We can see that also in this case the observed value is in line with predictions under the YHK model, while significantly lower than the PDA model (i.e., the tree is more balanced than expected under the PDA model).
Number of cherries
The final statistics provided in this report is the number of cherries for the tree; this is computed for any tree (rooted or unrooted, binary of multifurcating) with at least 6 six tips. A cherry is a sub-terminal node with exactly two descendants. This statistics again measures the balance of the tree; more balanced trees will have a higher number of cherries. Just like the Sackin and Colless indices, the possible values for this statistic depend on the number of taxa in the tree, and both raw and nomalised values are provided (McKenzie & Steel, 2000):
$$ K_{YHK} = \frac{K-n/3}{\sqrt{2n/45}} \qquad \qquad K_{PDA} = \frac{K - \frac{n\left (n - 1\right)}{2\left(2n - 5\right)}}{\sqrt{\frac{n\left(n - 1\right)\left(n - 4 \right)\left(n - 5 \right)}{2\left(2n - 5 \right)^2\left(2n - 7 \right)}}} $$
Where $n$ is the number of taxa in the tree and $K$ is the raw number of cherries. Unlike the Sackin and Colless index, the limit distribution for the normalised number of cherries is a much more tractable normal distribution $\mathcal{N} \left (0, 1 \right)$; therefore, the statistical analysis to determine whether the observed value significantly differs from expectations is run both using the same Monte Carlo approach as for the other indices, as well as using the limit distribution (this is why two p-values are produced). Like the previous statistics, for the current tree there should not be a significant difference between the observed number of cherries and the expected value under the YHK model, while the difference with the PDA model should be significant.
Similarly, the plot shows, in addition to the Monte Carlo samples for the YHK and PDA models, the limit normal distribuion. You may notice that the distributions have a "weird" appearance; this is because the number of cherries is a discrete variable (e.g., you can have 20 or 21 cherries, but not 20.5); the more taxa the tree has, the more the sampled distribution should approach a continuous normal distribution.
To replicate this plot in R, you can add the normal distribution as an overlay over the sampled Monte Carlo distribution. Assuming you have saved the data file as cherries.csv
:
data <- read.table("cherries.csv", header=TRUE, sep="\t", fill=TRUE)
hist.YHK <- hist(data$Samples.from.YHK.model, breaks="FD")
abline(v=data[1, 3])
p.smaller.YHK <- sum(data$Samples.from.YHK.model < data[1, 3]) / length(data$Samples.from.YHK.model)
text(data[1, 3], max(hist.YHK$counts), p.smaller.YHK, pos=2)
text(data[1, 3], max(hist.YHK$counts), 1 - p.smaller.YHK, pos=4)
lines(hist.YHK$mids, dnorm(hist.YHK$mids)/max(dnorm(hist.YHK$mids)) * max(hist.YHK$counts))
hist.PDA <- hist(data$Samples.from.PDA.model, breaks="FD")
abline(v=data[1, 4])
p.smaller.PDA <- sum(data$Samples.from.PDA.model < data[1, 4]) / length(data$Samples.from.PDA.model)
text(data[1, 4], max(hist.PDA$counts), p.smaller.PDA, pos=2)
text(data[1, 4], max(hist.PDA$counts), 1 - p.smaller.PDA, pos=4)
lines(hist.PDA$mids, dnorm(hist.PDA$mids)/max(dnorm(hist.PDA$mids)) * max(hist.PDA$counts))
graphics.off()
Reports about multiple trees
If you click on the Loaded trees
tab in the Tree statistics window, all of the 500 trees in the file we have opened will be analysed, producing a different kind of report. You can download a copy of this report in PDF format from here: MultipleTreesReport.pdf
.
General information
The first item in this new report analyses the taxa contained in each tree. In the tree file we are looking at, all trees have the same taxa; however, if this were not the case, the number of taxa shared between all trees, as well as the total number of taxa would be shown.
If the trees have different taxa, the rest of the analyses can be either performed on the subset of taxa shared between all the trees (i.e., by pruning off all taxa that are not present in every tree), or by using pairwise comparisons (i.e., by pruning, for each comparison, the taxa that are not shared between both trees). You can determine which of these strategies are used by changing the value of the Tree comparisons
setting in the global preferences for TreeViewer (accessed from the File
menu).
Tree space
If you are using pairwise comparisons or there are at least three taxa that are shared between all the trees, the next plot shows a 2D representation of the tree space, where each tree is represented by a point. This is computed based on a distance metric between the trees; if the trees all have the same topology, the metric is the edge-length distance (i.e., $d_{EL}\left(T, T^\prime\right)=\sqrt{\sum_i \left (\nu_i - \nu^\prime_i \right)^2} $, where $\nu_i$ and $\nu^\prime_i$ are the lengths for branch $i$ in the two trees). If the trees have different topologies, the metric used is the Robinson-Foulds (RF) distance; two plots are produced in this case, one for the unweighted RF distance, and one for the weighted RF distance (weighted by the branch lengths).
For each of these plots, a clustering analysis is performed, to determine whether the data support the presence of multiple clusters of trees. This can be performed either using the edge-length or RF distances directly, or by using the 2D metric obtained after the multidimensional scaling analysis used to plot the points; you can determine the approach used by changing the value of the Tree clustering metric
setting in the global preferences for TreeViewer (accessed from the File
menu).
If the 2D metric is being used, a Duda-Hart test (Duda & Hart, 1973) is performed first of all, to determine whether there is significant evidence for clustering within the data. If a significant result is returned, then the clustering anslysis proceeds; otherwise it is skipped. If the distances are being used directly, the Duda-Hart test is not performed, and the clustering analysis proceeds directly.
A K-medoid clustering analysis is then performed, with up to 12 clusters. For each clustering, the average silhouette score and the cluster size variance are computed. Clusterings with a silhouette score greater than or equal to 95% of the highest observed silhouette score are considered, and, among these, the number of clusters that leads to the smallest cluster size variance is selected. If all clusterings have a silhouette score smaller than 0.5, no clustering is performed, regardless of the results of the Duda-Hart test.
The plots produced by this analysis will contain a scatter plot of the trees, where the position of each tree is determined using a multidimensional scaling analysis based on the inter-tree distances that have been computed. If the clustering has been successful, different clusters will be highlighted by different colours, and the medoid for each cluster will be shown as a star (instead of a circle). A shaded area showing the variance of each cluster is also shown. Finally, to the right of the scatter plot, the silhouette scores for each number of clusters up to 12 are shown, together with the cluster size variance.
If you expand the tree space plots, you can move your mouse over the points to see which tree each point represents. If clusters are shown, you can hover each cluster to see how many points belong to it, and if you click on a cluster, a new TreeViewer window will open. This new window will contain a tree file that has been created from the trees in the cluster.
For example, the tree file we are using shows evidence for two clusters according to the weighted RF distance:
By clicking in the blue area or in the orange area, you can open the 444 trees in the blue cluster or the 56 trees in the orange cluster, respectively; you can then e.g. compare the consensus trees for each of these clusters to identify key differences. Similarly,if you move your mouse over the stars, you can see that the medoid for the blue cluster is the 415th tree in the file, while the medoid for the orange cluster is the 139th tree; you could compare these trees to see what changes between them and between the overall consensus tree.
To replicate these plots, you can export the plot data in CSV format, this will produce a file containing a lower triangular matrix representation of the tree distances. If you save this as treeSpace.csv
, you can use the following code to perform the multidimensional scaling and clustering and plot the results:
library(cluster)
treeCount <- length(readLines("treeSpace.csv")) - 1
distances <- read.table("treeSpace.csv", skip=1, fill=TRUE, col.names=paste("V", 1:treeCount))
distances[upper.tri(distances)] <- t(distances)[upper.tri(distances)]
# Perform MDS
points = cmdscale(distances)
# Perform clustering analysis
# Duda-Hart test
clustering1 <- pam(points, k=1)
# Sum of the (squared) distances between each point and the overall medoid
je1 <- sum(sweep(points, 2, clustering1$medoids[1,]) ^ 2)
clustering2 <- pam(points, k=2)
# Sum of the (squared) distances between each point and the medoid of the corresponding cluster
je2 <- sum(sweep(points[clustering2$clustering == 1,], 2, clustering2$medoids[1,]) ^ 2) + sum(sweep(points[clustering2$clustering == 2,], 2, clustering2$medoids[2,]) ^ 2)
# Critical value for the Duda-Hart test with alpha = 0.001
critVal <- 1 - 1 / pi - qnorm(1 - 0.001) * sqrt((1 - 4 / pi^2) / treeCount)
# Normalised test statistic value (used to compute the p-value)
z <- (1 - 1 / pi - je2 / je1) / sqrt((1 - 4 / pi^2) / treeCount)
pValue <- 1 - pnorm(z)
cat("Duda-Hart test p-value:", pValue, "\n")
if (je2 / je1 > critVal) {
cat("Only one cluster detected according to the Duda-Hart test.\n")
finalClustering <- clustering1
} else {
# Compute clusterings for 3-12 clusters
clusterings = c(list(clustering2), lapply(3:12, function(k) { pam(points, k) }))
maxSilhouetteScore <- max(sapply(clusterings, function(x) { x$silinfo$avg.width }))
if (maxSilhouetteScore < 0.5) {
cat("No clustering with silhouette score >= 0.5.\n")
finalClustering <- clustering1
} else {
# Considering only clusterings with silhouette score >= 95 % of the maximum, take the one with the lowest variance
finalClustering <- clusterings[order(sapply(clusterings, function(x) { if (x$silinfo$avg.width >= 0.95 * maxSilhouetteScore) { var(x$clusinfo[,"size"]) } else { Inf } }))[[1](/arklumpus/TreeViewer/wiki/order(sapply(clusterings,-function(x)-{-if-(x$silinfo$avg.width->=-0.95-*-maxSilhouetteScore)-{-var(x$clusinfo[,"size"])-}-else-{-Inf-}-}))[[1)]]
cat("Using ", nrow(finalClustering$medoids), " clusters.\n")
}
}
# Plot
plot(points, pch=16, col=finalClustering$clustering)
points(finalClustering$medoids, pch=23, col="white", bg=1:nrow(finalClustering$medoids), cex=2)
if (nrow(finalClustering$medoids) > 1) {
# Silhouette score and cluster size variance plot
plot(2:12, sapply(clusterings, function(x) { x$silinfo$avg.width }), pch=16, cex=1 + scale(sapply(clusterings, function(x) { var(x$clusinfo[,"size"]) }), c=0) * 4 )
lines(2:12, sapply(clusterings, function(x) { x$silinfo$avg.width }))
abline(h=maxSilhouetteScore * 0.95)
}
graphics.off()
Note that the results of the R plot may be slightly different to those obtained with TreeViewer, due to differences in the algorithms used by R and TreeViewer, but the differences should not be significant.
Tree shape statistics
If the trees have different topologies, the observed values for the Sackin index, Colless index and number of cherries for each tree are displayed as histograms.
Reports about two trees
If you click on the Compare
tab in the Tree statistics window, you will be able to create a new report, including the trees you like. These could be the final transformed tree, some of the loaded trees, or trees from files on your computer. For this example, you can download the tree files Tree1.tre
and Tree2.tre
and use them to create a new report, by clicking on the Add tree file...
button and selecting them one at a time. Each of these files contains a time-calibrated tree of some cyanobacterial strains and plant chloroplasts, built using different proteins. You can download a copy in PDF format of this report from here: TwoTreesReport1.pdf
.
General information
A report comparing two trees will start by stating whether the trees are rooted or unrooted, the number of nodes and leaves that they contain, whether the tree has branch lengths, and whether the trees contain exactly the same leaves or not. In this case, the trees have most leaves in common, while 6 leaves are specific to each tree; this is further detailed in a table, showing the tips shared by both trees or present just in one tree.
When the trees contain different leaves, the other analyses in the report are performed on the subtree induced by the leaves that are in common between the two trees (i.e., leaves that are only present in one tree are pruned from each tree).
The next comparison looks at the tree topology: if the trees have different topologies, the (weighted and unweighted) Robinson-Foulds distance between them is computed, and the number of splits that are in common or present in only one of the trees is shown. If the trees have the same topology (as in this case), the edge-length distance is computed instead.
The first plot in the report shows the distribution of the split length difference for splits that are present in both trees (if any), which is shown as a histogram/box plot combination.
Tree shape statistics
The next section of the report contains information about the tree shape statistics. If the trees have the same topology (as in this case), the report shows the same information for the Sackin index, the Colless index and the number of cherries that would appear in a single-tree report. If the trees have different topologies, however, the differences in between the shape statistics of the two trees are shown and statistically analysed (using a two-tailed test) instead; this can be used to determine whether one tree is significantly more balanced than the other.
For an example of trees with different topologies, you can use Tree3.tre
and Tree4.tre
. You can download a copy in PDF format of this report from here: TwoTreesReport2.pdf
.
To avoid producing files that are too large, when you export the data for a plot showing the difference in a statistic (e.g., the Sackin index difference), the exported file contains the estimated density rather than all the ~2 million sampled values. To replicate the Sackin index difference and Colless index difference plots, you can use the following code (assuming the output file has been saved as sackinDifference.csv
):
data <- read.table("sackinDifference.csv", header=TRUE, sep="\t", fill=TRUE)
plot(data$x, data$Density..YHK.model., t="l")
abline(v=data[1, 4])
abline(v=-data[1, 4])
p.smaller.YHK <- sum(data$Density..YHK.model.[data$x > -abs(data[1, 4]) & data$x < abs(data[1, 4])] ) / sum(data$Density..YHK.model.)
text(abs(data[1, 4]), max(data$Density..YHK.model.), p.smaller.YHK, pos=2)
text(abs(data[1, 4]), max(data$Density..YHK.model.), 1 - p.smaller.YHK, pos=4)
plot(data$x, data$Density..PDA.model., t="l")
abline(v=data[1, 5])
abline(v=-data[1, 5])
p.smaller.PDA <- sum(data$Density..PDA.model.[data$x > -abs(data[1, 5]) & data$x < abs(data[1, 5])] ) / sum(data$Density..PDA.model.)
text(abs(data[1, 5]), max(data$Density..PDA.model.), p.smaller.PDA, pos=2)
text(abs(data[1, 5]), max(data$Density..PDA.model.), 1 - p.smaller.PDA, pos=4)
graphics.off()
To replicate the plot displaying the difference in the number of cherries, you can use the following (very similar) code, instead:
data <- read.table("cherryDifference.csv", header=TRUE, sep="\t", fill=TRUE)
plot(data$x, data$Density..YHK.model., t="l")
lines(data$x, dnorm(data$x, sd=sqrt(2))/max(dnorm(data$x, sd=sqrt(2))) * max(data$Density..YHK.model.))
abline(v=data[1, 4])
abline(v=-data[1, 4])
p.smaller.YHK <- sum(data$Density..YHK.model.[data$x > -abs(data[1, 4]) & data$x < abs(data[1, 4])] ) / sum(data$Density..YHK.model.)
text(abs(data[1, 4]), max(data$Density..YHK.model.), p.smaller.YHK, pos=2)
text(abs(data[1, 4]), max(data$Density..YHK.model.), 1 - p.smaller.YHK, pos=4)
plot(data$x, data$Density..PDA.model., t="l")
lines(data$x, dnorm(data$x, sd=sqrt(2))/max(dnorm(data$x, sd=sqrt(2))) * max(data$Density..PDA.model.))
abline(v=data[1, 5])
abline(v=-data[1, 5])
p.smaller.PDA <- sum(data$Density..PDA.model.[data$x > -abs(data[1, 5]) & data$x < abs(data[1, 5])] ) / sum(data$Density..PDA.model.)
text(abs(data[1, 5]), max(data$Density..PDA.model.), p.smaller.PDA, pos=2)
text(abs(data[1, 5]), max(data$Density..PDA.model.), 1 - p.smaller.PDA, pos=4)
graphics.off()
Note that the p-values reported in the R plots may be slightly different from those computed by TreeViewer, because TreeViewer uses the full distribution to compute them, rather than just the density that is exported to the CSV file.
References:
Sackin, M. J. (1972). “Good” and “Bad” Phenograms. Systematic Zoology, 21(2), 225. DOI: 10.2307/2412292
Duda, R. O., & Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley - Interscience.
Colless, D. H. (1982). Review of “Phylogenetics: the theory and practice of phylogenetic systematics.” Systematic Zoology, 31(1), 100. DOI: 10.2307/2413420
Shao, K. T., & Sokal, R. R. (1990). Tree Balance. Systematic Biology, 39(3), 266–276. DOI: 10.2307/2992186
McKenzie, A., & Steel, M. (2000). Distributions of cherries for two models of trees. Mathematical Biosciences, 164(1), 81–92. DOI: 10.1016/S0025-5564(99)00060-7
Blum, M. G. B., François, O., & Janson, S. (2006). The mean, variance and limiting distribution of two statistics sensitive to phylogenetic tree balance. Annals of Applied Probability, 16(4), 2195–2214. DOI: 10.1214/105051606000000547
Coronado, T. M., Mir, A., Rosselló, F., & Rotger, L. (2020). On Sackin’s original proposal: The variance of the leaves’ depths as a phylogenetic balance index. BMC Bioinformatics, 21(1), 1–17. DOI: 10.1186/S12859-020-3405-1