Phylogenetic tree statistics: A systematic overview using the new R package 'treestats'

Thijs Janzen*, Rampal S Etienne

*Corresponding author voor dit werk

OnderzoeksoutputAcademicpeer review

1 Citaat (Scopus)
4 Downloads (Pure)

Samenvatting

Phylogenetic trees are believed to contain a wealth of information on diversification processes. However, comparing phylogenetic trees is not straightforward due to their high dimensionality. Researchers have therefore defined a wide range of low-dimensional summary statistics. Currently, it remains unexplored to what extent these summary statistics cover the same underlying information and what summary statistics best explain observed variation across phylogenies. Furthermore, a large subset of available summary statistics focusses on measuring the topological features of a phylogenetic tree, but are often only explored at the extreme edge cases of the fully balanced or imbalanced tree and not for trees of intermediate balance. Here, we introduce a new R package called 'treestats', that provides speed optimized code to compute 70 summary statistics. We study correlations between summary statistics on empirical trees and on trees simulated using several diversification models. Furthermore, we introduce an algorithm to create intermediately balanced trees in a well-defined manner, in order to explore variation in summary statistics across a balance gradient. We find that almost all summary statistics are correlated with tree size, and find that it is difficult, if not impossible, to correct for tree size, unless the tree generating model is known. Furthermore, we find that across empirical and simulated trees, at least three large clusters of correlated summary statistics can be found, where statistics group together based on information used (topology or branching times). However, the finer grained correlation structure appears to depend strongly on either the taxonomic group studied (in empirical studies) or the tree generating model (in simulation studies). Amongst statistics describing the (im)balance of a tree, we find that almost all statistics vary non-linearly, and sometimes even non-monotonically, with our generated balance gradient. This indicates that balance is perhaps a more complex property of a tree than previously thought. Furthermore, using our new imbalancing algorithm, we devise a numerical test to identify balance statistics, and identify several statistics as balance statistics that were not previously considered as such. Lastly, our results lead to several recommendations on which statistics to select when analyzing and comparing phylogenetic trees.

Originele taal-2English
Artikelnummer108168
Aantal pagina's15
TijdschriftMolecular Phylogenetics and Evolution
Volume200
Vroegere onlinedatum6-aug.-2024
DOI's
StatusE-pub ahead of print - 6-aug.-2024

Vingerafdruk

Duik in de onderzoeksthema's van 'Phylogenetic tree statistics: A systematic overview using the new R package 'treestats''. Samen vormen ze een unieke vingerafdruk.

Citeer dit