TY - UNPB
T1 - Phylogenetic tree statistics
T2 - A systematic overview using the new R package 'treestats'
AU - Janzen, Thijs
AU - Etienne, Rampal S.
PY - 2024/1/29
Y1 - 2024/1/29
N2 - Phylogenetic trees are believed to contain a wealth of information on diversification processes. Comparing phylogenetic trees is not straightforward due to their high dimensionality. Researchers have therefore defined a wide range of one-dimensional summary statistics. However, it remains unexplored to what extent these summary statistics cover the same underlying information and what summary statistics best explain observed variation across phylogenies. Furthermore, a large subset of available summary statistics focusses on measuring the topological features of a phylogenetic tree, but are often only explored at the extreme edge cases of the fully balanced or unbalanced tree and not for trees of intermediate balance. Here, we introduce a new R package that provides speed optimized code to compute 54 summary statistics. We study correlations between summary statistics on empirical trees and on trees simulated using several diversification models. Furthermore, we introduce an algorithm to create intermediately balanced trees in a well-defined manner, in order to explore variation in summary statistics across a balance gradient. We find that almost all summary statistics are correlated with tree size, and it is difficult if not impossible to correct for tree size, unless the tree generating model is known. Furthermore, we find that across empirical and simulated trees, at least two large clusters of correlated summary statistics can be found, where statistics group together based on information used (topology or branching times). However, the finer grained correlation structure appears to depend strongly on either the taxonomic group studied (in empirical studies) or the diversification model (in simulation studies). Nevertheless, we can identify multiple groups of summary statistics that are strongly and consistently correlated, indicating that these statistics measure the same underlying property of a tree. Lastly, we find that almost all topological summary statistics vary non-linearly and sometimes even non-monotonically with our intuitive balance gradient. Therefore, in order to avoid introducing biases and missing underlying information, we advocate for selecting as many summary statistics as possible in phylogenetic analyses. With the introduction of the treestats package, which provides fast and reliable calculations, such an approach is now routinely possible.
AB - Phylogenetic trees are believed to contain a wealth of information on diversification processes. Comparing phylogenetic trees is not straightforward due to their high dimensionality. Researchers have therefore defined a wide range of one-dimensional summary statistics. However, it remains unexplored to what extent these summary statistics cover the same underlying information and what summary statistics best explain observed variation across phylogenies. Furthermore, a large subset of available summary statistics focusses on measuring the topological features of a phylogenetic tree, but are often only explored at the extreme edge cases of the fully balanced or unbalanced tree and not for trees of intermediate balance. Here, we introduce a new R package that provides speed optimized code to compute 54 summary statistics. We study correlations between summary statistics on empirical trees and on trees simulated using several diversification models. Furthermore, we introduce an algorithm to create intermediately balanced trees in a well-defined manner, in order to explore variation in summary statistics across a balance gradient. We find that almost all summary statistics are correlated with tree size, and it is difficult if not impossible to correct for tree size, unless the tree generating model is known. Furthermore, we find that across empirical and simulated trees, at least two large clusters of correlated summary statistics can be found, where statistics group together based on information used (topology or branching times). However, the finer grained correlation structure appears to depend strongly on either the taxonomic group studied (in empirical studies) or the diversification model (in simulation studies). Nevertheless, we can identify multiple groups of summary statistics that are strongly and consistently correlated, indicating that these statistics measure the same underlying property of a tree. Lastly, we find that almost all topological summary statistics vary non-linearly and sometimes even non-monotonically with our intuitive balance gradient. Therefore, in order to avoid introducing biases and missing underlying information, we advocate for selecting as many summary statistics as possible in phylogenetic analyses. With the introduction of the treestats package, which provides fast and reliable calculations, such an approach is now routinely possible.
UR - https://doi.org/10.1101/2024.01.24.576848
U2 - 10.1101/2024.01.24.576848
DO - 10.1101/2024.01.24.576848
M3 - Preprint
BT - Phylogenetic tree statistics
PB - BioRxiv
ER -