1
|
Habib M, Roy K, Hasan S, Rahman AH, Bayzid MS. Terraces in species tree inference from gene trees. BMC Ecol Evol 2024; 24:135. [PMID: 39497030 PMCID: PMC11533290 DOI: 10.1186/s12862-024-02309-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2023] [Accepted: 09/16/2024] [Indexed: 11/06/2024] Open
Abstract
A terrace in a phylogenetic tree space is a region where all trees contain the same set of subtrees, due to certain patterns of missing data among the taxa sampled, resulting in an identical optimality score for a given data set. This was first investigated in the context of phylogenetic tree estimation from sequence alignments using maximum likelihood (ML) and maximum parsimony (MP). It was later extended to the species tree inference problem from a collection of gene trees, where a set of equally optimal species trees was referred to as a "pseudo" species tree terrace which does not consider the topological proximity of the trees in terms of the induced subtrees resulting from certain patterns of missing data. In this study, we mathematically characterize species tree terraces and investigate the mathematical properties and conditions that lead multiple species trees to induce/display an identical set of locus-specific subtrees owing to missing data. We report that species tree terraces are agnostic to gene tree heterogeneity. Therefore, we introduce and characterize a special type of gene tree topology-aware terrace which we call "peak terrace". Moreover, we empirically investigated various challenges and opportunities related to species tree terraces through extensive empirical studies using simulated and real biological data. We demonstrate the prevalence of species tree terraces and the resulting ambiguity created for tree search algorithms. Remarkably, our findings indicate that the identification of terraces could potentially lead to advances that enhance the accuracy of summary methods and provide reasonably accurate branch support.
Collapse
Affiliation(s)
- Mursalin Habib
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Kowshic Roy
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Saem Hasan
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Atif Hasan Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh.
| |
Collapse
|
2
|
Chernomor O, Elgert C, von Haeseler A. Gentrius: Generating Trees Compatible With a Set of Unrooted Subtrees and its Application to Phylogenetic Terraces. Mol Biol Evol 2024; 41:msae219. [PMID: 39431557 PMCID: PMC11536181 DOI: 10.1093/molbev/msae219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 09/30/2024] [Accepted: 10/11/2024] [Indexed: 10/22/2024] Open
Abstract
For a set of binary unrooted subtrees generating all binary unrooted trees compatible with them, i.e. generating their stand, is one of the classical problems in phylogenetics. Here, we introduce Gentrius-an efficient algorithm to tackle this task. The algorithm has a direct application in practice. Namely, Gentrius generates phylogenetic terraces-topologically distinct, equally scoring trees due to missing data. Despite stand generation being computationally intractable, we showed on simulated and biological datasets that Gentrius generates stands with millions of trees in feasible time. We exemplify that depending on the distribution of missing data across species and loci and the inferred phylogeny, the number of equally optimal terrace trees varies tremendously. The strict consensus tree computed from them displays all the branches unaffected by the pattern of missing data. Thus, by solving the problem of stand generation, in practice Gentrius provides an important systematic assessment of phylogenetic trees inferred from incomplete data. Furthermore, Gentrius can aid theoretical research by fostering understanding of tree space structure imposed by missing data.
Collapse
Affiliation(s)
- Olga Chernomor
- Center for Integrative Bioinformatics Vienna (CIBIV), Max Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna Bio Center (VBC), Vienna, Austria
| | - Christiane Elgert
- Center for Integrative Bioinformatics Vienna (CIBIV), Max Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna Bio Center (VBC), Vienna, Austria
| | - Arndt von Haeseler
- Center for Integrative Bioinformatics Vienna (CIBIV), Max Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna Bio Center (VBC), Vienna, Austria
- Department of Computer Science, University of Vienna, Vienna, Austria
- Ludwig Boltzmann Institute for Network Medicine, University of Vienna, Vienna, Austria
| |
Collapse
|
3
|
Berling L, Collienne L, Gavryushkin A. Estimating the mean in the space of ranked phylogenetic trees. Bioinformatics 2024; 40:btae514. [PMID: 39177090 PMCID: PMC11364146 DOI: 10.1093/bioinformatics/btae514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 05/16/2024] [Accepted: 08/21/2024] [Indexed: 08/24/2024] Open
Abstract
MOTIVATION Reconstructing evolutionary histories of biological entities, such as genes, cells, organisms, populations, and species, from phenotypic and molecular sequencing data is central to many biological, palaeontological, and biomedical disciplines. Typically, due to uncertainties and incompleteness in data, the true evolutionary history (phylogeny) is challenging to estimate. Statistical modelling approaches address this problem by introducing and studying probability distributions over all possible evolutionary histories, but can also introduce uncertainties due to misspecification. In practice, computational methods are deployed to learn those distributions typically by sampling them. This approach, however, is fundamentally challenging as it requires designing and implementing various statistical methods over a space of phylogenetic trees (or treespace). Although the problem of developing statistics over a treespace has received substantial attention in the literature and numerous breakthroughs have been made, it remains largely unsolved. The challenge of solving this problem is 2-fold: a treespace has nontrivial often counter-intuitive geometry implying that much of classical Euclidean statistics does not immediately apply; many parametrizations of treespace with promising statistical properties are computationally hard, so they cannot be used in data analyses. As a result, there is no single conventional method for estimating even the most fundamental statistics over any treespace, such as mean and variance, and various heuristics are used in practice. Despite the existence of numerous tree summary methods to approximate means of probability distributions over a treespace based on its geometry, and the theoretical promise of this idea, none of the attempts resulted in a practical method for summarizing tree samples. RESULTS In this paper, we present a tree summary method along with useful properties of our chosen treespace while focusing on its impact on phylogenetic analyses of real datasets. We perform an extensive benchmark study and demonstrate that our method outperforms currently most popular methods with respect to a number of important 'quality' statistics. Further, we apply our method to three empirical datasets ranging from cancer evolution to linguistics and find novel insights into corresponding evolutionary problems in all of them. We hence conclude that this treespace is a promising candidate to serve as a foundation for developing statistics over phylogenetic trees analytically, as well as new computational tools for evolutionary data analyses. AVAILABILITY AND IMPLEMENTATION An implementation is available at https://github.com/bioDS/Centroid-Code.
Collapse
Affiliation(s)
- Lars Berling
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| | - Lena Collienne
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| | - Alex Gavryushkin
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| |
Collapse
|
4
|
Khurana MP, Scheidwasser-Clow N, Penn MJ, Bhatt S, Duchêne DA. The Limits of the Constant-rate Birth-Death Prior for Phylogenetic Tree Topology Inference. Syst Biol 2024; 73:235-246. [PMID: 38153910 PMCID: PMC11129600 DOI: 10.1093/sysbio/syad075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 12/20/2023] [Accepted: 12/27/2023] [Indexed: 12/30/2023] Open
Abstract
Birth-death models are stochastic processes describing speciation and extinction through time and across taxa and are widely used in biology for inference of evolutionary timescales. Previous research has highlighted how the expected trees under the constant-rate birth-death (crBD) model tend to differ from empirical trees, for example, with respect to the amount of phylogenetic imbalance. However, our understanding of how trees differ between the crBD model and the signal in empirical data remains incomplete. In this Point of View, we aim to expose the degree to which the crBD model differs from empirically inferred phylogenies and test the limits of the model in practice. Using a wide range of topology indices to compare crBD expectations against a comprehensive dataset of 1189 empirically estimated trees, we confirm that crBD model trees frequently differ topologically compared with empirical trees. To place this in the context of standard practice in the field, we conducted a meta-analysis for a subset of the empirical studies. When comparing studies that used Bayesian methods and crBD priors with those that used other non-crBD priors and non-Bayesian methods (i.e., maximum likelihood methods), we do not find any significant differences in tree topology inferences. To scrutinize this finding for the case of highly imbalanced trees, we selected the 100 trees with the greatest imbalance from our dataset, simulated sequence data for these tree topologies under various evolutionary rates, and re-inferred the trees under maximum likelihood and using the crBD model in a Bayesian setting. We find that when the substitution rate is low, the crBD prior results in overly balanced trees, but the tendency is negligible when substitution rates are sufficiently high. Overall, our findings demonstrate the general robustness of crBD priors across a broad range of phylogenetic inference scenarios but also highlight that empirically observed phylogenetic imbalance is highly improbable under the crBD model, leading to systematic bias in data sets with limited information content.
Collapse
Affiliation(s)
- Mark P Khurana
- Section of Epidemiology, Department of Public Health, University of Copenhagen, 1352 Copenhagen, Denmark
| | - Neil Scheidwasser-Clow
- Section of Epidemiology, Department of Public Health, University of Copenhagen, 1352 Copenhagen, Denmark
| | - Matthew J Penn
- Department of Statistics, University of Oxford, OX1 3LB, Oxford, UK
| | - Samir Bhatt
- Section of Epidemiology, Department of Public Health, University of Copenhagen, 1352 Copenhagen, Denmark
- MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, SW7 2AZ, London, UK
| | - David A Duchêne
- Centre for Evolutionary Hologenomics, University of Copenhagen, 1352 Copenhagen, Denmark
| |
Collapse
|
5
|
Steenwyk JL, Li Y, Zhou X, Shen XX, Rokas A. Incongruence in the phylogenomics era. Nat Rev Genet 2023; 24:834-850. [PMID: 37369847 PMCID: PMC11499941 DOI: 10.1038/s41576-023-00620-x] [Citation(s) in RCA: 63] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/19/2023] [Indexed: 06/29/2023]
Abstract
Genome-scale data and the development of novel statistical phylogenetic approaches have greatly aided the reconstruction of a broad sketch of the tree of life and resolved many of its branches. However, incongruence - the inference of conflicting evolutionary histories - remains pervasive in phylogenomic data, hampering our ability to reconstruct and interpret the tree of life. Biological factors, such as incomplete lineage sorting, horizontal gene transfer, hybridization, introgression, recombination and convergent molecular evolution, can lead to gene phylogenies that differ from the species tree. In addition, analytical factors, including stochastic, systematic and treatment errors, can drive incongruence. Here, we review these factors, discuss methodological advances to identify and handle incongruence, and highlight avenues for future research.
Collapse
Affiliation(s)
- Jacob L Steenwyk
- Howards Hughes Medical Institute and the Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA
- Vanderbilt Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN, USA
| | - Yuanning Li
- Institute of Marine Science and Technology, Shandong University, Qingdao, China
| | - Xiaofan Zhou
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, China
| | - Xing-Xing Shen
- Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Institute of Insect Sciences, Zhejiang University, Hangzhou, China
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA.
- Vanderbilt Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN, USA.
- Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.
| |
Collapse
|
6
|
Dumm W, Barker M, Howard-Snyder W, DeWitt Iii WS, Matsen Iv FA. Representing and extending ensembles of parsimonious evolutionary histories with a directed acyclic graph. J Math Biol 2023; 87:75. [PMID: 37878119 PMCID: PMC10600060 DOI: 10.1007/s00285-023-02006-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Revised: 09/12/2023] [Accepted: 09/26/2023] [Indexed: 10/26/2023]
Abstract
In many situations, it would be useful to know not just the best phylogenetic tree for a given data set, but the collection of high-quality trees. This goal is typically addressed using Bayesian techniques, however, current Bayesian methods do not scale to large data sets. Furthermore, for large data sets with relatively low signal one cannot even store every good tree individually, especially when the trees are required to be bifurcating. In this paper, we develop a novel object called the "history subpartition directed acyclic graph" (or "history sDAG" for short) that compactly represents an ensemble of trees with labels (e.g. ancestral sequences) mapped onto the internal nodes. The history sDAG can be built efficiently and can also be efficiently trimmed to only represent maximally parsimonious trees. We show that the history sDAG allows us to find many additional equally parsimonious trees, extending combinatorially beyond the ensemble used to construct it. We argue that this object could be useful as the "skeleton" of a more complete uncertainty quantification.
Collapse
Affiliation(s)
- Will Dumm
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
- Howard Hughes Medical Institute, Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - Mary Barker
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
- Howard Hughes Medical Institute, Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - William Howard-Snyder
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, USA
| | - William S DeWitt Iii
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA
| | - Frederick A Matsen Iv
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA.
- Howard Hughes Medical Institute, Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA.
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA.
- Department of Statistics, University of Washington, Seattle, Washington, USA.
| |
Collapse
|
7
|
Khodaei M, Owen M, Beerli P. Geodesics to characterize the phylogenetic landscape. PLoS One 2023; 18:e0287350. [PMID: 37352194 PMCID: PMC10289362 DOI: 10.1371/journal.pone.0287350] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 06/04/2023] [Indexed: 06/25/2023] Open
Abstract
Phylogenetic trees are fundamental for understanding evolutionary history. However, finding maximum likelihood trees is challenging due to the complexity of the likelihood landscape and the size of tree space. Based on the Billera-Holmes-Vogtmann (BHV) distance between trees, we describe a method to generate intermediate trees on the shortest path between two trees, called pathtrees. These pathtrees give a structured way to generate and visualize part of treespace. They allow investigating intermediate regions between trees of interest, exploring locally optimal trees in topological clusters of treespace, and potentially finding trees of high likelihood unexplored by tree search algorithms. We compared our approach against other tree search tools (Paup*, RAxML, and RevBayes) using the highest likelihood trees and number of new topologies found, and validated the accuracy of the generated treespace. We assess our method using two datasets. The first consists of 23 primate species (CytB, 1141 bp), leading to well-resolved relationships. The second is a dataset of 182 milksnakes (CytB, 1117 bp), containing many similar sequences and complex relationships among individuals. Our method visualizes the treespace using log likelihood as a fitness function. It finds similarly optimal trees as heuristic methods and presents the likelihood landscape at different scales. It found relevant trees that were not found with MCMC methods. The validation measures indicated that our method performed well mapping treespace into lower dimensions. Our method complements heuristic search analyses, and the visualization allows the inspection of likelihood terraces and exploration of treespace areas not visited by heuristic searches.
Collapse
Affiliation(s)
- Marzieh Khodaei
- Department of Scientific Computing, Florida State University, Tallahassee, FL, United States of America
| | - Megan Owen
- Department of Mathematics, Lehman College and Graduate Center, CUNY, NY, NY, United States of America
| | - Peter Beerli
- Department of Scientific Computing, Florida State University, Tallahassee, FL, United States of America
| |
Collapse
|
8
|
How challenging RADseq data turned out to favor coalescent-based species tree inference. A case study in Aichryson (Crassulaceae). Mol Phylogenet Evol 2021; 167:107342. [PMID: 34785384 DOI: 10.1016/j.ympev.2021.107342] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Revised: 07/05/2021] [Accepted: 10/29/2021] [Indexed: 12/24/2022]
Abstract
Analysing multiple genomic regions while incorporating detection and qualification of discordance among regions has become standard for understanding phylogenetic relationships. In plants, which usually have comparatively large genomes, this is feasible by the combination of reduced-representation library (RRL) methods and high-throughput sequencing enabling the cost effective acquisition of genomic data for thousands of loci from hundreds of samples. One popular RRL method is RADseq. A major disadvantage of established RADseq approaches is the rather short fragment and sequencing range, leading to loci of little individual phylogenetic information. This issue hampers the application of coalescent-based species tree inference. The modified RADseq protocol presented here targets ca. 5,000 loci of 300-600nt length, sequenced with the latest short-read-sequencing (SRS) technology, has the potential to overcome this drawback. To illustrate the advantages of this approach we use the study group Aichryson Webb & Berthelott (Crassulaceae), a plant genus that diversified on the Canary Islands. The data analysis approach used here aims at a careful quality control of the long loci dataset. It involves an informed selection of thresholds for accurate clustering, a thorough exploration of locus properties, such as locus length, coverage and variability, to identify potential biased data and a comparative phylogenetic inference of filtered datasets, accompanied by an evaluation of resulting BS support, gene and site concordance factor values, to improve overall resolution of the resulting phylogenetic trees. The final dataset contains variable loci with an average length of 373nt and facilitates species tree estimation using a coalescent-based summary approach. Additional improvements brought by the approach are critically discussed.
Collapse
|
9
|
Silva AS, Wilkinson M. On Defining and Finding Islands of Trees and Mitigating Large Island Bias. Syst Biol 2021; 70:1282-1294. [PMID: 33749752 PMCID: PMC8513764 DOI: 10.1093/sysbio/syab015] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Accepted: 02/24/2021] [Indexed: 11/12/2022] Open
Abstract
How best can we summarize sets of phylogenetic trees? Systematists have relied heavily on consensus methods, but if tree distributions can be partitioned into distinct subsets, it may be helpful to provide separate summaries of these rather than relying entirely upon a single consensus tree. How sets of trees can most helpfully be partitioned and represented leads to many open questions, but one natural partitioning is provided by the islands of trees found during tree searches. Islands that are of dissimilar size have been shown to yield majority-rule consensus trees dominated by the largest sets We illustrate this large island bias and approaches that mitigate its impact by revisiting a recent analysis of phylogenetic relationships of living and fossil amphibians. We introduce a revised definition of tree islands based on any tree-to-tree pairwise distance metric that usefully extends the notion to any set or multiset of trees, as might be produced by, for example, Bayesian or bootstrap methods, and that facilitates finding tree islands a posteriori. We extract islands from a tree distribution obtained in a Bayesian analysis of the amphibian data to investigate their impact in that context, and we compare the partitioning produced by tree islands with those resulting from some alternative approaches. Distinct subsets of trees, such as tree islands, should be of interest because of what they may reveal about evolution and/or our attempts to understand it, and are an important, sometimes overlooked, consideration when building and interpreting consensus trees. [Amphibia; Bayesian inference; consensus; parsimony; partitions; phylogeny; Chinlestegophis.].
Collapse
Affiliation(s)
- Ana Serra Silva
- Department of Life Sciences, The Natural History Museum, London SW7 5BD, UK
- School of Earth Sciences, University of Bristol, Bristol BS8 1RL, UK
| | - Mark Wilkinson
- Department of Life Sciences, The Natural History Museum, London SW7 5BD, UK
| |
Collapse
|
10
|
Farah IT, Islam MM, Zinat KT, Rahman AH, Bayzid MS. Species tree estimation from gene trees by minimizing deep coalescence and maximizing quartet consistency: a comparative study and the presence of pseudo species tree terraces. Syst Biol 2021; 70:1213-1231. [PMID: 33844023 DOI: 10.1093/sysbio/syab026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 03/25/2021] [Accepted: 03/29/2021] [Indexed: 11/14/2022] Open
Abstract
Species tree estimation from multi-locus datasets is extremely challenging, especially in the presence of gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Summary methods have been developed which estimate gene trees and then combine the gene trees to estimate a species tree by optimizing various optimization scores. In this study, we have extended and adapted the concept of phylogenetic terraces to species tree estimation by "summarizing" a set of gene trees, where multiple species trees with distinct topologies may have exactly the same optimality score (i.e., quartet score, extra lineage score, etc.). We particularly investigated the presence and impacts of equally optimal trees in species tree estimation from multi-locus data using summary methods by taking ILS into account. We analyzed two of the most popular ILS-aware optimization criteria: maximize quartet consistency (MQC) and minimize deep coalescence (MDC). Methods based on MQC are provably statistically consistent, whereas MDC is not a consistent criterion for species tree estimation. We present a comprehensive comparative study of these two optimality criteria. Our experiments, on a collection of datasets simulated under ILS, indicate that MDC may result in competitive or identical quartet consistency score as MQC, but could be significantly worse than MQC in terms of tree accuracy - demonstrating the presence and impacts of equally optimal species trees. This is the first known study that provides the conditions for the datasets to have equally optimal trees in the context of phylogenomic inference using summary methods.
Collapse
Affiliation(s)
- Ishrat Tanzila Farah
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh
| | - Md Muktadirul Islam
- Applied Statistics and Data Science (ASDS), Department of Statistics Jahangirnagar University Dhaka-1342, Bangladesh
| | - Kazi Tasnim Zinat
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh.,Department of Computer Science University of Maryland, College Park, Maryland, USA
| | - Atif Hasan Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh
| |
Collapse
|
11
|
Collapsing dubiously resolved gene-tree branches in phylogenomic coalescent analyses. Mol Phylogenet Evol 2021; 158:107092. [PMID: 33545272 DOI: 10.1016/j.ympev.2021.107092] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Revised: 12/30/2020] [Accepted: 01/28/2021] [Indexed: 01/15/2023]
Abstract
In two-step coalescent analyses of phylogenomic data, gene-tree topologies are treated as fixed prior to species-tree inference. Although all gene-tree conflict is assumed to be caused by lineage sorting when applying these methods, in empirical datasets much of the conflict can be caused by estimation error. Weakly supported and even arbitrarily resolved clades are important sources of this estimation error for gene trees inferred from few informative characters relative to the number of sampled terminals, and the resulting extraneous conflict among gene trees can negatively impact species-tree inference. In this study, we quantified the relative severity of alternative methods for collapsing gene-tree branches for seven empirical datasets and quantified their effects on species-tree inference. The branch-collapsing methods that we employed were based on the strict consensus of optimal topologies, various bootstrap thresholds, and 0% approximate likelihood ratio test (SH-like aLRT) support. Up to 86% of internal gene-tree branches are dubiously or arbitrarily resolved in reanalyses of these published phylogenomic datasets, and collapsing these branches increased inferred species-tree coalescent branch lengths by up to 455%. For two datasets, the longer inferred branch lengths sometimes impacted inference of anomaly-zone conditions. Although branch-collapsing methods did not consistently affect the species-tree topology, they often increased branch support. The more severe and clearly justified gene-tree branch-collapsing methods, which we recommend be broadly applied for two-step coalescent analyses, are use of the strict consensus in parsimony analyses and the collapse clades with 0% SH-like aLRT support in likelihood analyses. Collapsing dubiously or arbitrarily resolved branches in gene trees sometimes improved congruence between coalescent-based results and concatenation trees. In such cases, we contend that the resolution provided by concatenation should be preferred and that incomplete lineage sorting is a poor explanation for the initial conflict between phylogenetic approaches.
Collapse
|
12
|
Shen XX, Li Y, Hittinger CT, Chen XX, Rokas A. An investigation of irreproducibility in maximum likelihood phylogenetic inference. Nat Commun 2020; 11:6096. [PMID: 33257660 PMCID: PMC7705714 DOI: 10.1038/s41467-020-20005-6] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2020] [Accepted: 11/05/2020] [Indexed: 01/09/2023] Open
Abstract
Phylogenetic trees are essential for studying biology, but their reproducibility under identical parameter settings remains unexplored. Here, we find that 3515 (18.11%) IQ-TREE-inferred and 1813 (9.34%) RAxML-NG-inferred maximum likelihood (ML) gene trees are topologically irreproducible when executing two replicates (Run1 and Run2) for each of 19,414 gene alignments in 15 animal, plant, and fungal phylogenomic datasets. Notably, coalescent-based ASTRAL species phylogenies inferred from Run1 and Run2 sets of individual gene trees are topologically irreproducible for 9/15 phylogenomic datasets, whereas concatenation-based phylogenies inferred twice from the same supermatrix are reproducible. Our simulations further show that irreproducible phylogenies are more likely to be incorrect than reproducible phylogenies. These results suggest that a considerable fraction of single-gene ML trees may be irreproducible. Increasing reproducibility in ML inference will benefit from providing analyses’ log files, which contain typically reported parameters (e.g., program, substitution model, number of tree searches) but also typically unreported ones (e.g., random starting seed number, number of threads, processor type). Replicate runs of maximum likelihood phylogenetic analyses can generate different tree topologies due to differences in parameters, such as random seeds. Here, Shen et al. demonstrate that replicate runs can generate substantially different tree topologies even with identical data and parameters.
Collapse
Affiliation(s)
- Xing-Xing Shen
- State Key Laboratory of Rice Biology, Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, 310058, Hangzhou, China. .,Institute of Insect Sciences, Zhejiang University, 310058, Hangzhou, China.
| | - Yuanning Li
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, 37235, USA
| | - Chris Todd Hittinger
- Laboratory of Genetics, J. F. Crow Institute for the Study of Evolution, Wisconsin Energy Institute, Center for Genomic Science Innovation, University of Wisconsin-Madison, Madison, WI, 53706, USA.,DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Xue-Xin Chen
- State Key Laboratory of Rice Biology, Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, 310058, Hangzhou, China.,Institute of Insect Sciences, Zhejiang University, 310058, Hangzhou, China
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, 37235, USA.
| |
Collapse
|
13
|
Ji X, Zhang Z, Holbrook A, Nishimura A, Baele G, Rambaut A, Lemey P, Suchard MA. Gradients Do Grow on Trees: A Linear-Time O(N)-Dimensional Gradient for Statistical Phylogenetics. Mol Biol Evol 2020; 37:3047-3060. [PMID: 32458974 PMCID: PMC7530611 DOI: 10.1093/molbev/msaa130] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
Calculation of the log-likelihood stands as the computational bottleneck for many statistical phylogenetic algorithms. Even worse is its gradient evaluation, often used to target regions of high probability. Order O(N)-dimensional gradient calculations based on the standard pruning algorithm require O(N2) operations, where N is the number of sampled molecular sequences. With the advent of high-throughput sequencing, recent phylogenetic studies have analyzed hundreds to thousands of sequences, with an apparent trend toward even larger data sets as a result of advancing technology. Such large-scale analyses challenge phylogenetic reconstruction by requiring inference on larger sets of process parameters to model the increasing data heterogeneity. To make these analyses tractable, we present a linear-time algorithm for O(N)-dimensional gradient evaluation and apply it to general continuous-time Markov processes of sequence substitution on a phylogenetic tree without a need to assume either stationarity or reversibility. We apply this approach to learn the branch-specific evolutionary rates of three pathogenic viruses: West Nile virus, Dengue virus, and Lassa virus. Our proposed algorithm significantly improves inference efficiency with a 126- to 234-fold increase in maximum-likelihood optimization and a 16- to 33-fold computational performance increase in a Bayesian framework.
Collapse
Affiliation(s)
- Xiang Ji
- Department of Biomathematics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA
- Department of Mathematics, School of Science & Engineering, Tulane University, New Orleans, LA
| | - Zhenyu Zhang
- Department of Biostatistics, Fielding School of Public Health, University of California Los Angeles, Los Angeles, CA
| | - Andrew Holbrook
- Department of Biostatistics, Fielding School of Public Health, University of California Los Angeles, Los Angeles, CA
| | - Akihiko Nishimura
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Andrew Rambaut
- Institute of Evolutionary Biology, Centre for Immunology, Infection and Evolution, University of Edinburgh, Edinburgh, United Kingdom
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Marc A Suchard
- Department of Biomathematics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA
- Department of Biostatistics, Fielding School of Public Health, University of California Los Angeles, Los Angeles, CA
- Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA
| |
Collapse
|
14
|
Evidence of absence treated as absence of evidence: The effects of variation in the number and distribution of gaps treated as missing data on the results of standard maximum likelihood analysis. Mol Phylogenet Evol 2020; 154:106966. [PMID: 32971285 DOI: 10.1016/j.ympev.2020.106966] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 08/15/2020] [Accepted: 09/15/2020] [Indexed: 11/23/2022]
Abstract
Although numerous studies have demonstrated the theoretical and empirical importance of treating gaps as insertion/deletion (indel) events in phylogenetic analyses, the standard approach to maximum likelihood (ML) analysis employed in the vast majority of empirical studies codes gaps as nucleotides of unknown identity ("missing data"). Therefore, it is imperative to understand the empirical consequences of different numbers and distributions of gaps treated as missing data. We evaluated the effects of variation in the number and distribution of gaps (i.e., no base, coded as IUPAC "." or "-") treated as missing data (i.e., any base, coded as "?" or IUPAC "N") in standard ML analysis. We obtained alignments with variable numbers and arrangements of gaps by aligning seven diverse empirical datasets under different gap opening costs using MAFFT. We selected the optimal substitution model for each alignment using the corrected Akaike Information Criterion in jModelTest2 and searched for optimal trees using GARLI. We also employed a Monte Carlo approach to randomly replace nucleotides with gaps (treated as missing data) in an empirical dataset to understand more precisely the effects of varying their number and distribution. To compare alignments, we developed four new indices and used several existing measures to quantify the number and distribution of gaps in all alignments. Our most important finding is that ML scores correlate negatively with gap opening costs and the amount of missing data. However, this negative relationship is not due to the increase in missing data per se-which increases ML scores-but instead to the effect of gaps on nucleotide homology. These variables also cause significant but largely unpredictable effects on tree topology.
Collapse
|
15
|
Jaramillo AF, De La Riva I, Guayasamin JM, Chaparro JC, Gagliardi-Urrutia G, Gutiérrez RC, Brcko I, Vilà C, Castroviejo-Fisher S. Vastly underestimated species richness of Amazonian salamanders (Plethodontidae: Bolitoglossa) and implications about plethodontid diversification. Mol Phylogenet Evol 2020; 149:106841. [PMID: 32305511 DOI: 10.1016/j.ympev.2020.106841] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2019] [Revised: 04/10/2020] [Accepted: 04/13/2020] [Indexed: 11/29/2022]
Abstract
We present data showing that the number of salamander species in Amazonia is vastly underestimated. We used DNA sequences of up to five genes (3 mitochondrial and 2 nuclear) of 366 specimens, 189 corresponding to 89 non-Amazonian nominal species and 177 Amazonian specimens, including types or topotypes, of eight of the nine recognized species in the region. By including representatives of all known species of Amazonian Bolitoglossa, except for one, and 73% of the currently 132 recognized species of the genus, our dataset represents the broadest sample of Bolitoglossa species, specimens, and geographic localities studied to date. We performed phylogenetic analyses using parsimony with tree-alignment and maximum likelihood (ML) with similarity alignment, with indels as binary characters. Our optimal topologies were used to delimit lineages that we assigned to nominal species and candidate new species following criteria that maximize the consilience of the current species taxonomy, monophyly, gaps in branch lengths, genetic distances, and geographic distribution. We contrasted the results of our species-delimitation protocol with those of Automated Barcode Gap Discovery (ABGD) and multi-rate Poisson Tree Processes (mPTP). Finally, we inferred the historical biogeography of South American salamanders by dating the trees and using dispersal-vicariance analysis (DIVA). Our results revealed a clade including almost all Amazonian salamanders, with a topology incompatible with just the currently recognized nine species. Following our species-delimitation criteria, we identified 44 putative species in Amazonia. Both ABGD and mPTP inferred more species than currently recognized, but their numbers (23-49) and limits vary. Our biogeographic analysis suggested a stepping-stone colonization of the Amazonian lowlands from Central America through the Chocó and the Andes, with several late dispersals from Amazonia back into the Andes. These biogeographic events are temporally concordant with an early land bridge between Central and South America (~10-15 MYA) and major landscape changes in Amazonia during the late Miocene and Pliocene, such as the drainage of the Pebas system, the establishment of the Amazon River, and the major orogeny of the northern Andes.
Collapse
Affiliation(s)
- Andrés F Jaramillo
- Pos-Graduação em Ecologia e Evolução da Biodiversidade, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Brazil; Laboratorio de Sistemática de Vertebrados, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Brazil.
| | | | - Juan M Guayasamin
- Laboratorio de Biología Evolutiva, Instituto BIOSFERA-USFQ, Colegio de Ciencias Biológicas y Ambientales COCIBA, Universidad San Francisco de Quito (USFQ), Ecuador; University of North Carolina at Chapel Hill, Department of Biology, USA
| | - Juan C Chaparro
- Museo de Biodiversidad del Perú (MUBI), Peru; Museo de Historia Natural de la Universidad Nacional de San Antonio Abad del Cusco, Peru
| | - Giussepe Gagliardi-Urrutia
- Pos-Graduação em Ecologia e Evolução da Biodiversidade, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Brazil; Laboratorio de Sistemática de Vertebrados, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Brazil; Peruvian Center for Biodiversity and Conservation (PCB&C), Peru; Dirección de Investigación en Diversidad Biológica Terrestre Amazónica, Instituto de Investigaciones de la Amazonía Peruana (IIAP), Peru
| | - Roberto C Gutiérrez
- Museo de Historia Natural de la Universidad Nacional de San Agustín de Arequipa (MUSA), Peru
| | - Isabela Brcko
- Laboratório de Biologia Molecular, Instituto de Ciências Biológicas, Universidade Federal do Pará (UFPA), Brazil
| | - Carles Vilà
- Estación Biológica de Doñana (EBD-CSIC), Spain
| | - Santiago Castroviejo-Fisher
- Pos-Graduação em Ecologia e Evolução da Biodiversidade, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Brazil; Laboratorio de Sistemática de Vertebrados, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Brazil; Department of Herpetology, American Museum of Natural History, USA
| |
Collapse
|
16
|
Simmons MP, Kessenich J. Divergence and support among slightly suboptimal likelihood gene trees. Cladistics 2019; 36:322-340. [DOI: 10.1111/cla.12404] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/11/2019] [Indexed: 12/18/2022] Open
Affiliation(s)
- Mark P. Simmons
- Department of Biology Colorado State University Fort Collins CO 80523‐1878 USA
| | - John Kessenich
- 305 W. Magnolia Street PMB 134 Fort Collins CO 80521 USA
| |
Collapse
|
17
|
Biczok R, Bozsoky P, Eisenmann P, Ernst J, Ribizel T, Scholz F, Trefzer A, Weber F, Hamann M, Stamatakis A. Two C++ libraries for counting trees on a phylogenetic terrace. Bioinformatics 2019; 34:3399-3401. [PMID: 29746618 PMCID: PMC6157082 DOI: 10.1093/bioinformatics/bty384] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2017] [Accepted: 05/03/2018] [Indexed: 11/12/2022] Open
Abstract
Motivation The presence of terraces in phylogenetic tree space, i.e. a potentially large number of distinct tree topologies that have exactly the same analytical likelihood score, was first described by Sanderson et al. However, popular software tools for maximum likelihood and Bayesian phylogenetic inference do not yet routinely report, if inferred phylogenies reside on a terrace, or not. We believe, this is due to the lack of an efficient library to (i) determine if a tree resides on a terrace, (ii) calculate how many trees reside on a terrace and (iii) enumerate all trees on a terrace. Results In our bioinformatics practical that is set up as a programming contest we developed two efficient and independent C++ implementations of the SUPERB algorithm by Constantinescu and Sankoff (1995) for counting and enumerating trees on a terrace. Both implementations yield exactly the same results, are more than one order of magnitude faster, and require one order of magnitude less memory than a previous thirrd party python implementation. Availability and implementation The source codes are available under GNU GPL at https://github.com/terraphast. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- R Biczok
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - P Bozsoky
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - P Eisenmann
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - J Ernst
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - T Ribizel
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - F Scholz
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - A Trefzer
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - F Weber
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - M Hamann
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - A Stamatakis
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany.,Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| |
Collapse
|
18
|
Abstract
It has long been appreciated that analyses of genomic data (e.g., whole genome sequencing or sequence capture) have the potential to reveal the tree of life, but it remains challenging to move from sequence data to a clear understanding of evolutionary history, in part due to the computational challenges of phylogenetic estimation using genome-scale data. Supertree methods solve that challenge because they facilitate a divide-and-conquer approach for large-scale phylogeny inference by integrating smaller subtrees in a computationally efficient manner. Here, we combined information from sequence capture and whole-genome phylogenies using supertree methods. However, the available phylogenomic trees had limited overlap so we used taxon-rich (but not phylogenomic) megaphylogenies to weave them together. This allowed us to construct a phylogenomic supertree, with support values, that included 707 bird species (~7% of avian species diversity). We estimated branch lengths using mitochondrial sequence data and we used these branch lengths to estimate divergence times. Our time-calibrated supertree supports radiation of all three major avian clades (Palaeognathae, Galloanseres, and Neoaves) near the Cretaceous-Paleogene (K-Pg) boundary. The approach we used will permit the continued addition of taxa to this supertree as new phylogenomic data are published, and it could be applied to other taxa as well.
Collapse
|
19
|
Dobrin BH, Zwickl DJ, Sanderson MJ. The prevalence of terraced treescapes in analyses of phylogenetic data sets. BMC Evol Biol 2018; 18:46. [PMID: 29618314 PMCID: PMC5885316 DOI: 10.1186/s12862-018-1162-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2017] [Accepted: 03/22/2018] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND The pattern of data availability in a phylogenetic data set may lead to the formation of terraces, collections of equally optimal trees. Terraces can arise in tree space if trees are scored with parsimony or with partitioned, edge-unlinked maximum likelihood. Theory predicts that terraces can be large, but their prevalence in contemporary data sets has never been surveyed. We selected 26 data sets and phylogenetic trees reported in recent literature and investigated the terraces to which the trees would belong, under a common set of inference assumptions. We examined terrace size as a function of the sampling properties of the data sets, including taxon coverage density (the proportion of taxon-by-gene positions with any data present) and a measure of gene sampling "sufficiency". We evaluated each data set in relation to the theoretical minimum gene sampling depth needed to reduce terrace size to a single tree, and explored the impact of the terraces found in replicate trees in bootstrap methods. RESULTS Terraces were identified in nearly all data sets with taxon coverage densities < 0.90. They were not found, however, in high-coverage-density (i.e., ≥ 0.94) transcriptomic and genomic data sets. The terraces could be very large, and size varied inversely with taxon coverage density and with gene sampling sufficiency. Few data sets achieved a theoretical minimum gene sampling depth needed to reduce terrace size to a single tree. Terraces found during bootstrap resampling reduced overall support. CONCLUSIONS If certain inference assumptions apply, trees estimated from empirical data sets often belong to large terraces of equally optimal trees. Terrace size correlates to data set sampling properties. Data sets seldom include enough genes to reduce terrace size to one tree. When bootstrap replicate trees lie on a terrace, statistical support for phylogenetic hypotheses may be reduced. Although some of the published analyses surveyed were conducted with edge-linked inference models (which do not induce terraces), unlinked models have been used and advocated. The present study describes the potential impact of that inference assumption on phylogenetic inference in the context of the kinds of multigene data sets now widely assembled for large-scale tree construction.
Collapse
Affiliation(s)
- Barbara H. Dobrin
- Department of Ecology and Evolutionary Biology, University of Arizona, 1041 E. Lowell St, Tucson, AZ 85721 USA
| | - Derrick J. Zwickl
- Department of Ecology and Evolutionary Biology, University of Arizona, 1041 E. Lowell St, Tucson, AZ 85721 USA
| | - Michael J. Sanderson
- Department of Ecology and Evolutionary Biology, University of Arizona, 1041 E. Lowell St, Tucson, AZ 85721 USA
| |
Collapse
|
20
|
Eiserhardt WL, Antonelli A, Bennett DJ, Botigué LR, Burleigh JG, Dodsworth S, Enquist BJ, Forest F, Kim JT, Kozlov AM, Leitch IJ, Maitner BS, Mirarab S, Piel WH, Pérez-Escobar OA, Pokorny L, Rahbek C, Sandel B, Smith SA, Stamatakis A, Vos RA, Warnow T, Baker WJ. A roadmap for global synthesis of the plant tree of life. AMERICAN JOURNAL OF BOTANY 2018; 105:614-622. [PMID: 29603138 DOI: 10.1002/ajb2.1041] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/13/2017] [Accepted: 11/08/2017] [Indexed: 06/08/2023]
Abstract
Providing science and society with an integrated, up-to-date, high quality, open, reproducible and sustainable plant tree of life would be a huge service that is now coming within reach. However, synthesizing the growing body of DNA sequence data in the public domain and disseminating the trees to a diverse audience are often not straightforward due to numerous informatics barriers. While big synthetic plant phylogenies are being built, they remain static and become quickly outdated as new data are published and tree-building methods improve. Moreover, the body of existing phylogenetic evidence is hard to navigate and access for non-experts. We propose that our community of botanists, tree builders, and informaticians should converge on a modular framework for data integration and phylogenetic analysis, allowing easy collaboration, updating, data sourcing and flexible analyses. With support from major institutions, this pipeline should be re-run at regular intervals, storing trees and their metadata long-term. Providing the trees to a diverse global audience through user-friendly front ends and application development interfaces should also be a priority. Interactive interfaces could be used to solicit user feedback and thus improve data quality and to coordinate the generation of new data. We conclude by outlining a number of steps that we suggest the scientific community should take to achieve global phylogenetic synthesis.
Collapse
Affiliation(s)
- Wolf L Eiserhardt
- Royal Botanic Gardens, Kew, TW9 3AE, Richmond, Surrey, UK
- Department of Bioscience, Aarhus University, Ny Munkegade 116, 8000, Aarhus C, Denmark
| | - Alexandre Antonelli
- Gothenburg Global Biodiversity Centre, Box 461, 405 30, Gothenburg, Sweden
- Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, 405 30, Gothenburg, Sweden
- Gothenburg Botanical Garden, Carl Skottsbergs Gata 22B, SE-413 19, Gothenburg, Sweden
| | - Dominic J Bennett
- Gothenburg Global Biodiversity Centre, Box 461, 405 30, Gothenburg, Sweden
- Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, 405 30, Gothenburg, Sweden
- Gothenburg Botanical Garden, Carl Skottsbergs Gata 22B, SE-413 19, Gothenburg, Sweden
| | | | | | | | - Brian J Enquist
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, 85721, USA
- The Santa Fe Institute, Santa Fe, NM, 87501, USA
| | - Félix Forest
- Royal Botanic Gardens, Kew, TW9 3AE, Richmond, Surrey, UK
| | - Jan T Kim
- Royal Botanic Gardens, Kew, TW9 3AE, Richmond, Surrey, UK
| | - Alexey M Kozlov
- Scientific Computing Group, Heidelberg Institute for Theoretical Studies, 69118, Heidelberg, Germany
| | - Ilia J Leitch
- Royal Botanic Gardens, Kew, TW9 3AE, Richmond, Surrey, UK
| | - Brian S Maitner
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, 85721, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California, San Diego, San Diego, CA, 92093, USA
| | - William H Piel
- Yale-NUS College, 16 College Avenue West, Singapore, 138527, Republic of Singapore
| | | | - Lisa Pokorny
- Royal Botanic Gardens, Kew, TW9 3AE, Richmond, Surrey, UK
| | - Carsten Rahbek
- Center for Macroecology, Evolution and Climate, University of Copenhagen, Universitetsparken 15, DK-2100, Copenhagen O, Denmark
- Imperial College London, Silwood Park, Buckhurst Road, Ascot, Berkshire, SL5 7PY, UK
| | - Brody Sandel
- Department of Biology, Santa Clara University, Santa Clara, CA, 95053, USA
| | - Stephen A Smith
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Alexandros Stamatakis
- Scientific Computing Group, Heidelberg Institute for Theoretical Studies, 69118, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, 76128, Karlsruhe, Germany
| | - Rutger A Vos
- Naturalis Biodiversity Center, P.O. Box 9517, 2300RA, Leiden, The Netherlands
- Institute of Biology Leiden, P.O. Box 9505, 2300RA, Leiden, The Netherlands
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | | |
Collapse
|
21
|
Beaulieu JM, O'Meara BC. Can we build it? Yes we can, but should we use it? Assessing the quality and value of a very large phylogeny of campanulid angiosperms. AMERICAN JOURNAL OF BOTANY 2018; 105:417-432. [PMID: 29746717 DOI: 10.1002/ajb2.1020] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Accepted: 10/31/2017] [Indexed: 06/08/2023]
Abstract
PREMISE OF THE STUDY The study of very large and very old clades holds the promise of greater insights into evolution across the tree of life. However, there has been a fair amount of criticism regarding the interpretations and quality of studies to date, with some suggesting that detailed studies carried out on smaller, tractable scales should be preferred over the increasingly grand syntheses of these data. METHODS We provided in detail our trials and tribulations of compiling a large, sparsely sampled matrix from GenBank data and inferring a well-supported, time-calibrated phylogeny of Campanulidae. We also used a simulation approach to assess tree quality and to study the value of using very large, comprehensive phylogenies in a comparative context. KEY RESULTS A robust and well-supported phylogeny can be produced as long as automated procedures are supplemented with some human intervention. In the case of campanulids, the overall topology may be driven not only by particular genes, but also particular sequences for a gene. We also determined that estimates of divergence times should be fairly robust to issues related to clade-specific heterogeneity. Finally, we demonstrated how relying on results from smaller, younger clades are prone to produce biased interpretations of tropical to temperate evolution across campanulids as a whole. CONCLUSIONS While we were both surprised and encouraged by the robust and fairly well-resolved, comprehensive phylogeny of campanulids, challenges still remain. Nevertheless, large phylogenies are inherently valuable in a comparative context if only to attenuate the issue of ascertainment bias.
Collapse
Affiliation(s)
- Jeremy M Beaulieu
- Department of Biological Sciences, University of Arkansas, Fayetteville, Arkansas, 72701, USA
| | - Brian C O'Meara
- Department of Ecology and Evolutionary Biology, University of Tennessee, Knoxville, Tennessee, 37996-1610, USA
| |
Collapse
|
22
|
Title PO, Rabosky DL. Do Macrophylogenies Yield Stable Macroevolutionary Inferences? An Example from Squamate Reptiles. Syst Biol 2018; 66:843-856. [PMID: 27821703 DOI: 10.1093/sysbio/syw102] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2015] [Accepted: 10/27/2016] [Indexed: 01/03/2023] Open
Abstract
Advances in the generation, retrieval, and analysis of phylogenetic data have enabled researchers to create phylogenies that contain many thousands of taxa. These "macrophylogenies"-large trees that typically derive from megaphylogeny, supermatrix, or supertree approaches-provide researchers with an unprecedented ability to conduct evolutionary analyses across broad phylogenetic scales. Many studies have now used these phylogenies to explore the dynamics of speciation, extinction, and phenotypic evolution across large swaths of the tree of life. These trees are characterized by substantial phylogenetic uncertainty on multiple levels, and the stability of macroevolutionary inferences from these data sets has not been rigorously explored. As a case study, we tested whether five recently published phylogenies for squamate reptiles-each consisting of more than 4000 species-yield congruent inferences about the processes that underlie variation in species richness across replicate evolutionary radiations of Australian snakes and lizards. We find discordance across the five focal phylogenies with respect to clade age and several diversification rate metrics, and in the effects of clade age on species richness. We also find that crown clade ages reported in the literature on these Australian groups are in conflict with all of the large phylogenies examined. Macrophylogenies offer an unprecedented opportunity to address evolutionary and ecological questions at broad phylogenetic scales, but accurately representing the uncertainty that is inherent to such analyses remains a critical challenge to our field. [Australia; macroevolution; macrophylogeny; squamates; time calibration.].
Collapse
Affiliation(s)
- Pascal O Title
- Department of Ecology and Evolutionary Biology and Museum of Zoology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Daniel L Rabosky
- Department of Ecology and Evolutionary Biology and Museum of Zoology, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
23
|
Sanderson MJ, Nicolae M, McMahon MM. Homology-Aware Phylogenomics at Gigabase Scales. Syst Biol 2018; 66:590-603. [PMID: 28123115 PMCID: PMC5790135 DOI: 10.1093/sysbio/syw104] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Accepted: 11/25/2016] [Indexed: 11/13/2022] Open
Abstract
Obstacles to inferring species trees from whole genome data sets range from algorithmic and data management challenges to the wholesale discordance in evolutionary history found in different parts of a genome. Recent work that builds trees directly from genomes by parsing them into sets of small $k$-mer strings holds promise to streamline and simplify these efforts, but existing approaches do not account well for gene tree discordance. We describe a "seed and extend" protocol that finds nearly exact matching sets of orthologous $k$-mers and extends them to construct data sets that can properly account for genomic heterogeneity. Exploiting an efficient suffix array data structure, sets of whole genomes can be parsed and converted into phylogenetic data matrices rapidly, with contiguous blocks of $k$-mers from the same chromosome, gene, or scaffold concatenated as needed. Phylogenetic trees constructed from highly curated rice genome data and a diverse set of six other eukaryotic whole genome, transcriptome, and organellar genome data sets recovered trees nearly identical to published phylogenomic analyses, in a small fraction of the time, and requiring many fewer parameter choices. Our method's ability to retain local homology information was demonstrated by using it to characterize gene tree discordance across the rice genome, and by its robustness to the high rate of interchromosomal gene transfer found in several rice species.
Collapse
Affiliation(s)
- M J Sanderson
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Marius Nicolae
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA
| | - M M McMahon
- School of Plant Sciences, University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
24
|
Schliep K, Potts AJ, Morrison DA, Grimm GW. Intertwining phylogenetic trees and networks. Methods Ecol Evol 2017. [DOI: 10.1111/2041-210x.12760] [Citation(s) in RCA: 119] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
25
|
An integrative systematic framework helps to reconstruct skeletal evolution of glass sponges (Porifera, Hexactinellida). Front Zool 2017; 14:18. [PMID: 28331531 PMCID: PMC5359874 DOI: 10.1186/s12983-017-0191-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Accepted: 01/20/2017] [Indexed: 11/24/2022] Open
Abstract
Background Glass sponges (Class Hexactinellida) are important components of deep-sea ecosystems and are of interest from geological and materials science perspectives. The reconstruction of their phylogeny with molecular data has only recently begun and shows a better agreement with morphology-based systematics than is typical for other sponge groups, likely because of a greater number of informative morphological characters. However, inconsistencies remain that have far-reaching implications for hypotheses about the evolution of their major skeletal construction types (body plans). Furthermore, less than half of all described extant genera have been sampled for molecular systematics, and several taxa important for understanding skeletal evolution are still missing. Increased taxon sampling for molecular phylogenetics of this group is therefore urgently needed. However, due to their remote habitat and often poorly preserved museum material, sequencing all 126 currently recognized extant genera will be difficult to achieve. Utilizing morphological data to incorporate unsequenced taxa into an integrative systematics framework therefore holds great promise, but it is unclear which methodological approach best suits this task. Results Here, we increase the taxon sampling of four previously established molecular markers (18S, 28S, and 16S ribosomal DNA, as well as cytochrome oxidase subunit I) by 12 genera, for the first time including representatives of the order Aulocalycoida and the type genus of Dactylocalycidae, taxa that are key to understanding hexactinellid body plan evolution. Phylogenetic analyses suggest that Aulocalycoida is diphyletic and provide further support for the paraphyly of order Hexactinosida; hence these orders are abolished from the Linnean classification. We further assembled morphological character matrices to integrate so far unsequenced genera into phylogenetic analyses in maximum parsimony (MP), maximum likelihood (ML), Bayesian, and morphology-based binning frameworks. We find that of these four approaches, total-evidence analysis using MP gave the most plausible results concerning congruence with existing phylogenetic and taxonomic hypotheses, whereas the other methods, especially ML and binning, performed more poorly. We use our total-evidence phylogeny of all extant glass sponge genera for ancestral state reconstruction of morphological characters in MP and ML frameworks, gaining new insights into the evolution of major hexactinellid body plans and other characters such as different spicule types. Conclusions Our study demonstrates how a comprehensive, albeit in some parts provisional, phylogeny of a larger taxon can be achieved with an integrative approach utilizing molecular and morphological data, and how this can be used as a basis for understanding phenotypic evolution. The datasets and associated trees presented here are intended as a resource and starting point for future work on glass sponge evolution. Electronic supplementary material The online version of this article (doi:10.1186/s12983-017-0191-3) contains supplementary material, which is available to authorized users.
Collapse
|
26
|
St. John K. Review Paper: The Shape of Phylogenetic Treespace. Syst Biol 2017; 66:e83-e94. [PMID: 28173538 PMCID: PMC5837343 DOI: 10.1093/sysbio/syw025] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2015] [Revised: 12/16/2015] [Accepted: 03/22/2016] [Indexed: 11/23/2022] Open
Abstract
Trees are a canonical structure for representing evolutionary histories. Many popular criteria used to infer optimal trees are computationally hard, and the number of possible tree shapes grows super-exponentially in the number of taxa. The underlying structure of the spaces of trees yields rich insights that can improve the search for optimal trees, both in accuracy and in running time, and the analysis and visualization of results. We review the past work on analyzing and comparing trees by their shape as well as recent work that incorporates trees with weighted branch lengths.
Collapse
Affiliation(s)
- Katherine St. John
- Department of Mathematics and Computer Science, Lehman College, NY 10034, USA
| |
Collapse
|
27
|
Eaton DAR, Spriggs EL, Park B, Donoghue MJ. Misconceptions on Missing Data in RAD-seq Phylogenetics with a Deep-scale Example from Flowering Plants. Syst Biol 2016; 66:399-412. [DOI: 10.1093/sysbio/syw092] [Citation(s) in RCA: 72] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2016] [Accepted: 10/10/2016] [Indexed: 01/08/2023] Open
|
28
|
Chernomor O, von Haeseler A, Minh BQ. Terrace Aware Data Structure for Phylogenomic Inference from Supermatrices. Syst Biol 2016; 65:997-1008. [PMID: 27121966 PMCID: PMC5066062 DOI: 10.1093/sysbio/syw037] [Citation(s) in RCA: 1093] [Impact Index Per Article: 121.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2015] [Revised: 04/18/2016] [Accepted: 04/19/2016] [Indexed: 11/13/2022] Open
Abstract
In phylogenomics the analysis of concatenated gene alignments, the so-called supermatrix, is commonly accompanied by the assumption of partition models. Under such models each gene, or more generally partition, is allowed to evolve under its own evolutionary model. Although partition models provide a more comprehensive analysis of supermatrices, missing data may hamper the tree search algorithms due to the existence of phylogenetic (partial) terraces. Here, we introduce the phylogenetic terrace aware (PTA) data structure for the efficient analysis under partition models. In the presence of missing data PTA exploits (partial) terraces and induced partition trees to save computation time. We show that an implementation of PTA in IQ-TREE leads to a substantial speedup of up to 4.5 and 8 times compared with the standard IQ-TREE and RAxML implementations, respectively. PTA is generally applicable to all types of partition models and common topological rearrangements thus can be employed by all phylogenomic inference software.
Collapse
Affiliation(s)
- Olga Chernomor
- Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, A-1030 Vienna, Austria and
| | - Arndt von Haeseler
- Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, A-1030 Vienna, Austria and.,Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, A-1090 Vienna, Austria
| | - Bui Quang Minh
- Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, A-1030 Vienna, Austria and
| |
Collapse
|
29
|
Simmons MP, Sloan DB, Gatesy J. The effects of subsampling gene trees on coalescent methods applied to ancient divergences. Mol Phylogenet Evol 2016; 97:76-89. [PMID: 26768112 DOI: 10.1016/j.ympev.2015.12.013] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2015] [Revised: 12/03/2015] [Accepted: 12/20/2015] [Indexed: 10/22/2022]
Abstract
Gene-tree-estimation error is a major concern for coalescent methods of phylogenetic inference. We sampled eight empirical studies of ancient lineages with diverse numbers of taxa and genes for which the original authors applied one or more coalescent methods. We found that the average pairwise congruence among gene trees varied greatly both between studies and also often within a study. We recommend that presenting plots of pairwise congruence among gene trees in a dataset be treated as a standard practice for empirical coalescent studies so that readers can readily assess the extent and distribution of incongruence among gene trees. ASTRAL-based coalescent analyses generally outperformed MP-EST and STAR with respect to both internal consistency (congruence between analyses of subsamples of genes with the complete dataset of all genes) and congruence with the concatenation-based topology. We evaluated the approach of subsampling gene trees that are, on average, more congruent with other gene trees as a method to reduce artifacts caused by gene-tree-estimation errors on coalescent analyses. We suggest that this method is well suited to testing whether gene-tree-estimation error is a primary cause of incongruence between concatenation- and coalescent-based results, to reconciling conflicting phylogenetic results based on different coalescent methods, and to identifying genes affected by artifacts that may then be targeted for reciprocal illumination. We provide scripts that automate the process of calculating pairwise gene-tree incongruence and subsampling trees while accounting for differential taxon sampling among genes. Finally, we assert that multiple tree-search replicates should be implemented as a standard practice for empirical coalescent studies that apply MP-EST.
Collapse
Affiliation(s)
- Mark P Simmons
- Department of Biology, Colorado State University, Fort Collins, CO 80523, USA.
| | - Daniel B Sloan
- Department of Biology, Colorado State University, Fort Collins, CO 80523, USA
| | - John Gatesy
- Department of Biology, University of California, Riverside, CA 92521, USA
| |
Collapse
|
30
|
Chernomor O, Minh BQ, von Haeseler A. Consequences of Common Topological Rearrangements for Partition Trees in Phylogenomic Inference. J Comput Biol 2015; 22:1129-42. [PMID: 26448206 PMCID: PMC4663649 DOI: 10.1089/cmb.2015.0146] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In phylogenomic analysis the collection of trees with identical score (maximum likelihood or parsimony score) may hamper tree search algorithms. Such collections are coined phylogenetic terraces. For sparse supermatrices with a lot of missing data, the number of terraces and the number of trees on the terraces can be very large. If terraces are not taken into account, a lot of computation time might be unnecessarily spent to evaluate many trees that in fact have identical score. To save computation time during the tree search, it is worthwhile to quickly identify such cases. The score of a species tree is the sum of scores for all the so-called induced partition trees. Therefore, if the topological rearrangement applied to a species tree does not change the induced partition trees, the score of these partition trees is unchanged. Here, we provide the conditions under which the three most widely used topological rearrangements (nearest neighbor interchange, subtree pruning and regrafting, and tree bisection and reconnection) change the topologies of induced partition trees. During the tree search, these conditions allow us to quickly identify whether we can save computation time on the evaluation of newly encountered trees. We also introduce the concept of partial terraces and demonstrate that they occur more frequently than the original "full" terrace. Hence, partial terrace is the more important factor of timesaving compared to full terrace. Therefore, taking into account the above conditions and the partial terrace concept will help to speed up the tree search in phylogenomic inference.
Collapse
Affiliation(s)
- Olga Chernomor
- 1 Max F. Perutz Laboratories, Center for Integrative Bioinformatics Vienna, University of Vienna , Vienna, Austria .,2 Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna , Vienna, Austria
| | - Bui Quang Minh
- 1 Max F. Perutz Laboratories, Center for Integrative Bioinformatics Vienna, University of Vienna , Vienna, Austria
| | - Arndt von Haeseler
- 1 Max F. Perutz Laboratories, Center for Integrative Bioinformatics Vienna, University of Vienna , Vienna, Austria .,2 Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna , Vienna, Austria
| |
Collapse
|
31
|
Abstract
Inference of phylogenetic trees under the maximum likelihood (ML) criterion represents a routine task in biological data analysis. In this unit we describe how to plan analyses and use Randomized Accelerated Maximum Likelihood (RAxML) for phylogenetic inferences under ML, how to infer support values using the standard bootstrap procedure as well as other statistical measures, and how to conduct post-analyses on collections/sets of phylogenetic trees including statistical significance tests and consensus tree methods. We also discuss what measures can be taken and what further analyses can be conducted when relationships in the inferred tree exhibit "low" support.
Collapse
Affiliation(s)
- Alexandros Stamatakis
- Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.,Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| |
Collapse
|