1
|
He J, Li M, Wu H, Cheng J, Xie L. Unraveling the Ancient Introgression History of Xanthoceras (Sapindaceae): Insights from Phylogenomic Analysis. Int J Mol Sci 2025; 26:1581. [PMID: 40004047 PMCID: PMC11855356 DOI: 10.3390/ijms26041581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/07/2025] [Accepted: 02/10/2025] [Indexed: 02/27/2025] Open
Abstract
Ancient introgression is an infrequent evolutionary process often associated with conflicts between nuclear and organellar phylogenies. Determining whether such conflicts arise from introgression, incomplete lineage sorting (ILS), or other processes is essential to understanding plant diversification. Previous studies have reported phylogenetic discordance in the placement of Xanthoceras, but its causes remain unclear. Here, we analyzed transcriptome data from 41 Sapindaceae samples to reconstruct phylogenies and investigate this discordance. While nuclear phylogenies consistently placed Xanthoceras as sister to subfam. Hippocastanoideae, plastid data positioned it as the earliest-diverging lineage within Sapindaceae. Our coalescent simulations suggest that this cyto-nuclear discordance is unlikely to be explained by ILS alone. HyDe and PhyloNet analyses provided strong evidence that Xanthoceras experienced ancient introgression, incorporating approximately 16% of its genetic material from ancestral subfam. Sapindoideae lineages. Morphological traits further support this evolutionary history, reflecting characteristics of both contributing subfamilies. Likely occurring during the Paleogene, this introgression represents a rare instance of cross-subfamily gene flow shaping the evolutionary trajectory of a major plant lineage. Our findings clarify the evolutionary history of Xanthoceras and underscore the role of ancient introgression in driving phylogenetic conflicts, offering a rare example of introgression-driven diversification in angiosperms.
Collapse
Affiliation(s)
- Jian He
- Correspondence: (J.H.); (L.X.)
| | | | | | | | - Lei Xie
- State Key Laboratory of Efficient Production of Forest Resources, Beijing Forestry University, Beijing 100083, China; (M.L.); (H.W.); (J.C.)
| |
Collapse
|
2
|
Piñeiro C, Pichel JC. Efficient phylogenetic tree inference for massive taxonomic datasets: harnessing the power of a server to analyze 1 million taxa. Gigascience 2024; 13:giae055. [PMID: 39115958 PMCID: PMC11308190 DOI: 10.1093/gigascience/giae055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 04/17/2024] [Accepted: 07/11/2024] [Indexed: 08/10/2024] Open
Abstract
BACKGROUND Phylogenies play a crucial role in biological research. Unfortunately, the search for the optimal phylogenetic tree incurs significant computational costs, and most of the existing state-of-the-art tools cannot deal with extremely large datasets in reasonable times. RESULTS In this work, we introduce the new VeryFastTree code (version 4.0), which is able to construct a tree on 1 server using single-precision arithmetic from a massive 1 million alignment dataset in only 36 hours, which is 3 times and 3.2 times faster than its previous version and FastTree-2, respectively. This new version further boosts performance by parallelizing all tree traversal operations during the tree construction process, including subtree pruning and regrafting moves. Additionally, it introduces significant new features such as support for new and compressed file formats, enhanced compatibility across a broader range of operating systems, and the integration of disk computing functionality. The latter feature is particularly advantageous for users without access to high-end servers, as it allows them to manage very large datasets, albeit with an increase in computing time. CONCLUSIONS Experimental results establish VeryFastTree as the fastest tool in the state-of-the-art for maximum likelihood phylogeny estimation. It is publicly available at https://github.com/citiususc/veryfasttree. In addition, VeryFastTree is included as a package in Bioconda, MacPorts, and all Debian-based Linux distributions.
Collapse
Affiliation(s)
- César Piñeiro
- Information Retrieval Lab, CITIC, Universidade da Coruña, A Coruña 15008, Spain
| | - Juan C Pichel
- CiTIUS, Universidade de Santiago de Compostela, Santiago de Compostela 15782, Spain
| |
Collapse
|
3
|
Wang Z, Sun J, Gao Y, Xue Y, Zhang Y, Li K, Zhang W, Zhang C, Zu J, Zhang L. Fusang: a framework for phylogenetic tree inference via deep learning. Nucleic Acids Res 2023; 51:10909-10923. [PMID: 37819036 PMCID: PMC10639059 DOI: 10.1093/nar/gkad805] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Revised: 08/17/2023] [Accepted: 09/20/2023] [Indexed: 10/13/2023] Open
Abstract
Phylogenetic tree inference is a classic fundamental task in evolutionary biology that entails inferring the evolutionary relationship of targets based on multiple sequence alignment (MSA). Maximum likelihood (ML) and Bayesian inference (BI) methods have dominated phylogenetic tree inference for many years, but BI is too slow to handle a large number of sequences. Recently, deep learning (DL) has been successfully applied to quartet phylogenetic tree inference and tentatively extended into more sequences with the quartet puzzling algorithm. However, no DL-based tools are immediately available for practical real-world applications. In this paper, we propose Fusang (http://fusang.cibr.ac.cn), a DL-based framework that achieves comparable performance to that of ML-based tools with both simulated and real datasets. More importantly, with continuous optimization, e.g. through the use of customized training datasets for real-world scenarios, Fusang has great potential to outperform ML-based tools.
Collapse
Affiliation(s)
- Zhicheng Wang
- Chinese Institute for Brain Research, Beijing 102206, China
- Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
| | - Jinnan Sun
- School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China
| | - Yuan Gao
- Chinese Institute for Brain Research, Beijing 102206, China
| | - Yongwei Xue
- Chinese Institute for Brain Research, Beijing 102206, China
| | - Yubo Zhang
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
| | - Kuan Li
- Chinese Institute for Brain Research, Beijing 102206, China
| | - Wei Zhang
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing 100871, China
| | - Chi Zhang
- Key Laboratory of Vertebrate Evolution and Human Origins, Institute of Vertebrate Paleontology and Paleoanthropology, Center for Excellence in Life and Paleoenvironment, Chinese Academy of Sciences, Beijing 100044, China
| | - Jian Zu
- School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China
| | - Li Zhang
- Chinese Institute for Brain Research, Beijing 102206, China
| |
Collapse
|
4
|
Aizenbud Y, Jaffe A, Wang M, Hu A, Amsel N, Nadler B, Chang JT, Kluger Y. Spectral top-down recovery of latent tree models. INFORMATION AND INFERENCE : A JOURNAL OF THE IMA 2023; 12:iaad032. [PMID: 37593361 PMCID: PMC10431953 DOI: 10.1093/imaiai/iaad032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 03/24/2023] [Accepted: 06/24/2023] [Indexed: 08/19/2023]
Abstract
Modeling the distribution of high-dimensional data by a latent tree graphical model is a prevalent approach in multiple scientific domains. A common task is to infer the underlying tree structure, given only observations of its terminal nodes. Many algorithms for tree recovery are computationally intensive, which limits their applicability to trees of moderate size. For large trees, a common approach, termed divide-and-conquer, is to recover the tree structure in two steps. First, separately recover the structure of multiple, possibly random subsets of the terminal nodes. Second, merge the resulting subtrees to form a full tree. Here, we develop spectral top-down recovery (STDR), a deterministic divide-and-conquer approach to infer large latent tree models. Unlike previous methods, STDR partitions the terminal nodes in a non random way, based on the Fiedler vector of a suitable Laplacian matrix related to the observed nodes. We prove that under certain conditions, this partitioning is consistent with the tree structure. This, in turn, leads to a significantly simpler merging procedure of the small subtrees. We prove that STDR is statistically consistent and bound the number of samples required to accurately recover the tree with high probability. Using simulated data from several common tree models in phylogenetics, we demonstrate that STDR has a significant advantage in terms of runtime, with improved or similar accuracy.
Collapse
Affiliation(s)
- Yariv Aizenbud
- Program in Applied Mathematics, Yale University, New Haven, CT 06511, USA
| | - Ariel Jaffe
- Program in Applied Mathematics, Yale University, New Haven, CT 06511, USA
| | - Meng Wang
- Department of Pathology, Yale University, New Haven, CT 06511, USA
| | - Amber Hu
- Program in Applied Mathematics, Yale University, New Haven, CT 06511, USA
| | - Noah Amsel
- Program in Applied Mathematics, Yale University, New Haven, CT 06511, USA
| | - Boaz Nadler
- Department of Computer Science, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Joseph T Chang
- Department of Statistics, Yale University, New Haven, CT 06520, USA
| | - Yuval Kluger
- Program in Applied Mathematics, Yale University, New Haven, CT 06511, USA
- Department of Pathology, Yale University, New Haven, CT 06511, USA
- Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA
| |
Collapse
|
5
|
Zaharias P, Warnow T. Recent progress on methods for estimating and updating large phylogenies. Philos Trans R Soc Lond B Biol Sci 2022; 377:20210244. [PMID: 35989607 PMCID: PMC9393559 DOI: 10.1098/rstb.2021.0244] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 01/07/2022] [Indexed: 12/20/2022] Open
Abstract
With the increased availability of sequence data and even of fully sequenced and assembled genomes, phylogeny estimation of very large trees (even of hundreds of thousands of sequences) is now a goal for some biologists. Yet, the construction of these phylogenies is a complex pipeline presenting analytical and computational challenges, especially when the number of sequences is very large. In the past few years, new methods have been developed that aim to enable highly accurate phylogeny estimations on these large datasets, including divide-and-conquer techniques for multiple sequence alignment and/or tree estimation, methods that can estimate species trees from multi-locus datasets while addressing heterogeneity due to biological processes (e.g. incomplete lineage sorting and gene duplication and loss), and methods to add sequences into large gene trees or species trees. Here we present some of these recent advances and discuss opportunities for future improvements. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.
Collapse
Affiliation(s)
- Paul Zaharias
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
6
|
Abstract
The estimation of phylogenetic trees for individual genes or multi-locus datasets is a basic part of considerable biological research. In order to enable large trees to be computed, Disjoint Tree Mergers (DTMs) have been developed; these methods operate by dividing the input sequence dataset into disjoint sets, constructing trees on each subset, and then combining the subset trees (using auxiliary information) into a tree on the full dataset. DTMs have been used to advantage for multi-locus species tree estimation, enabling highly accurate species trees at reduced computational effort, compared to leading species tree estimation methods. Here, we evaluate the feasibility of using DTMs to improve the scalability of maximum likelihood (ML) gene tree estimation to large numbers of input sequences. Our study shows distinct differences between the three selected ML codes—RAxML-NG, IQ-TREE 2, and FastTree 2—and shows that good DTM pipeline design can provide advantages over these ML codes on large datasets.
Collapse
|
7
|
Matsumoto H, Mimori T, Fukunaga T. Novel metric for hyperbolic phylogenetic tree embeddings. Biol Methods Protoc 2021; 6:bpab006. [PMID: 33928190 PMCID: PMC8058397 DOI: 10.1093/biomethods/bpab006] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Revised: 03/19/2021] [Accepted: 03/23/2021] [Indexed: 01/09/2023] Open
Abstract
Advances in experimental technologies, such as DNA sequencing, have opened up new avenues for the applications of phylogenetic methods to various fields beyond their traditional application in evolutionary investigations, extending to the fields of development, differentiation, cancer genomics, and immunogenomics. Thus, the importance of phylogenetic methods is increasingly being recognized, and the development of a novel phylogenetic approach can contribute to several areas of research. Recently, the use of hyperbolic geometry has attracted attention in artificial intelligence research. Hyperbolic space can better represent a hierarchical structure compared to Euclidean space, and can therefore be useful for describing and analyzing a phylogenetic tree. In this study, we developed a novel metric that considers the characteristics of a phylogenetic tree for representation in hyperbolic space. We compared the performance of the proposed hyperbolic embeddings, general hyperbolic embeddings, and Euclidean embeddings, and confirmed that our method could be used to more precisely reconstruct evolutionary distance. We also demonstrate that our approach is useful for predicting the nearest-neighbor node in a partial phylogenetic tree with missing nodes. Furthermore, we proposed a novel approach based on our metric to integrate multiple trees for analyzing tree nodes or imputing missing distances. This study highlights the utility of adopting a geometric approach for further advancing the applications of phylogenetic methods.
Collapse
Affiliation(s)
- Hirotaka Matsumoto
- School of Information and Data Sciences, Nagasaki University, Nagasaki, Japan.,Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, Saitama, Japan
| | - Takahiro Mimori
- Medical Image Analysis Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
| | - Tsukasa Fukunaga
- Department of Computer Science, Graduate School of Information Science and Engineering, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
8
|
Warnow T, Mirarab S. Multiple Sequence Alignment for Large Heterogeneous Datasets Using SATé, PASTA, and UPP. Methods Mol Biol 2021; 2231:99-119. [PMID: 33289889 DOI: 10.1007/978-1-0716-1036-7_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The estimation of very large multiple sequence alignments is a challenging problem that requires special techniques in order to achieve high accuracy. Here we describe two software packages-PASTA and UPP-for constructing alignments on large and ultra-large datasets. Both methods have been able to produce highly accurate alignments on 1,000,000 sequences, and trees computed on these alignments are also highly accurate. PASTA provides the best tree accuracy when the input sequences are all full-length, but UPP provides improved accuracy compared to PASTA and other methods when the input contains a large number of fragmentary sequences. Both methods are available in open source form on GitHub.
Collapse
Affiliation(s)
- Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
| | - Siavash Mirarab
- Electrical and Computer Engineering, University of California at San Diego, La Jolla, CA, USA
| |
Collapse
|
9
|
Le T, Sy A, Molloy EK, Zhang Q, Rao S, Warnow T. Using Constrained-INC for Large-Scale Gene Tree and Species Tree Estimation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2-15. [PMID: 32750844 DOI: 10.1109/tcbb.2020.2990867] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Incremental tree building (INC) is a new phylogeny estimation method that has been proven to be absolute fast converging under standard sequence evolution models. A variant of INC, called Constrained-INC, is designed for use in divide-and-conquer pipelines for phylogeny estimation where a set of species is divided into disjoint subsets, trees are computed on the subsets using a selected base method, and then the subset trees are combined together. We evaluate the accuracy of INC and Constrained-INC for gene tree and species tree estimation on simulated datasets, and compare it to similar pipelines using NJMerge (another method that merges disjoint trees). For gene tree estimation, we find that INC has very poor accuracy in comparison to standard methods, and even Constrained-INC(using maximum likelihood methods to compute constraint trees) does not match the accuracy of the better maximum likelihood methods. Results for species trees are somewhat different, with Constrained-INC coming close to the accuracy of the best species tree estimation methods, while being much faster; furthermore, using Constrained-INC allows species tree estimation methods to scale to large datasets within limited computational resources. Overall, this study exposes the benefits and limitations of divide-and-conquer strategies for large-scale phylogenetic tree estimation.
Collapse
|
10
|
Smirnov V, Warnow T. Phylogeny Estimation Given Sequence Length Heterogeneity. Syst Biol 2020; 70:268-282. [PMID: 32692823 PMCID: PMC7875441 DOI: 10.1093/sysbio/syaa058] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Revised: 07/14/2020] [Accepted: 07/15/2020] [Indexed: 12/21/2022] Open
Abstract
Phylogeny estimation is a major step in many biological studies, and has many well known challenges. With the dropping cost of sequencing technologies, biologists now have increasingly large datasets available for use in phylogeny estimation. Here we address the challenge of estimating a tree given large datasets with a combination of full-length sequences and fragmentary sequences, which can arise due to a variety of reasons, including sample collection, sequencing technologies, and analytical pipelines. We compare two basic approaches: (1) computing an alignment on the full dataset and then computing a maximum likelihood tree on the alignment, or (2) constructing an alignment and tree on the full length sequences and then using phylogenetic placement to add the remaining sequences (which will generally be fragmentary) into the tree. We explore these two approaches on a range of simulated datasets, each with 1000 sequences and varying in rates of evolution, and two biological datasets. Our study shows some striking performance differences between methods, especially when there is substantial sequence length heterogeneity and high rates of evolution. We find in particular that using UPP to align sequences and RAxML to compute a tree on the alignment provides the best accuracy, substantially outperforming trees computed using phylogenetic placement methods. We also find that FastTree has poor accuracy on alignments containing fragmentary sequences. Overall, our study provides insights into the literature comparing different methods and pipelines for phylogenetic estimation, and suggests directions for future method development. [Phylogeny estimation, sequence length heterogeneity, phylogenetic placement.]
Collapse
Affiliation(s)
- Vladimir Smirnov
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| |
Collapse
|
11
|
Abstract
Background Phylogeny estimation is an important part of much biological research, but large-scale tree estimation is infeasible using standard methods due to computational issues. Recently, an approach to large-scale phylogeny has been proposed that divides a set of species into disjoint subsets, computes trees on the subsets, and then merges the trees together using a computed matrix of pairwise distances between the species. The novel component of these approaches is the last step: Disjoint Tree Merger (DTM) methods. Results We present GTM (Guide Tree Merger), a polynomial time DTM method that adds edges to connect the subset trees, so as to provably minimize the topological distance to a computed guide tree. Thus, GTM performs unblended mergers, unlike the previous DTM methods. Yet, despite the potential limitation, our study shows that GTM has excellent accuracy, generally matching or improving on two previous DTMs, and is much faster than both. Conclusions The proposed GTM approach to the DTM problem is a useful new tool for large-scale phylogenomic analysis, and shows the surprising potential for unblended DTM methods.
Collapse
Affiliation(s)
- Vladimir Smirnov
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N Goodwin Ave, Urbana, 61801, IL, US
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N Goodwin Ave, Urbana, 61801, IL, US.
| |
Collapse
|