1
|
Jin XJ, Yu Y, Lin HY, Liu FL, Wang HF, Ma Q, Chen Y, Zhang YH, Li P. Revisiting the backbone phylogeny and inferring the evolutionary trends in inflorescence of Elsholtzieae (Lamiaceae): new insights from orthologous nuclear genes. Cladistics 2025; 41:157-176. [PMID: 39966307 DOI: 10.1111/cla.12604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Revised: 12/08/2024] [Accepted: 12/11/2024] [Indexed: 02/20/2025] Open
Abstract
The angiosperm tribe of Elsholtzieae (Lamiaceae) is characterized by complex inflorescences and has notable medicinal and economic significance. Relationships within Elsholtzieae, including the monophyly of Elsholtzia and Keiskea, and relationships among Mosla, Keiskea and Perilla, remain uncertain, hindering insights into inflorescence evolution within the tribe. Using hybridization capture sequencing and deep genome skimming data analysis, we reconstruct a phylogeny of Elsholtzieae using 279 orthologous nuclear loci from 56 species. We evaluated uncertainty among relationships using concatenation, coalescent and network approaches. Using a time-calibrated phylogeny, we reconstructed ancestral inflorescence traits to elucidate the patterns in their evolution within the tribe. Our analyses consistently support the paraphyly of the genus Elsholtzia. Phylogenetic network analyses, confirmed by PhyloNetworks and SplitsTree, showed reticulation events among the major lineages of Elsholtzieae. The unstable polyphyly of Keiskea observed in ASTRAL (accurate species tree algorithm), ML (maximum likelihood) and MP (maximum parsimony) analyses may be related to introgression from Perilla and Mosla. Based on the analyses of phylogenetic trees within Elsholtzieae, the evolutionary trajectory of inflorescences demonstrates a pattern of diversification, with specialization as one aspect of this process. Elsholtzieae support the hypothesis that compressed inflorescences evolved from larger and more complex ancestral forms through successive compressions of the inflorescence axis. Additionally, certain lineages within the tribe display a trend towards simplified inflorescences, characterized by a reduction in the number of florets. This highlights both the specialization and the diversity in the evolution of inflorescence structures within the tribe.
Collapse
Affiliation(s)
- Xin-Jie Jin
- College of Life and Environmental Science, Wenzhou University, Wenzhou, 325035, Zhejiang, China
- Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai, 201602, China
- Institute for Eco-environmental Research of Sanyang Wetland, Wenzhou University, Wenzhou, 325014, Zhejiang, China
| | - Yan Yu
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, 610065, Sichuan, China
| | - Han-Yang Lin
- Zhejiang Provincial Key Laboratory of Plant Evolutionary Ecology and Conservation, School of Life Sciences, Taizhou University, Taizhou, 318000, Zhejiang, China
| | - Feng-Luan Liu
- Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai, 201602, China
| | - Hai-Feng Wang
- College of Life and Environmental Science, Wenzhou University, Wenzhou, 325035, Zhejiang, China
| | - Qing Ma
- College of Biology and Environmental Engineering, Zhejiang Shuren University, Hangzhou, 310015, Zhejiang, China
| | - Yang Chen
- Laboratory of Systematic and Evolutionary Botany and Biodiversity, College of Life Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Yong-Hua Zhang
- College of Life and Environmental Science, Wenzhou University, Wenzhou, 325035, Zhejiang, China
- Institute for Eco-environmental Research of Sanyang Wetland, Wenzhou University, Wenzhou, 325014, Zhejiang, China
| | - Pan Li
- Laboratory of Systematic and Evolutionary Botany and Biodiversity, College of Life Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| |
Collapse
|
2
|
Collienne L, Barker M, Suchard MA, Matsen FA. Phylogenetic Tree Instability After Taxon Addition: Empirical Frequency, Predictability, and Consequences For Online Inference. Syst Biol 2025; 74:101-111. [PMID: 39453463 PMCID: PMC11809580 DOI: 10.1093/sysbio/syae059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 09/30/2024] [Accepted: 10/22/2024] [Indexed: 10/26/2024] Open
Abstract
Online phylogenetic inference methods add sequentially arriving sequences to an inferred phylogeny without the need to recompute the entire tree from scratch. Some online method implementations exist already, but there remains concern that additional sequences may change the topological relationship among the original set of taxa. We call such a change in tree topology a lack of stability for the inferred tree. In this article, we analyze the stability of single taxon addition in a Maximum Likelihood framework across 1000 empirical datasets. We find that instability occurs in almost 90% of our examples, although observed topological differences do not always reach significance under the approximately unbiased (AU) test. Changes in tree topology after addition of a taxon rarely occur close to its attachment location, and are more frequently observed in more distant tree locations carrying low bootstrap support. To investigate whether instability is predictable, we hypothesize sources of instability and design summary statistics addressing these hypotheses. Using these summary statistics as input features for machine learning under random forests, we are able to predict instability and can identify the most influential features. In summary, it does not appear that a strict insertion-only online inference method will deliver globally optimal trees, although relaxing insertion strictness by allowing for a small number of final tree rearrangements or accepting slightly suboptimal solutions appears feasible.
Collapse
Affiliation(s)
- Lena Collienne
- Computational Biology Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA
| | - Mary Barker
- Computational Biology Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA
- Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA
| | - Marc A Suchard
- Department of Human Genetics, University of California, 885 Tiverton Drive, Los Angeles, CA 90095, USA
- Department of Computational Medicine, University of California, 885 Tiverton Drive, Los Angeles, CA 90095, USA
- Department of Biostatistics, University of California, 650 Charles E. Young Dr. South, Los Angeles, CA 90095, USA
| | - Frederick A Matsen
- Computational Biology Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA
- Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA
- Department of Statistics, University of Washington, Padelford Hall, Northeast Stevens Way, Seattle, WA 98195, USA
- Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, WA 98195, USA
| |
Collapse
|
3
|
Ren H, Wong TKF, Minh BQ, Lanfear R. MixtureFinder: Estimating DNA Mixture Models for Phylogenetic Analyses. Mol Biol Evol 2025; 42:msae264. [PMID: 39715360 PMCID: PMC11704958 DOI: 10.1093/molbev/msae264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 11/26/2024] [Accepted: 12/19/2024] [Indexed: 12/25/2024] Open
Abstract
In phylogenetic studies, both partitioned models and mixture models are used to account for heterogeneity in molecular evolution among the sites of DNA sequence alignments. Partitioned models require the user to specify the grouping of sites into subsets, and then assume that each subset of sites can be modeled by a single common process. Mixture models do not require users to prespecify subsets of sites, and instead calculate the likelihood of every site under every model, while co-estimating the model weights and parameters. While much research has gone into the optimization of partitioned models by merging user-specified subsets, there has been less attention paid to the optimization of mixture models for DNA sequence alignments. In this study, we first ask whether a key assumption of partitioned models-that each user-specified subset can be modeled by a single common process-is supported by the data. Having shown that this is not the case, we then design, implement, test, and apply an algorithm, MixtureFinder, to select the optimum number of classes for a mixture model of Q-matrices for the standard models of DNA sequence evolution. We show this algorithm performs well on simulated and empirical datasets and suggest that it may be useful for future empirical studies. MixtureFinder is available in IQ-TREE2, and a tutorial for using MixtureFinder can be found here: http://www.iqtree.org/doc/Complex-Models#mixture-models.
Collapse
Affiliation(s)
- Huaiyan Ren
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Thomas K F Wong
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Bui Quang Minh
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| | - Robert Lanfear
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| |
Collapse
|
4
|
Berling L, Bouckaert R, Gavryushkin A. An Automated Convergence Diagnostic for Phylogenetic MCMC Analyses. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2246-2257. [PMID: 39255085 DOI: 10.1109/tcbb.2024.3457875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]
Abstract
Assessing convergence of Markov chain Monte Carlo (MCMC) based analyses is crucial but challenging, especially so in high dimensional and complex spaces such as the space of phylogenetic trees (treespace). In practice, it is assumed that the target distribution is the unique stationary distribution of the MCMC and convergence is achieved when samples appear to be stationary. Here we leverage recent advances in computational geometry of the treespace and introduce a method that combines classical statistical techniques and algorithms with geometric properties of the treespace to automatically evaluate and assess practical convergence of phylogenetic MCMC analyses. Our method monitors convergence across multiple MCMC chains and achieves high accuracy in detecting both practical convergence and convergence issues within treespace. Furthermore, our approach is developed to allow for real-time evaluation during the MCMC algorithm run, eliminating any of the chain post-processing steps that are currently required. Our tool therefore improves reliability and efficiency of MCMC based phylogenetic inference methods and makes analyses easier to reproduce and compare. We demonstrate the efficacy of our diagnostic via a well-calibrated simulation study and provide examples of its performance on real data sets. Although our method performs well in practice, a significant part of the underlying treespace probability theory is still missing, which creates an excellent opportunity for future mathematical research in this area.
Collapse
|
5
|
Liu C, Zhou X, Li Y, Hittinger CT, Pan R, Huang J, Chen XX, Rokas A, Chen Y, Shen XX. The Influence of the Number of Tree Searches on Maximum Likelihood Inference in Phylogenomics. Syst Biol 2024; 73:807-822. [PMID: 38940001 DOI: 10.1093/sysbio/syae031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Revised: 06/20/2024] [Accepted: 06/26/2024] [Indexed: 06/29/2024] Open
Abstract
Maximum likelihood (ML) phylogenetic inference is widely used in phylogenomics. As heuristic searches most likely find suboptimal trees, it is recommended to conduct multiple (e.g., 10) tree searches in phylogenetic analyses. However, beyond its positive role, how and to what extent multiple tree searches aid ML phylogenetic inference remains poorly explored. Here, we found that a random starting tree was not as effective as the BioNJ and parsimony starting trees in inferring the ML gene tree and that RAxML-NG and PhyML were less sensitive to different starting trees than IQ-TREE. We then examined the effect of the number of tree searches on ML tree inference with IQ-TREE and RAxML-NG, by running 100 tree searches on 19,414 gene alignments from 15 animal, plant, and fungal phylogenomic datasets. We found that the number of tree searches substantially impacted the recovery of the best-of-100 ML gene tree topology among 100 searches for a given ML program. In addition, all of the concatenation-based trees were topologically identical if the number of tree searches was ≥10. Quartet-based ASTRAL trees inferred from 1 to 80 tree searches differed topologically from those inferred from 100 tree searches for 6/15 phylogenomic datasets. Finally, our simulations showed that gene alignments with lower difficulty scores had a higher chance of finding the best-of-100 gene tree topology and were more likely to yield the correct trees.
Collapse
Affiliation(s)
- Chao Liu
- Department of Plant Protection, Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Zhejiang University, Hangzhou 310058, China
- Centre for Evolutionary & Organismal Biology, Zhejiang University, Hangzhou 310058, China
| | - Xiaofan Zhou
- Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou 510642, China
| | - Yuanning Li
- Institute of Marine Science and Technology, Shandong University, Qingdao 266237, China
- Department of Biological Sciences and Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | - Chris Todd Hittinger
- Laboratory of Genetics, Wisconsin Energy Institute, Center for Genomic Science Innovation, DOE Great Lakes Bioenergy Research Center, J. F. Crow Institute for the Study of Evolution, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Ronghui Pan
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University, Hangzhou, 310027, China
| | - Jinyan Huang
- Zhejiang Provincial Key Laboratory of Pancreatic Disease, Zhejiang University School of Medicine First Affiliated Hospital, Hangzhou 310003, China
| | - Xue-Xin Chen
- Department of Plant Protection, Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Zhejiang University, Hangzhou 310058, China
| | - Antonis Rokas
- Department of Biological Sciences and Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | - Yun Chen
- Department of Plant Protection, Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Zhejiang University, Hangzhou 310058, China
| | - Xing-Xing Shen
- Department of Plant Protection, Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Zhejiang University, Hangzhou 310058, China
- Centre for Evolutionary & Organismal Biology, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
6
|
Wiegert J, Höhler D, Haag J, Stamatakis A. Predicting Phylogenetic Bootstrap Values via Machine Learning. Mol Biol Evol 2024; 41:msae215. [PMID: 39418337 PMCID: PMC11523138 DOI: 10.1093/molbev/msae215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 08/28/2024] [Accepted: 09/15/2024] [Indexed: 10/19/2024] Open
Abstract
Estimating the statistical robustness of the inferred tree(s) constitutes an integral part of most phylogenetic analyses. Commonly, one computes and assigns a branch support value to each inner branch of the inferred phylogeny. The still most widely used method for calculating branch support on trees inferred under maximum likelihood (ML) is the Standard, nonparametric Felsenstein bootstrap support (SBS). Due to the high computational cost of the SBS, a plethora of methods has been developed to approximate it, for instance, via the rapid bootstrap (RB) algorithm. There have also been attempts to devise faster, alternative support measures, such as the SH-aLRT (Shimodaira-Hasegawa-like approximate likelihood ratio test) or the UltraFast bootstrap 2 (UFBoot2) method. Those faster alternatives exhibit some limitations, such as the need to assess model violations (UFBoot2) or unstable behavior in the low support interval range (SH-aLRT). Here, we present the educated bootstrap guesser (EBG), a machine learning-based tool that predicts SBS branch support values for a given input phylogeny. EBG is on average 9.4 (σ=5.5) times faster than UFBoot2. EBG-based SBS estimates exhibit a median absolute error of 5 when predicting SBS values between 0 and 100. Furthermore, EBG also provides uncertainty measures for all per-branch SBS predictions and thereby allows for a more rigorous and careful interpretation. EBG can, for instance, predict SBS support values on a phylogeny comprising 1,654 SARS-CoV2 genome sequences within 3 h on a mid-class laptop. EBG is available under GNU GPL3.
Collapse
Affiliation(s)
- Julius Wiegert
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Dimitri Höhler
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Julia Haag
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Biodiversity Computing Group, Institute of Computer Science, Foundation for Research and Technology - Hellas, Heraklion, Crete, Greece
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| |
Collapse
|
7
|
Charamis J, Balaska S, Ioannidis P, Dvořák V, Mavridis K, McDowell MA, Pavlidis P, Feyereisen R, Volf P, Vontas J. Comparative Genomics Uncovers the Evolutionary Dynamics of Detoxification and Insecticide Target Genes Across 11 Phlebotomine Sand Flies. Genome Biol Evol 2024; 16:evae186. [PMID: 39224065 PMCID: PMC11412322 DOI: 10.1093/gbe/evae186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Revised: 08/09/2024] [Accepted: 08/19/2024] [Indexed: 09/04/2024] Open
Abstract
Sand flies infect more than 1 million people annually with Leishmania parasites and other bacterial and viral pathogens. Progress in understanding sand fly adaptations to xenobiotics has been hampered by the limited availability of genomic resources. To address this gap, we sequenced, assembled, and annotated the transcriptomes of 11 phlebotomine sand fly species. Subsequently, we leveraged these genomic resources to generate novel evolutionary insights pertaining to their adaptations to xenobiotics, including those contributing to insecticide resistance. Specifically, we annotated over 2,700 sand fly detoxification genes and conducted large-scale phylogenetic comparisons to uncover the evolutionary dynamics of the five major detoxification gene families: cytochrome P450s (CYPs), glutathione-S-transferases (GSTs), UDP-glycosyltransferases (UGTs), carboxyl/cholinesterases (CCEs), and ATP-binding cassette (ABC) transporters. Using this comparative approach, we show that sand flies have evolved diverse CYP and GST gene repertoires, with notable lineage-specific expansions in gene groups evolutionarily related to known xenobiotic metabolizers. Furthermore, we show that sand flies have conserved orthologs of (i) CYP4G genes involved in cuticular hydrocarbon biosynthesis, (ii) ABCB genes involved in xenobiotic toxicity, and (iii) two primary insecticide targets, acetylcholinesterase-1 (Ace1) and voltage gated sodium channel (VGSC). The biological insights and genomic resources produced in this study provide a foundation for generating and testing hypotheses regarding the molecular mechanisms underlying sand fly adaptations to xenobiotics.
Collapse
Affiliation(s)
- Jason Charamis
- Department of Biology, University of Crete, Heraklion 71409, Greece
- Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas, Heraklion 70013, Greece
| | - Sofia Balaska
- Department of Biology, University of Crete, Heraklion 71409, Greece
- Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas, Heraklion 70013, Greece
| | - Panagiotis Ioannidis
- Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas, Heraklion 70013, Greece
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, Greece
| | - Vít Dvořák
- Department of Parasitology, Faculty of Science, Charles University, Prague, Czech Republic
| | - Konstantinos Mavridis
- Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas, Heraklion 70013, Greece
| | - Mary Ann McDowell
- Eck Institute for Global Health, Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA
| | - Pavlos Pavlidis
- Department of Biology, University of Crete, Heraklion 71409, Greece
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, Greece
| | - René Feyereisen
- Laboratory of Agrozoology, Department of Plants and Crops, Faculty of Bioscience Engineering, Ghent University, Ghent 9000, Belgium
| | - Petr Volf
- Department of Parasitology, Faculty of Science, Charles University, Prague, Czech Republic
| | - John Vontas
- Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas, Heraklion 70013, Greece
- Pesticide Science Laboratory, Department of Crop Science, Agricultural University of Athens, Athens 11855, Greece
| |
Collapse
|
8
|
Han Y, Xie Y, Hao Z, Mao J, Wang X, Chang Y, Tian Y. The Mitochondrial Genome of Ylistrum japonicum (Bivalvia, Pectinidae) and Its Phylogenetic Analysis. Int J Mol Sci 2024; 25:8755. [PMID: 39201441 PMCID: PMC11354973 DOI: 10.3390/ijms25168755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 08/06/2024] [Accepted: 08/09/2024] [Indexed: 09/02/2024] Open
Abstract
The Ylistrum japonicum is a commercially valuable scallop known for its long-distance swimming abilities. Despite its economic importance, genetic and genomic research on this species is limited. This study presents the first complete mitochondrial genome of Y. japonicum. The mitochondrial genome is 19,475 bp long and encompasses 13 protein-coding genes, three ribosomal RNA genes, and 23 transfer RNA genes. Two distinct phylogenetic analyses were used to explore the phylogenetic position of the Y. japonicum within the family Pectinidae. Based on one mitochondrial phylogenetic analysis by selecting 15 Pectinidae species and additional outgroup taxa and one single gene phylogenetic analysis by 16S rRNA, two phylogenetic trees were constructed to provide clearer insights into the evolutionary placement of Y. japonicum within the family Pectinidae. Our analysis reveals that Ylistrum is a basal lineage to the Pectininae clade, distinct from its previously assigned tribe, Amusiini. This study offers critical insights into the genetic makeup and evolutionary history of Y. japonicum, enhancing our knowledge of this economically vital species.
Collapse
Affiliation(s)
| | | | | | | | | | - Yaqing Chang
- Key Laboratory of Mariculture & Stock Enhancement in North China Sea, Ministry of Agriculture and Rural Affairs, Dalian Ocean University, Dalian 116023, China; (Y.H.); (Y.X.); (Z.H.); (J.M.); (X.W.)
| | - Ying Tian
- Key Laboratory of Mariculture & Stock Enhancement in North China Sea, Ministry of Agriculture and Rural Affairs, Dalian Ocean University, Dalian 116023, China; (Y.H.); (Y.X.); (Z.H.); (J.M.); (X.W.)
| |
Collapse
|
9
|
Mo YK, Hahn MW, Smith ML. Applications of machine learning in phylogenetics. Mol Phylogenet Evol 2024; 196:108066. [PMID: 38565358 DOI: 10.1016/j.ympev.2024.108066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 02/16/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]
Abstract
Machine learning has increasingly been applied to a wide range of questions in phylogenetic inference. Supervised machine learning approaches that rely on simulated training data have been used to infer tree topologies and branch lengths, to select substitution models, and to perform downstream inferences of introgression and diversification. Here, we review how researchers have used several promising machine learning approaches to make phylogenetic inferences. Despite the promise of these methods, several barriers prevent supervised machine learning from reaching its full potential in phylogenetics. We discuss these barriers and potential paths forward. In the future, we expect that the application of careful network designs and data encodings will allow supervised machine learning to accommodate the complex processes that continue to confound traditional phylogenetic methods.
Collapse
Affiliation(s)
- Yu K Mo
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Matthew W Hahn
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA; Department of Biology, Indiana University, Bloomington, IN 47405, USA
| | - Megan L Smith
- Department of Biological Sciences, Mississippi State University, Starkville, MS 39762, USA.
| |
Collapse
|
10
|
Ecker N, Huchon D, Mansour Y, Mayrose I, Pupko T. A machine-learning-based alternative to phylogenetic bootstrap. Bioinformatics 2024; 40:i208-i217. [PMID: 38940166 PMCID: PMC11211842 DOI: 10.1093/bioinformatics/btae255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Currently used methods for estimating branch support in phylogenetic analyses often rely on the classic Felsenstein's bootstrap, parametric tests, or their approximations. As these branch support scores are widely used in phylogenetic analyses, having accurate, fast, and interpretable scores is of high importance. RESULTS Here, we employed a data-driven approach to estimate branch support values with a probabilistic interpretation. To this end, we simulated thousands of realistic phylogenetic trees and the corresponding multiple sequence alignments. Each of the obtained alignments was used to infer the phylogeny using state-of-the-art phylogenetic inference software, which was then compared to the true tree. Using these extensive data, we trained machine-learning algorithms to estimate branch support values for each bipartition within the maximum-likelihood trees obtained by each software. Our results demonstrate that our model provides fast and more accurate probability-based branch support values than commonly used procedures. We demonstrate the applicability of our approach on empirical datasets. AVAILABILITY AND IMPLEMENTATION The data supporting this work are available in the Figshare repository at https://doi.org/10.6084/m9.figshare.25050554.v1, and the underlying code is accessible via GitHub at https://github.com/noaeker/bootstrap_repo.
Collapse
Affiliation(s)
- Noa Ecker
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Dorothée Huchon
- School of Zoology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
- The Steinhardt Museum of Natural History and National Research Center, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Yishay Mansour
- The Blavatnik School of Computer Science, Raymond & Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| |
Collapse
|
11
|
Azouri D, Granit O, Alburquerque M, Mansour Y, Pupko T, Mayrose I. The Tree Reconstruction Game: Phylogenetic Reconstruction Using Reinforcement Learning. Mol Biol Evol 2024; 41:msae105. [PMID: 38829798 PMCID: PMC11180600 DOI: 10.1093/molbev/msae105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Revised: 05/17/2024] [Accepted: 05/28/2024] [Indexed: 06/05/2024] Open
Abstract
The computational search for the maximum-likelihood phylogenetic tree is an NP-hard problem. As such, current tree search algorithms might result in a tree that is the local optima, not the global one. Here, we introduce a paradigm shift for predicting the maximum-likelihood tree, by approximating long-term gains of likelihood rather than maximizing likelihood gain at each step of the search. Our proposed approach harnesses the power of reinforcement learning to learn an optimal search strategy, aiming at the global optimum of the search space. We show that when analyzing empirical data containing dozens of sequences, the log-likelihood improvement from the starting tree obtained by the reinforcement learning-based agent was 0.969 or higher compared to that achieved by current state-of-the-art techniques. Notably, this performance is attained without the need to perform costly likelihood optimizations apart from the training process, thus potentially allowing for an exponential increase in runtime. We exemplify this for data sets containing 15 sequences of length 18,000 bp and demonstrate that the reinforcement learning-based method is roughly three times faster than the state-of-the-art software. This study illustrates the potential of reinforcement learning in addressing the challenges of phylogenetic tree reconstruction.
Collapse
Affiliation(s)
- Dana Azouri
- School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
| | - Oz Granit
- Balvatnik School of Computer Science, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
| | - Michael Alburquerque
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
| | - Yishay Mansour
- Balvatnik School of Computer Science, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
| |
Collapse
|
12
|
Stiller J, Feng S, Chowdhury AA, Rivas-González I, Duchêne DA, Fang Q, Deng Y, Kozlov A, Stamatakis A, Claramunt S, Nguyen JMT, Ho SYW, Faircloth BC, Haag J, Houde P, Cracraft J, Balaban M, Mai U, Chen G, Gao R, Zhou C, Xie Y, Huang Z, Cao Z, Yan Z, Ogilvie HA, Nakhleh L, Lindow B, Morel B, Fjeldså J, Hosner PA, da Fonseca RR, Petersen B, Tobias JA, Székely T, Kennedy JD, Reeve AH, Liker A, Stervander M, Antunes A, Tietze DT, Bertelsen MF, Lei F, Rahbek C, Graves GR, Schierup MH, Warnow T, Braun EL, Gilbert MTP, Jarvis ED, Mirarab S, Zhang G. Complexity of avian evolution revealed by family-level genomes. Nature 2024; 629:851-860. [PMID: 38560995 PMCID: PMC11111414 DOI: 10.1038/s41586-024-07323-1] [Citation(s) in RCA: 63] [Impact Index Per Article: 63.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 03/15/2024] [Indexed: 04/04/2024]
Abstract
Despite tremendous efforts in the past decades, relationships among main avian lineages remain heavily debated without a clear resolution. Discrepancies have been attributed to diversity of species sampled, phylogenetic method and the choice of genomic regions1-3. Here we address these issues by analysing the genomes of 363 bird species4 (218 taxonomic families, 92% of total). Using intergenic regions and coalescent methods, we present a well-supported tree but also a marked degree of discordance. The tree confirms that Neoaves experienced rapid radiation at or near the Cretaceous-Palaeogene boundary. Sufficient loci rather than extensive taxon sampling were more effective in resolving difficult nodes. Remaining recalcitrant nodes involve species that are a challenge to model due to either extreme DNA composition, variable substitution rates, incomplete lineage sorting or complex evolutionary events such as ancient hybridization. Assessment of the effects of different genomic partitions showed high heterogeneity across the genome. We discovered sharp increases in effective population size, substitution rates and relative brain size following the Cretaceous-Palaeogene extinction event, supporting the hypothesis that emerging ecological opportunities catalysed the diversification of modern birds. The resulting phylogenetic estimate offers fresh insights into the rapid radiation of modern birds and provides a taxon-rich backbone tree for future comparative studies.
Collapse
Affiliation(s)
- Josefin Stiller
- Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| | - Shaohong Feng
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Department of General Surgery, Sir Run-Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Innovation Center of Yangtze River Delta, Zhejiang University, Jiashan, China
| | - Al-Aabid Chowdhury
- School of Life and Environmental Sciences, University of Sydney, Sydney, New South Wales, Australia
| | | | - David A Duchêne
- Center for Evolutionary Hologenomics, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Qi Fang
- BGI Research, Shenzhen, China
| | - Yuan Deng
- BGI Research, Shenzhen, China
- BGI Research, Wuhan, China
| | - Alexey Kozlov
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute of Computer Science, Foundation for Research and Technology Hellas, Heraklion, Greece
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Santiago Claramunt
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
- Department of Natural History, Royal Ontario Museum, Toronto, Ontario, Canada
| | - Jacqueline M T Nguyen
- College of Science and Engineering, Flinders University, Adelaide, South Australia, Australia
- Australian Museum Research Institute, Sydney, New South Wales, Australia
| | - Simon Y W Ho
- School of Life and Environmental Sciences, University of Sydney, Sydney, New South Wales, Australia
| | - Brant C Faircloth
- Department of Biological Sciences and Museum of Natural Science, Louisiana State University, Baton Rouge, LA, USA
| | - Julia Haag
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Peter Houde
- Department of Biology, New Mexico State University, Las Cruces, NM, USA
| | - Joel Cracraft
- Department of Ornithology, American Museum of Natural History, New York, NY, USA
| | - Metin Balaban
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, USA
| | - Uyen Mai
- Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Guangji Chen
- BGI Research, Wuhan, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Rongsheng Gao
- BGI Research, Wuhan, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | | | - Yulong Xie
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zijian Huang
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zhen Cao
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Zhi Yan
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Huw A Ogilvie
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Luay Nakhleh
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Bent Lindow
- Natural History Museum Denmark, University of Copenhagen, Copenhagen, Denmark
| | - Benoit Morel
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute of Computer Science, Foundation for Research and Technology Hellas, Heraklion, Greece
| | - Jon Fjeldså
- Natural History Museum Denmark, University of Copenhagen, Copenhagen, Denmark
| | - Peter A Hosner
- Natural History Museum Denmark, University of Copenhagen, Copenhagen, Denmark
- Center for Global Mountain Biodiversity, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Rute R da Fonseca
- Center for Global Mountain Biodiversity, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Bent Petersen
- Center for Evolutionary Hologenomics, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
- Centre of Excellence for Omics-Driven Computational Biodiscovery, Faculty of Applied Sciences, AIMST University, Bedong, Malaysia
| | - Joseph A Tobias
- Department of Life Sciences, Imperial College London, Silwood Park, Ascot, UK
| | - Tamás Székely
- Milner Centre for Evolution, University of Bath, Bath, UK
- ELKH-DE Reproductive Strategies Research Group, University of Debrecen, Debrecen, Hungary
| | - Jonathan David Kennedy
- Center for Macroecology, Evolution, and Climate, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Andrew Hart Reeve
- Natural History Museum Denmark, University of Copenhagen, Copenhagen, Denmark
| | - Andras Liker
- HUN-REN-PE Evolutionary Ecology Research Group, University of Pannonia, Veszprém, Hungary
- Behavioural Ecology Research Group, Center for Natural Sciences, University of Pannonia, Veszprém, Hungary
| | | | - Agostinho Antunes
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Porto, Portugal
- Department of Biology, Faculty of Sciences, University of Porto, Porto, Portugal
| | | | - Mads F Bertelsen
- Centre for Zoo and Wild Animal Health, Copenhagen Zoo, Frederiksberg, Denmark
| | - Fumin Lei
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- College of Life Science, University of Chinese Academy of Sciences, Beijing, China
| | - Carsten Rahbek
- Center for Global Mountain Biodiversity, Globe Institute, University of Copenhagen, Copenhagen, Denmark
- Center for Macroecology, Evolution, and Climate, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
- Institute of Ecology, Peking University, Beijing, China
- Danish Institute for Advanced Study, University of Southern Denmark, Odense, Denmark
| | - Gary R Graves
- Center for Macroecology, Evolution, and Climate, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
- Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, Washington, DC, USA
| | | | - Tandy Warnow
- University of Illinois Urbana-Champaign, Champaign, IL, USA
| | - Edward L Braun
- Department of Biology, University of Florida, Gainesville, FL, USA
| | - M Thomas P Gilbert
- Center for Evolutionary Hologenomics, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
- University Museum, NTNU, Trondheim, Norway
| | - Erich D Jarvis
- Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Durham, NC, USA
| | | | - Guojie Zhang
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, China.
- Innovation Center of Yangtze River Delta, Zhejiang University, Jiashan, China.
- BGI Research, Wuhan, China.
- Villum Center for Biodiversity Genomics, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
13
|
Morel B, Williams TA, Stamatakis A, Szöllősi GJ. AleRax: a tool for gene and species tree co-estimation and reconciliation under a probabilistic model of gene duplication, transfer, and loss. Bioinformatics 2024; 40:btae162. [PMID: 38514421 PMCID: PMC10990685 DOI: 10.1093/bioinformatics/btae162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 01/30/2024] [Accepted: 03/19/2024] [Indexed: 03/23/2024] Open
Abstract
MOTIVATION Genomes are a rich source of information on the pattern and process of evolution across biological scales. How best to make use of that information is an active area of research in phylogenetics. Ideally, phylogenetic methods should not only model substitutions along gene trees, which explain differences between homologous gene sequences, but also the processes that generate the gene trees themselves along a shared species tree. To conduct accurate inferences, one needs to account for uncertainty at both levels, that is, in gene trees estimated from inherently short sequences and in their diverse evolutionary histories along a shared species tree. RESULTS We present AleRax, a software that can infer reconciled gene trees together with a shared species tree using a simple, yet powerful, probabilistic model of gene duplication, transfer, and loss. A key feature of AleRax is its ability to account for uncertainty in the gene tree and its reconciliation by using an efficient approximation to calculate the joint phylogenetic-reconciliation likelihood and sample reconciled gene trees accordingly. Simulations and analyses of empirical data show that AleRax is one order of magnitude faster than competing gene tree inference tools while attaining the same accuracy. It is consistently more robust than species tree inference methods such as SpeciesRax and ASTRAL-Pro 2 under gene tree uncertainty. Finally, AleRax can process multiple gene families in parallel thereby allowing users to compare competing phylogenetic hypotheses and estimate model parameters, such as duplication, transfer, and loss probabilities for genome-scale datasets with hundreds of taxa. AVAILABILITY AND IMPLEMENTATION GNU GPL at https://github.com/BenoitMorel/AleRax and data are made available at https://cme.h-its.org/exelixis/material/alerax_data.tar.gz.
Collapse
Affiliation(s)
- Benoit Morel
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe 76131, Germany
| | - Tom A Williams
- School of Biological Sciences, University of Bristol, Bristol BS8 1TQ, United Kingdom
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe 76131, Germany
- Institute of Computer Science, Biodiversity Computing Group, Heraklion GR-70013, Greece
| | - Gergely J Szöllősi
- ELTE-MTA “Lendület”, Evolutionary Genomics Research Group, Budapest H-1117, Hungary
- Institute of Evolution, HUN-REN Centre for Ecological Research, Budapest H-1121, Hungary
- Model-Based Evolutionary Genomics Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa 904-0495, Japan
| |
Collapse
|
14
|
Williams TA, Davin AA, Szánthó LL, Stamatakis A, Wahl NA, Woodcroft BJ, Soo RM, Eme L, Sheridan PO, Gubry-Rangin C, Spang A, Hugenholtz P, Szöllősi GJ. Phylogenetic reconciliation: making the most of genomes to understand microbial ecology and evolution. THE ISME JOURNAL 2024; 18:wrae129. [PMID: 39001714 PMCID: PMC11293204 DOI: 10.1093/ismejo/wrae129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 07/01/2024] [Accepted: 07/12/2024] [Indexed: 07/15/2024]
Abstract
In recent years, phylogenetic reconciliation has emerged as a promising approach for studying microbial ecology and evolution. The core idea is to model how gene trees evolve along a species tree and to explain differences between them via evolutionary events including gene duplications, transfers, and losses. Here, we describe how phylogenetic reconciliation provides a natural framework for studying genome evolution and highlight recent applications including ancestral gene content inference, the rooting of species trees, and the insights into metabolic evolution and ecological transitions they yield. Reconciliation analyses have elucidated the evolution of diverse microbial lineages, from Chlamydiae to Asgard archaea, shedding light on ecological adaptation, host-microbe interactions, and symbiotic relationships. However, there are many opportunities for broader application of the approach in microbiology. Continuing improvements to make reconciliation models more realistic and scalable, and integration of ecological metadata such as habitat, pH, temperature, and oxygen use offer enormous potential for understanding the rich tapestry of microbial life.
Collapse
Affiliation(s)
- Tom A Williams
- School of Biological Sciences, University of Bristol, Bristol BS81TQ, United Kingdom
| | - Adrian A Davin
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, 113-0033 Tokyo, Japan
| | - Lénárd L Szánthó
- MTA-ELTE “Lendület” Evolutionary Genomics Research Group, Eötvös University, 1117 Budapest, Hungary
- Model-Based Evolutionary Genomics Unit, Okinawa Institute of Science and Technology Graduate University, 904-0495 Okinawa, Japan
| | - Alexandros Stamatakis
- Biodiversity Computing Group, Institute of Computer Science, Foundation for Research and Technology Hellas, 70013 Heraklion, Greece
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, 76131 Karlsruhe, Germany
| | - Noah A Wahl
- Biodiversity Computing Group, Institute of Computer Science, Foundation for Research and Technology Hellas, 70013 Heraklion, Greece
| | - Ben J Woodcroft
- Centre for Microbiome Research, School of Biomedical Sciences, Queensland University of Technology (QUT), Translational Research Institute, Woolloongabba, QLD 4102, Australia
| | - Rochelle M Soo
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Laura Eme
- Unité d’Ecologie, Systématique et Evolution, Université Paris-Saclay, 91190 Gif-sur-Yvette, France
| | - Paul O Sheridan
- School of Biological and Chemical Sciences, University of Galway, Galway H91 TK33, Ireland
| | - Cecile Gubry-Rangin
- School of Biological Sciences, University of Aberdeen, Aberdeen AB24 3FX, United Kingdom
| | - Anja Spang
- Department of Marine Microbiology and Biogeochemistry, NIOZ, Royal Netherlands Institute for Sea Research, PO Box 59, 1790 AB Den Burg, The Netherlands
- Department of Evolutionary & Population Biology, Institute for Biodiversity and Ecosystem Dynamics (IBED), University of Amsterdam, Amsterdam, The Netherlands
| | - Philip Hugenholtz
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Gergely J Szöllősi
- MTA-ELTE “Lendület” Evolutionary Genomics Research Group, Eötvös University, 1117 Budapest, Hungary
- Model-Based Evolutionary Genomics Unit, Okinawa Institute of Science and Technology Graduate University, 904-0495 Okinawa, Japan
- Institute of Evolution, HUN REN Centre for Ecological Research, 1121 Budapest, Hungary
| |
Collapse
|
15
|
Trost J, Haag J, Höhler D, Jacob L, Stamatakis A, Boussau B. Simulations of Sequence Evolution: How (Un)realistic They Are and Why. Mol Biol Evol 2024; 41:msad277. [PMID: 38124381 PMCID: PMC10768886 DOI: 10.1093/molbev/msad277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 11/17/2023] [Accepted: 12/08/2023] [Indexed: 12/23/2023] Open
Abstract
MOTIVATION Simulating multiple sequence alignments (MSAs) using probabilistic models of sequence evolution plays an important role in the evaluation of phylogenetic inference tools and is crucial to the development of novel learning-based approaches for phylogenetic reconstruction, for instance, neural networks. These models and the resulting simulated data need to be as realistic as possible to be indicative of the performance of the developed tools on empirical data and to ensure that neural networks trained on simulations perform well on empirical data. Over the years, numerous models of evolution have been published with the goal to represent as faithfully as possible the sequence evolution process and thus simulate empirical-like data. In this study, we simulated DNA and protein MSAs under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how accurately supervised learning methods are able to predict whether a given MSA is simulated or empirical. RESULTS Our results show that we can distinguish between empirical and simulated MSAs with high accuracy using two distinct and independently developed classification approaches across all tested models of sequence evolution. Our findings suggest that the current state-of-the-art models fail to accurately replicate several aspects of empirical MSAs, including site-wise rates as well as amino acid and nucleotide composition.
Collapse
Affiliation(s)
- Johanna Trost
- Biometry and Evolutionary Biology Laboratory (LBBE), University Claude Bernard Lyon 1, Lyon, France
| | - Julia Haag
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Dimitri Höhler
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Laurent Jacob
- CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Sorbonne Université, Paris 75005, France
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Biodiversity Computing Group, Institute of Computer Science, Foundation for Research and Technology - Hellas, Heraklion, Crete, Greece
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Bastien Boussau
- Biometry and Evolutionary Biology Laboratory (LBBE), University Claude Bernard Lyon 1, Lyon, France
| |
Collapse
|
16
|
Steenwyk JL, Li Y, Zhou X, Shen XX, Rokas A. Incongruence in the phylogenomics era. Nat Rev Genet 2023; 24:834-850. [PMID: 37369847 PMCID: PMC11499941 DOI: 10.1038/s41576-023-00620-x] [Citation(s) in RCA: 63] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/19/2023] [Indexed: 06/29/2023]
Abstract
Genome-scale data and the development of novel statistical phylogenetic approaches have greatly aided the reconstruction of a broad sketch of the tree of life and resolved many of its branches. However, incongruence - the inference of conflicting evolutionary histories - remains pervasive in phylogenomic data, hampering our ability to reconstruct and interpret the tree of life. Biological factors, such as incomplete lineage sorting, horizontal gene transfer, hybridization, introgression, recombination and convergent molecular evolution, can lead to gene phylogenies that differ from the species tree. In addition, analytical factors, including stochastic, systematic and treatment errors, can drive incongruence. Here, we review these factors, discuss methodological advances to identify and handle incongruence, and highlight avenues for future research.
Collapse
Affiliation(s)
- Jacob L Steenwyk
- Howards Hughes Medical Institute and the Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA
- Vanderbilt Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN, USA
| | - Yuanning Li
- Institute of Marine Science and Technology, Shandong University, Qingdao, China
| | - Xiaofan Zhou
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, China
| | - Xing-Xing Shen
- Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Institute of Insect Sciences, Zhejiang University, Hangzhou, China
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA.
- Vanderbilt Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN, USA.
- Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.
| |
Collapse
|
17
|
Togkousidis A, Kozlov OM, Haag J, Höhler D, Stamatakis A. Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty. Mol Biol Evol 2023; 40:msad227. [PMID: 37804116 PMCID: PMC10584362 DOI: 10.1093/molbev/msad227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Revised: 09/06/2023] [Accepted: 09/26/2023] [Indexed: 10/08/2023] Open
Abstract
Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy multiple sequence alignments (MSAs) exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, overanalyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10×. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).
Collapse
Affiliation(s)
- Anastasis Togkousidis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany
| | - Oleksiy M Kozlov
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany
| | - Julia Haag
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany
| | - Dimitri Höhler
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, 76128 Karlsruhe, Germany
- Biodiversity Computing Group, Institute of Computer Science, Foundation for Research and Technology - Hellas, GR - 711 10 Heraklion, Crete, Greece
| |
Collapse
|
18
|
Kumar S, Tao Q, Lamarca AP, Tamura K. Computational Reproducibility of Molecular Phylogenies. Mol Biol Evol 2023; 40:msad165. [PMID: 37467477 PMCID: PMC10370456 DOI: 10.1093/molbev/msad165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 07/11/2023] [Accepted: 07/12/2023] [Indexed: 07/21/2023] Open
Abstract
Repeated runs of the same program can generate different molecular phylogenies from identical data sets under the same analytical conditions. This lack of reproducibility of inferred phylogenies casts a long shadow on downstream research employing these phylogenies in areas such as comparative genomics, systematics, and functional biology. We have assessed the relative accuracies and log-likelihoods of alternative phylogenies generated for computer-simulated and empirical data sets. Our findings indicate that these alternative phylogenies reconstruct evolutionary relationships with comparable accuracy. They also have similar log-likelihoods that are not inferior to the log-likelihoods of the true tree. We determined that the direct relationship between irreproducibility and inaccuracy is due to their common dependence on the amount of phylogenetic information in the data. While computational reproducibility can be enhanced through more extensive heuristic searches for the maximum likelihood tree, this does not lead to higher accuracy. We conclude that computational irreproducibility plays a minor role in molecular phylogenetics.
Collapse
Affiliation(s)
- Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
- Department of Biology, Temple University, Philadelphia, PA, USA
| | - Qiqing Tao
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
- Department of Biology, Temple University, Philadelphia, PA, USA
| | - Alessandra P Lamarca
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
- Department of Biology, Temple University, Philadelphia, PA, USA
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Koichiro Tamura
- Research Center for Genomics and Bioinformatics, Tokyo Metropolitan University, Hachioji, Tokyo, Japan
- Department of Biological Sciences, Tokyo Metropolitan University, Hachioji, Tokyo, Japan
| |
Collapse
|
19
|
Cokelaer T, Cohen-Boulakia S, Lemoine F. Reprohackathons: promoting reproducibility in bioinformatics through training. Bioinformatics 2023; 39:i11-i20. [PMID: 37387150 DOI: 10.1093/bioinformatics/btad227] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION The reproducibility crisis has highlighted the importance of improving the way bioinformatics data analyses are implemented, executed, and shared. To address this, various tools such as content versioning systems, workflow management systems, and software environment management systems have been developed. While these tools are becoming more widely used, there is still much work to be done to increase their adoption. The most effective way to ensure reproducibility becomes a standard part of most bioinformatics data analysis projects is to integrate it into the curriculum of bioinformatics Master's programs. RESULTS In this article, we present the Reprohackathon, a Master's course that we have been running for the last 3 years at Université Paris-Saclay (France), and that has been attended by a total of 123 students. The course is divided into two parts. The first part includes lessons on the challenges related to reproducibility, content versioning systems, container management, and workflow systems. In the second part, students work on a data analysis project for 3-4 months, reanalyzing data from a previously published study. The Reprohackaton has taught us many valuable lessons, such as the fact that implementing reproducible analyses is a complex and challenging task that requires significant effort. However, providing in-depth teaching of the concepts and the tools during a Master's degree program greatly improves students' understanding and abilities in this area.
Collapse
Affiliation(s)
- Thomas Cokelaer
- Institut Pasteur, Université Paris Cité, Plate-Forme Technologique Biomics, 75015 Paris, France
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, 75015 Paris, France
| | - Sarah Cohen-Boulakia
- Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, 91405 Orsay, France
| | - Frédéric Lemoine
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, 75015 Paris, France
- Institut Pasteur, Université Paris Cité, G5 Evolutionary Genomics of RNA Viruses, 75015 Paris, France
| |
Collapse
|
20
|
Černý D, Simonoff AL. Statistical evaluation of character support reveals the instability of higher-level dinosaur phylogeny. Sci Rep 2023; 13:9273. [PMID: 37286556 DOI: 10.1038/s41598-023-35784-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 05/23/2023] [Indexed: 06/09/2023] Open
Abstract
The interrelationships of the three major dinosaur clades (Theropoda, Sauropodomorpha, and Ornithischia) have come under increased scrutiny following the recovery of conflicting phylogenies by a large new character matrix and its extensively modified revision. Here, we use tools derived from recent phylogenomic studies to investigate the strength and causes of this conflict. Using maximum likelihood as an overarching framework, we examine the global support for alternative hypotheses as well as the distribution of phylogenetic signal among individual characters in both the original and rescored dataset. We find the three possible ways of resolving the relationships among the main dinosaur lineages (Saurischia, Ornithischiformes, and Ornithoscelida) to be statistically indistinguishable and supported by nearly equal numbers of characters in both matrices. While the changes made to the revised matrix increased the mean phylogenetic signal of individual characters, this amplified rather than reduced their conflict, resulting in greater sensitivity to character removal or coding changes and little overall improvement in the ability to discriminate between alternative topologies. We conclude that early dinosaur relationships are unlikely to be resolved without fundamental changes to both the quality of available datasets and the techniques used to analyze them.
Collapse
Affiliation(s)
- David Černý
- Department of the Geophysical Sciences, University of Chicago, 5734 South Ellis Avenue, Chicago, IL, 60637, USA.
| | - Ashley L Simonoff
- Department of the Geophysical Sciences, University of Chicago, 5734 South Ellis Avenue, Chicago, IL, 60637, USA
| |
Collapse
|