1
|
Salles MMA, Domingos FMCB. Towards the next generation of species delimitation methods: an overview of machine learning applications. Mol Phylogenet Evol 2025; 210:108368. [PMID: 40348350 DOI: 10.1016/j.ympev.2025.108368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 02/25/2025] [Accepted: 05/04/2025] [Indexed: 05/14/2025]
Abstract
Species delimitation is the process of distinguishing between populations of the same species and distinct species of a particular group of organisms. Various methods exist for inferring species limits, whether based on morphological, molecular, or other types of data. In the case of methods based on DNA sequences, most of them are rooted in the coalescent theory. However, coalescence-based models have limitations, for instance regarding complex evolutionary scenarios and large datasets. In this context, machine learning (ML) can be considered as a promising analytical tool, and provides an effective way to explore dataset structures when species-level divergences are hypothesized. In this review, we examine the use of ML in species delimitation and provide an overview and critical appraisal of existing workflows. We also provide simple explanations on how the main types of ML approaches operate, which should help uninitiated researchers and students interested in the field. Our review suggests that while current ML methods designed to infer species limits are analytically powerful, they also present specific limitations and should not be considered as definitive alternatives to coalescent methods for species delimitation. Future ML enterprises to delimit species should consider the constraints related to the use of simulated data, as in other model-based methods relying on simulations. Conversely, the flexibility of ML algorithms offers a significant advantage by enabling the analysis of diverse data types (e.g., genetic and phenotypic) and handling large datasets effectively. We also propose best practices for the use of ML methods in species delimitation, offering insights into potential future applications. We expect that the proposed guidelines will be useful for enhancing the accessibility, effectiveness, and objectivity of ML in species delimitation.
Collapse
Affiliation(s)
- Matheus M A Salles
- Departamento de Zoologia, Universidade Federal do Paraná, Curitiba 81531-980, Brazil.
| | | |
Collapse
|
2
|
Zhu Y, Li Y, Li C, Shen XX, Zhou X. A critical evaluation of deep-learning based phylogenetic inference programs using simulated datasets. J Genet Genomics 2025; 52:714-717. [PMID: 39824436 DOI: 10.1016/j.jgg.2025.01.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2024] [Revised: 01/08/2025] [Accepted: 01/09/2025] [Indexed: 01/20/2025]
Affiliation(s)
- Yixiao Zhu
- College of Agriculture and Biotechnology and Centre for Evolutionary & Organismal Biology, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Yonglin Li
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, Guangdong 510642, China
| | - Chuhao Li
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, Guangdong 510642, China
| | - Xing-Xing Shen
- College of Agriculture and Biotechnology and Centre for Evolutionary & Organismal Biology, Zhejiang University, Hangzhou, Zhejiang 310058, China.
| | - Xiaofan Zhou
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, Guangdong 510642, China.
| |
Collapse
|
3
|
Nesterenko L, Blassel L, Veber P, Boussau B, Jacob L. Phyloformer: Fast, Accurate, and Versatile Phylogenetic Reconstruction with Deep Neural Networks. Mol Biol Evol 2025; 42:msaf051. [PMID: 40066802 PMCID: PMC11965795 DOI: 10.1093/molbev/msaf051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Revised: 01/16/2025] [Accepted: 01/27/2025] [Indexed: 04/04/2025] Open
Abstract
Phylogenetic inference aims at reconstructing the tree describing the evolution of a set of sequences descending from a common ancestor. The high computational cost of state-of-the-art maximum likelihood and Bayesian inference methods limits their usability under realistic evolutionary models. Harnessing recent advances in likelihood-free inference and geometric deep learning, we introduce Phyloformer, a fast and accurate method for evolutionary distance estimation and phylogenetic reconstruction. Sampling many trees and sequences under an evolutionary model, we train the network to learn a function that enables predicting a tree from a multiple sequence alignment. On simulated data, we compare Phyloformer to FastME-a distance method-and two maximum likelihood methods: FastTree and IQTree. Under a commonly used model of protein sequence evolution and exploiting graphics processing unit (GPU) acceleration, Phyloformer outpaces all other approaches and exceeds their accuracy in the Kuhner-Felsenstein metric that accounts for both the topology and branch lengths. In terms of topological accuracy alone, Phyloformer outperforms FastME, but falls behind maximum likelihood approaches, especially as the number of sequences increases. When a model of sequence evolution that includes dependencies between sites is used, Phyloformer outperforms all other methods across all metrics on alignments with fewer than 80 sequences. On 3,801 empirical gene alignments from five different datasets, Phyloformer matches the topological accuracy of the two maximum likelihood implementations. Our results pave the way for the adoption of sophisticated realistic models for phylogenetic inference.
Collapse
Affiliation(s)
- Luca Nesterenko
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Luc Blassel
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Philippe Veber
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Bastien Boussau
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Laurent Jacob
- Laboratory of Computational and Quantitative Biology, Sorbonne Université, Paris, France
| |
Collapse
|
4
|
Landis MJ, Thompson A. phyddle: software for exploring phylogenetic models with deep learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.08.06.606717. [PMID: 39149349 PMCID: PMC11326143 DOI: 10.1101/2024.08.06.606717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Phylogenies contain a wealth of information about the evolutionary history and process that gave rise to the diversity of life. This information can be extracted by fitting phylogenetic models to trees. However, many realistic phylogenetic models lack tractable likelihood functions, prohibiting their use with standard inference methods. We present phyddle, pipeline-based software for performing phylogenetic modeling tasks on trees using likelihood-free deep learning approaches. phyddle has a flexible command-line interface, making it easy to integrate deep learning approaches for phylogenetics into research workflows. phyddle coordinates modeling tasks through five pipeline analysis steps (Simulate, Format, Train, Estimate, and Plot) that transform raw phylogenetic datasets as input into numerical and visual model-based output. We conduct three experiments to compare the accuracy of likelihood-based inferences against deep learning-based inferences obtained through phyddle. Benchmarks show that phyddle accurately performs the inference tasks for which it was designed, such as estimating macroevolutionary parameters, selecting among continuous trait evolution models, and passing coverage tests for epidemiological models, even for models that lack tractable likelihoods. Learn more about phyddle at https://phyddle.org.
Collapse
Affiliation(s)
- Michael J. Landis
- Department of Biology, Washington University, St. Louis, MO, 63110, USA
| | - Ammon Thompson
- Participant in an education program sponsored by U.S. Department of Defense (DOD)
| |
Collapse
|
5
|
Roa Lozano J, Duncan M, McKenna DD, Castoe TA, DeGiorgio M, Adams R. TraitTrainR: accelerating large-scale simulation under models of continuous trait evolution. BIOINFORMATICS ADVANCES 2024; 5:vbae196. [PMID: 39758830 PMCID: PMC11696700 DOI: 10.1093/bioadv/vbae196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Revised: 11/08/2024] [Accepted: 12/05/2024] [Indexed: 01/07/2025]
Abstract
Motivation The scale and scope of comparative trait data are expanding at unprecedented rates, and recent advances in evolutionary modeling and simulation sometimes struggle to match this pace. Well-organized and flexible applications for conducting large-scale simulations of evolution hold promise in this context for understanding models and more so our ability to confidently estimate them with real trait data sampled from nature. Results We introduce TraitTrainR, an R package designed to facilitate efficient, large-scale simulations under complex models of continuous trait evolution. TraitTrainR employs several output formats, supports popular trait data transformations, accommodates multi-trait evolution, and exhibits flexibility in defining input parameter space and model stacking. Moreover, TraitTrainR permits measurement error, allowing for investigation of its potential impacts on evolutionary inference. We envision a wealth of applications of TraitTrainR, and we demonstrate one such example by examining the problem of evolutionary model selection in three empirical phylogenetic case studies. Collectively, these demonstrations of applying TraitTrainR to explore problems in model selection underscores its utility and broader promise for addressing key questions, including those related to experimental design and statistical power, in comparative biology. Availability and implementation TraitTrainR is developed in R 4.4.0 and is freely available at https://github.com/radamsRHA/TraitTrainR/, which includes detailed documentation, quick-start guides, and a step-by-step tutorial.
Collapse
Affiliation(s)
- Jenniffer Roa Lozano
- Center for Agricultural Data Analytics, University of Arkansas, Fayetteville, AR 72701, United States
- Department of Entomology and Plant Pathology, University of Arkansas, Fayetteville, AR 72701, United States
| | - Mataya Duncan
- Center for Agricultural Data Analytics, University of Arkansas, Fayetteville, AR 72701, United States
- Department of Entomology and Plant Pathology, University of Arkansas, Fayetteville, AR 72701, United States
| | - Duane D McKenna
- Department of Biological Sciences, University of Memphis, Memphis, TN 38152, United States
- Center for Biodiversity Research, University of Memphis, Memphis, TN 38152, United States
| | - Todd A Castoe
- Department of Biology, University of Texas at Arlington, Arlington, TX 76010, United States
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, United States
| | - Richard Adams
- Center for Agricultural Data Analytics, University of Arkansas, Fayetteville, AR 72701, United States
- Department of Entomology and Plant Pathology, University of Arkansas, Fayetteville, AR 72701, United States
| |
Collapse
|
6
|
Silvestro D, Latrille T, Salamin N. Toward a Semi-Supervised Learning Approach to Phylogenetic Estimation. Syst Biol 2024; 73:789-806. [PMID: 38916476 PMCID: PMC11639169 DOI: 10.1093/sysbio/syae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/21/2024] [Accepted: 06/24/2024] [Indexed: 06/26/2024] Open
Abstract
Models have always been central to inferring molecular evolution and to reconstructing phylogenetic trees. Their use typically involves the development of a mechanistic framework reflecting our understanding of the underlying biological processes, such as nucleotide substitutions, and the estimation of model parameters by maximum likelihood or Bayesian inference. However, deriving and optimizing the likelihood of the data is not always possible under complex evolutionary scenarios or even tractable for large datasets, often leading to unrealistic simplifying assumptions in the fitted models. To overcome this issue, we coupled stochastic simulations of genome evolution with a new supervised deep-learning model to infer key parameters of molecular evolution. Our model is designed to directly analyze multiple sequence alignments and estimate per-site evolutionary rates and divergence without requiring a known phylogenetic tree. The accuracy of our predictions matched that of likelihood-based phylogenetic inference when rate heterogeneity followed a simple gamma distribution, but it strongly exceeded it under more complex patterns of rate variation, such as codon models. Our approach is highly scalable and can be efficiently applied to genomic data, as we showed on a dataset of 26 million nucleotides from the clownfish clade. Our simulations also showed that the integration of per-site rates obtained by deep learning within a Bayesian framework led to significantly more accurate phylogenetic inference, particularly with respect to the estimated branch lengths. We thus propose that future advancements in phylogenetic analysis will benefit from a semi-supervised learning approach that combines deep-learning estimation of substitution rates, which allows for more flexible models of rate variation, and probabilistic inference of the phylogenetic tree, which guarantees interpretability and a rigorous assessment of statistical support.
Collapse
Affiliation(s)
- Daniele Silvestro
- Department of Biology, University of Fribourg and Swiss Institute of Bioinformatics, 1700 Fribourg, Switzerland
- Department of Biological and Environmental Sciences, Gothenburg Global Biodiversity Centre, University of Gothenburg, 40530 Gothenburg, Sweden
| | - Thibault Latrille
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Nicolas Salamin
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| |
Collapse
|
7
|
Bernardini G, van Iersel L, Julien E, Stougie L. Inferring phylogenetic networks from multifurcating trees via cherry picking and machine learning. Mol Phylogenet Evol 2024; 199:108137. [PMID: 39029549 DOI: 10.1016/j.ympev.2024.108137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2023] [Revised: 02/19/2024] [Accepted: 06/29/2024] [Indexed: 07/21/2024]
Abstract
The Hybridization problem asks to reconcile a set of conflicting phylogenetic trees into a single phylogenetic network with the smallest possible number of reticulation nodes. This problem is computationally hard and previous solutions are limited to small and/or severely restricted data sets, for example, a set of binary trees with the same taxon set or only two non-binary trees with non-equal taxon sets. Building on our previous work on binary trees, we present FHyNCH, the first algorithmic framework to heuristically solve the Hybridization problem for large sets of multifurcating trees whose sets of taxa may differ. Our heuristics combine the cherry-picking technique, recently proposed to solve the same problem for binary trees, with two carefully designed machine-learning models. We demonstrate that our methods are practical and produce qualitatively good solutions through experiments on both synthetic and real data sets.
Collapse
Affiliation(s)
| | - Leo van Iersel
- Delft Institute of Applied Mathematics, Delft, The Netherlands
| | - Esther Julien
- Delft Institute of Applied Mathematics, Delft, The Netherlands.
| | - Leen Stougie
- CWI, Amsterdam, the Netherlands; Vrije Universiteit, Amsterdam, The Netherlands; INRIA-Erable, France
| |
Collapse
|
8
|
Li Z, Hu Y, Song Y, Li D, Yang X, Zhang L, Li T, Wang H. Diversity, Distribution and Structural Prediction of the Pathogenic Bacterial Effectors EspN and EspS. Genes (Basel) 2024; 15:1250. [PMID: 39457374 PMCID: PMC11507257 DOI: 10.3390/genes15101250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2024] [Revised: 09/20/2024] [Accepted: 09/23/2024] [Indexed: 10/28/2024] Open
Abstract
BACKGROUND Many Gram-negative enterobacteria translocate virulence proteins (effectors) into intestinal epithelial cells using a type III secretion system (T3SS) to subvert the activity of various cell functions possess. Many T3SS effectors have been extensively characterized, but there are still some effector proteins whose functional information is completely unknown. METHODS In this study, two predicted effectors of unknown function, EspN and EspS (Escherichia coli secreted protein N and S), were selected for analysis of translocation, distribution and structure prediction. RESULTS The TEM1 (β-lactamase) translocation assay was performed, which showed that EspN and EspS are translocated into host cells in a T3SS-dependent manner during bacterial infection. A phylogenetic tree analysis revealed that homologs of EspN and EspS are widely distributed in pathogenic bacteria. Multiple sequence alignment revealed that EspN and its homologs share a conserved C-terminal region (673-1133 a.a.). Furthermore, the structure of EspN (673-1133 a.a.) was also predicted and well-defined, which showed that it has three subdomains connected by a loop region. EspS and its homologs share a sequence-conserved C-terminal (146-291 a.a.). The predicted structure of EspS (146-291 a.a.) is composed of a β-sheet consisting of four β-strands and several short helices, which has a TM score of 0.5014 with the structure of the Vibrio cholerae RTX cysteine protease domain (PDBID: 3eeb). CONCLUSIONS These results suggest that EspN and EspS may represent two important classes of T3SS effectors associated with pathogen virulence, and our findings provide important clues to understanding the potential functions of EspN and EspS.
Collapse
Affiliation(s)
- Zhan Li
- State Key Laboratory of Pathogens and Biosecurity, Academy of Military Medical Sciences, Beijing 100071, China; (Z.L.); (Y.H.); (Y.S.); (D.L.); (X.Y.); (L.Z.)
| | - Yuru Hu
- State Key Laboratory of Pathogens and Biosecurity, Academy of Military Medical Sciences, Beijing 100071, China; (Z.L.); (Y.H.); (Y.S.); (D.L.); (X.Y.); (L.Z.)
| | - Yuan Song
- State Key Laboratory of Pathogens and Biosecurity, Academy of Military Medical Sciences, Beijing 100071, China; (Z.L.); (Y.H.); (Y.S.); (D.L.); (X.Y.); (L.Z.)
| | - Deyu Li
- State Key Laboratory of Pathogens and Biosecurity, Academy of Military Medical Sciences, Beijing 100071, China; (Z.L.); (Y.H.); (Y.S.); (D.L.); (X.Y.); (L.Z.)
| | - Xiaolan Yang
- State Key Laboratory of Pathogens and Biosecurity, Academy of Military Medical Sciences, Beijing 100071, China; (Z.L.); (Y.H.); (Y.S.); (D.L.); (X.Y.); (L.Z.)
| | - Liangyan Zhang
- State Key Laboratory of Pathogens and Biosecurity, Academy of Military Medical Sciences, Beijing 100071, China; (Z.L.); (Y.H.); (Y.S.); (D.L.); (X.Y.); (L.Z.)
| | - Tao Li
- State Key Laboratory of Pathogens and Biosecurity, Academy of Military Medical Sciences, Beijing 100071, China; (Z.L.); (Y.H.); (Y.S.); (D.L.); (X.Y.); (L.Z.)
| | - Hui Wang
- State Key Laboratory of Pathogens and Biosecurity, Academy of Military Medical Sciences, Beijing 100071, China; (Z.L.); (Y.H.); (Y.S.); (D.L.); (X.Y.); (L.Z.)
- School of Basic Medical Science, Anhui Medical University, Hefei 230032, China
| |
Collapse
|
9
|
Li X, Zhou T, Feng X, Yau ST, Yau SST. Exploring geometry of genome space via Grassmann manifolds. Innovation (N Y) 2024; 5:100677. [PMID: 39206218 PMCID: PMC11350263 DOI: 10.1016/j.xinn.2024.100677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Accepted: 07/18/2024] [Indexed: 09/04/2024] Open
Abstract
It is important to understand the geometry of genome space in biology. After transforming genome sequences into frequency matrices of the chaos game representation (FCGR), we regard a genome sequence as a point in a suitable Grassmann manifold by analyzing the column space of the corresponding FCGR. To assess the sequence similarity, we employ the generalized Grassmannian distance, an intrinsic geometric distance that differs from the traditional Euclidean distance used in the classical k-mer frequency-based methods. With this method, we constructed phylogenetic trees for various genome datasets, including influenza A virus hemagglutinin gene, Orthocoronavirinae genome, and SARS-CoV-2 complete genome sequences. Our comparative analysis with multiple sequence alignment and alignment-free methods for large-scale sequences revealed that our method, which employs the subspace distance between the column spaces of different FCGRs (FCGR-SD), outperformed its competitors in terms of both speed and accuracy. In addition, we used low-dimensional visualization of the SARS-CoV-2 genome sequences and spike protein nucleotide sequences with our methods, resulting in some intriguing findings. We not only propose a novel and efficient algorithm for comparing genome sequences but also demonstrate that genome data have some intrinsic manifold structures, providing a new geometric perspective for molecular biology studies.
Collapse
Affiliation(s)
- Xiaoguang Li
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Tao Zhou
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Xingdong Feng
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Shing-Tung Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China
| | - Stephen S.-T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China
| |
Collapse
|
10
|
Mello B, Schrago CG. Modeling Substitution Rate Evolution across Lineages and Relaxing the Molecular Clock. Genome Biol Evol 2024; 16:evae199. [PMID: 39332907 PMCID: PMC11430275 DOI: 10.1093/gbe/evae199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/08/2024] [Indexed: 09/29/2024] Open
Abstract
Relaxing the molecular clock using models of how substitution rates change across lineages has become essential for addressing evolutionary problems. The diversity of rate evolution models and their implementations are substantial, and studies have demonstrated their impact on divergence time estimates can be as significant as that of calibration information. In this review, we trace the development of rate evolution models from the proposal of the molecular clock concept to the development of sophisticated Bayesian and non-Bayesian methods that handle rate variation in phylogenies. We discuss the various approaches to modeling rate evolution, provide a comprehensive list of available software, and examine the challenges and advancements of the prevalent Bayesian framework, contrasting them to faster non-Bayesian methods. Lastly, we offer insights into potential advancements in the field in the era of big data.
Collapse
Affiliation(s)
- Beatriz Mello
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, RJ 21941-617, Brazil
| | - Carlos G Schrago
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, RJ 21941-617, Brazil
| |
Collapse
|
11
|
Mo YK, Hahn MW, Smith ML. Applications of machine learning in phylogenetics. Mol Phylogenet Evol 2024; 196:108066. [PMID: 38565358 DOI: 10.1016/j.ympev.2024.108066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 02/16/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]
Abstract
Machine learning has increasingly been applied to a wide range of questions in phylogenetic inference. Supervised machine learning approaches that rely on simulated training data have been used to infer tree topologies and branch lengths, to select substitution models, and to perform downstream inferences of introgression and diversification. Here, we review how researchers have used several promising machine learning approaches to make phylogenetic inferences. Despite the promise of these methods, several barriers prevent supervised machine learning from reaching its full potential in phylogenetics. We discuss these barriers and potential paths forward. In the future, we expect that the application of careful network designs and data encodings will allow supervised machine learning to accommodate the complex processes that continue to confound traditional phylogenetic methods.
Collapse
Affiliation(s)
- Yu K Mo
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Matthew W Hahn
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA; Department of Biology, Indiana University, Bloomington, IN 47405, USA
| | - Megan L Smith
- Department of Biological Sciences, Mississippi State University, Starkville, MS 39762, USA.
| |
Collapse
|
12
|
Leuchtenberger AF, von Haeseler A. Learning From an Artificial Neural Network in Phylogenetics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:278-288. [PMID: 38198267 DOI: 10.1109/tcbb.2024.3352268] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2024]
Abstract
We show that an iterative ansatz of deep learning and human intelligence guided simplification may lead to surprisingly simple solutions for a difficult problem in phylogenetics. Distinguishing Farris and Felsenstein trees is a longstanding problem in phylogenetic tree reconstruction. The Artificial Neural Network F-zoneNN solves this problem for 4-taxon alignments evolved under the Jukes-Cantor model. It distinguishes between Farris and Felsenstein trees, but owing to its complexity, lacks transparency in its mechanism of discernment. Based on the simplification of F-zoneNN and alignment properties we constructed the function FarFelDiscerner. In contrast to F-zoneNN, FarFelDiscerner's decision process is understandable. Moreover, FarFelDiscerner is significantly simpler than F-zoneNN. Despite its simplicity this function infers the tree-type almost perfectly on noise-free data, and also performs well on simulated noisy alignments of finite length. We applied FarFelDiscerner to the historical Holometabola alignments where it places Strepsiptera with beetles, concordant with the current scientific view.
Collapse
|