1
|
Salles MMA, Domingos FMCB. Towards the next generation of species delimitation methods: an overview of machine learning applications. Mol Phylogenet Evol 2025; 210:108368. [PMID: 40348350 DOI: 10.1016/j.ympev.2025.108368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 02/25/2025] [Accepted: 05/04/2025] [Indexed: 05/14/2025]
Abstract
Species delimitation is the process of distinguishing between populations of the same species and distinct species of a particular group of organisms. Various methods exist for inferring species limits, whether based on morphological, molecular, or other types of data. In the case of methods based on DNA sequences, most of them are rooted in the coalescent theory. However, coalescence-based models have limitations, for instance regarding complex evolutionary scenarios and large datasets. In this context, machine learning (ML) can be considered as a promising analytical tool, and provides an effective way to explore dataset structures when species-level divergences are hypothesized. In this review, we examine the use of ML in species delimitation and provide an overview and critical appraisal of existing workflows. We also provide simple explanations on how the main types of ML approaches operate, which should help uninitiated researchers and students interested in the field. Our review suggests that while current ML methods designed to infer species limits are analytically powerful, they also present specific limitations and should not be considered as definitive alternatives to coalescent methods for species delimitation. Future ML enterprises to delimit species should consider the constraints related to the use of simulated data, as in other model-based methods relying on simulations. Conversely, the flexibility of ML algorithms offers a significant advantage by enabling the analysis of diverse data types (e.g., genetic and phenotypic) and handling large datasets effectively. We also propose best practices for the use of ML methods in species delimitation, offering insights into potential future applications. We expect that the proposed guidelines will be useful for enhancing the accessibility, effectiveness, and objectivity of ML in species delimitation.
Collapse
Affiliation(s)
- Matheus M A Salles
- Departamento de Zoologia, Universidade Federal do Paraná, Curitiba 81531-980, Brazil.
| | | |
Collapse
|
2
|
Zhu Y, Li Y, Li C, Shen XX, Zhou X. A critical evaluation of deep-learning based phylogenetic inference programs using simulated datasets. J Genet Genomics 2025; 52:714-717. [PMID: 39824436 DOI: 10.1016/j.jgg.2025.01.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2024] [Revised: 01/08/2025] [Accepted: 01/09/2025] [Indexed: 01/20/2025]
Affiliation(s)
- Yixiao Zhu
- College of Agriculture and Biotechnology and Centre for Evolutionary & Organismal Biology, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Yonglin Li
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, Guangdong 510642, China
| | - Chuhao Li
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, Guangdong 510642, China
| | - Xing-Xing Shen
- College of Agriculture and Biotechnology and Centre for Evolutionary & Organismal Biology, Zhejiang University, Hangzhou, Zhejiang 310058, China.
| | - Xiaofan Zhou
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, Guangdong 510642, China.
| |
Collapse
|
3
|
Nesterenko L, Blassel L, Veber P, Boussau B, Jacob L. Phyloformer: Fast, Accurate, and Versatile Phylogenetic Reconstruction with Deep Neural Networks. Mol Biol Evol 2025; 42:msaf051. [PMID: 40066802 PMCID: PMC11965795 DOI: 10.1093/molbev/msaf051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Revised: 01/16/2025] [Accepted: 01/27/2025] [Indexed: 04/04/2025] Open
Abstract
Phylogenetic inference aims at reconstructing the tree describing the evolution of a set of sequences descending from a common ancestor. The high computational cost of state-of-the-art maximum likelihood and Bayesian inference methods limits their usability under realistic evolutionary models. Harnessing recent advances in likelihood-free inference and geometric deep learning, we introduce Phyloformer, a fast and accurate method for evolutionary distance estimation and phylogenetic reconstruction. Sampling many trees and sequences under an evolutionary model, we train the network to learn a function that enables predicting a tree from a multiple sequence alignment. On simulated data, we compare Phyloformer to FastME-a distance method-and two maximum likelihood methods: FastTree and IQTree. Under a commonly used model of protein sequence evolution and exploiting graphics processing unit (GPU) acceleration, Phyloformer outpaces all other approaches and exceeds their accuracy in the Kuhner-Felsenstein metric that accounts for both the topology and branch lengths. In terms of topological accuracy alone, Phyloformer outperforms FastME, but falls behind maximum likelihood approaches, especially as the number of sequences increases. When a model of sequence evolution that includes dependencies between sites is used, Phyloformer outperforms all other methods across all metrics on alignments with fewer than 80 sequences. On 3,801 empirical gene alignments from five different datasets, Phyloformer matches the topological accuracy of the two maximum likelihood implementations. Our results pave the way for the adoption of sophisticated realistic models for phylogenetic inference.
Collapse
Affiliation(s)
- Luca Nesterenko
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Luc Blassel
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Philippe Veber
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Bastien Boussau
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Laurent Jacob
- Laboratory of Computational and Quantitative Biology, Sorbonne Université, Paris, France
| |
Collapse
|
4
|
Wijaya AJ, Anžel A, Richard H, Hattab G. Current state and future prospects of Horizontal Gene Transfer detection. NAR Genom Bioinform 2025; 7:lqaf005. [PMID: 39935761 PMCID: PMC11811736 DOI: 10.1093/nargab/lqaf005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2024] [Revised: 12/26/2024] [Accepted: 02/04/2025] [Indexed: 02/13/2025] Open
Abstract
Artificial intelligence (AI) has been shown to be beneficial in a wide range of bioinformatics applications. Horizontal Gene Transfer (HGT) is a driving force of evolutionary changes in prokaryotes. It is widely recognized that it contributes to the emergence of antimicrobial resistance (AMR), which poses a particularly serious threat to public health. Many computational approaches have been developed to study and detect HGT. However, the application of AI in this field has not been investigated. In this work, we conducted a review to provide information on the current trend of existing computational approaches for detecting HGT and to decipher the use of AI in this field. Here, we show a growing interest in HGT detection, characterized by a surge in the number of computational approaches, including AI-based approaches, in recent years. We organize existing computational approaches into a hierarchical structure of computational groups based on their computational methods and show how each computational group evolved. We make recommendations and discuss the challenges of HGT detection in general and the adoption of AI in particular. Moreover, we provide future directions for the field of HGT detection.
Collapse
Affiliation(s)
- Andre Jatmiko Wijaya
- Center for Artificial Intelligent in Public Health Research (ZKI-PH), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Freie Universität, Arnimallee 14, 14195 Berlin, Germany
- Genome Competence Center (MF1), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
| | - Aleksandar Anžel
- Center for Artificial Intelligent in Public Health Research (ZKI-PH), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
| | - Hugues Richard
- Genome Competence Center (MF1), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
| | - Georges Hattab
- Center for Artificial Intelligent in Public Health Research (ZKI-PH), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Freie Universität, Arnimallee 14, 14195 Berlin, Germany
| |
Collapse
|
5
|
Landis MJ, Thompson A. phyddle: software for exploring phylogenetic models with deep learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.08.06.606717. [PMID: 39149349 PMCID: PMC11326143 DOI: 10.1101/2024.08.06.606717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Phylogenies contain a wealth of information about the evolutionary history and process that gave rise to the diversity of life. This information can be extracted by fitting phylogenetic models to trees. However, many realistic phylogenetic models lack tractable likelihood functions, prohibiting their use with standard inference methods. We present phyddle, pipeline-based software for performing phylogenetic modeling tasks on trees using likelihood-free deep learning approaches. phyddle has a flexible command-line interface, making it easy to integrate deep learning approaches for phylogenetics into research workflows. phyddle coordinates modeling tasks through five pipeline analysis steps (Simulate, Format, Train, Estimate, and Plot) that transform raw phylogenetic datasets as input into numerical and visual model-based output. We conduct three experiments to compare the accuracy of likelihood-based inferences against deep learning-based inferences obtained through phyddle. Benchmarks show that phyddle accurately performs the inference tasks for which it was designed, such as estimating macroevolutionary parameters, selecting among continuous trait evolution models, and passing coverage tests for epidemiological models, even for models that lack tractable likelihoods. Learn more about phyddle at https://phyddle.org.
Collapse
Affiliation(s)
- Michael J. Landis
- Department of Biology, Washington University, St. Louis, MO, 63110, USA
| | - Ammon Thompson
- Participant in an education program sponsored by U.S. Department of Defense (DOD)
| |
Collapse
|
6
|
Kulikov N, Derakhshandeh F, Mayer C. Machine learning can be as good as maximum likelihood when reconstructing phylogenetic trees and determining the best evolutionary model on four taxon alignments. Mol Phylogenet Evol 2024; 200:108181. [PMID: 39209046 DOI: 10.1016/j.ympev.2024.108181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 07/15/2024] [Accepted: 08/26/2024] [Indexed: 09/04/2024]
Abstract
Phylogenetic tree reconstruction with molecular data is important in many fields of life science research. The gold standard in this discipline is the phylogenetic tree reconstruction based on the Maximum Likelihood method. In this study, we present neural networks to predict the best model of sequence evolution and the correct topology for four sequence alignments of nucleotide or amino acid sequence data. We trained neural networks with different architectures using simulated alignments for a wide range of evolutionary models, model parameters and branch lengths. By comparing the accuracy of model and topology prediction of the trained neural networks with Maximum Likelihood and Neighbour Joining methods, we show that for quartet trees, the neural network classifier outperforms the Neighbour Joining method and is in most cases as good as the Maximum Likelihood method to infer the best model of sequence evolution and the best tree topology. These results are consistent for nucleotide and amino acid sequence data. We also show that our method is superior for model selection than previously published methods based on convolutionary networks. Furthermore, we found that neural network classifiers are much faster than the IQ-TREE implementation of the Maximum Likelihood method. Our results show that neural networks could become a true competitor for the Maximum Likelihood method in phylogenetic reconstructions.
Collapse
Affiliation(s)
- Nikita Kulikov
- Molecular Evolutionary Biology, Department of Biology, Hamburg University, Germany; Leibniz Institute for the Analysis of Biodiversity Change (LIB), Germany.
| | - Fatemeh Derakhshandeh
- Leibniz Institute for the Analysis of Biodiversity Change (LIB), Germany; Medical Faculty, Heidelberg University, Germany
| | - Christoph Mayer
- Leibniz Institute for the Analysis of Biodiversity Change (LIB), Germany
| |
Collapse
|
7
|
Salinas NR, Eshel G, Coruzzi GM, DeSalle R, Tessler M, Little DP. BAD2matrix: Phylogenomic matrix concatenation, indel coding, and more. APPLICATIONS IN PLANT SCIENCES 2024; 12:e11604. [PMID: 39628543 PMCID: PMC11610412 DOI: 10.1002/aps3.11604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 02/24/2024] [Accepted: 03/16/2024] [Indexed: 12/06/2024]
Abstract
Premise Common steps in phylogenomic matrix production include biological sequence concatenation, morphological data concatenation, insertion/deletion (indel) coding, gene content (presence/absence) coding, removing uninformative characters for parsimony analysis, recording with reduced amino acid alphabets, and occupancy filtering. Existing software does not accomplish these tasks on a phylogenomic scale using a single program. Methods and Results BAD2matrix is a Python script that performs the above-mentioned steps in phylogenomic matrix construction for DNA or amino acid sequences as well as morphological data. The script works in UNIX-like environments (e.g., LINUX, MacOS, Windows Subsystem for LINUX). Conclusions BAD2matrix helps simplify phylogenomic pipelines and can be downloaded from https://github.com/dpl10/BAD2matrix/tree/master under a GNU General Public License v2.
Collapse
Affiliation(s)
- Nelson R. Salinas
- Lewis B. and Dorothy Cullman Program for Molecular SystematicsThe New York Botanical Garden, BronxNew YorkUSA
| | - Gil Eshel
- Center for Genomics and Systems BiologyNew York UniversityNew YorkNew YorkUSA
| | - Gloria M. Coruzzi
- Center for Genomics and Systems BiologyNew York UniversityNew YorkNew YorkUSA
| | - Rob DeSalle
- Institute for Comparative GenomicsAmerican Museum of Natural HistoryNew YorkNew YorkUSA
| | - Michael Tessler
- Lewis B. and Dorothy Cullman Program for Molecular SystematicsThe New York Botanical Garden, BronxNew YorkUSA
- Institute for Comparative GenomicsAmerican Museum of Natural HistoryNew YorkNew YorkUSA
- Department of Biology, Medgar Evers CollegeCity University of New YorkBrooklynNew YorkUSA
| | - Damon P. Little
- Lewis B. and Dorothy Cullman Program for Molecular SystematicsThe New York Botanical Garden, BronxNew YorkUSA
| |
Collapse
|
8
|
Silvestro D, Latrille T, Salamin N. Toward a Semi-Supervised Learning Approach to Phylogenetic Estimation. Syst Biol 2024; 73:789-806. [PMID: 38916476 PMCID: PMC11639169 DOI: 10.1093/sysbio/syae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/21/2024] [Accepted: 06/24/2024] [Indexed: 06/26/2024] Open
Abstract
Models have always been central to inferring molecular evolution and to reconstructing phylogenetic trees. Their use typically involves the development of a mechanistic framework reflecting our understanding of the underlying biological processes, such as nucleotide substitutions, and the estimation of model parameters by maximum likelihood or Bayesian inference. However, deriving and optimizing the likelihood of the data is not always possible under complex evolutionary scenarios or even tractable for large datasets, often leading to unrealistic simplifying assumptions in the fitted models. To overcome this issue, we coupled stochastic simulations of genome evolution with a new supervised deep-learning model to infer key parameters of molecular evolution. Our model is designed to directly analyze multiple sequence alignments and estimate per-site evolutionary rates and divergence without requiring a known phylogenetic tree. The accuracy of our predictions matched that of likelihood-based phylogenetic inference when rate heterogeneity followed a simple gamma distribution, but it strongly exceeded it under more complex patterns of rate variation, such as codon models. Our approach is highly scalable and can be efficiently applied to genomic data, as we showed on a dataset of 26 million nucleotides from the clownfish clade. Our simulations also showed that the integration of per-site rates obtained by deep learning within a Bayesian framework led to significantly more accurate phylogenetic inference, particularly with respect to the estimated branch lengths. We thus propose that future advancements in phylogenetic analysis will benefit from a semi-supervised learning approach that combines deep-learning estimation of substitution rates, which allows for more flexible models of rate variation, and probabilistic inference of the phylogenetic tree, which guarantees interpretability and a rigorous assessment of statistical support.
Collapse
Affiliation(s)
- Daniele Silvestro
- Department of Biology, University of Fribourg and Swiss Institute of Bioinformatics, 1700 Fribourg, Switzerland
- Department of Biological and Environmental Sciences, Gothenburg Global Biodiversity Centre, University of Gothenburg, 40530 Gothenburg, Sweden
| | - Thibault Latrille
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Nicolas Salamin
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| |
Collapse
|
9
|
Redelings BD, Holmes I, Lunter G, Pupko T, Anisimova M. Insertions and Deletions: Computational Methods, Evolutionary Dynamics, and Biological Applications. Mol Biol Evol 2024; 41:msae177. [PMID: 39172750 PMCID: PMC11385596 DOI: 10.1093/molbev/msae177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Revised: 07/02/2024] [Accepted: 07/09/2024] [Indexed: 08/24/2024] Open
Abstract
Insertions and deletions constitute the second most important source of natural genomic variation. Insertions and deletions make up to 25% of genomic variants in humans and are involved in complex evolutionary processes including genomic rearrangements, adaptation, and speciation. Recent advances in long-read sequencing technologies allow detailed inference of insertions and deletion variation in species and populations. Yet, despite their importance, evolutionary studies have traditionally ignored or mishandled insertions and deletions due to a lack of comprehensive methodologies and statistical models of insertions and deletion dynamics. Here, we discuss methods for describing insertions and deletion variation and modeling insertions and deletions over evolutionary time. We provide practical advice for tackling insertions and deletions in genomic sequences and illustrate our discussion with examples of insertions and deletion-induced effects in human and other natural populations and their contribution to evolutionary processes. We outline promising directions for future developments in statistical methodologies that would allow researchers to analyze insertions and deletion variation and their effects in large genomic data sets and to incorporate insertions and deletions in evolutionary inference.
Collapse
Affiliation(s)
| | - Ian Holmes
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA
- Calico Life Sciences LLC, South San Francisco, CA 94080, USA
| | - Gerton Lunter
- Department of Epidemiology, University Medical Center Groningen, University of Groningen, Groningen 9713 GZ, The Netherlands
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Maria Anisimova
- Institute of Computational Life Sciences, Zurich University of Applied Sciences, Wädenswil, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
10
|
Yang B, Zhou X, Liu S. Tracing the genealogy origin of geographic populations based on genomic variation and deep learning. Mol Phylogenet Evol 2024; 198:108142. [PMID: 38964594 DOI: 10.1016/j.ympev.2024.108142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 05/30/2024] [Accepted: 07/01/2024] [Indexed: 07/06/2024]
Abstract
Assigning a query individual animal or plant to its derived population is a prime task in diverse applications related to organismal genealogy. Such endeavors have conventionally relied on short DNA sequences under a phylogenetic framework. These methods naturally show constraints when the inferred population sources are ambiguously phylogenetically structured, a scenario demanding substantially more informative genetic signals. Recent advances in cost-effective production of whole-genome sequences and artificial intelligence have created an unprecedented opportunity to trace the population origin for essentially any given individual, as long as the genome reference data are comprehensive and standardized. Here, we developed a convolutional neural network method to identify population origins using genomic SNPs. Three empirical datasets (an Asian honeybee, a red fire ant, and a chicken datasets) and two simulated populations are used for the proof of concepts. The performance tests indicate that our method can accurately identify the genealogy origin of query individuals, with success rates ranging from 93 % to 100 %. We further showed that the accuracy of the model can be significantly increased by refining the informative sites through FST filtering. Our method is robust to configurations related to batch sizes and epochs, whereas model learning benefits from the setting of a proper preset learning rate. Moreover, we explained the importance score of key sites for algorithm interpretability and credibility, which has been largely ignored. We anticipate that by coupling genomics and deep learning, our method will see broad potential in conservation and management applications that involve natural resources, invasive pests and weeds, and illegal trades of wildlife products.
Collapse
Affiliation(s)
- Bing Yang
- Department of Entomology, China Agricultural University, Beijing 100193, China
| | - Xin Zhou
- Department of Entomology, China Agricultural University, Beijing 100193, China.
| | - Shanlin Liu
- Department of Entomology, China Agricultural University, Beijing 100193, China; Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.
| |
Collapse
|
11
|
Suvorov A, Schrider DR. Reliable estimation of tree branch lengths using deep neural networks. PLoS Comput Biol 2024; 20:e1012337. [PMID: 39102450 PMCID: PMC11326709 DOI: 10.1371/journal.pcbi.1012337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 08/15/2024] [Accepted: 07/18/2024] [Indexed: 08/07/2024] Open
Abstract
A phylogenetic tree represents hypothesized evolutionary history for a set of taxa. Besides the branching patterns (i.e., tree topology), phylogenies contain information about the evolutionary distances (i.e. branch lengths) between all taxa in the tree, which include extant taxa (external nodes) and their last common ancestors (internal nodes). During phylogenetic tree inference, the branch lengths are typically co-estimated along with other phylogenetic parameters during tree topology space exploration. There are well-known regions of the branch length parameter space where accurate estimation of phylogenetic trees is especially difficult. Several novel studies have recently demonstrated that machine learning approaches have the potential to help solve phylogenetic problems with greater accuracy and computational efficiency. In this study, as a proof of concept, we sought to explore the possibility of machine learning models to predict branch lengths. To that end, we designed several deep learning frameworks to estimate branch lengths on fixed tree topologies from multiple sequence alignments or its representations. Our results show that deep learning methods can exhibit superior performance in some difficult regions of branch length parameter space. For example, in contrast to maximum likelihood inference, which is typically used for estimating branch lengths, deep learning methods are more efficient and accurate. In general, we find that our neural networks achieve similar accuracy to a Bayesian approach and are the best-performing methods when inferring long branches that are associated with distantly related taxa. Together, our findings represent a next step toward accurate, fast, and reliable phylogenetic inference with machine learning approaches.
Collapse
Affiliation(s)
- Anton Suvorov
- Department of Biological Sciences, Virginia Tech, Blacksburg, Virginia, United States of America
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| |
Collapse
|
12
|
Sharma S, Kumar S. Discovering Fragile Clades and Causal Sequences in Phylogenomics by Evolutionary Sparse Learning. Mol Biol Evol 2024; 41:msae131. [PMID: 38916040 PMCID: PMC11247346 DOI: 10.1093/molbev/msae131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 05/30/2024] [Accepted: 06/20/2024] [Indexed: 06/26/2024] Open
Abstract
Phylogenomic analyses of long sequences, consisting of many genes and genomic segments, reconstruct organismal relationships with high statistical confidence. But, inferred relationships can be sensitive to excluding just a few sequences. Currently, there is no direct way to identify fragile relationships and the associated individual gene sequences in species. Here, we introduce novel metrics for gene-species sequence concordance and clade probability derived from evolutionary sparse learning models. We validated these metrics using fungi, plant, and animal phylogenomic datasets, highlighting the ability of the new metrics to pinpoint fragile clades and the sequences responsible. The new approach does not necessitate the investigation of alternative phylogenetic hypotheses, substitution models, or repeated data subset analyses. Our methodology offers a streamlined approach to evaluating major inferred clades and identifying sequences that may distort reconstructed phylogenies using large datasets.
Collapse
Affiliation(s)
- Sudip Sharma
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA 19122, USA
- Department of Biology, Temple University, Philadelphia, PA 19122, USA
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA 19122, USA
- Department of Biology, Temple University, Philadelphia, PA 19122, USA
| |
Collapse
|
13
|
Mo YK, Hahn MW, Smith ML. Applications of machine learning in phylogenetics. Mol Phylogenet Evol 2024; 196:108066. [PMID: 38565358 DOI: 10.1016/j.ympev.2024.108066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 02/16/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]
Abstract
Machine learning has increasingly been applied to a wide range of questions in phylogenetic inference. Supervised machine learning approaches that rely on simulated training data have been used to infer tree topologies and branch lengths, to select substitution models, and to perform downstream inferences of introgression and diversification. Here, we review how researchers have used several promising machine learning approaches to make phylogenetic inferences. Despite the promise of these methods, several barriers prevent supervised machine learning from reaching its full potential in phylogenetics. We discuss these barriers and potential paths forward. In the future, we expect that the application of careful network designs and data encodings will allow supervised machine learning to accommodate the complex processes that continue to confound traditional phylogenetic methods.
Collapse
Affiliation(s)
- Yu K Mo
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Matthew W Hahn
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA; Department of Biology, Indiana University, Bloomington, IN 47405, USA
| | - Megan L Smith
- Department of Biological Sciences, Mississippi State University, Starkville, MS 39762, USA.
| |
Collapse
|
14
|
Ecker N, Huchon D, Mansour Y, Mayrose I, Pupko T. A machine-learning-based alternative to phylogenetic bootstrap. Bioinformatics 2024; 40:i208-i217. [PMID: 38940166 PMCID: PMC11211842 DOI: 10.1093/bioinformatics/btae255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Currently used methods for estimating branch support in phylogenetic analyses often rely on the classic Felsenstein's bootstrap, parametric tests, or their approximations. As these branch support scores are widely used in phylogenetic analyses, having accurate, fast, and interpretable scores is of high importance. RESULTS Here, we employed a data-driven approach to estimate branch support values with a probabilistic interpretation. To this end, we simulated thousands of realistic phylogenetic trees and the corresponding multiple sequence alignments. Each of the obtained alignments was used to infer the phylogeny using state-of-the-art phylogenetic inference software, which was then compared to the true tree. Using these extensive data, we trained machine-learning algorithms to estimate branch support values for each bipartition within the maximum-likelihood trees obtained by each software. Our results demonstrate that our model provides fast and more accurate probability-based branch support values than commonly used procedures. We demonstrate the applicability of our approach on empirical datasets. AVAILABILITY AND IMPLEMENTATION The data supporting this work are available in the Figshare repository at https://doi.org/10.6084/m9.figshare.25050554.v1, and the underlying code is accessible via GitHub at https://github.com/noaeker/bootstrap_repo.
Collapse
Affiliation(s)
- Noa Ecker
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Dorothée Huchon
- School of Zoology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
- The Steinhardt Museum of Natural History and National Research Center, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Yishay Mansour
- The Blavatnik School of Computer Science, Raymond & Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| |
Collapse
|
15
|
Azouri D, Granit O, Alburquerque M, Mansour Y, Pupko T, Mayrose I. The Tree Reconstruction Game: Phylogenetic Reconstruction Using Reinforcement Learning. Mol Biol Evol 2024; 41:msae105. [PMID: 38829798 PMCID: PMC11180600 DOI: 10.1093/molbev/msae105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Revised: 05/17/2024] [Accepted: 05/28/2024] [Indexed: 06/05/2024] Open
Abstract
The computational search for the maximum-likelihood phylogenetic tree is an NP-hard problem. As such, current tree search algorithms might result in a tree that is the local optima, not the global one. Here, we introduce a paradigm shift for predicting the maximum-likelihood tree, by approximating long-term gains of likelihood rather than maximizing likelihood gain at each step of the search. Our proposed approach harnesses the power of reinforcement learning to learn an optimal search strategy, aiming at the global optimum of the search space. We show that when analyzing empirical data containing dozens of sequences, the log-likelihood improvement from the starting tree obtained by the reinforcement learning-based agent was 0.969 or higher compared to that achieved by current state-of-the-art techniques. Notably, this performance is attained without the need to perform costly likelihood optimizations apart from the training process, thus potentially allowing for an exponential increase in runtime. We exemplify this for data sets containing 15 sequences of length 18,000 bp and demonstrate that the reinforcement learning-based method is roughly three times faster than the state-of-the-art software. This study illustrates the potential of reinforcement learning in addressing the challenges of phylogenetic tree reconstruction.
Collapse
Affiliation(s)
- Dana Azouri
- School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
| | - Oz Granit
- Balvatnik School of Computer Science, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
| | - Michael Alburquerque
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
| | - Yishay Mansour
- Balvatnik School of Computer Science, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel
| |
Collapse
|
16
|
Fonseca EM, Pope NS, Peterman WE, Werneck FP, Colli GR, Carstens BC. Genetic structure and landscape effects on gene flow in the Neotropical lizard Norops brasiliensis (Squamata: Dactyloidae). Heredity (Edinb) 2024; 132:284-295. [PMID: 38575800 PMCID: PMC11166928 DOI: 10.1038/s41437-024-00682-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 03/12/2024] [Accepted: 03/18/2024] [Indexed: 04/06/2024] Open
Abstract
One key research goal of evolutionary biology is to understand the origin and maintenance of genetic variation. In the Cerrado, the South American savanna located primarily in the Central Brazilian Plateau, many hypotheses have been proposed to explain how landscape features (e.g., geographic distance, river barriers, topographic compartmentalization, and historical climatic fluctuations) have promoted genetic structure by mediating gene flow. Here, we asked whether these landscape features have influenced the genetic structure and differentiation in the lizard species Norops brasiliensis (Squamata: Dactyloidae). To achieve our goal, we used a genetic clustering analysis and estimate an effective migration surface to assess genetic structure in the focal species. Optimized isolation-by-resistance models and a simulation-based approach combined with machine learning (convolutional neural network; CNN) were then used to infer current and historical effects on population genetic structure through 12 unique landscape models. We recovered five geographically distributed populations that are separated by regions of lower-than-expected gene flow. The results of the CNN showed that geographic distance is the sole predictor of genetic variation in N. brasiliensis, and that slope, rivers, and historical climate had no discernible influence on gene flow. Our novel CNN approach was accurate (89.5%) in differentiating each landscape model. CNN and other machine learning approaches are still largely unexplored in landscape genetics studies, representing promising avenues for future research with increasingly accessible genomic datasets.
Collapse
Affiliation(s)
- Emanuel M Fonseca
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, OH, USA
| | - Nathaniel S Pope
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97403, USA
| | - William E Peterman
- School of Environment and Natural Resources, The Ohio State University, Columbus, OH, USA
| | - Fernanda P Werneck
- Coordenação de Biodiversidade, Programa de Coleções Científicas Biológicas, Instituto Nacional de Pesquisas da Amazônia (INPA), Manaus, Brazil
| | - Guarino R Colli
- Departamento de Zoologia, Universidade de Brasília, Brasília, Brazil
| | - Bryan C Carstens
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
17
|
Thompson A, Liebeskind BJ, Scully EJ, Landis MJ. Deep Learning and Likelihood Approaches for Viral Phylogeography Converge on the Same Answers Whether the Inference Model Is Right or Wrong. Syst Biol 2024; 73:183-206. [PMID: 38189575 PMCID: PMC11249978 DOI: 10.1093/sysbio/syad074] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 11/22/2023] [Accepted: 01/05/2024] [Indexed: 01/09/2024] Open
Abstract
Analysis of phylogenetic trees has become an essential tool in epidemiology. Likelihood-based methods fit models to phylogenies to draw inferences about the phylodynamics and history of viral transmission. However, these methods are often computationally expensive, which limits the complexity and realism of phylodynamic models and makes them ill-suited for informing policy decisions in real-time during rapidly developing outbreaks. Likelihood-free methods using deep learning are pushing the boundaries of inference beyond these constraints. In this paper, we extend, compare, and contrast a recently developed deep learning method for likelihood-free inference from trees. We trained multiple deep neural networks using phylogenies from simulated outbreaks that spread among 5 locations and found they achieve close to the same levels of accuracy as Bayesian inference under the true simulation model. We compared robustness to model misspecification of a trained neural network to that of a Bayesian method. We found that both models had comparable performance, converging on similar biases. We also implemented a method of uncertainty quantification called conformalized quantile regression that we demonstrate has similar patterns of sensitivity to model misspecification as Bayesian highest posterior density (HPD) and greatly overlap with HPDs, but have lower precision (more conservative). Finally, we trained and tested a neural network against phylogeographic data from a recent study of the SARS-Cov-2 pandemic in Europe and obtained similar estimates of region-specific epidemiological parameters and the location of the common ancestor in Europe. Along with being as accurate and robust as likelihood-based methods, our trained neural networks are on average over 3 orders of magnitude faster after training. Our results support the notion that neural networks can be trained with simulated data to accurately mimic the good and bad statistical properties of the likelihood functions of generative phylogenetic models.
Collapse
Affiliation(s)
- Ammon Thompson
- Participant in an Education Program Sponsored by U.S. Department of Defense (DOD) at the National Geospatial-Intelligence Agency, Springfield, VA 22150, USA
| | | | - Erik J Scully
- National Geospatial-Intelligence Agency, Springfield, VA 22150, USA
| | - Michael J Landis
- Department of Biology, Washington University in St. Louis, Rebstock Hall, St. Louis, MO 63130, USA
| |
Collapse
|
18
|
Leuchtenberger AF, von Haeseler A. Learning From an Artificial Neural Network in Phylogenetics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:278-288. [PMID: 38198267 DOI: 10.1109/tcbb.2024.3352268] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2024]
Abstract
We show that an iterative ansatz of deep learning and human intelligence guided simplification may lead to surprisingly simple solutions for a difficult problem in phylogenetics. Distinguishing Farris and Felsenstein trees is a longstanding problem in phylogenetic tree reconstruction. The Artificial Neural Network F-zoneNN solves this problem for 4-taxon alignments evolved under the Jukes-Cantor model. It distinguishes between Farris and Felsenstein trees, but owing to its complexity, lacks transparency in its mechanism of discernment. Based on the simplification of F-zoneNN and alignment properties we constructed the function FarFelDiscerner. In contrast to F-zoneNN, FarFelDiscerner's decision process is understandable. Moreover, FarFelDiscerner is significantly simpler than F-zoneNN. Despite its simplicity this function infers the tree-type almost perfectly on noise-free data, and also performs well on simulated noisy alignments of finite length. We applied FarFelDiscerner to the historical Holometabola alignments where it places Strepsiptera with beetles, concordant with the current scientific view.
Collapse
|
19
|
Tang X, Zepeda-Nuñez L, Yang S, Zhao Z, Solís-Lemus C. Novel symmetry-preserving neural network model for phylogenetic inference. BIOINFORMATICS ADVANCES 2024; 4:vbae022. [PMID: 38638281 PMCID: PMC11026143 DOI: 10.1093/bioadv/vbae022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 01/29/2024] [Accepted: 02/17/2024] [Indexed: 04/20/2024]
Abstract
Motivation Scientists world-wide are putting together massive efforts to understand how the biodiversity that we see on Earth evolved from single-cell organisms at the origin of life and this diversification process is represented through the Tree of Life. Low sampling rates and high heterogeneity in the rate of evolution across sites and lineages produce a phenomenon denoted "long branch attraction" (LBA) in which long nonsister lineages are estimated to be sisters regardless of their true evolutionary relationship. LBA has been a pervasive problem in phylogenetic inference affecting different types of methodologies from distance-based to likelihood-based. Results Here, we present a novel neural network model that outperforms standard phylogenetic methods and other neural network implementations under LBA settings. Furthermore, unlike existing neural network models in phylogenetics, our model naturally accounts for the tree isomorphisms via permutation invariant functions which ultimately result in lower memory and allows the seamless extension to larger trees. Availability and implementation We implement our novel theory on an open-source publicly available GitHub repository: https://github.com/crsl4/nn-phylogenetics.
Collapse
Affiliation(s)
- Xudong Tang
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI 53706, United States
- Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, United States
| | - Leonardo Zepeda-Nuñez
- Department of Mathematics, University of Wisconsin-Madison, Madison, WI 53706, United States
| | - Shengwen Yang
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI 53706, United States
- Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, United States
| | - Zelin Zhao
- Department of Mathematics, University of Wisconsin-Madison, Madison, WI 53706, United States
| | - Claudia Solís-Lemus
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI 53706, United States
- Department of Plant Pathology, University of Wisconsin-Madison, Madison, WI 53706, United States
| |
Collapse
|
20
|
Trost J, Haag J, Höhler D, Jacob L, Stamatakis A, Boussau B. Simulations of Sequence Evolution: How (Un)realistic They Are and Why. Mol Biol Evol 2024; 41:msad277. [PMID: 38124381 PMCID: PMC10768886 DOI: 10.1093/molbev/msad277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 11/17/2023] [Accepted: 12/08/2023] [Indexed: 12/23/2023] Open
Abstract
MOTIVATION Simulating multiple sequence alignments (MSAs) using probabilistic models of sequence evolution plays an important role in the evaluation of phylogenetic inference tools and is crucial to the development of novel learning-based approaches for phylogenetic reconstruction, for instance, neural networks. These models and the resulting simulated data need to be as realistic as possible to be indicative of the performance of the developed tools on empirical data and to ensure that neural networks trained on simulations perform well on empirical data. Over the years, numerous models of evolution have been published with the goal to represent as faithfully as possible the sequence evolution process and thus simulate empirical-like data. In this study, we simulated DNA and protein MSAs under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how accurately supervised learning methods are able to predict whether a given MSA is simulated or empirical. RESULTS Our results show that we can distinguish between empirical and simulated MSAs with high accuracy using two distinct and independently developed classification approaches across all tested models of sequence evolution. Our findings suggest that the current state-of-the-art models fail to accurately replicate several aspects of empirical MSAs, including site-wise rates as well as amino acid and nucleotide composition.
Collapse
Affiliation(s)
- Johanna Trost
- Biometry and Evolutionary Biology Laboratory (LBBE), University Claude Bernard Lyon 1, Lyon, France
| | - Julia Haag
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Dimitri Höhler
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Laurent Jacob
- CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Sorbonne Université, Paris 75005, France
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Biodiversity Computing Group, Institute of Computer Science, Foundation for Research and Technology - Hellas, Heraklion, Crete, Greece
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Bastien Boussau
- Biometry and Evolutionary Biology Laboratory (LBBE), University Claude Bernard Lyon 1, Lyon, France
| |
Collapse
|
21
|
Lambert S, Voznica J, Morlon H. Deep Learning from Phylogenies for Diversification Analyses. Syst Biol 2023; 72:1262-1279. [PMID: 37556735 DOI: 10.1093/sysbio/syad044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Revised: 06/20/2023] [Accepted: 08/08/2023] [Indexed: 08/11/2023] Open
Abstract
Birth-death (BD) models are widely used in combination with species phylogenies to study past diversification dynamics. Current inference approaches typically rely on likelihood-based methods. These methods are not generalizable, as a new likelihood formula must be established each time a new model is proposed; for some models, such a formula is not even tractable. Deep learning can bring solutions in such situations, as deep neural networks can be trained to learn the relation between simulations and parameter values as a regression problem. In this paper, we adapt a recently developed deep learning method from pathogen phylodynamics to the case of diversification inference, and we extend its applicability to the case of the inference of state-dependent diversification models from phylogenies associated with trait data. We demonstrate the accuracy and time efficiency of the approach for the time-constant homogeneous BD model and the Binary-State Speciation and Extinction model. Finally, we illustrate the use of the proposed inference machinery by reanalyzing a phylogeny of primates and their associated ecological role as seed dispersers. Deep learning inference provides at least the same accuracy as likelihood-based inference while being faster by several orders of magnitude, offering a promising new inference approach for the deployment of future models in the field.
Collapse
Affiliation(s)
- Sophia Lambert
- Institut de Biologie de l'École Normale Supérieure, École Normale Supérieure, CNRS, INSERM, Université Paris Sciences et Lettres, 46 Rue d'Ulm, 75005 Paris, France
- Institute of Ecology and Evolution, Department of Biology, 5289 University of Oregon, Eugene, OR 97403, USA
| | - Jakub Voznica
- Institut Pasteur, Université Paris Cité, Unité Bioinformatique Evolutive, 25-28 Rue du Dr Roux, 75015 Paris, France
- Unité de Biologie Computationnelle, USR 3756 CNRS, 25-28 Rue du Dr Roux, 75015 Paris, France
| | - Hélène Morlon
- Institut de Biologie de l'École Normale Supérieure, École Normale Supérieure, CNRS, INSERM, Université Paris Sciences et Lettres, 46 Rue d'Ulm, 75005 Paris, France
| |
Collapse
|
22
|
Steenwyk JL, Li Y, Zhou X, Shen XX, Rokas A. Incongruence in the phylogenomics era. Nat Rev Genet 2023; 24:834-850. [PMID: 37369847 PMCID: PMC11499941 DOI: 10.1038/s41576-023-00620-x] [Citation(s) in RCA: 60] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/19/2023] [Indexed: 06/29/2023]
Abstract
Genome-scale data and the development of novel statistical phylogenetic approaches have greatly aided the reconstruction of a broad sketch of the tree of life and resolved many of its branches. However, incongruence - the inference of conflicting evolutionary histories - remains pervasive in phylogenomic data, hampering our ability to reconstruct and interpret the tree of life. Biological factors, such as incomplete lineage sorting, horizontal gene transfer, hybridization, introgression, recombination and convergent molecular evolution, can lead to gene phylogenies that differ from the species tree. In addition, analytical factors, including stochastic, systematic and treatment errors, can drive incongruence. Here, we review these factors, discuss methodological advances to identify and handle incongruence, and highlight avenues for future research.
Collapse
Affiliation(s)
- Jacob L Steenwyk
- Howards Hughes Medical Institute and the Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA
- Vanderbilt Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN, USA
| | - Yuanning Li
- Institute of Marine Science and Technology, Shandong University, Qingdao, China
| | - Xiaofan Zhou
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, China
| | - Xing-Xing Shen
- Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Institute of Insect Sciences, Zhejiang University, Hangzhou, China
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA.
- Vanderbilt Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN, USA.
- Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.
| |
Collapse
|
23
|
Wang Z, Sun J, Gao Y, Xue Y, Zhang Y, Li K, Zhang W, Zhang C, Zu J, Zhang L. Fusang: a framework for phylogenetic tree inference via deep learning. Nucleic Acids Res 2023; 51:10909-10923. [PMID: 37819036 PMCID: PMC10639059 DOI: 10.1093/nar/gkad805] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Revised: 08/17/2023] [Accepted: 09/20/2023] [Indexed: 10/13/2023] Open
Abstract
Phylogenetic tree inference is a classic fundamental task in evolutionary biology that entails inferring the evolutionary relationship of targets based on multiple sequence alignment (MSA). Maximum likelihood (ML) and Bayesian inference (BI) methods have dominated phylogenetic tree inference for many years, but BI is too slow to handle a large number of sequences. Recently, deep learning (DL) has been successfully applied to quartet phylogenetic tree inference and tentatively extended into more sequences with the quartet puzzling algorithm. However, no DL-based tools are immediately available for practical real-world applications. In this paper, we propose Fusang (http://fusang.cibr.ac.cn), a DL-based framework that achieves comparable performance to that of ML-based tools with both simulated and real datasets. More importantly, with continuous optimization, e.g. through the use of customized training datasets for real-world scenarios, Fusang has great potential to outperform ML-based tools.
Collapse
Affiliation(s)
- Zhicheng Wang
- Chinese Institute for Brain Research, Beijing 102206, China
- Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
| | - Jinnan Sun
- School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China
| | - Yuan Gao
- Chinese Institute for Brain Research, Beijing 102206, China
| | - Yongwei Xue
- Chinese Institute for Brain Research, Beijing 102206, China
| | - Yubo Zhang
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
| | - Kuan Li
- Chinese Institute for Brain Research, Beijing 102206, China
| | - Wei Zhang
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing 100871, China
| | - Chi Zhang
- Key Laboratory of Vertebrate Evolution and Human Origins, Institute of Vertebrate Paleontology and Paleoanthropology, Center for Excellence in Life and Paleoenvironment, Chinese Academy of Sciences, Beijing 100044, China
| | - Jian Zu
- School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China
| | - Li Zhang
- Chinese Institute for Brain Research, Beijing 102206, China
| |
Collapse
|
24
|
Zhang Y, Zhu Q, Shao Y, Jiang Y, Ouyang Y, Zhang L, Zhang W. Inferring Historical Introgression with Deep Learning. Syst Biol 2023; 72:1013-1038. [PMID: 37257491 DOI: 10.1093/sysbio/syad033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 05/28/2023] [Accepted: 05/30/2023] [Indexed: 06/02/2023] Open
Abstract
Resolving phylogenetic relationships among taxa remains a challenge in the era of big data due to the presence of genetic admixture in a wide range of organisms. Rapidly developing sequencing technologies and statistical tests enable evolutionary relationships to be disentangled at a genome-wide level, yet many of these tests are computationally intensive and rely on phased genotypes, large sample sizes, restricted phylogenetic topologies, or hypothesis testing. To overcome these difficulties, we developed a deep learning-based approach, named ERICA, for inferring genome-wide evolutionary relationships and local introgressed regions from sequence data. ERICA accepts sequence alignments of both population genomic data and multiple genome assemblies, and efficiently identifies discordant genealogy patterns and exchanged regions across genomes when compared with other methods. We further tested ERICA using real population genomic data from Heliconius butterflies that have undergone adaptive radiation and frequent hybridization. Finally, we applied ERICA to characterize hybridization and introgression in wild and cultivated rice, revealing the important role of introgression in rice domestication and adaptation. Taken together, our findings demonstrate that ERICA provides an effective method for teasing apart evolutionary relationships using whole genome data, which can ultimately facilitate evolutionary studies on hybridization and introgression.
Collapse
Affiliation(s)
- Yubo Zhang
- State Key Laboratory of Protein and Plant Gene Research, Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
| | - Qingjie Zhu
- Chinese Institute for Brain Research, Beijing 102206, China
| | - Yi Shao
- Chinese Institute for Brain Research, Beijing 102206, China
| | - Yanchen Jiang
- State Key Laboratory of Protein and Plant Gene Research, Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing 100871, China
| | - Yidan Ouyang
- National Key Laboratory of Crop Genetic Improvement and National Centre of Plant Gene Research (Wuhan), Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan 430070, China
| | - Li Zhang
- Chinese Institute for Brain Research, Beijing 102206, China
| | - Wei Zhang
- State Key Laboratory of Protein and Plant Gene Research, Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing 100871, China
| |
Collapse
|
25
|
Burgstaller-Muehlbacher S, Crotty SM, Schmidt HA, Reden F, Drucks T, von Haeseler A. ModelRevelator: Fast phylogenetic model estimation via deep learning. Mol Phylogenet Evol 2023; 188:107905. [PMID: 37595933 DOI: 10.1016/j.ympev.2023.107905] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 08/12/2023] [Indexed: 08/20/2023]
Abstract
Selecting the best model of sequence evolution for a multiple-sequence-alignment (MSA) constitutes the first step of phylogenetic tree reconstruction. Common approaches for inferring nucleotide models typically apply maximum likelihood (ML) methods, with discrimination between models determined by one of several information criteria. This requires tree reconstruction and optimisation which can be computationally expensive. We demonstrate that neural networks can be used to perform model selection, without the need to reconstruct trees, optimise parameters, or calculate likelihoods. We introduce ModelRevelator, a model selection tool underpinned by two deep neural networks. The first neural network, NNmodelfind, recommends one of six commonly used models of sequence evolution, ranging in complexity from Jukes and Cantor to General Time Reversible. The second, NNalphafind, recommends whether or not a Γ-distributed rate heterogeneous model should be incorporated, and if so, provides an estimate of the shape parameter, ɑ. Users can simply input an MSA into ModelRevelator, and swiftly receive output recommending the evolutionary model, inclusive of the presence or absence of rate heterogeneity, and an estimate of ɑ. We show that ModelRevelator performs comparably with likelihood-based methods and the recently published machine learning method ModelTeller over a wide range of parameter settings, with significant potential savings in computational effort. Further, we show that this performance is not restricted to the alignments on which the networks were trained, but is maintained even on unseen empirical data. We expect that ModelRevelator will provide a valuable alternative for phylogeneticists, especially where traditional methods of model selection are computationally prohibitive.
Collapse
Affiliation(s)
- Sebastian Burgstaller-Muehlbacher
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria.
| | - Stephen M Crotty
- School of Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia; ARC Centre of Excellence for Mathematical and Statistical Frontiers, University of Adelaide, Adelaide, SA 5005, Australia
| | - Heiko A Schmidt
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria
| | - Franziska Reden
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria
| | - Tamara Drucks
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria; Research Unit Machine Learning, TU Wien, 1040 Vienna, Austria
| | - Arndt von Haeseler
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria; Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, Währinger Straße 29, 1090 Vienna, Austria
| |
Collapse
|
26
|
Köksal Z, Børsting C, Gusmão L, Pereira V. SNPtotree-Resolving the Phylogeny of SNPs on Non-Recombining DNA. Genes (Basel) 2023; 14:1837. [PMID: 37895186 PMCID: PMC10606150 DOI: 10.3390/genes14101837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 09/18/2023] [Accepted: 09/21/2023] [Indexed: 10/29/2023] Open
Abstract
Genetic variants on non-recombining DNA and the hierarchical order in which they accumulate are commonly of interest. This variant hierarchy can be established and combined with information on the population and geographic origin of the individuals carrying the variants to find population structures and infer migration patterns. Further, individuals can be assigned to the characterized populations, which is relevant in forensic genetics, genetic genealogy, and epidemiologic studies. However, there is currently no straightforward method to obtain such a variant hierarchy. Here, we introduce the software SNPtotree v1.0, which uniquely determines the hierarchical order of variants on non-recombining DNA without error-prone manual sorting. The algorithm uses pairwise variant comparisons to infer their relationships and integrates the combined information into a phylogenetic tree. Variants that have contradictory pairwise relationships or ambiguous positions in the tree are removed by the software. When benchmarked using two human Y-chromosomal massively parallel sequencing datasets, SNPtotree outperforms traditional methods in the accuracy of phylogenetic trees for sequencing data with high amounts of missing information. The phylogenetic trees of variants created using SNPtotree can be used to establish and maintain publicly available phylogeny databases to further explore genetic epidemiology and genealogy, as well as population and forensic genetics.
Collapse
Affiliation(s)
- Zehra Köksal
- Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, 2100 Copenhagen, Denmark; (Z.K.); (C.B.)
| | - Claus Børsting
- Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, 2100 Copenhagen, Denmark; (Z.K.); (C.B.)
| | - Leonor Gusmão
- DNA Diagnostic Laboratory (LDD), State University of Rio de Janeiro (UERJ), Rio de Janeiro 20550-013, Brazil;
| | - Vania Pereira
- Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, 2100 Copenhagen, Denmark; (Z.K.); (C.B.)
| |
Collapse
|
27
|
Ly-Trong N, Barca GMJ, Minh BQ. AliSim-HPC: parallel sequence simulator for phylogenetics. Bioinformatics 2023; 39:btad540. [PMID: 37656933 PMCID: PMC10534053 DOI: 10.1093/bioinformatics/btad540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 08/16/2023] [Accepted: 08/31/2023] [Indexed: 09/03/2023] Open
Abstract
MOTIVATION Sequence simulation plays a vital role in phylogenetics with many applications, such as evaluating phylogenetic methods, testing hypotheses, and generating training data for machine-learning applications. We recently introduced a new simulator for multiple sequence alignments called AliSim, which outperformed existing tools. However, with the increasing demands of simulating large data sets, AliSim is still slow due to its sequential implementation; for example, to simulate millions of sequence alignments, AliSim took several days or weeks. Parallelization has been used for many phylogenetic inference methods but not yet for sequence simulation. RESULTS This paper introduces AliSim-HPC, which, for the first time, employs high-performance computing for phylogenetic simulations. AliSim-HPC parallelizes the simulation process at both multi-core and multi-CPU levels using the OpenMP and message passing interface (MPI) libraries, respectively. AliSim-HPC is highly efficient and scalable, which reduces the runtime to simulate 100 large gap-free alignments (30 000 sequences of one million sites) from over one day to 11 min using 256 CPU cores from a cluster with six computing nodes, a 153-fold speedup. While the OpenMP version can only simulate gap-free alignments, the MPI version supports insertion-deletion models like the sequential AliSim. AVAILABILITY AND IMPLEMENTATION AliSim-HPC is open-source and available as part of the new IQ-TREE version v2.2.3 at https://github.com/iqtree/iqtree2/releases with a user manual at http://www.iqtree.org/doc/AliSim.
Collapse
Affiliation(s)
- Nhan Ly-Trong
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| | - Giuseppe M J Barca
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| | - Bui Quang Minh
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| |
Collapse
|
28
|
Smith ML, Hahn MW. Phylogenetic inference using generative adversarial networks. Bioinformatics 2023; 39:btad543. [PMID: 37669126 PMCID: PMC10500083 DOI: 10.1093/bioinformatics/btad543] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 08/25/2023] [Accepted: 09/04/2023] [Indexed: 09/07/2023] Open
Abstract
MOTIVATION The application of machine learning approaches in phylogenetics has been impeded by the vast model space associated with inference. Supervised machine learning approaches require data from across this space to train models. Because of this, previous approaches have typically been limited to inferring relationships among unrooted quartets of taxa, where there are only three possible topologies. Here, we explore the potential of generative adversarial networks (GANs) to address this limitation. GANs consist of a generator and a discriminator: at each step, the generator aims to create data that is similar to real data, while the discriminator attempts to distinguish generated and real data. By using an evolutionary model as the generator, we use GANs to make evolutionary inferences. Since a new model can be considered at each iteration, heuristic searches of complex model spaces are possible. Thus, GANs offer a potential solution to the challenges of applying machine learning in phylogenetics. RESULTS We developed phyloGAN, a GAN that infers phylogenetic relationships among species. phyloGAN takes as input a concatenated alignment, or a set of gene alignments, and infers a phylogenetic tree either considering or ignoring gene tree heterogeneity. We explored the performance of phyloGAN for up to 15 taxa in the concatenation case and 6 taxa when considering gene tree heterogeneity. Error rates are relatively low in these simple cases. However, run times are slow and performance metrics suggest issues during training. Future work should explore novel architectures that may result in more stable and efficient GANs for phylogenetics. AVAILABILITY AND IMPLEMENTATION phyloGAN is available on github: https://github.com/meganlsmith/phyloGAN/.
Collapse
Affiliation(s)
- Megan L Smith
- Department of Biology, Indiana University, 1001 E 3rd St, Bloomington, IN 47405, United States
| | - Matthew W Hahn
- Department of Biology, Indiana University, 1001 E 3rd St, Bloomington, IN 47405, United States
- Department of Computer Science, Indiana University, 700 N Woodlawn Avenue, Bloomington, IN 47408, United States
| |
Collapse
|
29
|
Raslan MA, Raslan SA, Shehata EM, Mahmoud AS, Sabri NA. Advances in the Applications of Bioinformatics and Chemoinformatics. Pharmaceuticals (Basel) 2023; 16:1050. [PMID: 37513961 PMCID: PMC10384252 DOI: 10.3390/ph16071050] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 07/19/2023] [Accepted: 07/20/2023] [Indexed: 07/30/2023] Open
Abstract
Chemoinformatics involves integrating the principles of physical chemistry with computer-based and information science methodologies, commonly referred to as "in silico techniques", in order to address a wide range of descriptive and prescriptive chemistry issues, including applications to biology, drug discovery, and related molecular areas. On the other hand, the incorporation of machine learning has been considered of high importance in the field of drug design, enabling the extraction of chemical data from enormous compound databases to develop drugs endowed with significant biological features. The present review discusses the field of cheminformatics and proposes the use of virtual chemical libraries in virtual screening methods to increase the probability of discovering novel hit chemicals. The virtual libraries address the need to increase the quality of the compounds as well as discover promising ones. On the other hand, various applications of bioinformatics in disease classification, diagnosis, and identification of multidrug-resistant organisms were discussed. The use of ensemble models and brute-force feature selection methodology has resulted in high accuracy rates for heart disease and COVID-19 diagnosis, along with the role of special formulations for targeting meningitis and Alzheimer's disease. Additionally, the correlation between genomic variations and disease states such as obesity and chronic progressive external ophthalmoplegia, the investigation of the antibacterial activity of pyrazole and benzimidazole-based compounds against resistant microorganisms, and its applications in chemoinformatics for the prediction of drug properties and toxicity-all the previously mentioned-were presented in the current review.
Collapse
Affiliation(s)
| | | | | | - Amr S Mahmoud
- Department of Obstetrics and Gynecology, Faculty of Medicine, Ain Shams University, Cairo P.O. Box 11566, Egypt
| | - Nagwa A Sabri
- Department of Clinical Pharmacy, Faculty of Pharmacy, Ain Shams University, Cairo P.O. Box 11566, Egypt
| |
Collapse
|
30
|
Kende J, Bonomi M, Temmam S, Regnault B, Pérot P, Eloit M, Bigot T. Virus Pop-Expanding Viral Databases by Protein Sequence Simulation. Viruses 2023; 15:1227. [PMID: 37376527 DOI: 10.3390/v15061227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 05/15/2023] [Accepted: 05/16/2023] [Indexed: 06/29/2023] Open
Abstract
The improvement of our knowledge of the virosphere, which includes unknown viruses, is a key area in virology. Metagenomics tools, which perform taxonomic assignation from high throughput sequencing datasets, are generally evaluated with datasets derived from biological samples or in silico spiked samples containing known viral sequences present in public databases, resulting in the inability to evaluate the capacity of these tools to detect novel or distant viruses. Simulating realistic evolutionary directions is therefore key to benchmark and improve these tools. Additionally, expanding current databases with realistic simulated sequences can improve the capacity of alignment-based searching strategies for finding distant viruses, which could lead to a better characterization of the "dark matter" of metagenomics data. Here, we present Virus Pop, a novel pipeline for simulating realistic protein sequences and adding new branches to a protein phylogenetic tree. The tool generates simulated sequences with substitution rate variations that are dependent on protein domains and inferred from the input dataset, allowing for a realistic representation of protein evolution. The pipeline also infers ancestral sequences corresponding to multiple internal nodes of the input data phylogenetic tree, enabling new sequences to be inserted at various points of interest in the group studied. We demonstrated that Virus Pop produces simulated sequences that closely match the structural and functional characteristics of real protein sequences, taking as an example the spike protein of sarbecoviruses. Virus Pop also succeeded at creating sequences that resemble real sequences not included in the databases, which facilitated the identification of a novel pathogenic human circovirus not included in the input database. In conclusion, Virus Pop is helpful for challenging taxonomic assignation tools and could help improve databases to better detect distant viruses.
Collapse
Affiliation(s)
- Julia Kende
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
| | - Massimiliano Bonomi
- Department of Structural Biology and Chemistry, Institut Pasteur, Université Paris Cité, CNRS UMR 3528, F-75015 Paris, France
| | - Sarah Temmam
- Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
| | - Béatrice Regnault
- Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
| | - Philippe Pérot
- Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
| | - Marc Eloit
- Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
- Ecole Nationale Vétérinaire d'Alfort, F-94700 Maisons-Alfort, France
| | - Thomas Bigot
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
- Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
| |
Collapse
|
31
|
Korfmann K, Gaggiotti OE, Fumagalli M. Deep Learning in Population Genetics. Genome Biol Evol 2023; 15:evad008. [PMID: 36683406 PMCID: PMC9897193 DOI: 10.1093/gbe/evad008] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 12/19/2022] [Accepted: 01/16/2023] [Indexed: 01/24/2023] Open
Abstract
Population genetics is transitioning into a data-driven discipline thanks to the availability of large-scale genomic data and the need to study increasingly complex evolutionary scenarios. With likelihood and Bayesian approaches becoming either intractable or computationally unfeasible, machine learning, and in particular deep learning, algorithms are emerging as popular techniques for population genetic inferences. These approaches rely on algorithms that learn non-linear relationships between the input data and the model parameters being estimated through representation learning from training data sets. Deep learning algorithms currently employed in the field comprise discriminative and generative models with fully connected, convolutional, or recurrent layers. Additionally, a wide range of powerful simulators to generate training data under complex scenarios are now available. The application of deep learning to empirical data sets mostly replicates previous findings of demography reconstruction and signals of natural selection in model organisms. To showcase the feasibility of deep learning to tackle new challenges, we designed a branched architecture to detect signals of recent balancing selection from temporal haplotypic data, which exhibited good predictive performance on simulated data. Investigations on the interpretability of neural networks, their robustness to uncertain training data, and creative representation of population genetic data, will provide further opportunities for technological advancements in the field.
Collapse
Affiliation(s)
- Kevin Korfmann
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, Germany
| | - Oscar E Gaggiotti
- Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife KY16 9TF, UK
| | - Matteo Fumagalli
- Department of Biological and Behavioural Sciences, Queen Mary University of London, UK
| |
Collapse
|
32
|
Zhang C, Mirarab S. Weighting by Gene Tree Uncertainty Improves Accuracy of Quartet-based Species Trees. Mol Biol Evol 2022; 39:6750035. [PMID: 36201617 PMCID: PMC9750496 DOI: 10.1093/molbev/msac215] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 09/20/2022] [Accepted: 10/03/2022] [Indexed: 01/07/2023] Open
Abstract
Phylogenomic analyses routinely estimate species trees using methods that account for gene tree discordance. However, the most scalable species tree inference methods, which summarize independently inferred gene trees to obtain a species tree, are sensitive to hard-to-avoid errors introduced in the gene tree estimation step. This dilemma has created much debate on the merits of concatenation versus summary methods and practical obstacles to using summary methods more widely and to the exclusion of concatenation. The most successful attempt at making summary methods resilient to noisy gene trees has been contracting low support branches from the gene trees. Unfortunately, this approach requires arbitrary thresholds and poses new challenges. Here, we introduce threshold-free weighting schemes for the quartet-based species tree inference, the metric used in the popular method ASTRAL. By reducing the impact of quartets with low support or long terminal branches (or both), weighting provides stronger theoretical guarantees and better empirical performance than the unweighted ASTRAL. Our simulations show that weighting improves accuracy across many conditions and reduces the gap with concatenation in conditions with low gene tree discordance and high noise. On empirical data, weighting improves congruence with concatenation and increases support. Together, our results show that weighting, enabled by a new optimization algorithm we introduce, improves the utility of summary methods and can reduce the incongruence often observed across analytical pipelines.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, UC San Diego, La Jolla, CA, USA
| | | |
Collapse
|
33
|
Burger KE, Pfaffelhuber P, Baumdicker F. Neural networks for self-adjusting mutation rate estimation when the recombination rate is unknown. PLoS Comput Biol 2022; 18:e1010407. [PMID: 35921376 PMCID: PMC9377634 DOI: 10.1371/journal.pcbi.1010407] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 08/15/2022] [Accepted: 07/18/2022] [Indexed: 11/18/2022] Open
Abstract
Estimating the mutation rate, or equivalently effective population size, is a common task in population genetics. If recombination is low or high, optimal linear estimation methods are known and well understood. For intermediate recombination rates, the calculation of optimal estimators is more challenging. As an alternative to model-based estimation, neural networks and other machine learning tools could help to develop good estimators in these involved scenarios. However, if no benchmark is available it is difficult to assess how well suited these tools are for different applications in population genetics. Here we investigate feedforward neural networks for the estimation of the mutation rate based on the site frequency spectrum and compare their performance with model-based estimators. For this we use the model-based estimators introduced by Fu, Futschik et al., and Watterson that minimize the variance or mean squared error for no and free recombination. We find that neural networks reproduce these estimators if provided with the appropriate features and training sets. Remarkably, using the model-based estimators to adjust the weights of the training data, only one hidden layer is necessary to obtain a single estimator that performs almost as well as model-based estimators for low and high recombination rates, and at the same time provides a superior estimation method for intermediate recombination rates. We apply the method to simulated data based on the human chromosome 2 recombination map, highlighting its robustness in a realistic setting where local recombination rates vary and/or are unknown. single-layer feedforward neural networks learn the established model-based linear estimators for high and low recombination rates neural networks learn good estimators for intermediate recombination rates where computation of model-based optimal estimators is hardly possible a single neural network estimator can automatically adapt to a variable recombination rate and performs close to optimal this is advantageous when recombination rates vary along the chromosome according to a recombination map using the known estimators as a benchmark to adapt the training error function improves the estimates of the neural networks
Collapse
Affiliation(s)
- Klara Elisabeth Burger
- Cluster of Excellence “Machine Learning: New Perspectives for Science”, University of Tübingen, Tübingen, Germany
| | - Peter Pfaffelhuber
- Department of Mathematical Stochastics, University of Freiburg, Freiburg, Germany
| | - Franz Baumdicker
- Cluster of Excellence “Machine Learning: New Perspectives for Science”, University of Tübingen, Tübingen, Germany
- Cluster of Excellence “Controlling Microbes to Fight Infections”, Mathematical and Computational Population Genetics, University of Tübingen, Tübingen, Germany
- * E-mail:
| |
Collapse
|
34
|
Smith BT, Merwin J, Provost KL, Thom G, Brumfield RT, Ferreira M, Mauck Iii WM, Moyle RG, Wright T, Joseph L. Phylogenomic analysis of the parrots of the world distinguishes artifactual from biological sources of gene tree discordance. Syst Biol 2022; 72:228-241. [PMID: 35916751 DOI: 10.1093/sysbio/syac055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Revised: 02/22/2022] [Accepted: 07/22/2022] [Indexed: 11/14/2022] Open
Abstract
Gene tree discordance is expected in phylogenomic trees and biological processes are often invoked to explain it. However, heterogeneous levels of phylogenetic signal among individuals within datasets may cause artifactual sources of topological discordance. We examined how the information content in tips and subclades impacts topological discordance in the parrots (Order: Psittaciformes), a diverse and highly threatened clade of nearly 400 species. Using ultraconserved elements from 96% of the clade's species-level diversity, we estimated concatenated and species trees for 382 ingroup taxa. We found that discordance among tree topologies was most common at nodes dating between the late Miocene and Pliocene, and often at the taxonomic level of genus. Accordingly, we used two metrics to characterize information content in tips and assess the degree to which conflict between trees was being driven by lower quality samples. Most instances of topological conflict and non-monophyletic genera in the species tree could be objectively identified using these metrics. For subclades still discordant after tip-based filtering, we used a machine learning approach to determine whether phylogenetic signal or noise was the more important predictor of metrics supporting the alternative topologies. We found that when signal favored one of the topologies, noise was the most important variable in poorly performing models that favored the alternative topology. In sum, we show that artifactual sources of gene tree discordance, which are likely a common phenomenon in many datasets, can be distinguished from biological sources by quantifying the information content in each tip and modeling which factors support each topology.
Collapse
Affiliation(s)
- Brian Tilston Smith
- Department of Ornithology, American Museum of Natural History, Central Park West at 79th Street, New York, NY 10024, USA
| | - Jon Merwin
- Department of Ornithology, Academy of Natural Sciences of Drexel University, 1900 Benjamin Franklin Parkway, Philadelphia, PA 19103, USA.,Department of Biodiversity, Earth, and Environmental Science, Drexel University, Philadelphia, PA 19103, USA
| | - Kaiya L Provost
- Department of Evolution, Ecology, and Organismal Biology, The Ohio State University, 318 W. 12th Avenue, Columbus, OH 43210, USA
| | - Gregory Thom
- Museum of Natural Science and Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| | - Robb T Brumfield
- Museum of Natural Science and Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| | - Mateus Ferreira
- Centro de Estudos da Biodiversidade, Universidade Federal de Roraima, Av. Cap. Ene Garcez, 2413, Boa Vista, RR, Brazil
| | - William M Mauck Iii
- Department of Ornithology, American Museum of Natural History, Central Park West at 79th Street, New York, NY 10024, USA
| | - Robert G Moyle
- Department of Ecology and Evolutionary Biology and Biodiversity Institute, University of Kansas, 1345 Jayhawk Blvd., Lawrence, KS 66045, USA
| | - Timothy Wright
- Department of Biology, New Mexico State University, Las Cruces, NM, 88003, USA
| | - Leo Joseph
- Australian National Wildlife Collection, National Research Collections Australia, CSIRO, GPO Box 1700, Canberra, ACT, 2601, Australia
| |
Collapse
|
35
|
Borowiec ML, Dikow RB, Frandsen PB, McKeeken A, Valentini G, White AE. Deep learning as a tool for ecology and evolution. Methods Ecol Evol 2022. [DOI: 10.1111/2041-210x.13901] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Marek L. Borowiec
- Entomology, Plant Pathology and Nematology University of Idaho Moscow ID USA
- Institute for Bioinformatics and Evolutionary Studies (IBEST) University of Idaho Moscow ID USA
| | - Rebecca B. Dikow
- Data Science Lab, Office of the Chief Information Officer Smithsonian Institution Washington DC USA
| | - Paul B. Frandsen
- Data Science Lab, Office of the Chief Information Officer Smithsonian Institution Washington DC USA
- Department of Plant and Wildlife Sciences Brigham Young University Provo UT USA
| | - Alexander McKeeken
- Entomology, Plant Pathology and Nematology University of Idaho Moscow ID USA
| | | | - Alexander E. White
- Data Science Lab, Office of the Chief Information Officer Smithsonian Institution Washington DC USA
- Department of Botany, National Museum of Natural History Smithsonian Institution Washington DC USA
| |
Collapse
|
36
|
Ly-Trong N, Naser-Khdour S, Lanfear R, Minh BQ. AliSim: a fast and versatile phylogenetic sequence simulator for the genomic era. Mol Biol Evol 2022; 39:6577219. [PMID: 35511713 PMCID: PMC9113491 DOI: 10.1093/molbev/msac092] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Abstract
Sequence simulators play an important role in phylogenetics. Simulated data has many applications, such as evaluating the performance of different methods, hypothesis testing with parametric bootstraps, and, more recently, generating data for training machine-learning applications. Many sequence simulation programs exist, but the most feature-rich programs tend to be rather slow, and the fastest programs tend to be feature-poor. Here, we introduce AliSim, a new tool that can efficiently simulate biologically realistic alignments under a large range of complex evolutionary models. To achieve high performance across a wide range of simulation conditions, AliSim implements an adaptive approach that combines the commonly-used rate matrix and probability matrix approaches. AliSim takes 1.4 hours and 1.3 GB RAM to simulate alignments with one million sequences or sites, while popular software Seq-Gen, Dawg, and INDELible require two to five hours and 50 to 500 GB of RAM. We provide AliSim as an extension of the IQ-TREE software version 2.2, freely available at www.iqtree.org, and a comprehensive user tutorial at http://www.iqtree.org/doc/AliSim.
Collapse
Affiliation(s)
- Nhan Ly-Trong
- School of Computing, College of Engineering and Computer Science, Australian National University, Canberra, ACT 2600, Australia
| | - Suha Naser-Khdour
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Robert Lanfear
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Bui Quang Minh
- School of Computing, College of Engineering and Computer Science, Australian National University, Canberra, ACT 2600, Australia
| |
Collapse
|
37
|
Jiang Y, Balaban M, Zhu Q, Mirarab S. DEPP: Deep Learning Enables Extending Species Trees using Single Genes. Syst Biol 2022; 72:17-34. [PMID: 35485976 DOI: 10.1093/sysbio/syac031] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Revised: 04/13/2022] [Accepted: 04/22/2022] [Indexed: 11/13/2022] Open
Abstract
Placing new sequences onto reference phylogenies is increasingly used for analyzing environmental samples, especially microbiomes. Existing placement methods assume that query sequences have evolved under specific models directly on the reference phylogeny. For example, they assume single-gene data (e.g., 16S rRNA amplicons) have evolved under the GTR model on a gene tree. Placement, however, often has a more ambitious goal: extending a (genome-wide) species tree given data from individual genes without knowing the evolutionary model. Addressing this challenging problem requires new directions. Here, we introduce Deep-learning Enabled Phylogenetic Placement (DEPP), an algorithm that learns to extend species trees using single genes without pre-specified models. In simulations and on real data, we show that DEPP can match the accuracy of model-based methods without any prior knowledge of the model. We also show that DEPP can update the multi-locus microbial tree-of-life with single genes with high accuracy. We further demonstrate that DEPP can combine 16S and metagenomic data onto a single tree, enabling community structure analyses that take advantage of both sources of data.
Collapse
Affiliation(s)
- Yueyu Jiang
- Department of Electrical and Computer Engineering, UC San Diego, CA 92093, USA
| | - Metin Balaban
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, CA 92093, USA
| | - Qiyun Zhu
- Center for Fundamental and Applied Microbiomics, Arizona State University, Tempe, AZ 85281, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, CA 92093, USA
| |
Collapse
|
38
|
Sapoval N, Aghazadeh A, Nute MG, Antunes DA, Balaji A, Baraniuk R, Barberan CJ, Dannenfelser R, Dun C, Edrisi M, Elworth RAL, Kille B, Kyrillidis A, Nakhleh L, Wolfe CR, Yan Z, Yao V, Treangen TJ. Current progress and open challenges for applying deep learning across the biosciences. Nat Commun 2022; 13:1728. [PMID: 35365602 PMCID: PMC8976012 DOI: 10.1038/s41467-022-29268-7] [Citation(s) in RCA: 105] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2021] [Accepted: 03/09/2022] [Indexed: 11/19/2022] Open
Abstract
Deep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper we discuss recent advances, limitations, and future perspectives of DL on five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference. We discuss each application area and cover the main bottlenecks of DL approaches, such as training data, problem scope, and the ability to leverage existing DL architectures in new contexts. To conclude, we provide a summary of the subject-specific and general challenges for DL across the biosciences.
Collapse
Affiliation(s)
- Nicolae Sapoval
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Amirali Aghazadeh
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Michael G Nute
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Dinler A Antunes
- Department of Biology and Biochemistry, University of Houston, Houston, TX, USA
| | - Advait Balaji
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Richard Baraniuk
- Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA
| | - C J Barberan
- Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA
| | | | - Chen Dun
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | - R A Leo Elworth
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Bryce Kille
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | - Luay Nakhleh
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Cameron R Wolfe
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Zhi Yan
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Vicky Yao
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, USA.
- Department of Bioengineering, Rice University, Houston, TX, USA.
| |
Collapse
|
39
|
De Maio N, Boulton W, Weilguny L, Walker CR, Turakhia Y, Corbett-Detig R, Goldman N. phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets. PLoS Comput Biol 2022; 18:e1010056. [PMID: 35486906 PMCID: PMC9094560 DOI: 10.1371/journal.pcbi.1010056] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 05/11/2022] [Accepted: 03/25/2022] [Indexed: 11/26/2022] Open
Abstract
Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and it implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutability models that we developed to more realistically represent SARS-CoV-2 genome evolution.
Collapse
Affiliation(s)
- Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - William Boulton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Lukas Weilguny
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Conor R. Walker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
- Department of Genetics, University of Cambridge, Cambridge, United Kingdom
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, California, United States of America
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
- Genomics Institute, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| |
Collapse
|
40
|
Zaharias P, Grosshauser M, Warnow T. Re-evaluating Deep Neural Networks for Phylogeny Estimation: The Issue of Taxon Sampling. J Comput Biol 2022; 29:74-89. [PMID: 34986031 DOI: 10.1089/cmb.2021.0383] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Deep neural networks (DNNs) have been recently proposed for quartet tree phylogeny estimation. Here, we present a study evaluating recently trained DNNs in comparison to a collection of standard phylogeny estimation methods on a heterogeneous collection of datasets simulated under the same models that were used to train the DNNs, and also under similar conditions but with higher rates of evolution. Our study shows that using DNNs with quartet amalgamation is less accurate than several standard phylogeny estimation methods we explore (e.g., maximum likelihood and maximum parsimony). We further find that simple standard phylogeny estimation methods match or improve on DNNs for quartet accuracy, especially, but not exclusively, when used in a global manner (i.e., the tree on the full dataset is computed and then the induced quartet trees are extracted from the full tree). Thus, our study provides evidence that a major challenge impacting the utility of current DNNs for phylogeny estimation is their restriction to estimating quartet trees that must subsequently be combined into a tree on the full dataset. In contrast, global methods (i.e., those that estimate trees from the full set of sequences) are able to benefit from taxon sampling, and hence have higher accuracy on large datasets.
Collapse
Affiliation(s)
- Paul Zaharias
- Department of Computer Science, University of Illinois, Urbana, Illinois, USA
| | | | - Tandy Warnow
- Department of Computer Science, University of Illinois, Urbana, Illinois, USA
| |
Collapse
|
41
|
Blischak PD, Barker MS, Gutenkunst RN. Chromosome-scale inference of hybrid speciation and admixture with convolutional neural networks. Mol Ecol Resour 2021; 21:2676-2688. [PMID: 33682305 PMCID: PMC8675098 DOI: 10.1111/1755-0998.13355] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 01/26/2021] [Accepted: 02/05/2021] [Indexed: 11/30/2022]
Abstract
Inferring the frequency and mode of hybridization among closely related organisms is an important step for understanding the process of speciation and can help to uncover reticulated patterns of phylogeny more generally. Phylogenomic methods to test for the presence of hybridization come in many varieties and typically operate by leveraging expected patterns of genealogical discordance in the absence of hybridization. An important assumption made by these tests is that the data (genes or SNPs) are independent given the species tree. However, when the data are closely linked, it is especially important to consider their nonindependence. Recently, deep learning techniques such as convolutional neural networks (CNNs) have been used to perform population genetic inferences with linked SNPs coded as binary images. Here, we use CNNs for selecting among candidate hybridization scenarios using the tree topology (((P1 , P2 ), P3 ), Out) and a matrix of pairwise nucleotide divergence (dXY ) calculated in windows across the genome. Using coalescent simulations to train and independently test a neural network showed that our method, HyDe-CNN, was able to accurately perform model selection for hybridization scenarios across a wide breath of parameter space. We then used HyDe-CNN to test models of admixture in Heliconius butterflies, as well as comparing it to phylogeny-based introgression statistics. Given the flexibility of our approach, the dropping cost of long-read sequencing and the continued improvement of CNN architectures, we anticipate that inferences of hybridization using deep learning methods like ours will help researchers to better understand patterns of admixture in their study organisms.
Collapse
Affiliation(s)
- Paul D. Blischak
- Department of Ecology & Evolutionary Biology, University of Arizona, Tucson, AZ, 85721, USA
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, 85721, USA
| | - Michael S. Barker
- Department of Ecology & Evolutionary Biology, University of Arizona, Tucson, AZ, 85721, USA
| | - Ryan N. Gutenkunst
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, 85721, USA
| |
Collapse
|
42
|
De Maio N, Boulton W, Weilguny L, Walker CR, Turakhia Y, Corbett-Detig R, Goldman N. phastSim: efficient simulation of sequence evolution for pandemic-scale datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.03.15.435416. [PMID: 33758852 PMCID: PMC7987011 DOI: 10.1101/2021.03.15.435416] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100,000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software is available from https://github.com/NicolaDM/phastSim and allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutatability models that we developed to more realistically represent SARS-CoV-2 genome evolution.
Collapse
Affiliation(s)
- Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - William Boulton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Lukas Weilguny
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Conor R. Walker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
- Department of Genetics, University of Cambridge, Cambridge, CB2 3EH, UK
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA 92093, USA
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| |
Collapse
|
43
|
Duchêne DA, Mather N, Van Der Wal C, Ho SYW. Excluding loci with substitution saturation improves inferences from phylogenomic data. Syst Biol 2021; 71:676-689. [PMID: 34508605 PMCID: PMC9016599 DOI: 10.1093/sysbio/syab075] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2020] [Accepted: 09/07/2021] [Indexed: 11/21/2022] Open
Abstract
The historical signal in nucleotide sequences becomes eroded over time by substitutions occurring repeatedly at the same sites. This phenomenon, known as substitution saturation, is recognized as one of the primary obstacles to deep-time phylogenetic inference using genome-scale data sets. We present a new test of substitution saturation and demonstrate its performance in simulated and empirical data. For some of the 36 empirical phylogenomic data sets that we examined, we detect substitution saturation in around 50% of loci. We found that saturation tends to be flagged as problematic in loci with highly discordant phylogenetic signals across sites. Within each data set, the loci with smaller numbers of informative sites are more likely to be flagged as containing problematic levels of saturation. The entropy saturation test proposed here is sensitive to high evolutionary rates relative to the evolutionary timeframe, while also being sensitive to several factors known to mislead phylogenetic inference, including short internal branches relative to external branches, short nucleotide sequences, and tree imbalance. Our study demonstrates that excluding loci with substitution saturation can be an effective means of mitigating the negative impact of multiple substitutions on phylogenetic inferences. [Phylogenetic model performance; phylogenomics; substitution model; substitution saturation; test statistics.]
Collapse
Affiliation(s)
- David A Duchêne
- Centre for Evolutionary Hologenomics, University of Copenhagen, 1352 Copenhagen, Denmark
| | - Niklas Mather
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW 2006, Australia
| | - Cara Van Der Wal
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW 2006, Australia
| | - Simon Y W Ho
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW 2006, Australia
| |
Collapse
|
44
|
Abstract
We introduce a supervised machine learning approach with sparsity constraints for phylogenomics, referred to as evolutionary sparse learning (ESL). ESL builds models with genomic loci—such as genes, proteins, genomic segments, and positions—as parameters. Using the Least Absolute Shrinkage and Selection Operator, ESL selects only the most important genomic loci to explain a given phylogenetic hypothesis or presence/absence of a trait. ESL models do not directly involve conventional parameters such as rates of substitutions between nucleotides, rate variation among positions, and phylogeny branch lengths. Instead, ESL directly employs the concordance of variation across sequences in an alignment with the evolutionary hypothesis of interest. ESL provides a natural way to combine different molecular and nonmolecular data types and incorporate biological and functional annotations of genomic loci in model building. We propose positional, gene, function, and hypothesis sparsity scores, illustrate their use through an example, and suggest several applications of ESL. The ESL framework has the potential to drive the development of a new class of computational methods that will complement traditional approaches in evolutionary genomics, particularly for identifying influential loci and sequences given a phylogeny and building models to test hypotheses. ESL’s fast computational times and small memory footprint will also help democratize big data analytics and improve scientific rigor in phylogenomics.
Collapse
Affiliation(s)
- Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA.,Department of Biology, Temple University, Philadelphia, PA.,Center for Excellence in Genome Medicine and Research, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Sudip Sharma
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA.,Department of Biology, Temple University, Philadelphia, PA
| |
Collapse
|
45
|
Fonseca EM, Colli GR, Werneck FP, Carstens BC. Phylogeographic model selection using convolutional neural networks. Mol Ecol Resour 2021; 21:2661-2675. [PMID: 33973350 DOI: 10.1111/1755-0998.13427] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Revised: 04/02/2021] [Accepted: 04/28/2021] [Indexed: 11/26/2022]
Abstract
The discipline of phylogeography has evolved rapidly in terms of the analytical toolkit used to analyse large genomic data sets. Despite substantial advances, analytical tools that could potentially address the challenges posed by increased model complexity have not been fully explored. For example, deep learning techniques are underutilized for phylogeographic model selection. In non-model organisms, the lack of information about their ecology and evolution can lead to uncertainty about which demographic models are appropriate. Here, we assess the utility of convolutional neural networks (CNNs) for assessing demographic models in South American lizards in the genus Norops. Three demographic scenarios (constant, expansion, and bottleneck) were considered for each of four inferred population-level lineages, and we found that the overall model accuracy was higher than 98% for all lineages. We then evaluated a set of 26 models that accounted for evolutionary relationships, gene flow, and changes in effective population size among the four lineages, identifying a single model with an estimated overall accuracy of 87% when using CNNs. The inferred demography of the lizard system suggests that gene flow between non-sister populations and changes in effective population sizes through time, probably in response to Pleistocene climatic oscillations, have shaped genetic diversity in this system. Approximate Bayesian computation (ABC) was applied to provide a comparison to the performance of CNNs. ABC was unable to identify a single model among the larger set of 26 models in the subsequent analysis. Our results demonstrate that CNNs can be easily and usefully incorporated into the phylogeographer's toolkit.
Collapse
Affiliation(s)
- Emanuel M Fonseca
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, OH, USA
| | - Guarino R Colli
- Departamento de Zoologia, Universidade de Brasília, Brasília, Brazil
| | - Fernanda P Werneck
- Coordenação de Biodiversidade, Programa de Coleções Científicas Biológicas, Instituto Nacional de Pesquisas da Amazônia (INPA), Manaus, Brazil
| | - Bryan C Carstens
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
46
|
Azouri D, Abadi S, Mansour Y, Mayrose I, Pupko T. Harnessing machine learning to guide phylogenetic-tree search algorithms. Nat Commun 2021; 12:1983. [PMID: 33790270 PMCID: PMC8012635 DOI: 10.1038/s41467-021-22073-8] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Accepted: 02/26/2021] [Indexed: 02/01/2023] Open
Abstract
Inferring a phylogenetic tree is a fundamental challenge in evolutionary studies. Current paradigms for phylogenetic tree reconstruction rely on performing costly likelihood optimizations. With the aim of making tree inference feasible for problems involving more than a handful of sequences, inference under the maximum-likelihood paradigm integrates heuristic approaches to evaluate only a subset of all potential trees. Consequently, existing methods suffer from the known tradeoff between accuracy and running time. In this proof-of-concept study, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus potentially accelerating heuristic tree searches without losing accuracy. Our analyses suggest that machine learning can guide tree-search methodologies towards the most promising candidate trees.
Collapse
Affiliation(s)
- Dana Azouri
- School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel
| | - Shiran Abadi
- School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel
| | - Yishay Mansour
- Balvatnik School of Computer Science, Tel-Aviv University, Ramat Aviv, Tel-Aviv, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel.
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel.
| |
Collapse
|
47
|
Auslander N, Gussow AB, Koonin EV. Incorporating Machine Learning into Established Bioinformatics Frameworks. Int J Mol Sci 2021; 22:2903. [PMID: 33809353 PMCID: PMC8000113 DOI: 10.3390/ijms22062903] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 03/08/2021] [Accepted: 03/10/2021] [Indexed: 12/23/2022] Open
Abstract
The exponential growth of biomedical data in recent years has urged the application of numerous machine learning techniques to address emerging problems in biology and clinical research. By enabling the automatic feature extraction, selection, and generation of predictive models, these methods can be used to efficiently study complex biological systems. Machine learning techniques are frequently integrated with bioinformatic methods, as well as curated databases and biological networks, to enhance training and validation, identify the best interpretable features, and enable feature and model investigation. Here, we review recently developed methods that incorporate machine learning within the same framework with techniques from molecular evolution, protein structure analysis, systems biology, and disease genomics. We outline the challenges posed for machine learning, and, in particular, deep learning in biomedicine, and suggest unique opportunities for machine learning techniques integrated with established bioinformatics approaches to overcome some of these challenges.
Collapse
Affiliation(s)
| | | | - Eugene V. Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;
| |
Collapse
|
48
|
Xue AT, Schrider DR, Kern AD. Discovery of Ongoing Selective Sweeps within Anopheles Mosquito Populations Using Deep Learning. Mol Biol Evol 2021; 38:1168-1183. [PMID: 33022051 PMCID: PMC7947845 DOI: 10.1093/molbev/msaa259] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Identification of partial sweeps, which include both hard and soft sweeps that have not currently reached fixation, provides crucial information about ongoing evolutionary responses. To this end, we introduce partialS/HIC, a deep learning method to discover selective sweeps from population genomic data. partialS/HIC uses a convolutional neural network for image processing, which is trained with a large suite of summary statistics derived from coalescent simulations incorporating population-specific history, to distinguish between completed versus partial sweeps, hard versus soft sweeps, and regions directly affected by selection versus those merely linked to nearby selective sweeps. We perform several simulation experiments under various demographic scenarios to demonstrate partialS/HIC's performance, which exhibits excellent resolution for detecting partial sweeps. We also apply our classifier to whole genomes from eight mosquito populations sampled across sub-Saharan Africa by the Anopheles gambiae 1000 Genomes Consortium, elucidating both continent-wide patterns as well as sweeps unique to specific geographic regions. These populations have experienced intense insecticide exposure over the past two decades, and we observe a strong overrepresentation of sweeps at insecticide resistance loci. Our analysis thus provides a list of candidate adaptive loci that may be relevant to mosquito control efforts. More broadly, our supervised machine learning approach introduces a method to distinguish between completed and partial sweeps, as well as between hard and soft sweeps, under a variety of demographic scenarios. As whole-genome data rapidly accumulate for a greater diversity of organisms, partialS/HIC addresses an increasing demand for useful selection scan tools that can track in-progress evolutionary dynamics.
Collapse
Affiliation(s)
- Alexander T Xue
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC
| | - Andrew D Kern
- Institute of Ecology and Evolution, 5289 University of Oregon, Eugene, OR
| |
Collapse
|
49
|
Leuchtenberger AF, Crotty SM, Drucks T, Schmidt HA, Burgstaller-Muehlbacher S, von Haeseler A. Distinguishing Felsenstein Zone from Farris Zone Using Neural Networks. Mol Biol Evol 2020; 37:3632-3641. [PMID: 32637998 PMCID: PMC7743852 DOI: 10.1093/molbev/msaa164] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Maximum likelihood and maximum parsimony are two key methods for phylogenetic tree reconstruction. Under certain conditions, each of these two methods can perform more or less efficiently, resulting in unresolved or disputed phylogenies. We show that a neural network can distinguish between four-taxon alignments that were evolved under conditions susceptible to either long-branch attraction or long-branch repulsion. When likelihood and parsimony methods are discordant, the neural network can provide insight as to which tree reconstruction method is best suited to the alignment. When applied to the contentious case of Strepsiptera evolution, our method shows robust support for the current scientific view, that is, it places Strepsiptera with beetles, distant from flies.
Collapse
Affiliation(s)
- Alina F Leuchtenberger
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna, Austria
| | - Stephen M Crotty
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna, Austria
- School of Mathematical Sciences, University of Adelaide, Adelaide, SA, Australia
- ARC Centre of Excellence for Mathematical and Statistical Frontiers, University of Adelaide, Adelaide, SA, Australia
| | - Tamara Drucks
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna, Austria
| | - Heiko A Schmidt
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna, Austria
| | - Sebastian Burgstaller-Muehlbacher
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna, Austria
| | - Arndt von Haeseler
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna, Austria
- Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, Vienna, Austria
| |
Collapse
|
50
|
Zou Z, Zhang H, Guan Y, Zhang J. Deep Residual Neural Networks Resolve Quartet Molecular Phylogenies. Mol Biol Evol 2020; 37:1495-1507. [PMID: 31868908 PMCID: PMC8453599 DOI: 10.1093/molbev/msz307] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Phylogenetic inference is of fundamental importance to evolutionary as well as other fields of biology, and molecular sequences have emerged as the primary data for this task. Although many phylogenetic methods have been developed to explicitly take into account substitution models of sequence evolution, such methods could fail due to model misspecification or insufficiency, especially in the face of heterogeneities in substitution processes across sites and among lineages. In this study, we propose to infer topologies of four-taxon trees using deep residual neural networks, a machine learning approach needing no explicit modeling of the subject system and having a record of success in solving complex nonlinear inference problems. We train residual networks on simulated protein sequence data with extensive amino acid substitution heterogeneities. We show that the well-trained residual network predictors can outperform existing state-of-the-art inference methods such as the maximum likelihood method on diverse simulated test data, especially under extensive substitution heterogeneities. Reassuringly, residual network predictors generally agree with existing methods in the trees inferred from real phylogenetic data with known or widely believed topologies. Furthermore, when combined with the quartet puzzling algorithm, residual network predictors can be used to reconstruct trees with more than four taxa. We conclude that deep learning represents a powerful new approach to phylogenetic reconstruction, especially when sequences evolve via heterogeneous substitution processes. We present our best trained predictor in a freely available program named Phylogenetics by Deep Learning (PhyDL, https://gitlab.com/ztzou/phydl; last accessed January 3, 2020).
Collapse
Affiliation(s)
- Zhengting Zou
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI
| | - Hongjiu Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI
- Department of Internal Medicine, University of Michigan, Ann Arbor, MI
| | - Jianzhi Zhang
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI
| |
Collapse
|