1
|
Nesterenko L, Blassel L, Veber P, Boussau B, Jacob L. Phyloformer: Fast, Accurate, and Versatile Phylogenetic Reconstruction with Deep Neural Networks. Mol Biol Evol 2025; 42:msaf051. [PMID: 40066802 PMCID: PMC11965795 DOI: 10.1093/molbev/msaf051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Revised: 01/16/2025] [Accepted: 01/27/2025] [Indexed: 04/04/2025] Open
Abstract
Phylogenetic inference aims at reconstructing the tree describing the evolution of a set of sequences descending from a common ancestor. The high computational cost of state-of-the-art maximum likelihood and Bayesian inference methods limits their usability under realistic evolutionary models. Harnessing recent advances in likelihood-free inference and geometric deep learning, we introduce Phyloformer, a fast and accurate method for evolutionary distance estimation and phylogenetic reconstruction. Sampling many trees and sequences under an evolutionary model, we train the network to learn a function that enables predicting a tree from a multiple sequence alignment. On simulated data, we compare Phyloformer to FastME-a distance method-and two maximum likelihood methods: FastTree and IQTree. Under a commonly used model of protein sequence evolution and exploiting graphics processing unit (GPU) acceleration, Phyloformer outpaces all other approaches and exceeds their accuracy in the Kuhner-Felsenstein metric that accounts for both the topology and branch lengths. In terms of topological accuracy alone, Phyloformer outperforms FastME, but falls behind maximum likelihood approaches, especially as the number of sequences increases. When a model of sequence evolution that includes dependencies between sites is used, Phyloformer outperforms all other methods across all metrics on alignments with fewer than 80 sequences. On 3,801 empirical gene alignments from five different datasets, Phyloformer matches the topological accuracy of the two maximum likelihood implementations. Our results pave the way for the adoption of sophisticated realistic models for phylogenetic inference.
Collapse
Affiliation(s)
- Luca Nesterenko
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Luc Blassel
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Philippe Veber
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Bastien Boussau
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Laurent Jacob
- Laboratory of Computational and Quantitative Biology, Sorbonne Université, Paris, France
| |
Collapse
|
2
|
Liu R, Qiao X, Shi Y, Peterson CB, Bush WS, Cominelli F, Wang M, Zhang L. Constructing phylogenetic trees for microbiome data analysis: A mini-review. Comput Struct Biotechnol J 2024; 23:3859-3868. [PMID: 39554614 PMCID: PMC11564040 DOI: 10.1016/j.csbj.2024.10.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Revised: 10/20/2024] [Accepted: 10/20/2024] [Indexed: 11/19/2024] Open
Abstract
As next-generation sequencing technologies advance rapidly and the cost of metagenomic sequencing continues to decrease, researchers now face an unprecedented volume of microbiome data. This surge has stimulated the development of scalable microbiome data analysis methods and necessitated the incorporation of phylogenetic information into microbiome analysis for improved accuracy. Tools for constructing phylogenetic trees from 16S rRNA sequencing data are well-established, as the highly conserved regions of the 16S gene are limited, simplifying the identification of marker genes. In contrast, metagenomic and whole genome shotgun (WGS) sequencing involve sequencing from random fragments of the entire gene, making identification of consistent marker genes challenging owing to the vast diversity of genomic regions, resulting in a scarcity of robust tools for constructing phylogenetic trees. Although bacterial sequence tree construction tools exist for upstream bioinformatics, many downstream researchers-those integrating these trees into statistical models or machine learning-are either unaware of these tools or find them difficult to use due to the steep learning curve of processing raw sequences. This is compounded by the fact that public datasets often lack phylogenetic trees, providing only abundance tables and taxonomic classifications. To address this, we present a comprehensive review of phylogenetic tree construction techniques for microbiome data (16S rRNA or whole-genome shotgun sequencing). We outline the strengths and limitations of current methods, offering expert insights and step-by-step guidance to make these tools more accessible and widely applicable in quantitative microbiome data analysis.
Collapse
Affiliation(s)
- Ruitao Liu
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, 44106, OH, United States
| | - Xi Qiao
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, 44106, OH, United States
| | - Yushu Shi
- Weill Cornell Medicine, Cornell University, 1300 York Ave, New York, 10065, NY, United States
| | - Christine B. Peterson
- The University of Texas MD Anderson Cancer Center, 1515 Holcombe Blvd, Houston, 77030, TX, United States
| | - William S. Bush
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, 44106, OH, United States
| | - Fabio Cominelli
- Department of Pathology, School of Medicine, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, 44106, OH, United States
- Case Digestive Health Research Institute, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, 44106, OH, United States
| | - Ming Wang
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, 44106, OH, United States
| | - Liangliang Zhang
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, 44106, OH, United States
- Case Comprehensive Cancer Center, 10900 Euclid Avenue, Cleveland, 44106, OH, United States
| |
Collapse
|
3
|
Silvestro D, Latrille T, Salamin N. Toward a Semi-Supervised Learning Approach to Phylogenetic Estimation. Syst Biol 2024; 73:789-806. [PMID: 38916476 PMCID: PMC11639169 DOI: 10.1093/sysbio/syae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/21/2024] [Accepted: 06/24/2024] [Indexed: 06/26/2024] Open
Abstract
Models have always been central to inferring molecular evolution and to reconstructing phylogenetic trees. Their use typically involves the development of a mechanistic framework reflecting our understanding of the underlying biological processes, such as nucleotide substitutions, and the estimation of model parameters by maximum likelihood or Bayesian inference. However, deriving and optimizing the likelihood of the data is not always possible under complex evolutionary scenarios or even tractable for large datasets, often leading to unrealistic simplifying assumptions in the fitted models. To overcome this issue, we coupled stochastic simulations of genome evolution with a new supervised deep-learning model to infer key parameters of molecular evolution. Our model is designed to directly analyze multiple sequence alignments and estimate per-site evolutionary rates and divergence without requiring a known phylogenetic tree. The accuracy of our predictions matched that of likelihood-based phylogenetic inference when rate heterogeneity followed a simple gamma distribution, but it strongly exceeded it under more complex patterns of rate variation, such as codon models. Our approach is highly scalable and can be efficiently applied to genomic data, as we showed on a dataset of 26 million nucleotides from the clownfish clade. Our simulations also showed that the integration of per-site rates obtained by deep learning within a Bayesian framework led to significantly more accurate phylogenetic inference, particularly with respect to the estimated branch lengths. We thus propose that future advancements in phylogenetic analysis will benefit from a semi-supervised learning approach that combines deep-learning estimation of substitution rates, which allows for more flexible models of rate variation, and probabilistic inference of the phylogenetic tree, which guarantees interpretability and a rigorous assessment of statistical support.
Collapse
Affiliation(s)
- Daniele Silvestro
- Department of Biology, University of Fribourg and Swiss Institute of Bioinformatics, 1700 Fribourg, Switzerland
- Department of Biological and Environmental Sciences, Gothenburg Global Biodiversity Centre, University of Gothenburg, 40530 Gothenburg, Sweden
| | - Thibault Latrille
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Nicolas Salamin
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| |
Collapse
|
4
|
Yang B, Zhou X, Liu S. Tracing the genealogy origin of geographic populations based on genomic variation and deep learning. Mol Phylogenet Evol 2024; 198:108142. [PMID: 38964594 DOI: 10.1016/j.ympev.2024.108142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 05/30/2024] [Accepted: 07/01/2024] [Indexed: 07/06/2024]
Abstract
Assigning a query individual animal or plant to its derived population is a prime task in diverse applications related to organismal genealogy. Such endeavors have conventionally relied on short DNA sequences under a phylogenetic framework. These methods naturally show constraints when the inferred population sources are ambiguously phylogenetically structured, a scenario demanding substantially more informative genetic signals. Recent advances in cost-effective production of whole-genome sequences and artificial intelligence have created an unprecedented opportunity to trace the population origin for essentially any given individual, as long as the genome reference data are comprehensive and standardized. Here, we developed a convolutional neural network method to identify population origins using genomic SNPs. Three empirical datasets (an Asian honeybee, a red fire ant, and a chicken datasets) and two simulated populations are used for the proof of concepts. The performance tests indicate that our method can accurately identify the genealogy origin of query individuals, with success rates ranging from 93 % to 100 %. We further showed that the accuracy of the model can be significantly increased by refining the informative sites through FST filtering. Our method is robust to configurations related to batch sizes and epochs, whereas model learning benefits from the setting of a proper preset learning rate. Moreover, we explained the importance score of key sites for algorithm interpretability and credibility, which has been largely ignored. We anticipate that by coupling genomics and deep learning, our method will see broad potential in conservation and management applications that involve natural resources, invasive pests and weeds, and illegal trades of wildlife products.
Collapse
Affiliation(s)
- Bing Yang
- Department of Entomology, China Agricultural University, Beijing 100193, China
| | - Xin Zhou
- Department of Entomology, China Agricultural University, Beijing 100193, China.
| | - Shanlin Liu
- Department of Entomology, China Agricultural University, Beijing 100193, China; Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.
| |
Collapse
|
5
|
Mo YK, Hahn MW, Smith ML. Applications of machine learning in phylogenetics. Mol Phylogenet Evol 2024; 196:108066. [PMID: 38565358 DOI: 10.1016/j.ympev.2024.108066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 02/16/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]
Abstract
Machine learning has increasingly been applied to a wide range of questions in phylogenetic inference. Supervised machine learning approaches that rely on simulated training data have been used to infer tree topologies and branch lengths, to select substitution models, and to perform downstream inferences of introgression and diversification. Here, we review how researchers have used several promising machine learning approaches to make phylogenetic inferences. Despite the promise of these methods, several barriers prevent supervised machine learning from reaching its full potential in phylogenetics. We discuss these barriers and potential paths forward. In the future, we expect that the application of careful network designs and data encodings will allow supervised machine learning to accommodate the complex processes that continue to confound traditional phylogenetic methods.
Collapse
Affiliation(s)
- Yu K Mo
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Matthew W Hahn
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA; Department of Biology, Indiana University, Bloomington, IN 47405, USA
| | - Megan L Smith
- Department of Biological Sciences, Mississippi State University, Starkville, MS 39762, USA.
| |
Collapse
|
6
|
Jiang Y, McDonald D, Perry D, Knight R, Mirarab S. Scaling DEPP phylogenetic placement to ultra-large reference trees: a tree-aware ensemble approach. Bioinformatics 2024; 40:btae361. [PMID: 38870525 PMCID: PMC11193062 DOI: 10.1093/bioinformatics/btae361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 04/09/2024] [Accepted: 06/12/2024] [Indexed: 06/15/2024] Open
Abstract
MOTIVATION Phylogenetic placement of a query sequence on a backbone tree is increasingly used across biomedical sciences to identify the content of a sample from its DNA content. The accuracy of such analyses depends on the density of the backbone tree, making it crucial that placement methods scale to very large trees. Moreover, a new paradigm has been recently proposed to place sequences on the species tree using single-gene data. The goal is to better characterize the samples and to enable combined analyses of marker-gene (e.g., 16S rRNA gene amplicon) and genome-wide data. The recent method DEPP enables performing such analyses using metric learning. However, metric learning is hampered by a need to compute and save a quadratically growing matrix of pairwise distances during training. Thus, the training phase of DEPP does not scale to more than roughly 10 000 backbone species, a problem that we faced when trying to use our recently released Greengenes2 (GG2) reference tree containing 331 270 species. RESULTS This paper explores divide-and-conquer for training ensembles of DEPP models, culminating in a method called C-DEPP. While divide-and-conquer has been extensively used in phylogenetics, applying divide-and-conquer to data-hungry machine-learning methods needs nuance. C-DEPP uses carefully crafted techniques to enable quasi-linear scaling while maintaining accuracy. C-DEPP enables placing 20 million 16S fragments on the GG2 reference tree in 41 h of computation. AVAILABILITY AND IMPLEMENTATION The dataset and C-DEPP software are freely available at https://github.com/yueyujiang/dataset_cdepp/.
Collapse
Affiliation(s)
- Yueyu Jiang
- Electrical and Computer Engineering Department, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
| | - Daniel McDonald
- Pediatrics Department, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
| | - Daniela Perry
- Pediatrics Department, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
| | - Rob Knight
- Pediatrics Department, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
| | - Siavash Mirarab
- Electrical and Computer Engineering Department, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
| |
Collapse
|
7
|
Balaban M, Jiang Y, Zhu Q, McDonald D, Knight R, Mirarab S. Generation of accurate, expandable phylogenomic trees with uDance. Nat Biotechnol 2024; 42:768-777. [PMID: 37500914 PMCID: PMC10818028 DOI: 10.1038/s41587-023-01868-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Accepted: 06/20/2023] [Indexed: 07/29/2023]
Abstract
Phylogenetic trees provide a framework for organizing evolutionary histories across the tree of life and aid downstream comparative analyses such as metagenomic identification. Methods that rely on single-marker genes such as 16S rRNA have produced trees of limited accuracy with hundreds of thousands of organisms, whereas methods that use genome-wide data are not scalable to large numbers of genomes. We introduce updating trees using divide-and-conquer (uDance), a method that enables updatable genome-wide inference using a divide-and-conquer strategy that refines different parts of the tree independently and can build off of existing trees, with high accuracy and scalability. With uDance, we infer a species tree of roughly 200,000 genomes using 387 marker genes, totaling 42.5 billion amino acid residues.
Collapse
Affiliation(s)
- Metin Balaban
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, USA
| | - Yueyu Jiang
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA, USA
| | - Qiyun Zhu
- Biodesign Center for Fundamental and Applied Microbiomics, Arizona State University, Tempe, AZ, USA
- School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Daniel McDonald
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Rob Knight
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Department of Computer Science and Engineering, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, USA
- Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA, USA.
- Department of Computer Science and Engineering, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, USA.
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
8
|
Arasti S, Mirarab S. Median quartet tree search algorithms using optimal subtree prune and regraft. Algorithms Mol Biol 2024; 19:12. [PMID: 38481327 PMCID: PMC10938725 DOI: 10.1186/s13015-024-00257-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 02/13/2024] [Indexed: 03/17/2024] Open
Abstract
Gene trees can be different from the species tree due to biological processes and inference errors. One way to obtain a species tree is to find one that maximizes some measure of similarity to a set of gene trees. The number of shared quartets between a potential species tree and gene trees provides a statistically justifiable score; if maximized properly, it could result in a statistically consistent estimator of the species tree under several statistical models of discordance. However, finding the median quartet score tree, one that maximizes this score, is NP-Hard, motivating several existing heuristic algorithms. These heuristics do not follow the hill-climbing paradigm used extensively in phylogenetics. In this paper, we make theoretical contributions that enable an efficient hill-climbing approach. Specifically, we show that a subtree of size m can be placed optimally on a tree of size n in quasi-linear time with respect to n and (almost) independently of m. This result enables us to perform subtree prune and regraft (SPR) rearrangements as part of a hill-climbing search. We show that this approach can slightly improve upon the results of widely-used methods such as ASTRAL in terms of the optimization score but not necessarily accuracy.
Collapse
Affiliation(s)
- Shayesteh Arasti
- Computer Science and Engineering Department, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
| | - Siavash Mirarab
- Electrical and Computer Engineering Department, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA.
| |
Collapse
|
9
|
Patané JSL, Martins J, Setubal JC. A Guide to Phylogenomic Inference. Methods Mol Biol 2024; 2802:267-345. [PMID: 38819564 DOI: 10.1007/978-1-0716-3838-5_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Phylogenomics aims at reconstructing the evolutionary histories of organisms taking into account whole genomes or large fractions of genomes. Phylogenomics has significant applications in fields such as evolutionary biology, systematics, comparative genomics, and conservation genetics, providing valuable insights into the origins and relationships of species and contributing to our understanding of biological diversity and evolution. This chapter surveys phylogenetic concepts and methods aimed at both gene tree and species tree reconstruction while also addressing common pitfalls, providing references to relevant computer programs. A practical phylogenomic analysis example including bacterial genomes is presented at the end of the chapter.
Collapse
Affiliation(s)
- José S L Patané
- Laboratório de Genética e Cardiologia Molecular, Instituto do Coração/Heart Institute Hospital das Clínicas - Faculdade de Medicina da Universidade de São Paulo São Paulo, São Paulo, SP, Brazil
| | - Joaquim Martins
- Integrative Omics group, Biorenewables National Laboratory, Brazilian Center for Research in Energy and Materials, Campinas, SP, Brazil
| | - João Carlos Setubal
- Departmento de Bioquímica, Instituto de Química, Universidade de São Paulo, São Paulo, SP, Brazil.
| |
Collapse
|
10
|
Rachtman E, Sarmashghi S, Bafna V, Mirarab S. Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling. Cell Syst 2022; 13:817-829.e3. [PMID: 36265468 PMCID: PMC9589918 DOI: 10.1016/j.cels.2022.06.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 03/14/2022] [Accepted: 06/28/2022] [Indexed: 01/26/2023]
Abstract
Computing distance between two genomes without alignments or even access to assemblies has many downstream analyses. However, alignment-free methods, including in the fast-growing field of genome skimming, are hampered by a significant methodological gap. While accurate methods (many k-mer-based) for assembly-free distance calculation exist, measuring the uncertainty of estimated distances has not been sufficiently studied. In this paper, we show that bootstrapping, the standard non-parametric method of measuring estimator uncertainty, is not accurate for k-mer-based methods that rely on k-mer frequency profiles. Instead, we propose using subsampling (with no replacement) in combination with a correction step to reduce the variance of the inferred distribution. We show that the distribution of distances using our procedure matches the true uncertainty of the estimator. The resulting phylogenetic support values effectively differentiate between correct and incorrect branches and identify controversial branches that change across alignment-free and alignment-based phylogenies reported in the literature.
Collapse
Affiliation(s)
- Eleonora Rachtman
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, San Diego, CA 92093, USA
| | - Shahab Sarmashghi
- Department of Electrical and Computer Engineering, UC San Diego, San Diego, CA 92093, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, UC San Diego, San Diego, CA 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, San Diego, CA 92093, USA.
| |
Collapse
|
11
|
Zaharias P, Warnow T. Recent progress on methods for estimating and updating large phylogenies. Philos Trans R Soc Lond B Biol Sci 2022; 377:20210244. [PMID: 35989607 PMCID: PMC9393559 DOI: 10.1098/rstb.2021.0244] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 01/07/2022] [Indexed: 12/20/2022] Open
Abstract
With the increased availability of sequence data and even of fully sequenced and assembled genomes, phylogeny estimation of very large trees (even of hundreds of thousands of sequences) is now a goal for some biologists. Yet, the construction of these phylogenies is a complex pipeline presenting analytical and computational challenges, especially when the number of sequences is very large. In the past few years, new methods have been developed that aim to enable highly accurate phylogeny estimations on these large datasets, including divide-and-conquer techniques for multiple sequence alignment and/or tree estimation, methods that can estimate species trees from multi-locus datasets while addressing heterogeneity due to biological processes (e.g. incomplete lineage sorting and gene duplication and loss), and methods to add sequences into large gene trees or species trees. Here we present some of these recent advances and discuss opportunities for future improvements. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.
Collapse
Affiliation(s)
- Paul Zaharias
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
12
|
Jiang Y, Tabaghi P, Mirarab S. Learning Hyperbolic Embedding for Phylogenetic Tree Placement and Updates. BIOLOGY 2022; 11:biology11091256. [PMID: 36138735 PMCID: PMC9495508 DOI: 10.3390/biology11091256] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 08/11/2022] [Accepted: 08/19/2022] [Indexed: 11/20/2022]
Abstract
Simple Summary We show how the conventional (Euclidean) deep learning methods developed for phylogenetics can benefit from using hyperbolic geometry. The results point to lowered distance distortion and better accuracy in updating trees but not necessarily for phylogenetic placement. Abstract Phylogenetic placement, used widely in ecological analyses, seeks to add a new species to an existing tree. A deep learning approach was previously proposed to estimate the distance between query and backbone species by building a map from gene sequences to a high-dimensional space that preserves species tree distances. They then use a distance-based placement method to place the queries on that species tree. In this paper, we examine the appropriate geometry for faithfully representing tree distances while embedding gene sequences. Theory predicts that hyperbolic spaces should provide a drastic reduction in distance distortion compared to the conventional Euclidean space. Nevertheless, hyperbolic embedding imposes its own unique challenges related to arithmetic operations, exponentially-growing functions, and limited bit precision, and we address these challenges. Our results confirm that hyperbolic embeddings have substantially lower distance errors than Euclidean space. However, these better-estimated distances do not always lead to better phylogenetic placement. We then show that the deep learning framework can be used not just to place on a backbone tree but to update it to obtain a fully resolved tree. With our hyperbolic embedding framework, species trees can be updated remarkably accurately with only a handful of genes.
Collapse
Affiliation(s)
- Yueyu Jiang
- Electrical and Computer Engineering, University of California San Diego, La Jolla, CA 92093, USA
| | - Puoya Tabaghi
- Halıcıoğlu Data Science Institute, University of California San Diego, La Jolla, CA 92093, USA
| | - Siavash Mirarab
- Electrical and Computer Engineering, University of California San Diego, La Jolla, CA 92093, USA
- Correspondence: ; Tel.: +1-858-822-6245
| |
Collapse
|
13
|
Hasan NB, Balaban M, Biswas A, Bayzid MS, Mirarab S. Distance-Based Phylogenetic Placement with Statistical Support. BIOLOGY 2022; 11:1212. [PMID: 36009839 PMCID: PMC9404983 DOI: 10.3390/biology11081212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 07/30/2022] [Accepted: 08/02/2022] [Indexed: 11/16/2022]
Abstract
Phylogenetic identification of unknown sequences by placing them on a tree is routinely attempted in modern ecological studies. Such placements are often obtained from incomplete and noisy data, making it essential to augment the results with some notion of uncertainty. While the standard likelihood-based methods designed for placement naturally provide such measures of uncertainty, the newer and more scalable distance-based methods lack this crucial feature. Here, we adopt several parametric and nonparametric sampling methods for measuring the support of phylogenetic placements that have been obtained with the use of distances. Comparing the alternative strategies, we conclude that nonparametric bootstrapping is more accurate than the alternatives. We go on to show how bootstrapping can be performed efficiently using a linear algebraic formulation that makes it up to 30 times faster and implement this optimized version as part of the distance-based placement software APPLES. By examining a wide range of applications, we show that the relative accuracy of maximum likelihood (ML) support values as compared to distance-based methods depends on the application and the dataset. ML is advantageous for fragmentary queries, while distance-based support values are more accurate for full-length and multi-gene datasets. With the quantification of uncertainty, our work fills a crucial gap that prevents the broader adoption of distance-based placement tools.
Collapse
Affiliation(s)
- Navid Bin Hasan
- Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Metin Balaban
- Bioinformatics and System Biology Program, UC San Diego, San Diego, CA 92093, USA
| | - Avijit Biswas
- Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Md. Shamsuzzoha Bayzid
- Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Siavash Mirarab
- Electrical and Computer Engineering, UC San Diego, San Diego, CA 92093, USA
| |
Collapse
|