1
|
Hakim SA, Ratul MRZ, Bayzid MS. wQFM-DISCO: DISCO-enabled wQFM improves phylogenomic analyses despite the presence of paralogs. BIOINFORMATICS ADVANCES 2024; 4:vbae189. [PMID: 39664861 PMCID: PMC11634537 DOI: 10.1093/bioadv/vbae189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Revised: 10/18/2024] [Accepted: 11/24/2024] [Indexed: 12/13/2024]
Abstract
Motivation Gene trees often differ from the species trees that contain them due to various factors, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Several highly accurate species tree estimation methods have been introduced to explicitly address ILS, including ASTRAL, a widely used statistically consistent method, and wQFM, a quartet amalgamation approach experimentally shown to be more accurate than ASTRAL. Two recent advancements, ASTRAL-Pro and DISCO, have emerged in phylogenomics to consider GDL. ASTRAL-Pro introduces a refined quartet similarity measure, accounting for orthology and paralogy. On the other hand, DISCO offers a general strategy to decompose multi-copy gene trees into a collection of single-copy trees, allowing the utilization of methods previously designed for species tree inference in the context of single-copy gene trees. Results In this study, we first introduce some variants of DISCO to examine its underlying hypotheses and present analytical results on the statistical guarantees of DISCO. In particular, we introduce DISCO-R, a variant of DISCO with a refined and improved pruning strategy that provides more accurate and robust results. We then demonstrate with extensive evaluation studies on a collection of simulated and real data sets that wQFM paired with DISCO variants consistently matches or outperforms ASTRAL-Pro and other competing methods. Availability and implementation DISCO-R and other variants are freely available at https://github.com/skhakim/DISCO-variants.
Collapse
Affiliation(s)
- Sheikh Azizul Hakim
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Md Rownok Zahan Ratul
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| |
Collapse
|
2
|
Habib M, Roy K, Hasan S, Rahman AH, Bayzid MS. Terraces in species tree inference from gene trees. BMC Ecol Evol 2024; 24:135. [PMID: 39497030 PMCID: PMC11533290 DOI: 10.1186/s12862-024-02309-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2023] [Accepted: 09/16/2024] [Indexed: 11/06/2024] Open
Abstract
A terrace in a phylogenetic tree space is a region where all trees contain the same set of subtrees, due to certain patterns of missing data among the taxa sampled, resulting in an identical optimality score for a given data set. This was first investigated in the context of phylogenetic tree estimation from sequence alignments using maximum likelihood (ML) and maximum parsimony (MP). It was later extended to the species tree inference problem from a collection of gene trees, where a set of equally optimal species trees was referred to as a "pseudo" species tree terrace which does not consider the topological proximity of the trees in terms of the induced subtrees resulting from certain patterns of missing data. In this study, we mathematically characterize species tree terraces and investigate the mathematical properties and conditions that lead multiple species trees to induce/display an identical set of locus-specific subtrees owing to missing data. We report that species tree terraces are agnostic to gene tree heterogeneity. Therefore, we introduce and characterize a special type of gene tree topology-aware terrace which we call "peak terrace". Moreover, we empirically investigated various challenges and opportunities related to species tree terraces through extensive empirical studies using simulated and real biological data. We demonstrate the prevalence of species tree terraces and the resulting ambiguity created for tree search algorithms. Remarkably, our findings indicate that the identification of terraces could potentially lead to advances that enhance the accuracy of summary methods and provide reasonably accurate branch support.
Collapse
Affiliation(s)
- Mursalin Habib
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Kowshic Roy
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Saem Hasan
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Atif Hasan Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh.
| |
Collapse
|
3
|
Schrago CG, Mello B. Challenges in Assembling the Dated Tree of Life. Genome Biol Evol 2024; 16:evae229. [PMID: 39475308 PMCID: PMC11523137 DOI: 10.1093/gbe/evae229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/15/2024] [Indexed: 11/02/2024] Open
Abstract
The assembly of a comprehensive and dated Tree of Life (ToL) remains one of the most formidable challenges in evolutionary biology. The complexity of life's history, involving both vertical and horizontal transmission of genetic information, defies its representation by a simple bifurcating phylogeny. With the advent of genome and metagenome sequencing, vast amounts of data have become available. However, employing this information for phylogeny and divergence time inference has introduced significant theoretical and computational hurdles. This perspective addresses some key methodological challenges in assembling the dated ToL, namely, the identification and classification of homologous genes, accounting for gene tree-species tree mismatch due to population-level processes along with duplication, loss, and horizontal gene transfer, and the accurate dating of evolutionary events. Ultimately, the success of this endeavor requires new approaches that integrate knowledge databases with optimized phylogenetic algorithms capable of managing complex evolutionary models.
Collapse
Affiliation(s)
- Carlos G Schrago
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Beatriz Mello
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| |
Collapse
|
4
|
Schmidt S, Toivonen S, Medvedev P, Tomescu AI. Applying the Safe-And-Complete Framework to Practical Genome Assembly. LIPICS : LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS 2024; 312:8. [PMID: 40297742 PMCID: PMC12037172 DOI: 10.4230/lipics.wabi.2024.8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/30/2025]
Abstract
Despite the long history of genome assembly research, there remains a large gap between the theoretical and practical work. There is practical software with little theoretical underpinning of accuracy on one hand and theoretical algorithms which have not been adopted in practice on the other. In this paper we attempt to bridge the gap between theory and practice by showing how the theoretical safe-and-complete framework can be integrated into existing assemblers in order to improve contiguity. The optimal algorithm in this framework, called the omnitig algorithm, has not been used in practice due to its complexity and its lack of robustness to real data. Instead, we pursue a simplified notion of omnitigs (simple omnitigs), giving an efficient algorithm to compute them and demonstrating their safety under certain conditions. We modify two assemblers (wtdbg2 and Flye) by replacing their unitig algorithm with the simple omnitig algorithm. We test our modifications using real HiFi data from the D. melanogaster and the C. elegans genomes. Our modified algorithms lead to a substantial improvement in alignment-based contiguity, with negligible additional computational costs and either no or a small increase in the number of misassemblies.
Collapse
Affiliation(s)
| | | | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
| | | |
Collapse
|
5
|
Dai J, Rubel T, Han Y, Molloy EK. Dollo-CDP: a polynomial-time algorithm for the clade-constrained large Dollo parsimony problem. Algorithms Mol Biol 2024; 19:2. [PMID: 38191515 PMCID: PMC10775561 DOI: 10.1186/s13015-023-00249-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 12/10/2023] [Indexed: 01/10/2024] Open
Abstract
The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set. This technique has yet to be used for parsimony problems where the input are binary characters, sometimes with missing values. Here, we introduce the clade-constrained character parsimony problem and present an algorithm that solves this problem for the Dollo criterion score in [Formula: see text] time, where n is the number of leaves, k is the number of characters, and [Formula: see text] is the set of clades used as constraints. Dollo parsimony, which requires traits/mutations to be gained at most once but allows them to be lost any number of times, is widely used for tumor phylogenetics as well as species phylogenetics, for example analyses of low-homoplasy retroelement insertions across the vertebrate tree of life. This motivated us to implement our algorithm in a software package, called Dollo-CDP, and evaluate its utility for analyzing retroelement insertion presence / absence patterns for bats, birds, toothed whales as well as simulated data. Our results show that Dollo-CDP can improve upon heuristic search from a single starting tree, often recovering a better scoring tree. Moreover, Dollo-CDP scales to data sets with much larger numbers of taxa than branch-and-bound while still having an optimality guarantee, albeit a more restricted one. Lastly, we show that our algorithm for Dollo parsimony can easily be adapted to Camin-Sokal parsimony but not Fitch parsimony.
Collapse
Affiliation(s)
- Junyan Dai
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Tobias Rubel
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Yunheng Han
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Erin K Molloy
- Department of Computer Science, University of Maryland, College Park, MD, USA.
- University of Maryland Institute for Advanced Computer Studies, College Park, MD, USA.
| |
Collapse
|
6
|
Schmidt S, Khan S, Alanko JN, Pibiri GE, Tomescu AI. Matchtigs: minimum plain text representation of k-mer sets. Genome Biol 2023; 24:136. [PMID: 37296461 PMCID: PMC10251615 DOI: 10.1186/s13059-023-02968-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Accepted: 05/10/2023] [Indexed: 06/12/2023] Open
Abstract
We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.
Collapse
Affiliation(s)
- Sebastian Schmidt
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Shahbaz Khan
- Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, India
| | - Jarno N. Alanko
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Giulio E. Pibiri
- Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Venice, Italy
- ISTI-CNR, Pisa, Italy
| | | |
Collapse
|
7
|
Bayzid MS. Inferring Optimal Species Trees in the Presence of Gene Duplication and Loss: Beyond Rooted Gene Trees. J Comput Biol 2023; 30:161-175. [PMID: 36251762 DOI: 10.1089/cmb.2021.0522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
Estimating species trees from multiple genes is complicated and challenging due to gene tree-species tree discordance. One of the basic approaches to understanding differences between gene trees and species trees is gene duplication and loss events. Minimize Gene Duplication and Loss (MGDL) is a popular technique for inferring species trees from gene trees when the gene trees are discordant due to gene duplications and losses. Previously, exact algorithms for estimating species trees from rooted, binary trees under MGDL were proposed. However, gene trees are usually estimated using time-reversible mutation models, which result in unrooted trees. In this article, we propose a dynamic programming (DP) algorithm that can be used for an exact but exponential time solution for the case when gene trees are not rooted. We also show that a constrained version of this problem can be solved by this DP algorithm in time that is polynomial in the number of gene trees and taxa. We have proved important structural properties that allow us to extend the algorithms for rooted gene trees to unrooted gene trees. We propose a linear time algorithm for finding the optimal rooted version of an unrooted gene tree given a rooted species tree so that the duplication and loss cost is minimized. Moreover, we prove that the optimal rooting under MGDL is also optimal under the MDC (minimize deep coalescence) criterion. The proposed methods can be applied to both orthologous genes and gene families that by definition include both paralogs and orthologs. Therefore, we hope that these techniques will be useful for estimating species trees from genes sampled throughout the whole genome.
Collapse
Affiliation(s)
- Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|
8
|
Mahbub S, Sawmya S, Saha A, Reaz R, Rahman MS, Bayzid MS. Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data. JOURNAL OF COMPUTATIONAL BIOLOGY : A JOURNAL OF COMPUTATIONAL MOLECULAR CELL BIOLOGY 2022; 29:1156-1172. [PMID: 36048555 DOI: 10.1089/cmb.2022.0212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of imputing the quartet distribution induced by a set of incomplete gene trees, which involves adding the missing quartets back to the quartet distribution. We present Quartet based Gene tree Imputation using Deep Learning (QT-GILD), an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural language processing, which learns the quartet distribution in a given set of incomplete gene trees and generates a complete set of quartets accordingly. QT-GILD is a general-purpose technique needing no explicit modeling of the subject system or reasons for missing data or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical datasets suggest that QT-GILD can effectively impute the quartet distribution, which results in a dramatic improvement in the species tree accuracy. Remarkably, QT-GILD not only imputes the missing quartets but can also account for gene tree estimation error. Therefore, QT-GILD advances the state-of-the-art in species tree estimation from gene trees in the face of missing data.
Collapse
Affiliation(s)
- Sazan Mahbub
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh.,Department of Computer Science, University of Maryland, College Park, Maryland, USA
| | - Shashata Sawmya
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Arpita Saha
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Rezwana Reaz
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - M Sohel Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|
9
|
Cerón-Romero MA, Fonseca MM, de Oliveira Martins L, Posada D, Katz LA. Phylogenomic Analyses of 2,786 Genes in 158 Lineages Support a Root of the Eukaryotic Tree of Life between Opisthokonts and All Other Lineages. Genome Biol Evol 2022; 14:evac119. [PMID: 35880421 PMCID: PMC9366629 DOI: 10.1093/gbe/evac119] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2022] [Indexed: 12/02/2022] Open
Abstract
Advances in phylogenomics and high-throughput sequencing have allowed the reconstruction of deep phylogenetic relationships in the evolution of eukaryotes. Yet, the root of the eukaryotic tree of life remains elusive. The most popular hypothesis in textbooks and reviews is a root between Unikonta (Opisthokonta + Amoebozoa) and Bikonta (all other eukaryotes), which emerged from analyses of a single-gene fusion. Subsequent, highly cited studies based on concatenation of genes supported this hypothesis with some variations or proposed a root within Excavata. However, concatenation of genes does not consider phylogenetically-informative events like gene duplications and losses. A recent study using gene tree parsimony (GTP) suggested the root lies between Opisthokonta and all other eukaryotes, but only including 59 taxa and 20 genes. Here we use GTP with a duplication-loss model in a gene-rich and taxon-rich dataset (i.e., 2,786 gene families from two sets of 155 and 158 diverse eukaryotic lineages) to assess the root, and we iterate each analysis 100 times to quantify tree space uncertainty. We also contrasted our results and discarded alternative hypotheses from the literature using GTP and the likelihood-based method SpeciesRax. Our estimates suggest a root between Fungi or Opisthokonta and all other eukaryotes; but based on further analysis of genome size, we propose that the root between Opisthokonta and all other eukaryotes is the most likely.
Collapse
Affiliation(s)
- Mario A Cerón-Romero
- Department of Biological Sciences, Smith College, Northampton, Massachusetts, USA
- Program in Organismic and Evolutionary Biology, University of Massachusetts Amherst, Amherst, Massachusetts, USA
- Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana-Champaign, Illinois, USA
| | - Miguel M Fonseca
- CIIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Porto, Portugal
| | - Leonardo de Oliveira Martins
- Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo, Spain
- Quadram Institute Bioscience, Norwich, United Kingdom
| | - David Posada
- Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo, Spain
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Spain
| | - Laura A Katz
- Department of Biological Sciences, Smith College, Northampton, Massachusetts, USA
- Program in Organismic and Evolutionary Biology, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| |
Collapse
|
10
|
Pinheiro D, Santander-Jimenéz S, Ilic A. PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data. BMC Genomics 2022; 23:377. [PMID: 35585494 PMCID: PMC9116704 DOI: 10.1186/s12864-022-08540-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Accepted: 04/01/2022] [Indexed: 11/10/2022] Open
Abstract
Background In the pursuit of a better understanding of biodiversity, evolutionary biologists rely on the study of phylogenetic relationships to illustrate the course of evolution. The relationships among natural organisms, depicted in the shape of phylogenetic trees, not only help to understand evolutionary history but also have a wide range of additional applications in science. One of the most challenging problems that arise when building phylogenetic trees is the presence of missing biological data. More specifically, the possibility of inferring wrong phylogenetic trees increases proportionally to the amount of missing values in the input data. Although there are methods proposed to deal with this issue, their applicability and accuracy is often restricted by different constraints. Results We propose a framework, called PhyloMissForest, to impute missing entries in phylogenetic distance matrices and infer accurate evolutionary relationships. PhyloMissForest is built upon a random forest structure that infers the missing entries of the input data, based on the known parts of it. PhyloMissForest contributes with a robust and configurable framework that incorporates multiple search strategies and machine learning, complemented by phylogenetic techniques, to provide a more accurate inference of lost phylogenetic distances. We evaluate our framework by examining three real-world datasets, two DNA-based sequence alignments and one containing amino acid data, and two additional instances with simulated DNA data. Moreover, we follow a design of experiments methodology to define the hyperparameter values of our algorithm, which is a concise method, preferable in comparison to the well-known exhaustive parameters search. By varying the percentages of missing data from 5% to 60%, we generally outperform the state-of-the-art alternative imputation techniques in the tests conducted on real DNA data. In addition, significant improvements in execution time are observed for the amino acid instance. The results observed on simulated data also denote the attainment of improved imputations when dealing with large percentages of missing data. Conclusions By merging multiple search strategies, machine learning, and phylogenetic techniques, PhyloMissForest provides a highly customizable and robust framework for phylogenetic missing data imputation, with significant topological accuracy and effective speedups over the state of the art. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-022-08540-6).
Collapse
Affiliation(s)
- Diogo Pinheiro
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, Lisboa, 1000-029, Portugal
| | - Sergio Santander-Jimenéz
- Department of Computer and Communications Technologies, University of Extremadura, Campus universitario s/n, Cáceres, 10003, Spain
| | - Aleksandar Ilic
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, Lisboa, 1000-029, Portugal.
| |
Collapse
|
11
|
Farah IT, Islam MM, Zinat KT, Rahman AH, Bayzid MS. Species tree estimation from gene trees by minimizing deep coalescence and maximizing quartet consistency: a comparative study and the presence of pseudo species tree terraces. Syst Biol 2021; 70:1213-1231. [PMID: 33844023 DOI: 10.1093/sysbio/syab026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 03/25/2021] [Accepted: 03/29/2021] [Indexed: 11/14/2022] Open
Abstract
Species tree estimation from multi-locus datasets is extremely challenging, especially in the presence of gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Summary methods have been developed which estimate gene trees and then combine the gene trees to estimate a species tree by optimizing various optimization scores. In this study, we have extended and adapted the concept of phylogenetic terraces to species tree estimation by "summarizing" a set of gene trees, where multiple species trees with distinct topologies may have exactly the same optimality score (i.e., quartet score, extra lineage score, etc.). We particularly investigated the presence and impacts of equally optimal trees in species tree estimation from multi-locus data using summary methods by taking ILS into account. We analyzed two of the most popular ILS-aware optimization criteria: maximize quartet consistency (MQC) and minimize deep coalescence (MDC). Methods based on MQC are provably statistically consistent, whereas MDC is not a consistent criterion for species tree estimation. We present a comprehensive comparative study of these two optimality criteria. Our experiments, on a collection of datasets simulated under ILS, indicate that MDC may result in competitive or identical quartet consistency score as MQC, but could be significantly worse than MQC in terms of tree accuracy - demonstrating the presence and impacts of equally optimal species trees. This is the first known study that provides the conditions for the datasets to have equally optimal trees in the context of phylogenomic inference using summary methods.
Collapse
Affiliation(s)
- Ishrat Tanzila Farah
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh
| | - Md Muktadirul Islam
- Applied Statistics and Data Science (ASDS), Department of Statistics Jahangirnagar University Dhaka-1342, Bangladesh
| | - Kazi Tasnim Zinat
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh.,Department of Computer Science University of Maryland, College Park, Maryland, USA
| | - Atif Hasan Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh
| |
Collapse
|
12
|
New Approaches for Inferring Phylogenies in the Presence of Paralogs. Trends Genet 2021; 37:174-187. [DOI: 10.1016/j.tig.2020.08.012] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Revised: 08/13/2020] [Accepted: 08/19/2020] [Indexed: 12/18/2022]
|
13
|
|
14
|
Molloy EK, Warnow T. FastMulRFS: fast and accurate species tree estimation under generic gene duplication and loss models. Bioinformatics 2020; 36:i57-i65. [PMID: 32657396 PMCID: PMC7355287 DOI: 10.1093/bioinformatics/btaa444] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
MOTIVATION Species tree estimation is a basic part of biological research but can be challenging because of gene duplication and loss (GDL), which results in genes that can appear more than once in a given genome. All common approaches in phylogenomic studies either reduce available data or are error-prone, and thus, scalable methods that do not discard data and have high accuracy on large heterogeneous datasets are needed. RESULTS We present FastMulRFS, a polynomial-time method for estimating species trees without knowledge of orthology. We prove that FastMulRFS is statistically consistent under a generic model of GDL when adversarial GDL does not occur. Our extensive simulation study shows that FastMulRFS matches the accuracy of MulRF (which tries to solve the same optimization problem) and has better accuracy than prior methods, including ASTRAL-multi (the only method to date that has been proven statistically consistent under GDL), while being much faster than both methods. AVAILABILITY AND IMPEMENTATION FastMulRFS is available on Github (https://github.com/ekmolloy/fastmulrfs). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Erin K Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
15
|
Islam M, Sarker K, Das T, Reaz R, Bayzid MS. STELAR: a statistically consistent coalescent-based species tree estimation method by maximizing triplet consistency. BMC Genomics 2020; 21:136. [PMID: 32039704 PMCID: PMC7011378 DOI: 10.1186/s12864-020-6519-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2019] [Accepted: 01/20/2020] [Indexed: 12/14/2022] Open
Abstract
Background Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, estimating a species tree from a collection of gene trees can be complicated due to the presence of gene tree incongruence resulting from incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent process. Maximum likelihood and Bayesian MCMC methods can potentially result in accurate trees, but they do not scale well to large datasets. Results We present STELAR (Species Tree Estimation by maximizing tripLet AgReement), a new fast and highly accurate statistically consistent coalescent-based method for estimating species trees from a collection of gene trees. We formalized the constrained triplet consensus (CTC) problem and showed that the solution to the CTC problem is a statistically consistent estimate of the species tree under the multi-species coalescent (MSC) model. STELAR is an efficient dynamic programming based solution to the CTC problem which is highly accurate and scalable. We evaluated the accuracy of STELAR in comparison with SuperTriplets, which is an alternate fast and highly accurate triplet-based supertree method, and with MP-EST and ASTRAL – two of the most popular and accurate coalescent-based methods. Experimental results suggest that STELAR matches the accuracy of ASTRAL and improves on MP-EST and SuperTriplets. Conclusions Theoretical and empirical results (on both simulated and real biological datasets) suggest that STELAR is a valuable technique for species tree estimation from gene tree distributions.
Collapse
Affiliation(s)
- Mazharul Islam
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Kowshika Sarker
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Trisha Das
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Rezwana Reaz
- Department of Computer Science, The University of Texas at Austin, Texas, 78712, USA
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh.
| |
Collapse
|
16
|
Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss. LECTURE NOTES IN COMPUTER SCIENCE 2020. [DOI: 10.1007/978-3-030-45257-5_8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
17
|
Molloy EK, Warnow T. TreeMerge: a new method for improving the scalability of species tree estimation methods. Bioinformatics 2019; 35:i417-i426. [PMID: 31510668 PMCID: PMC6612878 DOI: 10.1093/bioinformatics/btz344] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
MOTIVATION At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species. RESULTS Here we present a new method called 'TreeMerge' that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework-only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All. AVAILABILITY AND IMPLEMENTATION TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Erin K Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| |
Collapse
|
18
|
Soto Gomez M, Pokorny L, Kantar MB, Forest F, Leitch IJ, Gravendeel B, Wilkin P, Graham SW, Viruel J. A customized nuclear target enrichment approach for developing a phylogenomic baseline for Dioscorea yams (Dioscoreaceae). APPLICATIONS IN PLANT SCIENCES 2019; 7:e11254. [PMID: 31236313 PMCID: PMC6580989 DOI: 10.1002/aps3.11254] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/13/2018] [Accepted: 04/01/2019] [Indexed: 05/14/2023]
Abstract
PREMISE We developed a target enrichment panel for phylogenomic studies of Dioscorea, an economically important genus with incompletely resolved relationships. METHODS Our bait panel comprises 260 low- to single-copy nuclear genes targeted to work in Dioscorea, assessed here using a preliminary taxon sampling that includes both distantly and closely related taxa, including several yam crops and potential crop wild relatives. We applied coalescent-based and maximum likelihood phylogenomic inference approaches to the pilot taxon set, incorporating new and published transcriptome data from additional species. RESULTS The custom panel retrieved ~94% of targets and >80% of full gene length from 88% and 68% of samples, respectively. It has minimal gene overlap with existing panels designed for angiosperm-wide studies and generally recovers longer and more variable targets. Pilot phylogenomic analyses consistently resolve most deep and recent relationships with strong support across analyses and point to revised relationships between the crop species D. alata and candidate crop wild relatives. DISCUSSION Our customized panel reliably retrieves targeted loci from Dioscorea, is informative for resolving relationships in denser samplings, and is suitable for refining our understanding of the independent origins of cultivated yam species; the panel likely has broader promise for phylogenomic studies across Dioscoreales.
Collapse
Affiliation(s)
- Marybel Soto Gomez
- Department of BotanyUniversity of British Columbia6270 University BoulevardVancouverBritish ColumbiaV6T 1Z4Canada
- UBC Botanical Garden and Centre for Plant ResearchUniversity of British Columbia6804 Marine Drive SWVancouverBritish ColumbiaV6T 1Z4Canada
| | - Lisa Pokorny
- Royal Botanic GardensKew, RichmondSurreyTW9 3DSUnited Kingdom
| | - Michael B. Kantar
- Department of Tropical Plant and Soil SciencesUniversity of Hawai‘i at ManoaHonoluluHawai‘i96822USA
| | - Félix Forest
- Royal Botanic GardensKew, RichmondSurreyTW9 3DSUnited Kingdom
| | - Ilia J. Leitch
- Royal Botanic GardensKew, RichmondSurreyTW9 3DSUnited Kingdom
| | - Barbara Gravendeel
- Naturalis Biodiversity CenterEndless FormsSylviusweg 72Leiden2333 BEThe Netherlands
- Institute Biology LeidenLeiden UniversitySylviusweg 72Leiden2333 BEThe Netherlands
- Faculty of Science and TechnologyUniversity of Applied Sciences LeidenZernikedreef 11Leiden2333 CKThe Netherlands
| | - Paul Wilkin
- Royal Botanic GardensKew, RichmondSurreyTW9 3DSUnited Kingdom
| | - Sean W. Graham
- Department of BotanyUniversity of British Columbia6270 University BoulevardVancouverBritish ColumbiaV6T 1Z4Canada
- UBC Botanical Garden and Centre for Plant ResearchUniversity of British Columbia6804 Marine Drive SWVancouverBritish ColumbiaV6T 1Z4Canada
| | - Juan Viruel
- Royal Botanic GardensKew, RichmondSurreyTW9 3DSUnited Kingdom
| |
Collapse
|