1
|
Tabatabaee Y, Zhang C, Arasti S, Mirarab S. Species tree branch length estimation despite incomplete lineage sorting, duplication, and loss. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.20.639320. [PMID: 40027742 PMCID: PMC11870528 DOI: 10.1101/2025.02.20.639320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Phylogenetic branch lengths are essential for many analyses, such as estimating divergence times, analyzing rate changes, and studying adaptation. However, true gene tree heterogeneity due to incomplete lineage sorting (ILS), gene duplication and loss (GDL), and horizontal gene transfer (HGT) can complicate the estimation of species tree branch lengths. While several tools exist for estimating the topology of a species tree addressing various causes of gene tree discordance, much less attention has been paid to branch length estimation on multi-locus datasets. For single-copy gene trees, some methods are available that summarize gene tree branch lengths onto a species tree, including coalescent-based methods that account for heterogeneity due to ILS. However, no such branch length estimation method exists for multi-copy gene family trees that have evolved with gene duplication and loss. To address this gap, we introduce the CASTLES-Pro algorithm for estimating species tree branch lengths while accounting for both GDL and ILS. CASTLES-Pro improves on the existing coalescent-based branch length estimation method CASTLES by increasing its accuracy for single-copy gene trees and extends it to handle multi-copy ones. Our simulation studies show that CASTLES-Pro is generally more accurate than alternatives, eliminating the systematic bias toward overestimating terminal branch lengths often observed when using concatenation. Moreover, while not theoretically designed for HGT, we show that CASTLES-Pro maintains relatively high accuracy under high rates of random HGT. Code availability CASTLES-Pro is implemented inside the software package ASTER, available at https://github.com/chaoszhang/ASTER . Data availability The datasets and scripts used in this study are available at https://github.com/ytabatabaee/CASTLES-Pro-paper .
Collapse
|
2
|
Kong S, Swofford DL, Kubatko LS. Inference of Phylogenetic Networks From Sequence Data Using Composite Likelihood. Syst Biol 2025; 74:53-69. [PMID: 39387633 DOI: 10.1093/sysbio/syae054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Revised: 09/13/2024] [Accepted: 10/08/2024] [Indexed: 10/12/2024] Open
Abstract
While phylogenies have been essential in understanding how species evolve, they do not adequately describe some evolutionary processes. For instance, hybridization, a common phenomenon where interbreeding between 2 species leads to formation of a new species, must be depicted by a phylogenetic network, a structure that modifies a phylogenetic tree by allowing 2 branches to merge into 1, resulting in reticulation. However, existing methods for estimating networks become computationally expensive as the dataset size and/or topological complexity increase. The lack of methods for scalable inference hampers phylogenetic networks from being widely used in practice, despite accumulating evidence that hybridization occurs frequently in nature. Here, we propose a novel method, PhyNEST (Phylogenetic Network Estimation using SiTe patterns), that estimates binary, level-1 phylogenetic networks with a fixed, user-specified number of reticulations directly from sequence data. By using the composite likelihood as the basis for inference, PhyNEST is able to use the full genomic data in a computationally tractable manner, eliminating the need to summarize the data as a set of gene trees prior to network estimation. To search network space, PhyNEST implements both hill climbing and simulated annealing algorithms. PhyNEST assumes that the data are composed of coalescent independent sites that evolve according to the Jukes-Cantor substitution model and that the network has a constant effective population size. Simulation studies demonstrate that PhyNEST is often more accurate than 2 existing composite likelihood summary methods (SNaQand PhyloNet) and that it is robust to at least one form of model misspecification (assuming a less complex nucleotide substitution model than the true generating model). We applied PhyNEST to reconstruct the evolutionary relationships among Heliconius butterflies and Papionini primates, characterized by hybrid speciation and widespread introgression, respectively. PhyNEST is implemented in an open-source Julia package and is publicly available at https://github.com/sungsik-kong/PhyNEST.jl.
Collapse
Affiliation(s)
- Sungsik Kong
- Department of Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, USA
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI 53715, USA
| | - David L Swofford
- Florida Museum of Natural History, University of Florida, Gainesville, FL 32611, USA
| | - Laura S Kubatko
- Department of Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, USA
- Department of Statistics, The Ohio State University, Columbus, OH 43210, USA
| |
Collapse
|
3
|
Hakim SA, Ratul MRZ, Bayzid MS. wQFM-DISCO: DISCO-enabled wQFM improves phylogenomic analyses despite the presence of paralogs. BIOINFORMATICS ADVANCES 2024; 4:vbae189. [PMID: 39664861 PMCID: PMC11634537 DOI: 10.1093/bioadv/vbae189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Revised: 10/18/2024] [Accepted: 11/24/2024] [Indexed: 12/13/2024]
Abstract
Motivation Gene trees often differ from the species trees that contain them due to various factors, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Several highly accurate species tree estimation methods have been introduced to explicitly address ILS, including ASTRAL, a widely used statistically consistent method, and wQFM, a quartet amalgamation approach experimentally shown to be more accurate than ASTRAL. Two recent advancements, ASTRAL-Pro and DISCO, have emerged in phylogenomics to consider GDL. ASTRAL-Pro introduces a refined quartet similarity measure, accounting for orthology and paralogy. On the other hand, DISCO offers a general strategy to decompose multi-copy gene trees into a collection of single-copy trees, allowing the utilization of methods previously designed for species tree inference in the context of single-copy gene trees. Results In this study, we first introduce some variants of DISCO to examine its underlying hypotheses and present analytical results on the statistical guarantees of DISCO. In particular, we introduce DISCO-R, a variant of DISCO with a refined and improved pruning strategy that provides more accurate and robust results. We then demonstrate with extensive evaluation studies on a collection of simulated and real data sets that wQFM paired with DISCO variants consistently matches or outperforms ASTRAL-Pro and other competing methods. Availability and implementation DISCO-R and other variants are freely available at https://github.com/skhakim/DISCO-variants.
Collapse
Affiliation(s)
- Sheikh Azizul Hakim
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Md Rownok Zahan Ratul
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| |
Collapse
|
4
|
Morel B, Williams TA, Stamatakis A, Szöllősi GJ. AleRax: a tool for gene and species tree co-estimation and reconciliation under a probabilistic model of gene duplication, transfer, and loss. Bioinformatics 2024; 40:btae162. [PMID: 38514421 PMCID: PMC10990685 DOI: 10.1093/bioinformatics/btae162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 01/30/2024] [Accepted: 03/19/2024] [Indexed: 03/23/2024] Open
Abstract
MOTIVATION Genomes are a rich source of information on the pattern and process of evolution across biological scales. How best to make use of that information is an active area of research in phylogenetics. Ideally, phylogenetic methods should not only model substitutions along gene trees, which explain differences between homologous gene sequences, but also the processes that generate the gene trees themselves along a shared species tree. To conduct accurate inferences, one needs to account for uncertainty at both levels, that is, in gene trees estimated from inherently short sequences and in their diverse evolutionary histories along a shared species tree. RESULTS We present AleRax, a software that can infer reconciled gene trees together with a shared species tree using a simple, yet powerful, probabilistic model of gene duplication, transfer, and loss. A key feature of AleRax is its ability to account for uncertainty in the gene tree and its reconciliation by using an efficient approximation to calculate the joint phylogenetic-reconciliation likelihood and sample reconciled gene trees accordingly. Simulations and analyses of empirical data show that AleRax is one order of magnitude faster than competing gene tree inference tools while attaining the same accuracy. It is consistently more robust than species tree inference methods such as SpeciesRax and ASTRAL-Pro 2 under gene tree uncertainty. Finally, AleRax can process multiple gene families in parallel thereby allowing users to compare competing phylogenetic hypotheses and estimate model parameters, such as duplication, transfer, and loss probabilities for genome-scale datasets with hundreds of taxa. AVAILABILITY AND IMPLEMENTATION GNU GPL at https://github.com/BenoitMorel/AleRax and data are made available at https://cme.h-its.org/exelixis/material/alerax_data.tar.gz.
Collapse
Affiliation(s)
- Benoit Morel
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe 76131, Germany
| | - Tom A Williams
- School of Biological Sciences, University of Bristol, Bristol BS8 1TQ, United Kingdom
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe 76131, Germany
- Institute of Computer Science, Biodiversity Computing Group, Heraklion GR-70013, Greece
| | - Gergely J Szöllősi
- ELTE-MTA “Lendület”, Evolutionary Genomics Research Group, Budapest H-1117, Hungary
- Institute of Evolution, HUN-REN Centre for Ecological Research, Budapest H-1121, Hungary
- Model-Based Evolutionary Genomics Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa 904-0495, Japan
| |
Collapse
|
5
|
Górecki P, Rutecka N, Mykowiecka A, Paszek J. Unifying duplication episode clustering and gene-species mapping inference. Algorithms Mol Biol 2024; 19:7. [PMID: 38355611 PMCID: PMC10865717 DOI: 10.1186/s13015-024-00252-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 01/04/2024] [Indexed: 02/16/2024] Open
Abstract
We present a novel problem, called MetaEC, which aims to infer gene-species assignments in a collection of partially leaf-labeled gene trees labels by minimizing the size of duplication episode clustering (EC). This problem is particularly relevant in metagenomics, where incomplete data often poses a challenge in the accurate reconstruction of gene histories. To solve MetaEC, we propose a polynomial time dynamic programming (DP) formulation that verifies the existence of a set of duplication episodes from a predefined set of episode candidates. In addition, we design a method to infer distributions of gene-species mappings. We then demonstrate how to use DP to design an algorithm that solves MetaEC. Although the algorithm is exponential in the worst case, we introduce a heuristic modification of the algorithm that provides a solution with the knowledge that it is exact. To evaluate our method, we perform two computational experiments on simulated and empirical data containing whole genome duplication events, showing that our algorithm is able to accurately infer the corresponding events.
Collapse
Affiliation(s)
- Paweł Górecki
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland.
| | - Natalia Rutecka
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland
| | - Agnieszka Mykowiecka
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland
| | - Jarosław Paszek
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland
| |
Collapse
|
6
|
Dai J, Rubel T, Han Y, Molloy EK. Dollo-CDP: a polynomial-time algorithm for the clade-constrained large Dollo parsimony problem. Algorithms Mol Biol 2024; 19:2. [PMID: 38191515 PMCID: PMC10775561 DOI: 10.1186/s13015-023-00249-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 12/10/2023] [Indexed: 01/10/2024] Open
Abstract
The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set. This technique has yet to be used for parsimony problems where the input are binary characters, sometimes with missing values. Here, we introduce the clade-constrained character parsimony problem and present an algorithm that solves this problem for the Dollo criterion score in [Formula: see text] time, where n is the number of leaves, k is the number of characters, and [Formula: see text] is the set of clades used as constraints. Dollo parsimony, which requires traits/mutations to be gained at most once but allows them to be lost any number of times, is widely used for tumor phylogenetics as well as species phylogenetics, for example analyses of low-homoplasy retroelement insertions across the vertebrate tree of life. This motivated us to implement our algorithm in a software package, called Dollo-CDP, and evaluate its utility for analyzing retroelement insertion presence / absence patterns for bats, birds, toothed whales as well as simulated data. Our results show that Dollo-CDP can improve upon heuristic search from a single starting tree, often recovering a better scoring tree. Moreover, Dollo-CDP scales to data sets with much larger numbers of taxa than branch-and-bound while still having an optimality guarantee, albeit a more restricted one. Lastly, we show that our algorithm for Dollo parsimony can easily be adapted to Camin-Sokal parsimony but not Fitch parsimony.
Collapse
Affiliation(s)
- Junyan Dai
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Tobias Rubel
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Yunheng Han
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Erin K Molloy
- Department of Computer Science, University of Maryland, College Park, MD, USA.
- University of Maryland Institute for Advanced Computer Studies, College Park, MD, USA.
| |
Collapse
|
7
|
Willson J, Tabatabaee Y, Liu B, Warnow T. DISCO+QR: rooting species trees in the presence of GDL and ILS. BIOINFORMATICS ADVANCES 2023; 3:vbad015. [PMID: 36789293 PMCID: PMC9923442 DOI: 10.1093/bioadv/vbad015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Revised: 01/21/2023] [Accepted: 02/06/2023] [Indexed: 02/10/2023]
Abstract
Motivation Genes evolve under processes such as gene duplication and loss (GDL), so that gene family trees are multi-copy, as well as incomplete lineage sorting (ILS); both processes produce gene trees that differ from the species tree. The estimation of species trees from sets of gene family trees is challenging, and the estimation of rooted species trees presents additional analytical challenges. Two of the methods developed for this problem are STRIDE, which roots species trees by considering GDL events, and Quintet Rooting (QR), which roots species trees by considering ILS. Results We present DISCO+QR, a new approach to rooting species trees that first uses DISCO to address GDL and then uses QR to perform rooting in the presence of ILS. DISCO+QR operates by taking the input gene family trees and decomposing them into single-copy trees using DISCO and then roots the given species tree using the information in the single-copy gene trees using QR. We show that the relative accuracy of STRIDE and DISCO+QR depend on the properties of the dataset (number of species, genes, rate of gene duplication, degree of ILS and gene tree estimation error), and that each provides advantages over the other under some conditions. Availability and implementation DISCO and QR are available in github. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- James Willson
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Yasamin Tabatabaee
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Baqiao Liu
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | | |
Collapse
|
8
|
Bayzid MS. Inferring Optimal Species Trees in the Presence of Gene Duplication and Loss: Beyond Rooted Gene Trees. J Comput Biol 2023; 30:161-175. [PMID: 36251762 DOI: 10.1089/cmb.2021.0522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
Estimating species trees from multiple genes is complicated and challenging due to gene tree-species tree discordance. One of the basic approaches to understanding differences between gene trees and species trees is gene duplication and loss events. Minimize Gene Duplication and Loss (MGDL) is a popular technique for inferring species trees from gene trees when the gene trees are discordant due to gene duplications and losses. Previously, exact algorithms for estimating species trees from rooted, binary trees under MGDL were proposed. However, gene trees are usually estimated using time-reversible mutation models, which result in unrooted trees. In this article, we propose a dynamic programming (DP) algorithm that can be used for an exact but exponential time solution for the case when gene trees are not rooted. We also show that a constrained version of this problem can be solved by this DP algorithm in time that is polynomial in the number of gene trees and taxa. We have proved important structural properties that allow us to extend the algorithms for rooted gene trees to unrooted gene trees. We propose a linear time algorithm for finding the optimal rooted version of an unrooted gene tree given a rooted species tree so that the duplication and loss cost is minimized. Moreover, we prove that the optimal rooting under MGDL is also optimal under the MDC (minimize deep coalescence) criterion. The proposed methods can be applied to both orthologous genes and gene families that by definition include both paralogs and orthologs. Therefore, we hope that these techniques will be useful for estimating species trees from genes sampled throughout the whole genome.
Collapse
Affiliation(s)
- Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|
9
|
Zaharias P, Warnow T. Recent progress on methods for estimating and updating large phylogenies. Philos Trans R Soc Lond B Biol Sci 2022; 377:20210244. [PMID: 35989607 PMCID: PMC9393559 DOI: 10.1098/rstb.2021.0244] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 01/07/2022] [Indexed: 12/20/2022] Open
Abstract
With the increased availability of sequence data and even of fully sequenced and assembled genomes, phylogeny estimation of very large trees (even of hundreds of thousands of sequences) is now a goal for some biologists. Yet, the construction of these phylogenies is a complex pipeline presenting analytical and computational challenges, especially when the number of sequences is very large. In the past few years, new methods have been developed that aim to enable highly accurate phylogeny estimations on these large datasets, including divide-and-conquer techniques for multiple sequence alignment and/or tree estimation, methods that can estimate species trees from multi-locus datasets while addressing heterogeneity due to biological processes (e.g. incomplete lineage sorting and gene duplication and loss), and methods to add sequences into large gene trees or species trees. Here we present some of these recent advances and discuss opportunities for future improvements. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.
Collapse
Affiliation(s)
- Paul Zaharias
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
10
|
Zhang C, Mirarab S. ASTRAL-Pro 2: ultrafast species tree reconstruction from multi-copy gene family trees. Bioinformatics 2022; 38:4949-4950. [PMID: 36094339 DOI: 10.1093/bioinformatics/btac620] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 09/03/2022] [Accepted: 09/09/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Species tree inference from multi-copy gene trees has long been a challenge in phylogenomics. The recent method ASTRAL-Pro has made large strides by enabling multi-copy gene family trees as input and has been quickly adopted. Yet, its scalability, especially memory usage, needs to improve to accommodate the ever-growing dataset size. RESULTS We present ASTRAL-Pro 2, an ultrafast and memory efficient version of ASTRAL-Pro that adopts a placement-based optimization algorithm for significantly better scalability without sacrificing accuracy. AVAILABILITY The source code and binary files are publicly available at https://github.com/chaoszhang/ASTER; data are available at https://github.com/chaoszhang/A-Pro2_data. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, UC San Diego, La Jolla, CA, 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, La Jolla, CA, 92093, USA
| |
Collapse
|
11
|
Cerón-Romero MA, Fonseca MM, de Oliveira Martins L, Posada D, Katz LA. Phylogenomic Analyses of 2,786 Genes in 158 Lineages Support a Root of the Eukaryotic Tree of Life between Opisthokonts and All Other Lineages. Genome Biol Evol 2022; 14:evac119. [PMID: 35880421 PMCID: PMC9366629 DOI: 10.1093/gbe/evac119] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2022] [Indexed: 12/02/2022] Open
Abstract
Advances in phylogenomics and high-throughput sequencing have allowed the reconstruction of deep phylogenetic relationships in the evolution of eukaryotes. Yet, the root of the eukaryotic tree of life remains elusive. The most popular hypothesis in textbooks and reviews is a root between Unikonta (Opisthokonta + Amoebozoa) and Bikonta (all other eukaryotes), which emerged from analyses of a single-gene fusion. Subsequent, highly cited studies based on concatenation of genes supported this hypothesis with some variations or proposed a root within Excavata. However, concatenation of genes does not consider phylogenetically-informative events like gene duplications and losses. A recent study using gene tree parsimony (GTP) suggested the root lies between Opisthokonta and all other eukaryotes, but only including 59 taxa and 20 genes. Here we use GTP with a duplication-loss model in a gene-rich and taxon-rich dataset (i.e., 2,786 gene families from two sets of 155 and 158 diverse eukaryotic lineages) to assess the root, and we iterate each analysis 100 times to quantify tree space uncertainty. We also contrasted our results and discarded alternative hypotheses from the literature using GTP and the likelihood-based method SpeciesRax. Our estimates suggest a root between Fungi or Opisthokonta and all other eukaryotes; but based on further analysis of genome size, we propose that the root between Opisthokonta and all other eukaryotes is the most likely.
Collapse
Affiliation(s)
- Mario A Cerón-Romero
- Department of Biological Sciences, Smith College, Northampton, Massachusetts, USA
- Program in Organismic and Evolutionary Biology, University of Massachusetts Amherst, Amherst, Massachusetts, USA
- Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana-Champaign, Illinois, USA
| | - Miguel M Fonseca
- CIIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Porto, Portugal
| | - Leonardo de Oliveira Martins
- Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo, Spain
- Quadram Institute Bioscience, Norwich, United Kingdom
| | - David Posada
- Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo, Spain
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Spain
| | - Laura A Katz
- Department of Biological Sciences, Smith College, Northampton, Massachusetts, USA
- Program in Organismic and Evolutionary Biology, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| |
Collapse
|
12
|
Xiong H, Wang D, Shao C, Yang X, Yang J, Ma T, Davis CC, Liu L, Xi Z. Species Tree Estimation and the Impact of Gene Loss Following Whole-Genome Duplication. Syst Biol 2022; 71:1348-1361. [PMID: 35689633 PMCID: PMC9558847 DOI: 10.1093/sysbio/syac040] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2021] [Revised: 06/03/2022] [Accepted: 06/07/2022] [Indexed: 12/02/2022] Open
Abstract
Whole-genome duplication (WGD) occurs broadly and repeatedly across the history of eukaryotes and is recognized as a prominent evolutionary force, especially in plants. Immediately following WGD, most genes are present in two copies as paralogs. Due to this redundancy, one copy of a paralog pair commonly undergoes pseudogenization and is eventually lost. When speciation occurs shortly after WGD; however, differential loss of paralogs may lead to spurious phylogenetic inference resulting from the inclusion of pseudoorthologs–paralogous genes mistakenly identified as orthologs because they are present in single copies within each sampled species. The influence and impact of including pseudoorthologs versus true orthologs as a result of gene extinction (or incomplete laboratory sampling) are only recently gaining empirical attention in the phylogenomics community. Moreover, few studies have yet to investigate this phenomenon in an explicit coalescent framework. Here, using mathematical models, numerous simulated data sets, and two newly assembled empirical data sets, we assess the effect of pseudoorthologs on species tree estimation under varying degrees of incomplete lineage sorting (ILS) and differential gene loss scenarios following WGD. When gene loss occurs along the terminal branches of the species tree, alignment-based (BPP) and gene-tree-based (ASTRAL, MP-EST, and STAR) coalescent methods are adversely affected as the degree of ILS increases. This can be greatly improved by sampling a sufficiently large number of genes. Under the same circumstances, however, concatenation methods consistently estimate incorrect species trees as the number of genes increases. Additionally, pseudoorthologs can greatly mislead species tree inference when gene loss occurs along the internal branches of the species tree. Here, both coalescent and concatenation methods yield inconsistent results. These results underscore the importance of understanding the influence of pseudoorthologs in the phylogenomics era. [Coalescent method; concatenation method; incomplete lineage sorting; pseudoorthologs; single-copy gene; whole-genome duplication.]
Collapse
Affiliation(s)
- Haifeng Xiong
- Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China
| | - Danying Wang
- Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China
| | - Chen Shao
- Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China
| | - Xuchen Yang
- Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China
| | - Jialin Yang
- Department of Statistics and Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
| | - Tao Ma
- Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China
| | - Charles C Davis
- Department of Organismic and Evolutionary Biology, Harvard University Herbaria, Cambridge, MA 02138, USA
| | - Liang Liu
- Department of Statistics and Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
| | - Zhenxiang Xi
- Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China
| |
Collapse
|
13
|
Wawerka M, Dąbkowski D, Rutecka N, Mykowiecka A, Górecki P. Embedding gene trees into phylogenetic networks by conflict resolution algorithms. Algorithms Mol Biol 2022; 17:11. [PMID: 35590416 PMCID: PMC9119282 DOI: 10.1186/s13015-022-00218-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 03/22/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Phylogenetic networks are mathematical models of evolutionary processes involving reticulate events such as hybridization, recombination, or horizontal gene transfer. One of the crucial notions in phylogenetic network modelling is displayed tree, which is obtained from a network by removing a set of reticulation edges. Displayed trees may represent an evolutionary history of a gene family if the evolution is shaped by reticulation events. RESULTS We address the problem of inferring an optimal tree displayed by a network, given a gene tree G and a tree-child network N, under the deep coalescence and duplication costs. We propose an O(mn)-time dynamic programming algorithm (DP) to compute a lower bound of the optimal displayed tree cost, where m and n are the sizes of G and N, respectively. In addition, our algorithm can verify whether the solution is exact. Moreover, it provides a set of reticulation edges corresponding to the obtained cost. If the cost is exact, the set induces an optimal displayed tree. Otherwise, the set contains pairs of conflicting edges, i.e., edges sharing a reticulation node. Next, we show a conflict resolution algorithm that requires [Formula: see text] invocations of DP in the worst case, where r is the number of reticulations. We propose a similar [Formula: see text]-time algorithm for level-k tree-child networks and a branch and bound solution to compute lower and upper bounds of optimal costs. We also extend the algorithms to a broader class of phylogenetic networks. Based on simulated data, the average runtime is [Formula: see text] under the deep-coalescence cost and [Formula: see text] under the duplication cost. CONCLUSIONS Despite exponential complexity in the worst case, our algorithms perform significantly well on empirical and simulated datasets, due to the strategy of resolving internal dissimilarities between gene trees and networks. Therefore, the algorithms are efficient alternatives to enumeration strategies commonly proposed in the literature and enable analyses of complex networks with dozens of reticulations.
Collapse
|
14
|
Willson J, Roddur MS, Liu B, Zaharias P, Warnow T. DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition. Syst Biol 2022; 71:610-629. [PMID: 34450658 PMCID: PMC9016570 DOI: 10.1093/sysbio/syab070] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 08/18/2021] [Accepted: 08/23/2021] [Indexed: 11/21/2022] Open
Abstract
Species tree inference from gene family trees is a significant problem in computational biology. However, gene tree heterogeneity, which can be caused by several factors including gene duplication and loss, makes the estimation of species trees very challenging. While there have been several species tree estimation methods introduced in recent years to specifically address gene tree heterogeneity due to gene duplication and loss (such as DupTree, FastMulRFS, ASTRAL-Pro, and SpeciesRax), many incur high cost in terms of both running time and memory. We introduce a new approach, DISCO, that decomposes the multi-copy gene family trees into many single copy trees, which allows for methods previously designed for species tree inference in a single copy gene tree context to be used. We prove that using DISCO with ASTRAL (i.e., ASTRAL-DISCO) is statistically consistent under the GDL model, provided that ASTRAL-Pro correctly roots and tags each gene family tree. We evaluate DISCO paired with different methods for estimating species trees from single copy genes (e.g., ASTRAL, ASTRID, and IQ-TREE) under a wide range of model conditions, and establish that high accuracy can be obtained even when ASTRAL-Pro is not able to correctly roots and tags the gene family trees. We also compare results using MI, an alternative decomposition strategy from Yang Y. and Smith S.A. (2014), and find that DISCO provides better accuracy, most likely as a result of covering more of the gene family tree leafset in the output decomposition. [Concatenation analysis; gene duplication and loss; species tree inference; summary method.].
Collapse
Affiliation(s)
- James Willson
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Mrinmoy Saha Roddur
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Baqiao Liu
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Paul Zaharias
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
15
|
Yan Z, Smith ML, Du P, Hahn MW, Nakhleh L. Species Tree Inference Methods Intended to Deal with Incomplete Lineage Sorting Are Robust to the Presence of Paralogs. Syst Biol 2022; 71:367-381. [PMID: 34245291 PMCID: PMC8978208 DOI: 10.1093/sysbio/syab056] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2020] [Revised: 06/23/2021] [Accepted: 06/30/2021] [Indexed: 11/24/2022] Open
Abstract
Many recent phylogenetic methods have focused on accurately inferring species trees when there is gene tree discordance due to incomplete lineage sorting (ILS). For almost all of these methods, and for phylogenetic methods in general, the data for each locus are assumed to consist of orthologous, single-copy sequences. Loci that are present in more than a single copy in any of the studied genomes are excluded from the data. These steps greatly reduce the number of loci available for analysis. The question we seek to answer in this study is: what happens if one runs such species tree inference methods on data where paralogy is present, in addition to or without ILS being present? Through simulation studies and analyses of two large biological data sets, we show that running such methods on data with paralogs can still provide accurate results. We use multiple different methods, some of which are based directly on the multispecies coalescent model, and some of which have been proven to be statistically consistent under it. We also treat the paralogous loci in multiple ways: from explicitly denoting them as paralogs, to randomly selecting one copy per species. In all cases, the inferred species trees are as accurate as equivalent analyses using single-copy orthologs. Our results have significant implications for the use of ILS-aware phylogenomic analyses, demonstrating that they do not have to be restricted to single-copy loci. This will greatly increase the amount of data that can be used for phylogenetic inference.[Gene duplication and loss; incomplete lineage sorting; multispecies coalescent; orthology; paralogy.].
Collapse
Affiliation(s)
- Zhi Yan
- Department of Computer Science, Rice University,
6100 Main Street, Houston, TX 77005, USA
| | - Megan L Smith
- Department of Biology and Department of Computer Science,
Indiana University, 1001 East Third Street, Bloomington,
IN 47405, USA
| | - Peng Du
- Department of Computer Science, Rice University,
6100 Main Street, Houston, TX 77005, USA
| | - Matthew W Hahn
- Department of Biology and Department of Computer Science,
Indiana University, 1001 East Third Street, Bloomington,
IN 47405, USA
| | - Luay Nakhleh
- Department of Computer Science, Rice University,
6100 Main Street, Houston, TX 77005, USA
- Department of BioSciences, Rice University, 6100
Main Street, Houston, TX 77005, USA
| |
Collapse
|
16
|
Morel B, Schade P, Lutteropp S, Williams TA, Szöllősi GJ, Stamatakis A. SpeciesRax: A Tool for Maximum Likelihood Species Tree Inference from Gene Family Trees under Duplication, Transfer, and Loss. Mol Biol Evol 2022; 39:msab365. [PMID: 35021210 PMCID: PMC8826479 DOI: 10.1093/molbev/msab365] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modeling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated data sets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large data sets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising 188 species from 31,612 gene families in 1 h using 40 cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.
Collapse
Affiliation(s)
- Benoit Morel
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Paul Schade
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Sarah Lutteropp
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Tom A Williams
- School of Biological Sciences, University of Bristol, Bristol, United Kingdom
| | - Gergely J Szöllősi
- ELTE-MTA “Lendület” Evolutionary Genomics Research Group, Budapest, Hungary
- Department of Biological Physics, Eötvös University, Budapest, Hungary
- Institute of Evolution, Centre for Ecological Research, Budapest, Hungary
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| |
Collapse
|
17
|
Morales-Briones DF, Gehrke B, Huang CH, Liston A, Ma H, Marx HE, Tank DC, Yang Y. Analysis of Paralogs in Target Enrichment Data Pinpoints Multiple Ancient Polyploidy Events in Alchemilla s.l. (Rosaceae). Syst Biol 2021; 71:190-207. [PMID: 33978764 PMCID: PMC8677558 DOI: 10.1093/sysbio/syab032] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2020] [Revised: 04/28/2021] [Accepted: 05/03/2021] [Indexed: 12/16/2022] Open
Abstract
Target enrichment is becoming increasingly popular for phylogenomic studies. Although baits for enrichment are typically designed to target single-copy genes, paralogs are often recovered with increased sequencing depth, sometimes from a significant proportion of loci, especially in groups experiencing whole-genome duplication (WGD) events. Common approaches for processing paralogs in target enrichment data sets include random selection, manual pruning, and mainly, the removal of entire genes that show any evidence of paralogy. These approaches are prone to errors in orthology inference or removing large numbers of genes. By removing entire genes, valuable information that could be used to detect and place WGD events is discarded. Here, we used an automated approach for orthology inference in a target enrichment data set of 68 species of Alchemilla s.l. (Rosaceae), a widely distributed clade of plants primarily from temperate climate regions. Previous molecular phylogenetic studies and chromosome numbers both suggested ancient WGDs in the group. However, both the phylogenetic location and putative parental lineages of these WGD events remain unknown. By taking paralogs into consideration and inferring orthologs from target enrichment data, we identified four nodes in the backbone of Alchemilla s.l. with an elevated proportion of gene duplication. Furthermore, using a gene-tree reconciliation approach, we established the autopolyploid origin of the entire Alchemilla s.l. and the nested allopolyploid origin of four major clades within the group. Here, we showed the utility of automated tree-based orthology inference methods, previously designed for genomic or transcriptomic data sets, to study complex scenarios of polyploidy and reticulate evolution from target enrichment data sets.[Alchemilla; allopolyploidy; autopolyploidy; gene tree discordance; orthology inference; paralogs; Rosaceae; target enrichment; whole genome duplication.].
Collapse
Affiliation(s)
- Diego F Morales-Briones
- Department of Plant and Microbial Biology, University of Minnesota-Twin Cities, 1445 Gortner Avenue, St. Paul, MN 55108, USA
- Department of Biological Sciences and Institute for Bioinformatics and Evolutionary Studies, University of Idaho, 875 Perimeter Drive MS 3051, Moscow, ID 83844, USA
| | - Berit Gehrke
- University Gardens, University Museum, University of Bergen, Mildeveien 240, 5259 Hjellestad, Norway
| | - Chien-Hsun Huang
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center of Genetics and Development, Ministry of Education Key Laboratory of Biodiversity and Ecological Engineering, Institute of Plant Biology, Center of Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai 200433, China
| | - Aaron Liston
- Department of Botany and Plant Pathology, Oregon State University, 2082 Cordley Hall, Corvallis, OR 97331, USA
| | - Hong Ma
- Department of Biology, the Huck Institute of the Life Sciences, the Pennsylvania State University, 510D Mueller Laboratory, University Park, PA 16802 USA
| | - Hannah E Marx
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109-1048, USA
- Museum of Southwestern Biology and Department of Biology, University of New Mexico, Albuquerque, NM 87131, USA
| | - David C Tank
- Department of Biological Sciences and Institute for Bioinformatics and Evolutionary Studies, University of Idaho, 875 Perimeter Drive MS 3051, Moscow, ID 83844, USA
| | - Ya Yang
- Department of Plant and Microbial Biology, University of Minnesota-Twin Cities, 1445 Gortner Avenue, St. Paul, MN 55108, USA
| |
Collapse
|
18
|
Mirarab S, Nakhleh L, Warnow T. Multispecies Coalescent: Theory and Applications in Phylogenetics. ANNUAL REVIEW OF ECOLOGY, EVOLUTION, AND SYSTEMATICS 2021. [DOI: 10.1146/annurev-ecolsys-012121-095340] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Species tree estimation is a basic part of many biological research projects, ranging from answering basic evolutionary questions (e.g., how did a group of species adapt to their environments?) to addressing questions in functional biology. Yet, species tree estimation is very challenging, due to processes such as incomplete lineage sorting, gene duplication and loss, horizontal gene transfer, and hybridization, which can make gene trees differ from each other and from the overall evolutionary history of the species. Over the last 10–20 years, there has been tremendous growth in methods and mathematical theory for estimating species trees and phylogenetic networks, and some of these methods are now in wide use. In this survey, we provide an overview of the current state of the art, identify the limitations of existing methods and theory, and propose additional research problems and directions.
Collapse
Affiliation(s)
- Siavash Mirarab
- Electrical and Computer Engineering Department, University of California, San Diego, La Jolla, California 92093, USA
| | - Luay Nakhleh
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|
19
|
Yu X, Le T, Christensen SA, Molloy EK, Warnow T. Using Robinson-Foulds supertrees in divide-and-conquer phylogeny estimation. Algorithms Mol Biol 2021; 16:12. [PMID: 34183037 PMCID: PMC8240396 DOI: 10.1186/s13015-021-00189-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 06/05/2021] [Indexed: 12/01/2022] Open
Abstract
One of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a "supertree method". Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP-hard. Exact-RFS-2 is available in open source form on Github at https://github.com/yuxilin51/GreedyRFS .
Collapse
Affiliation(s)
| | - Thien Le
- Department of EECS, Massachusetts Institute of Technology, Cambridge, USA
| | - Sarah A. Christensen
- Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, USA
| | - Erin K. Molloy
- Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, USA
| | - Tandy Warnow
- Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, USA
| |
Collapse
|
20
|
Dibaeinia P, Tabe-Bordbar S, Warnow T. FASTRAL: Improving scalability of phylogenomic analysis. Bioinformatics 2021; 37:2317-2324. [PMID: 33576396 PMCID: PMC8388037 DOI: 10.1093/bioinformatics/btab093] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Revised: 02/02/2021] [Accepted: 02/04/2021] [Indexed: 01/22/2023] Open
Abstract
MOTIVATION ASTRAL is the current leading method for species tree estimation from phylogenomic datasets (i.e., hundreds to thousands of genes) that addresses gene tree discord resulting from incomplete lineage sorting (ILS). ASTRAL is statistically consistent under the multi-locus coalescent model (MSC), runs in polynomial time, and is able to run on large datasets. Key to ASTRAL's algorithm is the use of dynamic programming to find an optimal solution to the MQSST (maximum quartet support supertree) within a constraint space that it computes from the input. Yet, ASTRAL can fail to complete within reasonable timeframes on large datasets with many genes and species, because in these cases the constraint space it computes is too large. RESULTS Here we introduce FASTRAL, a phylogenomic estimation method. FASTRAL is based on ASTRAL, but uses a different technique for constructing the constraint space. The technique we use to define the constraint space maintains statistical consistency and is polynomial time; thus we prove that FASTRAL is a polynomial time algorithm that is statistically consistent under the MSC. Our performance study on both biological and simulated data sets demonstrates that FASTRAL matches or improves on ASTRAL with respect to species tree topology accuracy (and under high ILS conditions it is statistically significantly more accurate), while being dramatically faster-especially on datasets with large numbers of genes and high ILS-due to using a significantly smaller constraint space. AVAILABILITY FASTRAL is available in open-source form at https://github.com/PayamDiba/FASTRAL. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Payam Dibaeinia
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA
| | - Shayan Tabe-Bordbar
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA,To whom correspondence should be addressed.
| |
Collapse
|
21
|
New Approaches for Inferring Phylogenies in the Presence of Paralogs. Trends Genet 2021; 37:174-187. [DOI: 10.1016/j.tig.2020.08.012] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Revised: 08/13/2020] [Accepted: 08/19/2020] [Indexed: 12/18/2022]
|
22
|
Zhang C, Scornavacca C, Molloy EK, Mirarab S. ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy. Mol Biol Evol 2020; 37:3292-3307. [PMID: 32886770 PMCID: PMC7751180 DOI: 10.1093/molbev/msaa139] [Citation(s) in RCA: 107] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. As a result, most phylogenomic studies have focused on single-copy genes and have discarded large parts of the data. Here, we first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy. We then introduce a method called ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs) to find the species tree that optimizes our quartet similarity measure using dynamic programing. By studying its performance on an extensive collection of simulated data sets and on real data sets, we show that ASTRAL-Pro is more accurate than alternative methods.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, University of California San Diego, San Diego, CA
| | | | - Erin K Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA
| |
Collapse
|