1
|
Cortada Garcia M, Diéguez Moscardó A, Casanellas M. Generating Heterogeneous Data on Gene Trees. J Comput Biol 2025. [PMID: 40340442 DOI: 10.1089/cmb.2024.0843] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/10/2025] Open
Abstract
We introduce GenPhylo, a Python module that simulates nucleotide sequence data along a phylogeny avoiding the restriction of continuous-time Markov processes. GenPhylo uses directly a general Markov model and therefore naturally incorporates heterogeneity across lineages. We solve the challenge of generating transition matrices with a pre-given expected number of substitutions (the branch length information) by providing an algorithm that can be incorporated in other simulation software.
Collapse
Affiliation(s)
- Martí Cortada Garcia
- Department of Mathematics, Universitat Politècnica de Catalunya, Barcelona, Spain
| | | | - Marta Casanellas
- Department of Mathematics, Universitat Politècnica de Catalunya, Barcelona, Spain
- Centre de Recerca Matemàtica, Bellaterra, Spain
| |
Collapse
|
2
|
Arasti S, Tabaghi P, Tabatabaee Y, Mirarab S. Branch Length Transforms using Optimal Tree Metric Matching. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2023.11.13.566962. [PMID: 38746464 PMCID: PMC11092445 DOI: 10.1101/2023.11.13.566962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
The abundant discordance between evolutionary relationships across the genome has rekindled interest in methods for comparing and averaging trees on a shared leaf set. However, most attempts at comparing and matching trees have focused on tree topology. Comparing branch lengths has been more elusive due to several challenges. Species tree branch lengths can be measured in various units, often different from gene trees. Moreover, rates of evolution change across the genome, the species tree, and specific branches of gene trees. These factors compound the stochasticity of coalescence times and estimation noise, making branch lengths highly heterogeneous across the genome. For many downstream applications in phylogenomic analyses, branch lengths are as important as the topology, and yet, existing tools to compare and combine weighted trees are limited. In this paper, we address the question of matching one tree to another, accounting for their branch lengths. We define a series of computational problems called Topology-Constrained Metric Matching (TCMM) that seek to transform the branch lengths of a query tree based on a reference tree. We show that TCMM problems can be solved in quadratic time and memory using a linear algebraic formulation coupled with dynamic programming preprocessing. While many applications can be imagined for this framework, we explore two applications in this paper: embedding leaves of gene trees in Euclidean space to find outliers potentially indicative of errors and summarizing gene tree branch lengths onto the species tree. In these applications, our method, when paired with existing methods, increases their accuracy at limited computational expense. Code and Data availability The software is available at https://github.com/shayesteh99/TCMM.git . Data is available on Github https://github.com/shayesteh99/TCMM-Data.git .
Collapse
|
3
|
Zhang C, Nielsen R, Mirarab S. CASTER: Direct species tree inference from whole-genome alignments. Science 2025; 387:eadk9688. [PMID: 39847611 PMCID: PMC12038793 DOI: 10.1126/science.adk9688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Revised: 08/05/2024] [Accepted: 12/04/2024] [Indexed: 01/25/2025]
Abstract
Genomes contain mosaics of discordant evolutionary histories, challenging the accurate inference of the tree of life. Although genome-wide data are routinely used for discordance-aware phylogenomic analyses, because of modeling and scalability limitations, the current practice leaves out large chunks of genomes. As more high-quality genomes become available, we urgently need discordance-aware methods to infer the tree directly from a multiple genome alignment. In this study, we introduce Coalescence-Aware Alignment-Based Species Tree Estimator (CASTER), a theoretically justified site-based method that eliminates the need to predefine recombination-free loci. CASTER is scalable to hundreds of mammalian whole genomes. We demonstrate the accuracy and scalability of CASTER in simulations that include recombination and apply CASTER to several biological datasets, showing that its per-site scores can reveal both biological and artifactual patterns of discordance across the genome.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, University of
California San Diego, 9500 Gilman Drive, La Jolla, 92093, CA, USA
- Integrative Biology Department, University of California
Berkeley, 110 Sproul Hall, Berkeley, 94704, CA, USA
- Globe Institute, University of Copenhagen, Øster
Voldgade 5-7, Copenhagen, 1350, Denmark
| | - Rasmus Nielsen
- Integrative Biology Department, University of California
Berkeley, 110 Sproul Hall, Berkeley, 94704, CA, USA
- Globe Institute, University of Copenhagen, Øster
Voldgade 5-7, Copenhagen, 1350, Denmark
| | - Siavash Mirarab
- Electrical and Computer Engineering, University of
California San Diego, 9500 Gilman Drive, La Jolla, 92093, CA, USA
| |
Collapse
|
4
|
Tabatabaee Y, Zhang C, Arasti S, Mirarab S. Species tree branch length estimation despite incomplete lineage sorting, duplication, and loss. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.20.639320. [PMID: 40027742 PMCID: PMC11870528 DOI: 10.1101/2025.02.20.639320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Phylogenetic branch lengths are essential for many analyses, such as estimating divergence times, analyzing rate changes, and studying adaptation. However, true gene tree heterogeneity due to incomplete lineage sorting (ILS), gene duplication and loss (GDL), and horizontal gene transfer (HGT) can complicate the estimation of species tree branch lengths. While several tools exist for estimating the topology of a species tree addressing various causes of gene tree discordance, much less attention has been paid to branch length estimation on multi-locus datasets. For single-copy gene trees, some methods are available that summarize gene tree branch lengths onto a species tree, including coalescent-based methods that account for heterogeneity due to ILS. However, no such branch length estimation method exists for multi-copy gene family trees that have evolved with gene duplication and loss. To address this gap, we introduce the CASTLES-Pro algorithm for estimating species tree branch lengths while accounting for both GDL and ILS. CASTLES-Pro improves on the existing coalescent-based branch length estimation method CASTLES by increasing its accuracy for single-copy gene trees and extends it to handle multi-copy ones. Our simulation studies show that CASTLES-Pro is generally more accurate than alternatives, eliminating the systematic bias toward overestimating terminal branch lengths often observed when using concatenation. Moreover, while not theoretically designed for HGT, we show that CASTLES-Pro maintains relatively high accuracy under high rates of random HGT. Code availability CASTLES-Pro is implemented inside the software package ASTER, available at https://github.com/chaoszhang/ASTER . Data availability The datasets and scripts used in this study are available at https://github.com/ytabatabaee/CASTLES-Pro-paper .
Collapse
|
5
|
Habib M, Roy K, Hasan S, Rahman AH, Bayzid MS. Terraces in species tree inference from gene trees. BMC Ecol Evol 2024; 24:135. [PMID: 39497030 PMCID: PMC11533290 DOI: 10.1186/s12862-024-02309-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2023] [Accepted: 09/16/2024] [Indexed: 11/06/2024] Open
Abstract
A terrace in a phylogenetic tree space is a region where all trees contain the same set of subtrees, due to certain patterns of missing data among the taxa sampled, resulting in an identical optimality score for a given data set. This was first investigated in the context of phylogenetic tree estimation from sequence alignments using maximum likelihood (ML) and maximum parsimony (MP). It was later extended to the species tree inference problem from a collection of gene trees, where a set of equally optimal species trees was referred to as a "pseudo" species tree terrace which does not consider the topological proximity of the trees in terms of the induced subtrees resulting from certain patterns of missing data. In this study, we mathematically characterize species tree terraces and investigate the mathematical properties and conditions that lead multiple species trees to induce/display an identical set of locus-specific subtrees owing to missing data. We report that species tree terraces are agnostic to gene tree heterogeneity. Therefore, we introduce and characterize a special type of gene tree topology-aware terrace which we call "peak terrace". Moreover, we empirically investigated various challenges and opportunities related to species tree terraces through extensive empirical studies using simulated and real biological data. We demonstrate the prevalence of species tree terraces and the resulting ambiguity created for tree search algorithms. Remarkably, our findings indicate that the identification of terraces could potentially lead to advances that enhance the accuracy of summary methods and provide reasonably accurate branch support.
Collapse
Affiliation(s)
- Mursalin Habib
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Kowshic Roy
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Saem Hasan
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Atif Hasan Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh.
| |
Collapse
|
6
|
Leal JL, Milesi P, Hodková E, Zhou Q, James J, Eklund DM, Pyhäjärvi T, Salojärvi J, Lascoux M. Complex Polyploids: Origins, Genomic Composition, and Role of Introgressed Alleles. Syst Biol 2024; 73:392-418. [PMID: 38613229 PMCID: PMC11282369 DOI: 10.1093/sysbio/syae012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 12/18/2023] [Accepted: 03/28/2024] [Indexed: 04/14/2024] Open
Abstract
Introgression allows polyploid species to acquire new genomic content from diploid progenitors or from other unrelated diploid or polyploid lineages, contributing to genetic diversity and facilitating adaptive allele discovery. In some cases, high levels of introgression elicit the replacement of large numbers of alleles inherited from the polyploid's ancestral species, profoundly reshaping the polyploid's genomic composition. In such complex polyploids, it is often difficult to determine which taxa were the progenitor species and which taxa provided additional introgressive blocks through subsequent hybridization. Here, we use population-level genomic data to reconstruct the phylogenetic history of Betula pubescens (downy birch), a tetraploid species often assumed to be of allopolyploid origin and which is known to hybridize with at least four other birch species. This was achieved by modeling polyploidization and introgression events under the multispecies coalescent and then using an approximate Bayesian computation rejection algorithm to evaluate and compare competing polyploidization models. We provide evidence that B. pubescens is the outcome of an autoploid genome doubling event in the common ancestor of B. pendula and its extant sister species, B. platyphylla, that took place approximately 178,000-188,000 generations ago. Extensive hybridization with B. pendula, B. nana, and B. humilis followed in the aftermath of autopolyploidization, with the relative contribution of each of these species to the B. pubescens genome varying markedly across the species' range. Functional analysis of B. pubescens loci containing alleles introgressed from B. nana identified multiple genes involved in climate adaptation, while loci containing alleles derived from B. humilis revealed several genes involved in the regulation of meiotic stability and pollen viability in plant species.
Collapse
Affiliation(s)
- J Luis Leal
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
| | - Pascal Milesi
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
- Science for Life Laboratory (SciLifeLab), Uppsala University, 75237 Uppsala, Sweden
| | - Eva Hodková
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
- Faculty of Environmental Sciences, Czech University of Life Sciences Prague, Kamýcká 129, 16521 Prague, Czech Republic
| | - Qiujie Zhou
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
| | - Jennifer James
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
| | - D Magnus Eklund
- Physiology and Environmental Toxicology, Department of Organismal Biology, Uppsala University, Norbyvägen 18A, 75236 Uppsala, Sweden
| | - Tanja Pyhäjärvi
- Organismal and Evolutionary Biology Research Program, Faculty of Biological and Environmental Sciences, and Viikki Plant Science Centre, University of Helsinki, P.O. Box 65 (Viikinkaari 1), 00014 Helsinki, Finland
- Department of Forest Sciences, University of Helsinki, 00014 Helsinki, Finland
| | - Jarkko Salojärvi
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
- Organismal and Evolutionary Biology Research Program, Faculty of Biological and Environmental Sciences, and Viikki Plant Science Centre, University of Helsinki, P.O. Box 65 (Viikinkaari 1), 00014 Helsinki, Finland
| | - Martin Lascoux
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
- Science for Life Laboratory (SciLifeLab), Uppsala University, 75237 Uppsala, Sweden
| |
Collapse
|
7
|
Li Q, Chan YB, Galtier N, Scornavacca C. The Effect of Copy Number Hemiplasy on Gene Family Evolution. Syst Biol 2024; 73:355-374. [PMID: 38330161 DOI: 10.1093/sysbio/syae007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 01/24/2024] [Accepted: 02/03/2024] [Indexed: 02/10/2024] Open
Abstract
The evolution of gene families is complex, involving gene-level evolutionary events such as gene duplication, horizontal gene transfer, and gene loss, and other processes such as incomplete lineage sorting (ILS). Because of this, topological differences often exist between gene trees and species trees. A number of models have been recently developed to explain these discrepancies, the most realistic of which attempts to consider both gene-level events and ILS. When unified in a single model, the interaction between ILS and gene-level events can cause polymorphism in gene copy number, which we refer to as copy number hemiplasy (CNH). In this paper, we extend the Wright-Fisher process to include duplications and losses over several species, and show that the probability of CNH for this process can be significant. We study how well two unified models-multilocus multispecies coalescent (MLMSC), which models CNH, and duplication, loss, and coalescence (DLCoal), which does not-approximate the Wright-Fisher process with duplication and loss. We then study the effect of CNH on gene family evolution by comparing MLMSC and DLCoal. We generate comparable gene trees under both models, showing significant differences in various summary statistics; most importantly, CNH reduces the number of gene copies greatly. If this is not taken into account, the traditional method of estimating duplication rates (by counting the number of gene copies) becomes inaccurate. The simulated gene trees are also used for species tree inference with the summary methods ASTRAL and ASTRAL-Pro, demonstrating that their accuracy, based on CNH-unaware simulations calibrated on real data, may have been overestimated.
Collapse
Affiliation(s)
- Qiuyi Li
- School of Mathematics and Statistics/Melbourne Integrative Genomics, The University of Melbourne, Melbourne 3010, Australia
- Alibaba Cloud, Hangzhou, China
| | - Yao-Ban Chan
- School of Mathematics and Statistics/Melbourne Integrative Genomics, The University of Melbourne, Melbourne 3010, Australia
| | - Nicolas Galtier
- Institut des Sciences de lEvolution, Université Montpellier, CNRS, IRD, EPHE, Montpellier 34095, France
| | - Celine Scornavacca
- Institut des Sciences de l'Evolution, Université Montpellier, CNRS, IRD, EPHE, Montpellier 34095, France
| |
Collapse
|
8
|
Jiang Y, McDonald D, Perry D, Knight R, Mirarab S. Scaling DEPP phylogenetic placement to ultra-large reference trees: a tree-aware ensemble approach. Bioinformatics 2024; 40:btae361. [PMID: 38870525 PMCID: PMC11193062 DOI: 10.1093/bioinformatics/btae361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 04/09/2024] [Accepted: 06/12/2024] [Indexed: 06/15/2024] Open
Abstract
MOTIVATION Phylogenetic placement of a query sequence on a backbone tree is increasingly used across biomedical sciences to identify the content of a sample from its DNA content. The accuracy of such analyses depends on the density of the backbone tree, making it crucial that placement methods scale to very large trees. Moreover, a new paradigm has been recently proposed to place sequences on the species tree using single-gene data. The goal is to better characterize the samples and to enable combined analyses of marker-gene (e.g., 16S rRNA gene amplicon) and genome-wide data. The recent method DEPP enables performing such analyses using metric learning. However, metric learning is hampered by a need to compute and save a quadratically growing matrix of pairwise distances during training. Thus, the training phase of DEPP does not scale to more than roughly 10 000 backbone species, a problem that we faced when trying to use our recently released Greengenes2 (GG2) reference tree containing 331 270 species. RESULTS This paper explores divide-and-conquer for training ensembles of DEPP models, culminating in a method called C-DEPP. While divide-and-conquer has been extensively used in phylogenetics, applying divide-and-conquer to data-hungry machine-learning methods needs nuance. C-DEPP uses carefully crafted techniques to enable quasi-linear scaling while maintaining accuracy. C-DEPP enables placing 20 million 16S fragments on the GG2 reference tree in 41 h of computation. AVAILABILITY AND IMPLEMENTATION The dataset and C-DEPP software are freely available at https://github.com/yueyujiang/dataset_cdepp/.
Collapse
Affiliation(s)
- Yueyu Jiang
- Electrical and Computer Engineering Department, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
| | - Daniel McDonald
- Pediatrics Department, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
| | - Daniela Perry
- Pediatrics Department, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
| | - Rob Knight
- Pediatrics Department, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
| | - Siavash Mirarab
- Electrical and Computer Engineering Department, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States
| |
Collapse
|
9
|
Rick JA, Brock CD, Lewanski AL, Golcher-Benavides J, Wagner CE. Reference Genome Choice and Filtering Thresholds Jointly Influence Phylogenomic Analyses. Syst Biol 2024; 73:76-101. [PMID: 37881861 DOI: 10.1093/sysbio/syad065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 09/20/2023] [Accepted: 10/20/2023] [Indexed: 10/27/2023] Open
Abstract
Molecular phylogenies are a cornerstone of modern comparative biology and are commonly employed to investigate a range of biological phenomena, such as diversification rates, patterns in trait evolution, biogeography, and community assembly. Recent work has demonstrated that significant biases may be introduced into downstream phylogenetic analyses from processing genomic data; however, it remains unclear whether there are interactions among bioinformatic parameters or biases introduced through the choice of reference genome for sequence alignment and variant calling. We address these knowledge gaps by employing a combination of simulated and empirical data sets to investigate the extent to which the choice of reference genome in upstream bioinformatic processing of genomic data influences phylogenetic inference, as well as the way that reference genome choice interacts with bioinformatic filtering choices and phylogenetic inference method. We demonstrate that more stringent minor allele filters bias inferred trees away from the true species tree topology, and that these biased trees tend to be more imbalanced and have a higher center of gravity than the true trees. We find the greatest topological accuracy when filtering sites for minor allele count (MAC) >3-4 in our 51-taxa data sets, while tree center of gravity was closest to the true value when filtering for sites with MAC >1-2. In contrast, filtering for missing data increased accuracy in the inferred topologies; however, this effect was small in comparison to the effect of minor allele filters and may be undesirable due to a subsequent mutation spectrum distortion. The bias introduced by these filters differs based on the reference genome used in short read alignment, providing further support that choosing a reference genome for alignment is an important bioinformatic decision with implications for downstream analyses. These results demonstrate that attributes of the study system and dataset (and their interaction) add important nuance for how best to assemble and filter short-read genomic data for phylogenetic inference.
Collapse
Affiliation(s)
- Jessica A Rick
- School of Natural Resources & the Environment, University of Arizona, Tucson, AZ 85719, USA
| | - Chad D Brock
- Department of Biological Sciences, Tarleton State University, Stephenville, TX 76401, USA
| | - Alexander L Lewanski
- Department of Integrative Biology and W.K. Kellogg Biological Station, Michigan State University, East Lansing, MI 48824, USA
| | - Jimena Golcher-Benavides
- Department of Natural Resource Ecology and Management, Iowa State University, Ames, IA 50011, USA
| | - Catherine E Wagner
- Program in Ecology and Evolution, University of Wyoming, Laramie, WY 82071, USA
- Department of Botany, University of Wyoming, Laramie, WY 82071, USA
| |
Collapse
|
10
|
Balaban M, Jiang Y, Zhu Q, McDonald D, Knight R, Mirarab S. Generation of accurate, expandable phylogenomic trees with uDance. Nat Biotechnol 2024; 42:768-777. [PMID: 37500914 PMCID: PMC10818028 DOI: 10.1038/s41587-023-01868-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Accepted: 06/20/2023] [Indexed: 07/29/2023]
Abstract
Phylogenetic trees provide a framework for organizing evolutionary histories across the tree of life and aid downstream comparative analyses such as metagenomic identification. Methods that rely on single-marker genes such as 16S rRNA have produced trees of limited accuracy with hundreds of thousands of organisms, whereas methods that use genome-wide data are not scalable to large numbers of genomes. We introduce updating trees using divide-and-conquer (uDance), a method that enables updatable genome-wide inference using a divide-and-conquer strategy that refines different parts of the tree independently and can build off of existing trees, with high accuracy and scalability. With uDance, we infer a species tree of roughly 200,000 genomes using 387 marker genes, totaling 42.5 billion amino acid residues.
Collapse
Affiliation(s)
- Metin Balaban
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, USA
| | - Yueyu Jiang
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA, USA
| | - Qiyun Zhu
- Biodesign Center for Fundamental and Applied Microbiomics, Arizona State University, Tempe, AZ, USA
- School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Daniel McDonald
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Rob Knight
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Department of Computer Science and Engineering, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, USA
- Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA, USA.
- Department of Computer Science and Engineering, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, USA.
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
11
|
Arasti S, Mirarab S. Median quartet tree search algorithms using optimal subtree prune and regraft. Algorithms Mol Biol 2024; 19:12. [PMID: 38481327 PMCID: PMC10938725 DOI: 10.1186/s13015-024-00257-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 02/13/2024] [Indexed: 03/17/2024] Open
Abstract
Gene trees can be different from the species tree due to biological processes and inference errors. One way to obtain a species tree is to find one that maximizes some measure of similarity to a set of gene trees. The number of shared quartets between a potential species tree and gene trees provides a statistically justifiable score; if maximized properly, it could result in a statistically consistent estimator of the species tree under several statistical models of discordance. However, finding the median quartet score tree, one that maximizes this score, is NP-Hard, motivating several existing heuristic algorithms. These heuristics do not follow the hill-climbing paradigm used extensively in phylogenetics. In this paper, we make theoretical contributions that enable an efficient hill-climbing approach. Specifically, we show that a subtree of size m can be placed optimally on a tree of size n in quasi-linear time with respect to n and (almost) independently of m. This result enables us to perform subtree prune and regraft (SPR) rearrangements as part of a hill-climbing search. We show that this approach can slightly improve upon the results of widely-used methods such as ASTRAL in terms of the optimization score but not necessarily accuracy.
Collapse
Affiliation(s)
- Shayesteh Arasti
- Computer Science and Engineering Department, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
| | - Siavash Mirarab
- Electrical and Computer Engineering Department, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA.
| |
Collapse
|
12
|
Górecki P, Rutecka N, Mykowiecka A, Paszek J. Unifying duplication episode clustering and gene-species mapping inference. Algorithms Mol Biol 2024; 19:7. [PMID: 38355611 PMCID: PMC10865717 DOI: 10.1186/s13015-024-00252-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 01/04/2024] [Indexed: 02/16/2024] Open
Abstract
We present a novel problem, called MetaEC, which aims to infer gene-species assignments in a collection of partially leaf-labeled gene trees labels by minimizing the size of duplication episode clustering (EC). This problem is particularly relevant in metagenomics, where incomplete data often poses a challenge in the accurate reconstruction of gene histories. To solve MetaEC, we propose a polynomial time dynamic programming (DP) formulation that verifies the existence of a set of duplication episodes from a predefined set of episode candidates. In addition, we design a method to infer distributions of gene-species mappings. We then demonstrate how to use DP to design an algorithm that solves MetaEC. Although the algorithm is exponential in the worst case, we introduce a heuristic modification of the algorithm that provides a solution with the knowledge that it is exact. To evaluate our method, we perform two computational experiments on simulated and empirical data containing whole genome duplication events, showing that our algorithm is able to accurately infer the corresponding events.
Collapse
Affiliation(s)
- Paweł Górecki
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland.
| | - Natalia Rutecka
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland
| | - Agnieszka Mykowiecka
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland
| | - Jarosław Paszek
- Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland
| |
Collapse
|
13
|
Dai J, Rubel T, Han Y, Molloy EK. Dollo-CDP: a polynomial-time algorithm for the clade-constrained large Dollo parsimony problem. Algorithms Mol Biol 2024; 19:2. [PMID: 38191515 PMCID: PMC10775561 DOI: 10.1186/s13015-023-00249-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 12/10/2023] [Indexed: 01/10/2024] Open
Abstract
The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set. This technique has yet to be used for parsimony problems where the input are binary characters, sometimes with missing values. Here, we introduce the clade-constrained character parsimony problem and present an algorithm that solves this problem for the Dollo criterion score in [Formula: see text] time, where n is the number of leaves, k is the number of characters, and [Formula: see text] is the set of clades used as constraints. Dollo parsimony, which requires traits/mutations to be gained at most once but allows them to be lost any number of times, is widely used for tumor phylogenetics as well as species phylogenetics, for example analyses of low-homoplasy retroelement insertions across the vertebrate tree of life. This motivated us to implement our algorithm in a software package, called Dollo-CDP, and evaluate its utility for analyzing retroelement insertion presence / absence patterns for bats, birds, toothed whales as well as simulated data. Our results show that Dollo-CDP can improve upon heuristic search from a single starting tree, often recovering a better scoring tree. Moreover, Dollo-CDP scales to data sets with much larger numbers of taxa than branch-and-bound while still having an optimality guarantee, albeit a more restricted one. Lastly, we show that our algorithm for Dollo parsimony can easily be adapted to Camin-Sokal parsimony but not Fitch parsimony.
Collapse
Affiliation(s)
- Junyan Dai
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Tobias Rubel
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Yunheng Han
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Erin K Molloy
- Department of Computer Science, University of Maryland, College Park, MD, USA.
- University of Maryland Institute for Advanced Computer Studies, College Park, MD, USA.
| |
Collapse
|
14
|
Comte A, Tricou T, Tannier E, Joseph J, Siberchicot A, Penel S, Allio R, Delsuc F, Dray S, de Vienne DM. PhylteR: Efficient Identification of Outlier Sequences in Phylogenomic Datasets. Mol Biol Evol 2023; 40:msad234. [PMID: 37879113 PMCID: PMC10655845 DOI: 10.1093/molbev/msad234] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 09/29/2023] [Accepted: 10/18/2023] [Indexed: 10/27/2023] Open
Abstract
In phylogenomics, incongruences between gene trees, resulting from both artifactual and biological reasons, can decrease the signal-to-noise ratio and complicate species tree inference. The amount of data handled today in classical phylogenomic analyses precludes manual error detection and removal. However, a simple and efficient way to automate the identification of outliers from a collection of gene trees is still missing. Here, we present PhylteR, a method that allows rapid and accurate detection of outlier sequences in phylogenomic datasets, i.e. species from individual gene trees that do not follow the general trend. PhylteR relies on DISTATIS, an extension of multidimensional scaling to 3 dimensions to compare multiple distance matrices at once. In PhylteR, these distance matrices extracted from individual gene phylogenies represent evolutionary distances between species according to each gene. On simulated datasets, we show that PhylteR identifies outliers with more sensitivity and precision than a comparable existing method. We also show that PhylteR is not sensitive to ILS-induced incongruences, which is a desirable feature. On a biological dataset of 14,463 genes for 53 species previously assembled for Carnivora phylogenomics, we show (i) that PhylteR identifies as outliers sequences that can be considered as such by other means, and (ii) that the removal of these sequences improves the concordance between the gene trees and the species tree. Thanks to the generation of numerous graphical outputs, PhylteR also allows for the rapid and easy visual characterization of the dataset at hand, thus aiding in the precise identification of errors. PhylteR is distributed as an R package on CRAN and as containerized versions (docker and singularity).
Collapse
Affiliation(s)
- Aurore Comte
- French Institute of Bioinformatics (IFB)—South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier, France
- IRD, CIRAD, INRAE, Institut Agro, PHIM Plant Health Institute, Montpellier University, Montpellier, France
| | - Théo Tricou
- Université de Lyon, Université Lyon 1, UMR CNRS 5558 Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
| | - Eric Tannier
- Université de Lyon, Université Lyon 1, UMR CNRS 5558 Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
- Centre de Recherches Inria de Lyon, Villeurbanne, France
| | - Julien Joseph
- Université de Lyon, Université Lyon 1, UMR CNRS 5558 Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
| | - Aurélie Siberchicot
- Université de Lyon, Université Lyon 1, UMR CNRS 5558 Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
| | - Simon Penel
- Université de Lyon, Université Lyon 1, UMR CNRS 5558 Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
| | - Rémi Allio
- CBGP, INRAE, CIRAD, IRD, Montpellier SupAgro, Univ. Montpellier, Montpellier, France
| | | | - Stéphane Dray
- Université de Lyon, Université Lyon 1, UMR CNRS 5558 Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
| | - Damien M de Vienne
- Université de Lyon, Université Lyon 1, UMR CNRS 5558 Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
| |
Collapse
|
15
|
Tabatabaee Y, Roch S, Warnow T. QR-STAR: A Polynomial-Time Statistically Consistent Method for Rooting Species Trees Under the Coalescent. J Comput Biol 2023; 30:1146-1181. [PMID: 37902986 DOI: 10.1089/cmb.2023.0185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2023] Open
Abstract
We address the problem of rooting an unrooted species tree given a set of unrooted gene trees, under the assumption that gene trees evolve within the model species tree under the multispecies coalescent (MSC) model. Quintet Rooting (QR) is a polynomial time algorithm that was recently proposed for this problem, which is based on the theory developed by Allman, Degnan, and Rhodes that proves the identifiability of rooted 5-taxon trees from unrooted gene trees under the MSC. However, although QR had good accuracy in simulations, its statistical consistency was left as an open problem. We present QR-STAR, a variant of QR with an additional step and a different cost function, and prove that it is statistically consistent under the MSC. Moreover, we derive sample complexity bounds for QR-STAR and show that a particular variant of it based on "short quintets" has polynomial sample complexity. Finally, our simulation study under a variety of model conditions shows that QR-STAR matches or improves on the accuracy of QR. QR-STAR is available in open-source form on github.
Collapse
Affiliation(s)
- Yasamin Tabatabaee
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Sebastien Roch
- Department of Mathematics, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| |
Collapse
|
16
|
Fogg J, Allman ES, Ané C. PhyloCoalSimulations: A Simulator for Network Multispecies Coalescent Models, Including a New Extension for the Inheritance of Gene Flow. Syst Biol 2023; 72:1171-1179. [PMID: 37254872 DOI: 10.1093/sysbio/syad030] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Revised: 05/03/2023] [Accepted: 05/15/2023] [Indexed: 06/01/2023] Open
Abstract
We consider the evolution of phylogenetic gene trees along phylogenetic species networks, according to the network multispecies coalescent process, and introduce a new network coalescent model with correlated inheritance of gene flow. This model generalizes two traditional versions of the network coalescent: with independent or common inheritance. At each reticulation, multiple lineages of a given locus are inherited from parental populations chosen at random, either independently across lineages or with positive correlation according to a Dirichlet process. This process may account for locus-specific probabilities of inheritance, for example. We implemented the simulation of gene trees under these network coalescent models in the Julia package PhyloCoalSimulations, which depends on PhyloNetworks and its powerful network manipulation tools. Input species phylogenies can be read in extended Newick format, either in numbers of generations or in coalescent units. Simulated gene trees can be written in Newick format, and in a way that preserves information about their embedding within the species network. This embedding can be used for downstream purposes, such as to simulate species-specific processes like rate variation across species, or for other scenarios as illustrated in this note. This package should be useful for simulation studies and simulation-based inference methods. The software is available open source with documentation and a tutorial at https://github.com/cecileane/PhyloCoalSimulations.jl.
Collapse
Affiliation(s)
- John Fogg
- Department of Statistics, University of Wisconsin - Madison, WI, 53706, USA
| | - Elizabeth S Allman
- Department of Mathematics and Statistics, University of Alaska - Fairbanks, AK, 99775, USA
| | - Cécile Ané
- Department of Statistics, University of Wisconsin - Madison, WI, 53706, USA
- Department of Botany, University of Wisconsin - Madison, WI, 53706, USA
| |
Collapse
|
17
|
Tabatabaee Y, Zhang C, Warnow T, Mirarab S. Phylogenomic branch length estimation using quartets. Bioinformatics 2023; 39:i185-i193. [PMID: 37387151 PMCID: PMC10311336 DOI: 10.1093/bioinformatics/btad221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Branch lengths and topology of a species tree are essential in most downstream analyses, including estimation of diversification dates, characterization of selection, understanding adaptation, and comparative genomics. Modern phylogenomic analyses often use methods that account for the heterogeneity of evolutionary histories across the genome due to processes such as incomplete lineage sorting. However, these methods typically do not generate branch lengths in units that are usable by downstream applications, forcing phylogenomic analyses to resort to alternative shortcuts such as estimating branch lengths by concatenating gene alignments into a supermatrix. Yet, concatenation and other available approaches for estimating branch lengths fail to address heterogeneity across the genome. RESULTS In this article, we derive expected values of gene tree branch lengths in substitution units under an extension of the multispecies coalescent (MSC) model that allows substitutions with varying rates across the species tree. We present CASTLES, a new technique for estimating branch lengths on the species tree from estimated gene trees that uses these expected values, and our study shows that CASTLES improves on the most accurate prior methods with respect to both speed and accuracy. AVAILABILITY AND IMPLEMENTATION CASTLES is available at https://github.com/ytabatabaee/CASTLES.
Collapse
Affiliation(s)
- Yasamin Tabatabaee
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, United States
| | - Chao Zhang
- Department of Integrative Biology, University of California at Berkeley, Berkeley, CA 94720, United States
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, United States
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California, San Diego, La Jolla, CA 92093, United States
| |
Collapse
|
18
|
Leal JL, Milesi P, Salojärvi J, Lascoux M. Phylogenetic Analysis of Allotetraploid Species Using Polarized Genomic Sequences. Syst Biol 2023; 72:372-390. [PMID: 36932679 PMCID: PMC10275558 DOI: 10.1093/sysbio/syad009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 10/14/2022] [Accepted: 03/10/2023] [Indexed: 03/19/2023] Open
Abstract
Phylogenetic analysis of polyploid hybrid species has long posed a formidable challenge as it requires the ability to distinguish between alleles of different ancestral origins in order to disentangle their individual evolutionary history. This problem has been previously addressed by conceiving phylogenies as reticulate networks, using a two-step phasing strategy that first identifies and segregates homoeologous loci and then, during a second phasing step, assigns each gene copy to one of the subgenomes of an allopolyploid species. Here, we propose an alternative approach, one that preserves the core idea behind phasing-to produce separate nucleotide sequences that capture the reticulate evolutionary history of a polyploid-while vastly simplifying its implementation by reducing a complex multistage procedure to a single phasing step. While most current methods used for phylogenetic reconstruction of polyploid species require sequencing reads to be pre-phased using experimental or computational methods-usually an expensive, complex, and/or time-consuming endeavor-phasing executed using our algorithm is performed directly on the multiple-sequence alignment (MSA), a key change that allows for the simultaneous segregation and sorting of gene copies. We introduce the concept of genomic polarization that, when applied to an allopolyploid species, produces nucleotide sequences that capture the fraction of a polyploid genome that deviates from that of a reference sequence, usually one of the other species present in the MSA. We show that if the reference sequence is one of the parental species, the polarized polyploid sequence has a close resemblance (high pairwise sequence identity) to the second parental species. This knowledge is harnessed to build a new heuristic algorithm where, by replacing the allopolyploid genomic sequence in the MSA by its polarized version, it is possible to identify the phylogenetic position of the polyploid's ancestral parents in an iterative process. The proposed methodology can be used with long-read and short-read high-throughput sequencing data and requires only one representative individual for each species to be included in the phylogenetic analysis. In its current form, it can be used in the analysis of phylogenies containing tetraploid and diploid species. We test the newly developed method extensively using simulated data in order to evaluate its accuracy. We show empirically that the use of polarized genomic sequences allows for the correct identification of both parental species of an allotetraploid with up to 97% certainty in phylogenies with moderate levels of incomplete lineage sorting (ILS) and 87% in phylogenies containing high levels of ILS. We then apply the polarization protocol to reconstruct the reticulate histories of Arabidopsis kamchatica and Arabidopsis suecica, two allopolyploids whose ancestry has been well documented. [Allopolyploidy; Arabidopsis; genomic polarization; homoeologs; incomplete lineage sorting; phasing; polyploid phylogenetics; reticulate evolution.].
Collapse
Affiliation(s)
- J Luis Leal
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
| | - Pascal Milesi
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
- Science for Life Laboratory (SciLifeLab), Uppsala University, 75237 Uppsala, Sweden
| | - Jarkko Salojärvi
- Organismal and Evolutionary Biology Research Program, Faculty of Biological and Environmental Sciences, and Viikki Plant Science Centre, University of Helsinki, P.O. Box 65 (Viikinkaari 1), 00014 Helsinki, Finland
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Martin Lascoux
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
- Science for Life Laboratory (SciLifeLab), Uppsala University, 75237 Uppsala, Sweden
| |
Collapse
|
19
|
Allman ES, Banos H, Rhodes JA. Testing Multispecies Coalescent Simulators Using Summary Statistics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1613-1618. [PMID: 35617176 PMCID: PMC10183998 DOI: 10.1109/tcbb.2022.3177956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
As genomic-scale datasets motivate research on species tree inference, simulators of the multispecies coalescent (MSC) process have become essential for the testing and evaluation of new inference methods. However, the simulators themselves must be tested to ensure that they give valid samples. This work develops methods for checking whether a collection of gene trees is in accord with the MSC model on a given species tree. When applied to well-known simulators, we find that several give flawed samples. The tests presented are capable of validating both topological and metric properties of gene tree samples, and are implemented in a freely available R package MSCsimtester so that developers and users may easily apply them.
Collapse
|
20
|
Willson J, Tabatabaee Y, Liu B, Warnow T. DISCO+QR: rooting species trees in the presence of GDL and ILS. BIOINFORMATICS ADVANCES 2023; 3:vbad015. [PMID: 36789293 PMCID: PMC9923442 DOI: 10.1093/bioadv/vbad015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Revised: 01/21/2023] [Accepted: 02/06/2023] [Indexed: 02/10/2023]
Abstract
Motivation Genes evolve under processes such as gene duplication and loss (GDL), so that gene family trees are multi-copy, as well as incomplete lineage sorting (ILS); both processes produce gene trees that differ from the species tree. The estimation of species trees from sets of gene family trees is challenging, and the estimation of rooted species trees presents additional analytical challenges. Two of the methods developed for this problem are STRIDE, which roots species trees by considering GDL events, and Quintet Rooting (QR), which roots species trees by considering ILS. Results We present DISCO+QR, a new approach to rooting species trees that first uses DISCO to address GDL and then uses QR to perform rooting in the presence of ILS. DISCO+QR operates by taking the input gene family trees and decomposing them into single-copy trees using DISCO and then roots the given species tree using the information in the single-copy gene trees using QR. We show that the relative accuracy of STRIDE and DISCO+QR depend on the properties of the dataset (number of species, genes, rate of gene duplication, degree of ILS and gene tree estimation error), and that each provides advantages over the other under some conditions. Availability and implementation DISCO and QR are available in github. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- James Willson
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Yasamin Tabatabaee
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Baqiao Liu
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | | |
Collapse
|
21
|
Morel B, Williams TA, Stamatakis A. Asteroid: a new algorithm to infer species trees from gene trees under high proportions of missing data. Bioinformatics 2022; 39:6964379. [PMID: 36576010 PMCID: PMC9838317 DOI: 10.1093/bioinformatics/btac832] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 12/12/2022] [Accepted: 12/26/2022] [Indexed: 12/29/2022] Open
Abstract
MOTIVATION Missing data and incomplete lineage sorting (ILS) are two major obstacles to accurate species tree inference. Gene tree summary methods such as ASTRAL and ASTRID have been developed to account for ILS. However, they can be severely affected by high levels of missing data. RESULTS We present Asteroid, a novel algorithm that infers an unrooted species tree from a set of unrooted gene trees. We show on both empirical and simulated datasets that Asteroid is substantially more accurate than ASTRAL and ASTRID for very high proportions (>80%) of missing data. Asteroid is several orders of magnitude faster than ASTRAL for datasets that contain thousands of genes. It offers advanced features such as parallelization, support value computation and support for multi-copy and multifurcating gene trees. AVAILABILITY AND IMPLEMENTATION Asteroid is freely available at https://github.com/BenoitMorel/Asteroid. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Tom A Williams
- School of Biological Sciences, University of Bristol, Bristol BS8, UK
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany,Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe 76131, Germany
| |
Collapse
|
22
|
Super-Mitobarcoding in Plant Species Identification? It Can Work! The Case of Leafy Liverworts Belonging to the Genus Calypogeia. Int J Mol Sci 2022; 23:ijms232415570. [PMID: 36555212 PMCID: PMC9779425 DOI: 10.3390/ijms232415570] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 12/04/2022] [Accepted: 12/05/2022] [Indexed: 12/14/2022] Open
Abstract
Molecular identification of species is especially important where traditional taxonomic methods fail. The genus Calypogeia belongs to one of the tricky taxons. The simple morphology of these species and a tendency towards environmental plasticity make them complicated in identification. The finding of the universal single-locus DNA barcode in plants seems to be 'the Holy Grail'; therefore, researchers are increasingly looking for multiloci DNA barcodes or super-barcoding. Since the mitochondrial genome has low sequence variation in plants, species delimitation is usually based on the chloroplast genome. Unexpectedly, our research shows that super-mitobarcoding can also work! However, our outcomes showed that a single method of molecular species delimitation should be avoided. Moreover, it is recommended to interpret the results of molecular species delimitation alongside other types of evidence, such as ecology, population genetics or comparative morphology. Here, we also presented genetic data supporting the view that C. suecica is not a homogeneous species.
Collapse
|
23
|
Zhang C, Mirarab S. Weighting by Gene Tree Uncertainty Improves Accuracy of Quartet-based Species Trees. Mol Biol Evol 2022; 39:6750035. [PMID: 36201617 PMCID: PMC9750496 DOI: 10.1093/molbev/msac215] [Citation(s) in RCA: 51] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 09/20/2022] [Accepted: 10/03/2022] [Indexed: 01/07/2023] Open
Abstract
Phylogenomic analyses routinely estimate species trees using methods that account for gene tree discordance. However, the most scalable species tree inference methods, which summarize independently inferred gene trees to obtain a species tree, are sensitive to hard-to-avoid errors introduced in the gene tree estimation step. This dilemma has created much debate on the merits of concatenation versus summary methods and practical obstacles to using summary methods more widely and to the exclusion of concatenation. The most successful attempt at making summary methods resilient to noisy gene trees has been contracting low support branches from the gene trees. Unfortunately, this approach requires arbitrary thresholds and poses new challenges. Here, we introduce threshold-free weighting schemes for the quartet-based species tree inference, the metric used in the popular method ASTRAL. By reducing the impact of quartets with low support or long terminal branches (or both), weighting provides stronger theoretical guarantees and better empirical performance than the unweighted ASTRAL. Our simulations show that weighting improves accuracy across many conditions and reduces the gap with concatenation in conditions with low gene tree discordance and high noise. On empirical data, weighting improves congruence with concatenation and increases support. Together, our results show that weighting, enabled by a new optimization algorithm we introduce, improves the utility of summary methods and can reduce the incongruence often observed across analytical pipelines.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, UC San Diego, La Jolla, CA, USA
| | | |
Collapse
|
24
|
Balaban M, Bristy NA, Faisal A, Bayzid MS, Mirarab S. Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model. BIOINFORMATICS ADVANCES 2022; 2:vbac055. [PMID: 35992043 PMCID: PMC9383262 DOI: 10.1093/bioadv/vbac055] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Accepted: 08/09/2022] [Indexed: 01/27/2023]
Abstract
While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes-Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data. Availability and implementation Our software is available open source at https://github.com/nishatbristy007/NSB. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | | | - Ahnaf Faisal
- Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | | |
Collapse
|
25
|
Cui X, Xue Y, McCormack C, Garces A, Rachman TW, Yi Y, Stolzer M, Durand D. Simulating domain architecture evolution. Bioinformatics 2022; 38:i134-i142. [PMID: 35758772 PMCID: PMC9236583 DOI: 10.1093/bioinformatics/btac242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Motivation Simulation is an essential technique for generating biomolecular data with a ‘known’ history for use in validating phylogenetic inference and other evolutionary methods. On longer time scales, simulation supports investigations of equilibrium behavior and provides a formal framework for testing competing evolutionary hypotheses. Twenty years of molecular evolution research have produced a rich repertoire of simulation methods. However, current models do not capture the stringent constraints acting on the domain insertions, duplications, and deletions by which multidomain architectures evolve. Although these processes have the potential to generate any combination of domains, only a tiny fraction of possible domain combinations are observed in nature. Modeling these stringent constraints on domain order and co-occurrence is a fundamental challenge in domain architecture simulation that does not arise with sequence and gene family simulation. Results Here, we introduce a stochastic model of domain architecture evolution to simulate evolutionary trajectories that reflect the constraints on domain order and co-occurrence observed in nature. This framework is implemented in a novel domain architecture simulator, DomArchov, using the Metropolis–Hastings algorithm with data-driven transition probabilities. The use of a data-driven event module enables quick and easy redeployment of the simulator for use in different taxonomic and protein function contexts. Using empirical evaluation with metazoan datasets, we demonstrate that domain architectures simulated by DomArchov recapitulate properties of genuine domain architectures that reflect the constraints on domain order and adjacency seen in nature. This work expands the realm of evolutionary processes that are amenable to simulation. Availability and implementation DomArchov is written in Python 3 and is available at http://www.cs.cmu.edu/~durand/DomArchov. The data underlying this article are available via the same link. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaoyue Cui
- Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Yifan Xue
- Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA.,Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Collin McCormack
- Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA.,Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Alejandro Garces
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Thomas W Rachman
- Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Yang Yi
- Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA.,Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Maureen Stolzer
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Dannie Durand
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
26
|
Carson J, Ledda A, Ferretti L, Keeling M, Didelot X. The bounded coalescent model: Conditioning a genealogy on a minimum root date. J Theor Biol 2022; 548:111186. [PMID: 35697144 DOI: 10.1016/j.jtbi.2022.111186] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 05/05/2022] [Accepted: 06/02/2022] [Indexed: 01/27/2023]
Abstract
The coalescent model represents how individuals sampled from a population may have originated from a last common ancestor. The bounded coalescent model is obtained by conditioning the coalescent model such that the last common ancestor must have existed after a certain date. This conditioned model arises in a variety of applications, such as speciation, horizontal gene transfer or transmission analysis, and yet the bounded coalescent model has not been previously analysed in detail. Here we describe a new algorithm to simulate from this model directly, without resorting to rejection sampling. We show that this direct simulation algorithm is more computationally efficient than the rejection sampling approach. We also show how to calculate the probability of the last common ancestor occurring after a given date, which is required to compute the probability density of realisations under the bounded coalescent model. Our results are applicable in both the isochronous (when all samples have the same date) and heterochronous (where samples can have different dates) settings. We explore the effect of setting a bound on the date of the last common ancestor, and show that it affects a number of properties of the resulting phylogenies. All our methods are implemented in a new R package called BoundedCoalescent which is freely available online.
Collapse
Affiliation(s)
- Jake Carson
- Mathematics Institute, University of Warwick, United Kingdom
| | - Alice Ledda
- HCAI, Fungal, AMR, AMU & Sepsis Division, UK Health Security Agency, United Kingdom
| | - Luca Ferretti
- Big Data Institute, University of Oxford, United Kingdom
| | - Matt Keeling
- Mathematics Institute, University of Warwick, United Kingdom
| | - Xavier Didelot
- Department of Statistics and School of Life Sciences, University of Warwick, United Kingdom
| |
Collapse
|
27
|
Wawerka M, Dąbkowski D, Rutecka N, Mykowiecka A, Górecki P. Embedding gene trees into phylogenetic networks by conflict resolution algorithms. Algorithms Mol Biol 2022; 17:11. [PMID: 35590416 PMCID: PMC9119282 DOI: 10.1186/s13015-022-00218-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 03/22/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Phylogenetic networks are mathematical models of evolutionary processes involving reticulate events such as hybridization, recombination, or horizontal gene transfer. One of the crucial notions in phylogenetic network modelling is displayed tree, which is obtained from a network by removing a set of reticulation edges. Displayed trees may represent an evolutionary history of a gene family if the evolution is shaped by reticulation events. RESULTS We address the problem of inferring an optimal tree displayed by a network, given a gene tree G and a tree-child network N, under the deep coalescence and duplication costs. We propose an O(mn)-time dynamic programming algorithm (DP) to compute a lower bound of the optimal displayed tree cost, where m and n are the sizes of G and N, respectively. In addition, our algorithm can verify whether the solution is exact. Moreover, it provides a set of reticulation edges corresponding to the obtained cost. If the cost is exact, the set induces an optimal displayed tree. Otherwise, the set contains pairs of conflicting edges, i.e., edges sharing a reticulation node. Next, we show a conflict resolution algorithm that requires [Formula: see text] invocations of DP in the worst case, where r is the number of reticulations. We propose a similar [Formula: see text]-time algorithm for level-k tree-child networks and a branch and bound solution to compute lower and upper bounds of optimal costs. We also extend the algorithms to a broader class of phylogenetic networks. Based on simulated data, the average runtime is [Formula: see text] under the deep-coalescence cost and [Formula: see text] under the duplication cost. CONCLUSIONS Despite exponential complexity in the worst case, our algorithms perform significantly well on empirical and simulated datasets, due to the strategy of resolving internal dissimilarities between gene trees and networks. Therefore, the algorithms are efficient alternatives to enumeration strategies commonly proposed in the literature and enable analyses of complex networks with dozens of reticulations.
Collapse
|
28
|
Jiang Y, Balaban M, Zhu Q, Mirarab S. DEPP: Deep Learning Enables Extending Species Trees using Single Genes. Syst Biol 2022; 72:17-34. [PMID: 35485976 DOI: 10.1093/sysbio/syac031] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Revised: 04/13/2022] [Accepted: 04/22/2022] [Indexed: 11/13/2022] Open
Abstract
Placing new sequences onto reference phylogenies is increasingly used for analyzing environmental samples, especially microbiomes. Existing placement methods assume that query sequences have evolved under specific models directly on the reference phylogeny. For example, they assume single-gene data (e.g., 16S rRNA amplicons) have evolved under the GTR model on a gene tree. Placement, however, often has a more ambitious goal: extending a (genome-wide) species tree given data from individual genes without knowing the evolutionary model. Addressing this challenging problem requires new directions. Here, we introduce Deep-learning Enabled Phylogenetic Placement (DEPP), an algorithm that learns to extend species trees using single genes without pre-specified models. In simulations and on real data, we show that DEPP can match the accuracy of model-based methods without any prior knowledge of the model. We also show that DEPP can update the multi-locus microbial tree-of-life with single genes with high accuracy. We further demonstrate that DEPP can combine 16S and metagenomic data onto a single tree, enabling community structure analyses that take advantage of both sources of data.
Collapse
Affiliation(s)
- Yueyu Jiang
- Department of Electrical and Computer Engineering, UC San Diego, CA 92093, USA
| | - Metin Balaban
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, CA 92093, USA
| | - Qiyun Zhu
- Center for Fundamental and Applied Microbiomics, Arizona State University, Tempe, AZ 85281, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, CA 92093, USA
| |
Collapse
|
29
|
Willson J, Roddur MS, Liu B, Zaharias P, Warnow T. DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition. Syst Biol 2022; 71:610-629. [PMID: 34450658 PMCID: PMC9016570 DOI: 10.1093/sysbio/syab070] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 08/18/2021] [Accepted: 08/23/2021] [Indexed: 11/21/2022] Open
Abstract
Species tree inference from gene family trees is a significant problem in computational biology. However, gene tree heterogeneity, which can be caused by several factors including gene duplication and loss, makes the estimation of species trees very challenging. While there have been several species tree estimation methods introduced in recent years to specifically address gene tree heterogeneity due to gene duplication and loss (such as DupTree, FastMulRFS, ASTRAL-Pro, and SpeciesRax), many incur high cost in terms of both running time and memory. We introduce a new approach, DISCO, that decomposes the multi-copy gene family trees into many single copy trees, which allows for methods previously designed for species tree inference in a single copy gene tree context to be used. We prove that using DISCO with ASTRAL (i.e., ASTRAL-DISCO) is statistically consistent under the GDL model, provided that ASTRAL-Pro correctly roots and tags each gene family tree. We evaluate DISCO paired with different methods for estimating species trees from single copy genes (e.g., ASTRAL, ASTRID, and IQ-TREE) under a wide range of model conditions, and establish that high accuracy can be obtained even when ASTRAL-Pro is not able to correctly roots and tags the gene family trees. We also compare results using MI, an alternative decomposition strategy from Yang Y. and Smith S.A. (2014), and find that DISCO provides better accuracy, most likely as a result of covering more of the gene family tree leafset in the output decomposition. [Concatenation analysis; gene duplication and loss; species tree inference; summary method.].
Collapse
Affiliation(s)
- James Willson
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Mrinmoy Saha Roddur
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Baqiao Liu
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Paul Zaharias
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
30
|
Yan Z, Smith ML, Du P, Hahn MW, Nakhleh L. Species Tree Inference Methods Intended to Deal with Incomplete Lineage Sorting Are Robust to the Presence of Paralogs. Syst Biol 2022; 71:367-381. [PMID: 34245291 PMCID: PMC8978208 DOI: 10.1093/sysbio/syab056] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2020] [Revised: 06/23/2021] [Accepted: 06/30/2021] [Indexed: 11/24/2022] Open
Abstract
Many recent phylogenetic methods have focused on accurately inferring species trees when there is gene tree discordance due to incomplete lineage sorting (ILS). For almost all of these methods, and for phylogenetic methods in general, the data for each locus are assumed to consist of orthologous, single-copy sequences. Loci that are present in more than a single copy in any of the studied genomes are excluded from the data. These steps greatly reduce the number of loci available for analysis. The question we seek to answer in this study is: what happens if one runs such species tree inference methods on data where paralogy is present, in addition to or without ILS being present? Through simulation studies and analyses of two large biological data sets, we show that running such methods on data with paralogs can still provide accurate results. We use multiple different methods, some of which are based directly on the multispecies coalescent model, and some of which have been proven to be statistically consistent under it. We also treat the paralogous loci in multiple ways: from explicitly denoting them as paralogs, to randomly selecting one copy per species. In all cases, the inferred species trees are as accurate as equivalent analyses using single-copy orthologs. Our results have significant implications for the use of ILS-aware phylogenomic analyses, demonstrating that they do not have to be restricted to single-copy loci. This will greatly increase the amount of data that can be used for phylogenetic inference.[Gene duplication and loss; incomplete lineage sorting; multispecies coalescent; orthology; paralogy.].
Collapse
Affiliation(s)
- Zhi Yan
- Department of Computer Science, Rice University,
6100 Main Street, Houston, TX 77005, USA
| | - Megan L Smith
- Department of Biology and Department of Computer Science,
Indiana University, 1001 East Third Street, Bloomington,
IN 47405, USA
| | - Peng Du
- Department of Computer Science, Rice University,
6100 Main Street, Houston, TX 77005, USA
| | - Matthew W Hahn
- Department of Biology and Department of Computer Science,
Indiana University, 1001 East Third Street, Bloomington,
IN 47405, USA
| | - Luay Nakhleh
- Department of Computer Science, Rice University,
6100 Main Street, Houston, TX 77005, USA
- Department of BioSciences, Rice University, 6100
Main Street, Houston, TX 77005, USA
| |
Collapse
|
31
|
Morel B, Schade P, Lutteropp S, Williams TA, Szöllősi GJ, Stamatakis A. SpeciesRax: A Tool for Maximum Likelihood Species Tree Inference from Gene Family Trees under Duplication, Transfer, and Loss. Mol Biol Evol 2022; 39:msab365. [PMID: 35021210 PMCID: PMC8826479 DOI: 10.1093/molbev/msab365] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modeling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated data sets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large data sets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising 188 species from 31,612 gene families in 1 h using 40 cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.
Collapse
Affiliation(s)
- Benoit Morel
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Paul Schade
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Sarah Lutteropp
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Tom A Williams
- School of Biological Sciences, University of Bristol, Bristol, United Kingdom
| | - Gergely J Szöllősi
- ELTE-MTA “Lendület” Evolutionary Genomics Research Group, Budapest, Hungary
- Department of Biological Physics, Eötvös University, Budapest, Hungary
- Institute of Evolution, Centre for Ecological Research, Budapest, Hungary
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| |
Collapse
|
32
|
Abstract
Motivation Phylogenomics faces a dilemma: on the one hand, most accurate species and gene tree estimation methods are those that co-estimate them; on the other hand, these co-estimation methods do not scale to moderately large numbers of species. The summary-based methods, which first infer gene trees independently and then combine them, are much more scalable but are prone to gene tree estimation error, which is inevitable when inferring trees from limited-length data. Gene tree estimation error is not just random noise and can create biases such as long-branch attraction. Results We introduce a scalable likelihood-based approach to co-estimation under the multi-species coalescent model. The method, called quartet co-estimation (QuCo), takes as input independently inferred distributions over gene trees and computes the most likely species tree topology and internal branch length for each quartet, marginalizing over gene tree topologies and ignoring branch lengths by making several simplifying assumptions. It then updates the gene tree posterior probabilities based on the species tree. The focus on gene tree topologies and the heuristic division to quartets enables fast likelihood calculations. We benchmark our method with extensive simulations for quartet trees in zones known to produce biased species trees and further with larger trees. We also run QuCo on a biological dataset of bees. Our results show better accuracy than the summary-based approach ASTRAL run on estimated gene trees. Availability and implementation QuCo is available on https://github.com/maryamrabiee/quco. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Maryam Rabiee
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093, USA
| | | |
Collapse
|
33
|
Höhler D, Pfeiffer W, Ioannidis V, Stockinger H, Stamatakis A. RAxML Grove: an empirical phylogenetic tree database. Bioinformatics 2021; 38:1741-1742. [PMID: 34962976 PMCID: PMC8896636 DOI: 10.1093/bioinformatics/btab863] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 12/16/2021] [Accepted: 12/22/2021] [Indexed: 02/03/2023] Open
Abstract
SUMMARY The assessment of novel phylogenetic models and inference methods is routinely being conducted via experiments on simulated as well as empirical data. When generating synthetic data it is often unclear how to set simulation parameters for the models and generate trees that appropriately reflect empirical model parameter distributions and tree shapes. As a solution, we present and make available a new database called 'RAxML Grove' currently comprising more than 60 000 inferred trees and respective model parameter estimates from fully anonymized empirical datasets that were analyzed using RAxML and RAxML-NG on two web servers. We also describe and make available two simple applications of RAxML Grove to exemplify its usage and highlight its utility for designing realistic simulation studies and analyzing empirical model parameter and tree shape distributions. AVAILABILITY AND IMPLEMENTATION RAxML Grove is freely available at https://github.com/angtft/RAxMLGrove. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dimitri Höhler
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany
| | - Wayne Pfeiffer
- San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093-0505, USA
| | - Vassilios Ioannidis
- Core-IT Group, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Heinz Stockinger
- Core-IT Group, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | | |
Collapse
|
34
|
Smith ML, Hahn MW. The Frequency and Topology of Pseudoorthologs. Syst Biol 2021; 71:649-659. [PMID: 34951639 DOI: 10.1093/sysbio/syab097] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Revised: 12/15/2021] [Accepted: 12/17/2021] [Indexed: 11/12/2022] Open
Abstract
Phylogenetics has long relied on the use of orthologs, or genes related through speciation events, to infer species relationships. However, identifying orthologs is difficult because gene duplication can obscure relationships among genes. Researchers have been particularly concerned with the insidious effects of pseudoorthologs-duplicated genes that are mistaken for orthologs because they are present in a single copy in each sampled species. Because gene tree topologies of pseudoorthologs may differ from the species tree topology, they have often been invoked as the cause of counterintuitive results in phylogenetics. Despite these perceived problems, no previous work has calculated the probabilities of pseudoortholog topologies, or has been able to circumscribe the regions of parameter space in which pseudoorthologs are most likely to occur. Here, we introduce a model for calculating the probabilities and branch lengths of orthologs and pseudoorthologs, including concordant and discordant pseudoortholog topologies, on a rooted three-taxon species tree. We show that the probability of orthologs is high relative to the probability of pseudoorthologs across reasonable regions of parameter space. Furthermore, the probabilities of the two discordant topologies are equal and never exceed that of the concordant topology, generally being much lower. We describe the species tree topologies most prone to generating pseudoorthologs, finding that they are likely to present problems to phylogenetic inference irrespective of the presence of pseudoorthologs. Overall, our results suggest that pseudoorthologs are unlikely to mislead inferences of species relationships under the biological scenarios considered here.
Collapse
Affiliation(s)
- Megan L Smith
- Department of Biology and Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Matthew W Hahn
- Department of Biology and Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| |
Collapse
|
35
|
Walker JF, Smith SA, Hodel RGJ, Moyroud E. Concordance-based approaches for the inference of relationships and molecular rates with phylogenomic datasets. Syst Biol 2021; 71:943-958. [PMID: 34240209 DOI: 10.1093/sysbio/syab052] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Revised: 06/23/2021] [Accepted: 07/01/2021] [Indexed: 11/12/2022] Open
Abstract
Gene tree conflict is common and finding methods to analyze and alleviate the negative effects that conflict has on species tree analysis is a crucial part of phylogenomics. This study aims to expand the discussion of inferring species trees and molecular branch lengths when conflict is present. Conflict is typically examined in two ways: inferring its prevalence, and inferring the influence of the individual genes (how strongly one gene supports any given topology compared to an alternative topology). Here, we examine a procedure for incorporating both conflict and the influence of genes in order to infer evolutionary relationships. All supported relationships in the gene trees are analyzed and the likelihood of the genes constrained to these relationships is summed to provide a likelihood for the relationship. Consensus tree assembly is conducted based on the sum of likelihoods for a given relationship and choosing relationships based on the most likely relationship assuming it does not conflict with a relationship that has a higher likelihood score. If it is not possible for all most likely relationships to be combined into a single bifurcating tree then multiple trees are produced and a consensus tree with a polytomy is created. This procedure allows for more influential genes to have greater influence on an inferred relationship, does not assume conflict has arisen from any one source, and does not force the dataset to produce a single bifurcating tree. Using this approach on three empirical datasets, we examine and discuss the relationship between influence and prevalence of gene tree conflict. We find that in one of the datasets, assembling a bifurcating consensus tree solely composed of the most likely relationships is impossible. To account for conflict in molecular rate analysis we also introduce a concordance-based approach to the summary and estimation of branch lengths suitable for downstream comparative analyses. We demonstrate through simulation that even under high levels of stochastic conflict, the mean and median of the concordant rates recapitulate the true molecular rate better than using a supermatrix approach. Using a large phylogenomic dataset, we examine rate heterogeneity across concordant genes with a focus on the branch subtending crown angiosperms. Notably, we find highly variable rates of evolution along the branch subtending crown angiosperms. The approaches outlined here have several limitations, but they also represent some alternative methods for harnessing the complexity of phylogenomic datasets and enrich our inferences of both species' relationships and evolutionary processes.
Collapse
Affiliation(s)
- Joseph F Walker
- The Sainsbury Laboratory, University of Cambridge, 47 Bateman Street, Cambridge CB2 1LR, UK.,Department of Biological Sciences, University of Illinois at Chicago, Chicago, IL, 60607 U.S.A
| | - Stephen A Smith
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, USA
| | - Richard G J Hodel
- Department of Botany, National Museum of Natural History, MRC 166, Smithsonian Institution, Washington, DC, 20013-7012, USA
| | - Edwige Moyroud
- The Sainsbury Laboratory, University of Cambridge, 47 Bateman Street, Cambridge CB2 1LR, UK.,Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK
| |
Collapse
|
36
|
Paszek J, Markin A, Górecki P, Eulenstein O. Taming the Duplication-Loss-Coalescence Model with Integer Linear Programming. J Comput Biol 2021; 28:758-773. [PMID: 34125600 DOI: 10.1089/cmb.2021.0011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The duplication-loss-coalescence (DLC) parsimony model is invaluable for analyzing the complex scenarios of concurrent duplication loss and deep coalescence events in the evolution of gene families. However, inferring such scenarios for already moderately sized families is prohibitive owing to the computational complexity involved. To overcome this stringent limitation, we make the first step by describing a flexible integer linear programming (ILP) formulation for inferring DLC evolutionary scenarios. Then, to make the DLC model more scalable, we introduce four sensibly constrained versions of the model and describe modified versions of our ILP formulation reflecting these constraints. Our simulation studies showcase that our constrained ILP formulations compute evolutionary scenarios that are substantially larger than scenarios computable under our original ILP formulation and the original dynamic programming algorithm by Wu et al. Furthermore, scenarios computed under our constrained DLC models are remarkably accurate compared with corresponding scenarios under the original DLC model, which we also confirm in an empirical study with thousands of gene families.
Collapse
Affiliation(s)
- Jarosław Paszek
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warszawa, Poland
| | - Alexey Markin
- Department of Computer Science, Iowa State University, Ames, Iowa, USA
| | - Paweł Górecki
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warszawa, Poland
| | - Oliver Eulenstein
- Department of Computer Science, Iowa State University, Ames, Iowa, USA
| |
Collapse
|
37
|
Dismukes W, Heath TA. treeducken: An R package for simulating cophylogenetic systems. Methods Ecol Evol 2021. [DOI: 10.1111/2041-210x.13625] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Wade Dismukes
- Department of Ecology, Evolution, and Organismal Biology Iowa State University Ames IA USA
| | - Tracy A. Heath
- Department of Ecology, Evolution, and Organismal Biology Iowa State University Ames IA USA
| |
Collapse
|
38
|
DNA Barcoding of Marine Mollusks Associated with Corallina officinalis Turfs in Southern Istria (Adriatic Sea). DIVERSITY 2021. [DOI: 10.3390/d13050196] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Presence of mollusk assemblages was studied within red coralligenous algae Corallina officinalis L. along the southern Istrian coast. C. officinalis turfs can be considered a biodiversity reservoir, as they shelter numerous invertebrate species. The aim of this study was to identify mollusk species within these settlements using DNA barcoding as a method for detailed identification of mollusks. Nine locations and 18 localities with algal coverage range above 90% were chosen at four research areas. From 54 collected samples of C. officinalis turfs, a total of 46 mollusk species were identified. Molecular methods helped identify 16 gastropod, 14 bivalve and one polyplacophoran species. COI sequences for two bivalve species (Musculus cf. costulatus (Risso, 1826) and Gregariella semigranata (Reeve, 1858)) and seven gastropod species (Megastomia winfriedi Peñas & Rolán, 1999, Eatonina sp. Thiele, 1912, Eatonina cossurae (Calcara, 1841), Crisilla cf. maculata (Monterosato, 1869), Alvania cf. carinata (da Costa, 1778), Vitreolina antiflexa (Monterosato, 1884) and Odostomia plicata (Montagu, 1803)) represent new BINs in BOLD database. This study contributes to new findings related to the high biodiversity of mollusks associated with widespread C. officinalis settlements along the southern coastal area of Istria.
Collapse
|
39
|
Morel B, Kozlov AM, Stamatakis A, Szöllősi GJ. GeneRax: A Tool for Species-Tree-Aware Maximum Likelihood-Based Gene Family Tree Inference under Gene Duplication, Transfer, and Loss. Mol Biol Evol 2021; 37:2763-2774. [PMID: 32502238 PMCID: PMC8312565 DOI: 10.1093/molbev/msaa141] [Citation(s) in RCA: 80] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Inferring phylogenetic trees for individual homologous gene families is difficult because
alignments are often too short, and thus contain insufficient signal, while substitution
models inevitably fail to capture the complexity of the evolutionary processes. To
overcome these challenges, species-tree-aware methods also leverage information from a
putative species tree. However, only few methods are available that implement a full
likelihood framework or account for horizontal gene transfers. Furthermore, these methods
often require expensive data preprocessing (e.g., computing bootstrap trees) and rely on
approximations and heuristics that limit the degree of tree space exploration. Here, we
present GeneRax, the first maximum likelihood species-tree-aware phylogenetic inference
software. It simultaneously accounts for substitutions at the sequence level as well as
gene level events, such as duplication, transfer, and loss relying on established maximum
likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for
multiple gene families, directly from the per-gene sequence alignments and a rooted, yet
undated, species tree. We show that compared with competing tools, on simulated data
GeneRax infers trees that are the closest to the true tree in 90% of the simulations in
terms of relative Robinson–Foulds distance. On empirical data sets, GeneRax is the fastest
among all tested methods when starting from aligned sequences, and it infers trees with
the highest likelihood score, based on our model. GeneRax completed tree inferences and
reconciliations for 1,099 Cyanobacteria families in 8 min on 512 CPU cores. Thus, its
parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at
https://github.com/BenoitMorel/GeneRax (last accessed June 17, 2020).
Collapse
Affiliation(s)
- Benoit Morel
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Alexey M Kozlov
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.,Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Gergely J Szöllősi
- ELTE-MTA "Lendület" Evolutionary Genomics Research Group, Budapest, Hungary.,Department of Biological Physics, Eötvös University, Budapest, Hungary.,Evolutionary Systems Research Group, Centre for Ecological Research, Hungarian Academy of Sciences, Tihany, Hungary
| |
Collapse
|
40
|
Shen XX, Steenwyk JL, Rokas A. Dissecting incongruence between concatenation- and quartet-based approaches in phylogenomic data. Syst Biol 2021; 70:997-1014. [PMID: 33616672 DOI: 10.1093/sysbio/syab011] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Revised: 02/10/2021] [Accepted: 02/17/2021] [Indexed: 12/12/2022] Open
Abstract
Topological conflict or incongruence is widespread in phylogenomic data. Concatenation- and coalescent-based approaches often result in incongruent topologies, but the causes of this conflict can be difficult to characterize. We examined incongruence stemming from conflict between likelihood-based signal (quantified by the difference in gene-wise log likelihood score or ΔGLS) and quartet-based topological signal (quantified by the difference in gene-wise quartet score or ΔGQS) for every gene in three phylogenomic studies in animals, fungi, and plants, which were chosen because their concatenation-based IQ-TREE (T1) and quartet-based ASTRAL (T2) phylogenies are known to produce eight conflicting internal branches (bipartitions). By comparing the types of phylogenetic signal for all genes in these three data matrices, we found that 30% - 36% of genes in each data matrix are inconsistent, that is, each of these genes has higher log likelihood score for T1 versus T2 (i.e., ΔGLS >0) whereas its T1 topology has lower quartet score than its T2 topology (i.e., ΔGQS <0) or vice versa. Comparison of inconsistent and consistent genes using a variety of metrics (e.g., evolutionary rate, gene tree topology, distribution of branch lengths, hidden paralogy, and gene tree discordance) showed that inconsistent genes are more likely to recover neither T1 nor T2 and have higher levels of gene tree discordance than consistent genes. Simulation analyses demonstrate that removal of inconsistent genes from datasets with low levels of incomplete lineage sorting (ILS) and low and medium levels of gene tree estimation error (GTEE) reduced incongruence and increased accuracy. In contrast, removal of inconsistent genes from datasets with medium and high ILS levels and high GTEE levels eliminated or extensively reduced incongruence, but the resulting congruent species phylogenies were not always topologically identical to the true species trees.
Collapse
Affiliation(s)
- Xing-Xing Shen
- State Key Laboratory of Rice Biology and Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China.,Institute of Insect Sciences, Zhejiang University, Hangzhou, China
| | - Jacob L Steenwyk
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA
| |
Collapse
|
41
|
Allman ES, Mitchell JD, Rhodes JA. Gene tree discord, simplex plots, and statistical tests under the coalescent. Syst Biol 2021; 71:929-942. [PMID: 33560348 DOI: 10.1093/sysbio/syab008] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Revised: 01/31/2021] [Accepted: 02/03/2021] [Indexed: 02/06/2023] Open
Abstract
A simple graphical device, the simplex plot of quartet concordance factors, is introduced to aid in the exploration of a collection of gene trees on a common set of taxa. A single plot summarizes all gene tree discord, and allows for visual comparison to the expected discord from the multispecies coalescent model (MSC) of incomplete lineage sorting on a species tree. A formal statistical procedure is described that can quantify the deviation from expectation for each subset of four taxa, suggesting when the data is not in accord with the MSC, and thus that either gene tree inference error is substantial or a more complex model such as that on a network may be required. If the collection of gene trees is in accord with the MSC, the plots reveal when substantial incomplete lineage sorting is present. Applications to both simulated and empirical multilocus data sets illustrate the insights provided.
Collapse
Affiliation(s)
- Elizabeth S Allman
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99709, USA
| | - Jonathan D Mitchell
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99709, USA.,Unité Bioinformatique Evolutive, C3BI USR 3756, Institut Pasteur & CNRS, Paris, France
| | - John A Rhodes
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99709, USA
| |
Collapse
|
42
|
Rabiee M, Mirarab S. SODA: Multi-locus species delimitation using quartet frequencies. Bioinformatics 2021; 36:5623-5631. [PMID: 33555318 DOI: 10.1093/bioinformatics/btaa1010] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 10/19/2020] [Accepted: 11/21/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Species delimitation, the process of deciding how to group a set of organisms into units called species, is one of the most challenging problems in evolutionary computational biology. While many methods exist for species delimitation, most based on the coalescent theory, few are scalable to very large datasets, and methods that scale tend to be not accurate. Species delimitation is closely related to species tree inference from discordant gene trees, a problem that has enjoyed rapid advances in recent years. RESULTS In this paper, we build on the accuracy and scalability of recent quartet-based methods for species tree estimation and propose a new method called SODA for species delimitation. SODA relies heavily on a recently developed method for testing zero branch length in species trees. In extensive simulations, we show that SODA can easily scale to very large datasets while maintaining high accuracy. AVAILABILITY The code and data presented here are available on https://github.com/maryamrabiee/SODA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Maryam Rabiee
- Computer Science and Engineering, University of California, San Diego, US
| | - Siavash Mirarab
- Electrical and Computer Engineering, University of California, San Diego, US
| |
Collapse
|
43
|
Tabaszewski P, Gorecki P, Markin A, Anderson T, Eulenstein O. Consensus of All Solutions for Intractable Phylogenetic Tree Inference. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:149-161. [PMID: 31613775 DOI: 10.1109/tcbb.2019.2947051] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Solving median tree problems is a classic approach for inferring species trees from a collection of discordant gene trees. Median tree problems are typically NP-hard and dealt with by local search heuristics. Unfortunately, such heuristics generally lack provable correctness and precision. Algorithmic advances addressing this uncertainty have led to exact dynamic programming formulations suitable to solve a well-studied group of median tree problems for smaller phylogenetic analyses. However, these formulations allow computing only very few optimal species trees out of possibly many such trees, and phylogenetic studies often require the analysis of all optimal solutions through their consensus tree. Here, we describe a significant algorithmic modification of the dynamic programming formulations that compute the cluster counts of all optimal species trees from which various types of consensus trees can be efficiently computed. Through experimental studies, we demonstrate that our parallel implementation of the modified dynamic programming formulation is more efficient than a previous implementation of the original formulation. Finally, we show that the parallel implementation can rapidly identify novel reassorted influenza A viruses potentially facilitating pandemic preparedness efforts.
Collapse
|
44
|
Legried B, Molloy EK, Warnow T, Roch S. Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss. J Comput Biol 2020; 28:452-468. [DOI: 10.1089/cmb.2020.0424] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Affiliation(s)
- Brandon Legried
- Department of Mathematics, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Erin K. Molloy
- Department of Computer Science, University of California, Los Angeles, Los Angeles, California, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Sébastien Roch
- Department of Mathematics, University of Wisconsin-Madison, Madison, Wisconsin, USA
| |
Collapse
|
45
|
Li Q, Scornavacca C, Galtier N, Chan YB. The Multilocus Multispecies Coalescent: A Flexible New Model of Gene Family Evolution. Syst Biol 2020; 70:822-837. [PMID: 33169795 DOI: 10.1093/sysbio/syaa084] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 05/07/2020] [Accepted: 10/19/2020] [Indexed: 02/06/2023] Open
Abstract
Incomplete lineage sorting (ILS), the interaction between coalescence and speciation, can generate incongruence between gene trees and species trees, as can gene duplication (D), transfer (T), and loss (L). These processes are usually modeled independently, but in reality, ILS can affect gene copy number polymorphism, that is, interfere with DTL. This has been previously recognized, but not treated in a satisfactory way, mainly because DTL events are naturally modeled forward-in-time, while ILS is naturally modeled backward-in-time with the coalescent. Here, we consider the joint action of ILS and DTL on the gene tree/species tree problem in all its complexity. In particular, we show that the interaction between ILS and duplications/transfers (without losses) can result in patterns usually interpreted as resulting from gene loss, and that the realized rate of D, T, and L becomes nonhomogeneous in time when ILS is taken into account. We introduce algorithmic solutions to these problems. Our new model, the multilocus multispecies coalescent, which also accounts for any level of linkage between loci, generalizes the multispecies coalescent (MSC) model and offers a versatile, powerful framework for proper simulation, and inference of gene family evolution. [Gene duplication; gene loss; horizontal gene transfer; incomplete lineage sorting; multispecies coalescent; hemiplasy; recombination.].
Collapse
Affiliation(s)
- Qiuyi Li
- School of Mathematics and Statistics / Melbourne Integrative Genomics, The University of Melbourne, Melbourne 3010, Australia
| | - Celine Scornavacca
- Institut des Sciences de l'Evolution, Université Montpellier, CNRS, IRD, EPHE, Montpellier, 34095, France
| | - Nicolas Galtier
- Institut des Sciences de l'Evolution, Université Montpellier, CNRS, IRD, EPHE, Montpellier, 34095, France
| | - Yao-Ban Chan
- School of Mathematics and Statistics / Melbourne Integrative Genomics, The University of Melbourne, Melbourne 3010, Australia
| |
Collapse
|
46
|
Zhang C, Scornavacca C, Molloy EK, Mirarab S. ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy. Mol Biol Evol 2020; 37:3292-3307. [PMID: 32886770 PMCID: PMC7751180 DOI: 10.1093/molbev/msaa139] [Citation(s) in RCA: 111] [Impact Index Per Article: 22.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. As a result, most phylogenomic studies have focused on single-copy genes and have discarded large parts of the data. Here, we first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy. We then introduce a method called ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs) to find the species tree that optimizes our quartet similarity measure using dynamic programing. By studying its performance on an extensive collection of simulated data sets and on real data sets, we show that ASTRAL-Pro is more accurate than alternative methods.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, University of California San Diego, San Diego, CA
| | | | - Erin K Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA
| |
Collapse
|
47
|
Davín AA, Tricou T, Tannier E, de Vienne DM, Szöllősi GJ. Zombi: a phylogenetic simulator of trees, genomes and sequences that accounts for dead linages. Bioinformatics 2020; 36:1286-1288. [PMID: 31566657 PMCID: PMC7031779 DOI: 10.1093/bioinformatics/btz710] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Revised: 09/09/2019] [Accepted: 09/26/2019] [Indexed: 11/14/2022] Open
Abstract
Summary Here we present Zombi, a tool to simulate the evolution of species, genomes and sequences in silico, that considers for the first time the evolution of genomes in extinct lineages. It also incorporates various features that have not to date been combined in a single simulator, such as the possibility of generating species trees with a pre-defined variation of speciation and extinction rates through time, simulating explicitly intergenic sequences of variable length and outputting gene tree—species tree reconciliations. Availability and implementation Source code and manual are freely available in https://github.com/AADavin/ZOMBI/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Adrián A Davín
- MTA-ELTE Lendület Evolutionary Genomics Research Group, Budapest, Hungary.,Department of Biological Physics, Eötvös Loránd, Budapest, Hungary
| | - Théo Tricou
- Univ Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR5558, Villeurbanne F-69622, France
| | - Eric Tannier
- Univ Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR5558, Villeurbanne F-69622, France.,INRIA Grenoble Rhône-Alpes, Montbonnot-Saint-Martin F-38334, France
| | - Damien M de Vienne
- Univ Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR5558, Villeurbanne F-69622, France
| | - Gergely J Szöllősi
- MTA-ELTE Lendület Evolutionary Genomics Research Group, Budapest, Hungary.,Department of Biological Physics, Eötvös Loránd, Budapest, Hungary.,Evolutionary Systems Research Group, Centre for Ecological Research, Hungarian Academy of Sciences, Tihany H-8237, Hungary
| |
Collapse
|
48
|
Van Dam MH, Henderson JB, Esposito L, Trautwein M. Genomic Characterization and Curation of UCEs Improves Species Tree Reconstruction. Syst Biol 2020; 70:307-321. [PMID: 32750133 PMCID: PMC7875437 DOI: 10.1093/sysbio/syaa063] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2019] [Revised: 07/26/2020] [Accepted: 07/29/2020] [Indexed: 12/12/2022] Open
Abstract
Ultraconserved genomic elements (UCEs) are generally treated as independent loci in phylogenetic analyses. The identification pipeline for UCE probes does not require prior knowledge of genetic identity, only selecting loci that are highly conserved, single copy, without repeats, and of a particular length. Here, we characterized UCEs from 11 phylogenomic studies across the animal tree of life, from birds to marine invertebrates. We found that within vertebrate lineages, UCEs are mostly intronic and intergenic, while in invertebrates, the majority are in exons. We then curated four different sets of UCE markers by genomic category from five different studies including: birds, mammals, fish, Hymenoptera (ants, wasps, and bees), and Coleoptera (beetles). Of genes captured by UCEs, we find that many are represented by two or more UCEs, corresponding to nonoverlapping segments of a single gene. We considered these UCEs to be nonindependent, merged all UCEs that belonged to a particular gene, constructed gene and species trees, and then evaluated the subsequent effect of merging cogenic UCEs on gene and species tree reconstruction. Average bootstrap support for merged UCE gene trees was significantly improved across all data sets apparently driven by the increase in loci length. Additionally, we conducted simulations and found that gene trees generated from merged UCEs were more accurate than those generated by unmerged UCEs. As loci length improves gene tree accuracy, this modest degree of UCE characterization and curation impacts downstream analyses and demonstrates the advantages of incorporating basic genomic characterizations into phylogenomic analyses. [Anchored hybrid enrichment; ants; ASTRAL; bait capture; carangimorph; Coleoptera; conserved nonexonic elements; exon capture; gene tree; Hymenoptera; mammal; phylogenomic markers; songbird; species tree; ultraconserved elements; weevils.]
Collapse
Affiliation(s)
- Matthew H Van Dam
- Entomology Department, Institute for Biodiversity Science and Sustainability, California Academy of Sciences, 55 Music Concourse Dr., San Francisco, CA 94118, USA.,Center for Comparative Genomics, Institute for Biodiversity Science and Sustainability, California Academy of Sciences, 55 Music Concourse Dr., San Francisco, CA 94118, USA
| | - James B Henderson
- Center for Comparative Genomics, Institute for Biodiversity Science and Sustainability, California Academy of Sciences, 55 Music Concourse Dr., San Francisco, CA 94118, USA
| | - Lauren Esposito
- Entomology Department, Institute for Biodiversity Science and Sustainability, California Academy of Sciences, 55 Music Concourse Dr., San Francisco, CA 94118, USA.,Center for Comparative Genomics, Institute for Biodiversity Science and Sustainability, California Academy of Sciences, 55 Music Concourse Dr., San Francisco, CA 94118, USA
| | - Michelle Trautwein
- Entomology Department, Institute for Biodiversity Science and Sustainability, California Academy of Sciences, 55 Music Concourse Dr., San Francisco, CA 94118, USA.,Center for Comparative Genomics, Institute for Biodiversity Science and Sustainability, California Academy of Sciences, 55 Music Concourse Dr., San Francisco, CA 94118, USA
| |
Collapse
|
49
|
Bobay LM. CoreSimul: a forward-in-time simulator of genome evolution for prokaryotes modeling homologous recombination. BMC Bioinformatics 2020; 21:264. [PMID: 32580695 PMCID: PMC7315543 DOI: 10.1186/s12859-020-03619-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Accepted: 06/19/2020] [Indexed: 12/26/2022] Open
Abstract
Background Prokaryotes are asexual, but these organisms frequently engage in homologous recombination, a process that differs from meiotic recombination in sexual organisms. Most tools developed to simulate genome evolution either assume sexual reproduction or the complete absence of DNA flux in the population. As a result, very few simulators are adapted to model prokaryotic genome evolution while accounting for recombination. Moreover, many simulators are based on the coalescent, which assumes a neutral model of genomic evolution, and those are best suited for organisms evolving under weak selective pressures, such as animals and plants. In contrast, prokaryotes are thought to be evolving under much stronger selective pressures, suggesting that forward-in-time simulators are better suited for these organisms. Results Here, I present CoreSimul, a forward-in-time simulator of core genome evolution for prokaryotes modeling homologous recombination. Simulations are guided by a phylogenetic tree and incorporate different substitution models, including models of codon selection. Conclusions CoreSimul is a flexible forward-in-time simulator that constitutes a significant addition to the limited list of available simulators applicable to prokaryote genome evolution.
Collapse
Affiliation(s)
- Louis-Marie Bobay
- Department of Biology, University of North Carolina Greensboro, 321 McIver Street, PO Box 26170, Greensboro, NC, 27402, USA.
| |
Collapse
|
50
|
Jermiin LS, Catullo RA, Holland BR. A new phylogenetic protocol: dealing with model misspecification and confirmation bias in molecular phylogenetics. NAR Genom Bioinform 2020; 2:lqaa041. [PMID: 33575594 PMCID: PMC7671319 DOI: 10.1093/nargab/lqaa041] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 05/18/2020] [Accepted: 06/04/2020] [Indexed: 12/15/2022] Open
Abstract
Molecular phylogenetics plays a key role in comparative genomics and has increasingly significant impacts on science, industry, government, public health and society. In this paper, we posit that the current phylogenetic protocol is missing two critical steps, and that their absence allows model misspecification and confirmation bias to unduly influence phylogenetic estimates. Based on the potential offered by well-established but under-used procedures, such as assessment of phylogenetic assumptions and tests of goodness of fit, we introduce a new phylogenetic protocol that will reduce confirmation bias and increase the accuracy of phylogenetic estimates.
Collapse
Affiliation(s)
- Lars S Jermiin
- CSIRO Land & Water, Canberra, ACT 2601, Australia
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
- School of Biology & Environment Science, University College Dublin, Belfield, Dublin 4, Ireland
- Earth Institute, University College Dublin, Belfield, Dublin 4, Ireland
| | - Renee A Catullo
- CSIRO Land & Water, Canberra, ACT 2601, Australia
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
- School of Science and Health & Hawkesbury Institute of the Environment, Western Sydney University, Penrith, NSW 2751, Australia
| | - Barbara R Holland
- School of Natural Sciences, University of Tasmania, Hobart, TAS 7001, Australia
| |
Collapse
|