1
|
De Maio N. The Cumulative Indel Model: Fast and Accurate Statistical Evolutionary Alignment. Syst Biol 2021; 70:236-257. [PMID: 32653921 PMCID: PMC8559576 DOI: 10.1093/sysbio/syaa050] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Revised: 06/21/2020] [Accepted: 06/23/2020] [Indexed: 01/20/2023] Open
Abstract
Sequence alignment is essential for phylogenetic and molecular evolution inference, as well as in many other areas of bioinformatics and evolutionary biology. Inaccurate alignments can lead to severe biases in most downstream statistical analyses. Statistical alignment based on probabilistic models of sequence evolution addresses these issues by replacing heuristic score functions with evolutionary model-based probabilities. However, score-based aligners and fixed-alignment phylogenetic approaches are still more prevalent than methods based on evolutionary indel models, mostly due to computational convenience. Here, I present new techniques for improving the accuracy and speed of statistical evolutionary alignment. The "cumulative indel model" approximates realistic evolutionary indel dynamics using differential equations. "Adaptive banding" reduces the computational demand of most alignment algorithms without requiring prior knowledge of divergence levels or pseudo-optimal alignments. Using simulations, I show that these methods lead to fast and accurate pairwise alignment inference. Also, I show that it is possible, with these methods, to align and infer evolutionary parameters from a single long synteny block ($\approx$530 kbp) between the human and chimp genomes. The cumulative indel model and adaptive banding can therefore improve the performance of alignment and phylogenetic methods. [Evolutionary alignment; pairHMM; sequence evolution; statistical alignment; statistical genetics.].
Collapse
Affiliation(s)
- Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute
(EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| |
Collapse
|
2
|
Maldonado E, Antunes A. LMAP_S: Lightweight Multigene Alignment and Phylogeny eStimation. BMC Bioinformatics 2019; 20:739. [PMID: 31888452 PMCID: PMC6937843 DOI: 10.1186/s12859-019-3292-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 11/26/2019] [Indexed: 01/22/2023] Open
Abstract
Background Recent advances in genome sequencing technologies and the cost drop in high-throughput sequencing continue to give rise to a deluge of data available for downstream analyses. Among others, evolutionary biologists often make use of genomic data to uncover phenotypic diversity and adaptive evolution in protein-coding genes. Therefore, multiple sequence alignments (MSA) and phylogenetic trees (PT) need to be estimated with optimal results. However, the preparation of an initial dataset of multiple sequence file(s) (MSF) and the steps involved can be challenging when considering extensive amount of data. Thus, it becomes necessary the development of a tool that removes the potential source of error and automates the time-consuming steps of a typical workflow with high-throughput and optimal MSA and PT estimations. Results We introduce LMAP_S (Lightweight Multigene Alignment and Phylogeny eStimation), a user-friendly command-line and interactive package, designed to handle an improved alignment and phylogeny estimation workflow: MSF preparation, MSA estimation, outlier detection, refinement, consensus, phylogeny estimation, comparison and editing, among which file and directory organization, execution, manipulation of information are automated, with minimal manual user intervention. LMAP_S was developed for the workstation multi-core environment and provides a unique advantage for processing multiple datasets. Our software, proved to be efficient throughout the workflow, including, the (unlimited) handling of more than 20 datasets. Conclusions We have developed a simple and versatile LMAP_S package enabling researchers to effectively estimate multiple datasets MSAs and PTs in a high-throughput fashion. LMAP_S integrates more than 25 software providing overall more than 65 algorithm choices distributed in five stages. At minimum, one FASTA file is required within a single input directory. To our knowledge, no other software combines MSA and phylogeny estimation with as many alternatives and provides means to find optimal MSAs and phylogenies. Moreover, we used a case study comparing methodologies that highlighted the usefulness of our software. LMAP_S has been developed as an open-source package, allowing its integration into more complex open-source bioinformatics pipelines. LMAP_S package is released under GPLv3 license and is freely available at https://lmap-s.sourceforge.io/.
Collapse
Affiliation(s)
- Emanuel Maldonado
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, 4450-208, Porto, Portugal
| | - Agostinho Antunes
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, 4450-208, Porto, Portugal. .,Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre, 4169-007, Porto, Portugal.
| |
Collapse
|
3
|
Abstract
Whole-genome alignment (WGA) is the prediction of evolutionary relationships at the nucleotide level between two or more genomes. It combines aspects of both colinear sequence alignment and gene orthology prediction and is typically more challenging to address than either of these tasks due to the size and complexity of whole genomes. Despite the difficulty of this problem, numerous methods have been developed for its solution because WGAs are valuable for genome-wide analyses such as phylogenetic inference, genome annotation, and function prediction. In this chapter, we discuss the meaning and significance of WGA and present an overview of the methods that address it. We also examine the problem of evaluating whole-genome aligners and offer a set of methodological challenges that need to be tackled in order to make most effective use of our rapidly growing databases of whole genomes.
Collapse
Affiliation(s)
- Colin N Dewey
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
4
|
Herman JL. Enhancing Statistical Multiple Sequence Alignment and Tree Inference Using Structural Information. Methods Mol Biol 2019; 1851:183-214. [PMID: 30298398 DOI: 10.1007/978-1-4939-8736-8_10] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
For highly divergent sequences, there is often insufficient information to reliably construct alignments and phylogenetic trees. Since protein structure may be strongly conserved despite large divergences in sequence, structural information can be used to help identify homology in such cases.While there exist well-studied models of sequence evolution, structurally informed alignment methods have typically made use of geometric measures of deviation that do not take into account the underlying mutational processes. In order to integrate structural information into sequence-based evolutionary models, we recently developed a stochastic model of structural evolution on a phylogenetic tree and implemented this as the StructAlign plugin for the StatAlign statistical alignment package.In this chapter, we will outline the types of analyses that can be carried out using StructAlign, illustrating how the inclusion of structural information can be used to inform joint estimation of alignments and trees. StructAlign can also be used to infer branch-specific rates of structural evolution, and analysis of an example globin dataset highlights strong variation in the inferred rate across the tree. While structure is more highly conserved within clades, the rate of structural divergence as a function of sequence variation is larger between functionally divergent proteins. Allowing for the rate of structural divergence to vary over the tree results in an improved fit to the empirically observed pairwise RMSD values.
Collapse
Affiliation(s)
- Joseph L Herman
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
5
|
Holmes IH. Historian: accurate reconstruction of ancestral sequences and evolutionary rates. Bioinformatics 2017; 33:1227-1229. [PMID: 28104629 PMCID: PMC6074814 DOI: 10.1093/bioinformatics/btw791] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2016] [Accepted: 12/11/2016] [Indexed: 01/09/2023] Open
Abstract
Motivation Reconstruction of ancestral sequence histories, and estimation of parameters like indel rates, are improved by using explicit evolutionary models and summing over uncertain alignments. The previous best tool for this purpose (according to simulation benchmarks) was ProtPal, but this tool was too slow for practical use. Results Historian combines an efficient reimplementation of the ProtPal algorithm with performance-improving heuristics from other alignment tools. Simulation results on fidelity of rate estimation via ancestral reconstruction, along with evaluations on the structurally informed alignment dataset BAliBase 3.0, recommend Historian over other alignment tools for evolutionary applications. Availability and Implementation Historian is available at https://github.com/evoldoers/historian under the Creative Commons Attribution 3.0 US license. Contact ihholmes+historian@gmail.com.
Collapse
Affiliation(s)
- Ian H Holmes
- Department of Bioengineering, University of California, Berkeley, CA, USA,
| |
Collapse
|
6
|
Abstract
Background Multiple sequence alignment is an important task in bioinformatics, and alignments of large datasets containing hundreds or thousands of sequences are increasingly of interest. While many alignment methods exist, the most accurate alignments are likely to be based on stochastic models where sequences evolve down a tree with substitutions, insertions, and deletions. While some methods have been developed to estimate alignments under these stochastic models, only the Bayesian method BAli-Phy has been able to run on even moderately large datasets, containing 100 or so sequences. A technique to extend BAli-Phy to enable alignments of thousands of sequences could potentially improve alignment and phylogenetic tree accuracy on large-scale data beyond the best-known methods today. Results We use simulated data with up to 10,000 sequences representing a variety of model conditions, including some that are significantly divergent from the statistical models used in BAli-Phy and elsewhere. We give a method for incorporating BAli-Phy into PASTA and UPP, two strategies for enabling alignment methods to scale to large datasets, and give alignment and tree accuracy results measured against the ground truth from simulations. Comparable results are also given for other methods capable of aligning this many sequences. Conclusions Extensions of BAli-Phy using PASTA and UPP produce significantly more accurate alignments and phylogenetic trees than the current leading methods. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3101-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Michael Nute
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S. Wright St, Champaign, 61820, IL, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Ave, Urbana, 61801, IL, USA. .,Department of Bioengineering, University of Illinois at Urbana-Champaign, 1270 Digital Computing Laboratory, MC-278, Urbana, 61801, IL, USA. .,National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, 1205 W. Clark St., MC-257, Urbana, 61801, IL, USA. .,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 W. Gregory Dr., MC-195, Urbana, 61801, IL, USA.
| |
Collapse
|
7
|
General continuous-time Markov model of sequence evolution via insertions/deletions: local alignment probability computation. BMC Bioinformatics 2016; 17:397. [PMID: 27677569 PMCID: PMC5039815 DOI: 10.1186/s12859-016-1167-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Accepted: 08/09/2016] [Indexed: 11/16/2022] Open
Abstract
Background Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a method to reliably calculate the occurrence probabilities of sequence alignments via evolutionary processes on an entire sequence. Previously, we presented a perturbative formulation that facilitates the ab initio calculation of alignment probabilities under a continuous-time Markov model, which describes the stochastic evolution of an entire sequence via indels with quite general rate parameters. And we demonstrated that, under some conditions, the ab initio probability of an alignment can be factorized into the product of an overall factor and contributions from regions (or local alignments) delimited by gapless columns. Results Here, using our formulation, we attempt to approximately calculate the probabilities of local alignments under space-homogeneous cases. First, for each of all types of local pairwise alignments (PWAs) and some typical types of local multiple sequence alignments (MSAs), we numerically computed the total contribution from all parsimonious indel histories and that from all next-parsimonious histories, and compared them. Second, for some common types of local PWAs, we derived two integral equation systems that can be numerically solved to give practically exact solutions. We compared the total parsimonious contribution with the practically exact solution for each such local PWA. Third, we developed an algorithm that calculates the first-approximate MSA probability by multiplying total parsimonious contributions from all local MSAs. Then we compared the first-approximate probability of each local MSA with its absolute frequency in the MSAs created via a genuine sequence evolution simulator, Dawg. In all these analyses, the total parsimonious contributions approximated the multiplication factors fairly well, as long as gap sizes and branch lengths are at most moderate. Examination of the accuracy of another indel probabilistic model in the light of our formulation indicated some modifications necessary for the model’s accuracy improvement. Conclusions At least under moderate conditions, the approximate methods can quite accurately calculate ab initio alignment probabilities under biologically more realistic models than before. Thus, our formulation will provide other indel probabilistic models with a sound reference point. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1167-6) contains supplementary material, which is available to authorized users.
Collapse
|
8
|
Ezawa K. General continuous-time Markov model of sequence evolution via insertions/deletions: are alignment probabilities factorable? BMC Bioinformatics 2016; 17:304. [PMID: 27638547 PMCID: PMC5026781 DOI: 10.1186/s12859-016-1105-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2016] [Accepted: 05/26/2016] [Indexed: 11/10/2022] Open
Abstract
Background Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. Recently, indel probabilistic models are mostly based on either hidden Markov models (HMMs) or transducer theories, both of which give the indel component of the probability of a given sequence alignment as a product of either probabilities of column-to-column transitions or block-wise contributions along the alignment. However, it is not a priori clear how these models are related with any genuine stochastic evolutionary model, which describes the stochastic evolution of an entire sequence along the time-axis. Moreover, currently none of these models can fully accommodate biologically realistic features, such as overlapping indels, power-law indel-length distributions, and indel rate variation across regions. Results Here, we theoretically dissect the ab initio calculation of the probability of a given sequence alignment under a genuine stochastic evolutionary model, more specifically, a general continuous-time Markov model of the evolution of an entire sequence via insertions and deletions. Our model is a simple extension of the general “substitution/insertion/deletion (SID) model”. Using the operator representation of indels and the technique of time-dependent perturbation theory, we express the ab initio probability as a summation over all alignment-consistent indel histories. Exploiting the equivalence relations between different indel histories, we find a “sufficient and nearly necessary” set of conditions under which the probability can be factorized into the product of an overall factor and the contributions from regions separated by gapless columns of the alignment, thus providing a sort of generalized HMM. The conditions distinguish evolutionary models with factorable alignment probabilities from those without ones. The former category includes the “long indel” model (a space-homogeneous SID model) and the model used by Dawg, a genuine sequence evolution simulator. Conclusions With intuitive clarity and mathematical preciseness, our theoretical formulation will help further advance the ab initio calculation of alignment probabilities under biologically realistic models of sequence evolution via indels. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1105-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kiyoshi Ezawa
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka, 820-8502, Japan. .,Department of Biology and Biochemistry, University of Houston, Houston, TX, 77204-5001, USA.
| |
Collapse
|
9
|
|
10
|
Levy Karin E, Rabin A, Ashkenazy H, Shkedy D, Avram O, Cartwright RA, Pupko T. Inferring Indel Parameters using a Simulation-based Approach. Genome Biol Evol 2015; 7:3226-38. [PMID: 26537226 PMCID: PMC4700945 DOI: 10.1093/gbe/evv212] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
In this study, we present a novel methodology to infer indel parameters from multiple sequence alignments (MSAs) based on simulations. Our algorithm searches for the set of evolutionary parameters describing indel dynamics which best fits a given input MSA. In each step of the search, we use parametric bootstraps and the Mahalanobis distance to estimate how well a proposed set of parameters fits input data. Using simulations, we demonstrate that our methodology can accurately infer the indel parameters for a large variety of plausible settings. Moreover, using our methodology, we show that indel parameters substantially vary between three genomic data sets: Mammals, bacteria, and retroviruses. Finally, we demonstrate how our methodology can be used to simulate MSAs based on indel parameters inferred from real data sets.
Collapse
Affiliation(s)
- Eli Levy Karin
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Avigayel Rabin
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Haim Ashkenazy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Dafna Shkedy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Oren Avram
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel The Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Reed A Cartwright
- The Biodesign Institute, Arizona State University, Tempe School of Life Sciences, Arizona State University, Tempe
| | - Tal Pupko
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| |
Collapse
|