1
|
Nute M, Saleh E, Warnow T. Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets. Syst Biol 2019; 68:396-411. [PMID: 30329135 PMCID: PMC6472439 DOI: 10.1093/sysbio/syy068] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 09/27/2018] [Accepted: 10/11/2018] [Indexed: 01/15/2023] Open
Abstract
The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical coestimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical coestimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy has better precision and recall (with respect to the true alignments) than the other alignment methods on the simulated data sets but has consistently lower recall on the biological benchmarks (with respect to the reference alignments) than many of the other methods. In other words, we find that BAli-Phy systematically underaligns when operating on biological sequence data but shows no sign of this on simulated data. There are several potential causes for this change in performance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments, and future research is needed to determine the most likely explanation. We conclude with a discussion of the potential ramifications for each of these possibilities. [BAli-Phy; homology; multiple sequence alignment; protein sequences; structural alignment.]
Collapse
Affiliation(s)
- Michael Nute
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S Wright St #101, Champaign, IL 61820, USA
| | - Ehsan Saleh
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1205 W. Clark St., Urbana, IL 61801, USA.,National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
2
|
Levy Karin E, Ashkenazy H, Hein J, Pupko T. A Simulation-Based Approach to Statistical Alignment. Syst Biol 2018; 68:252-266. [DOI: 10.1093/sysbio/syy059] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2018] [Accepted: 09/10/2018] [Indexed: 12/26/2022] Open
Affiliation(s)
- Eli Levy Karin
- School of Molecular Cell Biology & Biotechnology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel
| | - Haim Ashkenazy
- School of Molecular Cell Biology & Biotechnology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel
| | - Jotun Hein
- School of Molecular Cell Biology & Biotechnology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel
- Department of Statistics, University of Oxford, Oxford, UK
| | - Tal Pupko
- School of Molecular Cell Biology & Biotechnology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel
| |
Collapse
|
3
|
Abstract
BACKGROUND Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. MAIN TEXT This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. CONCLUSIONS While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances.
Collapse
Affiliation(s)
- Ian H. Holmes
- 0000 0001 2181 7878grid.47840.3fDept of Bioengineering, University of California, Berkeley, 94720 USA
| |
Collapse
|
4
|
Abstract
Background Multiple sequence alignment is an important task in bioinformatics, and alignments of large datasets containing hundreds or thousands of sequences are increasingly of interest. While many alignment methods exist, the most accurate alignments are likely to be based on stochastic models where sequences evolve down a tree with substitutions, insertions, and deletions. While some methods have been developed to estimate alignments under these stochastic models, only the Bayesian method BAli-Phy has been able to run on even moderately large datasets, containing 100 or so sequences. A technique to extend BAli-Phy to enable alignments of thousands of sequences could potentially improve alignment and phylogenetic tree accuracy on large-scale data beyond the best-known methods today. Results We use simulated data with up to 10,000 sequences representing a variety of model conditions, including some that are significantly divergent from the statistical models used in BAli-Phy and elsewhere. We give a method for incorporating BAli-Phy into PASTA and UPP, two strategies for enabling alignment methods to scale to large datasets, and give alignment and tree accuracy results measured against the ground truth from simulations. Comparable results are also given for other methods capable of aligning this many sequences. Conclusions Extensions of BAli-Phy using PASTA and UPP produce significantly more accurate alignments and phylogenetic trees than the current leading methods. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3101-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Michael Nute
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S. Wright St, Champaign, 61820, IL, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Ave, Urbana, 61801, IL, USA. .,Department of Bioengineering, University of Illinois at Urbana-Champaign, 1270 Digital Computing Laboratory, MC-278, Urbana, 61801, IL, USA. .,National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, 1205 W. Clark St., MC-257, Urbana, 61801, IL, USA. .,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 W. Gregory Dr., MC-195, Urbana, 61801, IL, USA.
| |
Collapse
|
5
|
Levy Karin E, Rabin A, Ashkenazy H, Shkedy D, Avram O, Cartwright RA, Pupko T. Inferring Indel Parameters using a Simulation-based Approach. Genome Biol Evol 2015; 7:3226-38. [PMID: 26537226 PMCID: PMC4700945 DOI: 10.1093/gbe/evv212] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
In this study, we present a novel methodology to infer indel parameters from multiple sequence alignments (MSAs) based on simulations. Our algorithm searches for the set of evolutionary parameters describing indel dynamics which best fits a given input MSA. In each step of the search, we use parametric bootstraps and the Mahalanobis distance to estimate how well a proposed set of parameters fits input data. Using simulations, we demonstrate that our methodology can accurately infer the indel parameters for a large variety of plausible settings. Moreover, using our methodology, we show that indel parameters substantially vary between three genomic data sets: Mammals, bacteria, and retroviruses. Finally, we demonstrate how our methodology can be used to simulate MSAs based on indel parameters inferred from real data sets.
Collapse
Affiliation(s)
- Eli Levy Karin
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Avigayel Rabin
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Haim Ashkenazy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Dafna Shkedy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Oren Avram
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel The Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Reed A Cartwright
- The Biodesign Institute, Arizona State University, Tempe School of Life Sciences, Arizona State University, Tempe
| | - Tal Pupko
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel
| |
Collapse
|
6
|
Bouchard-Côté A. A note on probabilistic models over strings: the linear algebra approach. Bull Math Biol 2013; 75:2529-50. [PMID: 24135792 DOI: 10.1007/s11538-013-9906-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2013] [Accepted: 09/19/2013] [Indexed: 11/28/2022]
Abstract
Probabilistic models over strings have played a key role in developing methods that take into consideration indels as phylogenetically informative events. There is an extensive literature on using automata and transducers on phylogenies to do inference on these probabilistic models, in which an important theoretical question is the complexity of computing the normalization of a class of string-valued graphical models. This question has been investigated using tools from combinatorics, dynamic programming, and graph theory, and has practical applications in Bayesian phylogenetics. In this work, we revisit this theoretical question from a different point of view, based on linear algebra. The main contribution is a set of results based on this linear algebra view that facilitate the analysis and design of inference algorithms on string-valued graphical models. As an illustration, we use this method to give a new elementary proof of a known result on the complexity of inference on the "TKF91" model, a well-known probabilistic model over strings. Compared to previous work, our proving method is easier to extend to other models, since it relies on a novel weak condition, triangular transducers, which is easy to establish in practice. The linear algebra view provides a concise way of describing transducer algorithms and their compositions, opens the possibility of transferring fast linear algebra libraries (for example, based on GPUs), as well as low rank matrix approximation methods, to string-valued inference problems.
Collapse
Affiliation(s)
- Alexandre Bouchard-Côté
- Department of Statistics, The University of British Columbia, 3182 Earth Sciences Building, 2207 Main Mall, Vancouver, BC, V6T 1Z4, Canada,
| |
Collapse
|
7
|
Warnow T. Large-Scale Multiple Sequence Alignment and Phylogeny Estimation. MODELS AND ALGORITHMS FOR GENOME EVOLUTION 2013. [DOI: 10.1007/978-1-4471-5298-9_6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|
8
|
Abstract
The standard approach to phylogeny estimation uses two phases, in which the first phase produces an alignment on a set of homologous sequences, and the second phase estimates a tree on the multiple sequence alignment. POY, a method which seeks a tree/alignment pair minimizing the total treelength, is the most widely used alternative to this two-phase approach. The topological accuracy of trees computed under treelength optimization is, however, controversial. In particular, one study showed that treelength optimization using simple gap penalties produced poor trees and alignments, and suggested the possibility that if POY were used with an affine gap penalty, it might be able to be competitive with the best two-phase methods. In this paper we report on a study addressing this possibility. We present a new heuristic for treelength, called BeeTLe (Better Treelength), that is guaranteed to produce trees at least as short as POY. We then use this heuristic to analyze a large number of simulated and biological datasets, and compare the resultant trees and alignments to those produced using POY and also maximum likelihood (ML) and maximum parsimony (MP) trees computed on a number of alignments. In general, we find that trees produced by BeeTLe are shorter and more topologically accurate than POY trees, but that neither POY nor BeeTLe produces trees as topologically accurate as ML trees produced on standard alignments. These findings, taken as a whole, suggest that treelength optimization is not as good an approach to phylogenetic tree estimation as maximum likelihood based upon good alignment methods.
Collapse
Affiliation(s)
| | - Tandy Warnow
- Department of Computer Science, University of Texas at Austin, Austin, Texas, United States of America
| |
Collapse
|
9
|
Warnow T. Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent. PLOS CURRENTS 2012; 4:RRN1308. [PMID: 22453901 PMCID: PMC3299439 DOI: 10.1371/currents.rrn1308] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 03/12/2012] [Indexed: 12/04/2022]
Abstract
BackgroundMost statistical methods for phylogenetic estimation in use today treat a gap (generally representing an insertion or deletion, i.e., indel) within the input sequence alignment as missing data. However, the statistical properties of this treatment of indels have not been fully investigated.ResultsWe prove that maximum likelihood phylogeny estimation, treating indels as missing data, can be statistically inconsistent for a general (and rather simple) model of sequence evolution, even when given the true alignment. Therefore, accurate phylogeny estimation cannot be guaranteed for maximum likelihood analyses, even given arbitrarily long sequences, when indels are present and treated as missing data.ConclusionsOur result shows that the standard statistical techniques used to estimate phylogenies from sequence alignments may have unfavorable statistical properties, even when the sequence alignment is accurate and the assumed substitution model matches the generation model. This suggests that the recent research focus on developing statistical methods that treat indel events properly is an important direction for phylogeny estimation.
Collapse
Affiliation(s)
- Tandy Warnow
- Professor of Computer Science, Department of Computer Science, University of Texas at Austin
| |
Collapse
|
10
|
Abstract
Natural products researchers are increasingly employing evolutionary analyses of genes and gene products that rely on phylogenetic trees. The field of phylogenetic inference and of evolutionary analyses based on phylogenies is growing at an amazing rate, making it difficult to keep up with the latest methodologies. Here, we summarize phylogenetic applications in natural products research, and review methods and software useful for carrying out analyses inferring or using phylogenetic trees. We include an updated overview of available alignment methods and programs, as well as a selection of some useful phylogenetic analysis tools. This review covers primarily the period 2000-2009 for applications of phylogenetic methods in natural product research, and 1990-2009 for phylogenetic methods, with some references going back to the 1960s.
Collapse
Affiliation(s)
- Imke Schmitt
- Department of Plant Biology and Bell Museum of Natural History, University of Minnesota, 250 Biological Sciences Center, 1445 Gortner Ave., St. Paul, MN 55108, USA.
| | | |
Collapse
|
11
|
Bradley RK, Holmes I. Evolutionary triplet models of structured RNA. PLoS Comput Biol 2009; 5:e1000483. [PMID: 19714212 PMCID: PMC2725318 DOI: 10.1371/journal.pcbi.1000483] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2008] [Accepted: 07/23/2009] [Indexed: 12/31/2022] Open
Abstract
The reconstruction and synthesis of ancestral RNAs is a feasible goal for paleogenetics. This will require new bioinformatics methods, including a robust statistical framework for reconstructing histories of substitutions, indels and structural changes. We describe a "transducer composition" algorithm for extending pairwise probabilistic models of RNA structural evolution to models of multiple sequences related by a phylogenetic tree. This algorithm draws on formal models of computational linguistics as well as the 1985 protosequence algorithm of David Sankoff. The output of the composition algorithm is a multiple-sequence stochastic context-free grammar. We describe dynamic programming algorithms, which are robust to null cycles and empty bifurcations, for parsing this grammar. Example applications include structural alignment of non-coding RNAs, propagation of structural information from an experimentally-characterized sequence to its homologs, and inference of the ancestral structure of a set of diverged RNAs. We implemented the above algorithms for a simple model of pairwise RNA structural evolution; in particular, the algorithms for maximum likelihood (ML) alignment of three known RNA structures and a known phylogeny and inference of the common ancestral structure. We compared this ML algorithm to a variety of related, but simpler, techniques, including ML alignment algorithms for simpler models that omitted various aspects of the full model and also a posterior-decoding alignment algorithm for one of the simpler models. In our tests, incorporation of basepair structure was the most important factor for accurate alignment inference; appropriate use of posterior-decoding was next; and fine details of the model were least important. Posterior-decoding heuristics can be substantially faster than exact phylogenetic inference, so this motivates the use of sum-over-pairs heuristics where possible (and approximate sum-over-pairs). For more exact probabilistic inference, we discuss the use of transducer composition for ML (or MCMC) inference on phylogenies, including possible ways to make the core operations tractable.
Collapse
Affiliation(s)
- Robert K. Bradley
- Biophysics Graduate Group, University of California, Berkeley, California, United States of America
| | - Ian Holmes
- Biophysics Graduate Group, University of California, Berkeley, California, United States of America
- Department of Bioengineering, University of California, Berkeley, California, United States of America
- * E-mail:
| |
Collapse
|
12
|
Abstract
We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/. Biological sequence alignment is one of the fundamental problems in comparative genomics, yet it remains unsolved. Over sixty sequence alignment programs are listed on Wikipedia, and many new programs are published every year. However, many popular programs suffer from pathologies such as aligning unrelated sequences and producing discordant alignments in protein (amino acid) and codon (nucleotide) space, casting doubt on the accuracy of the inferred alignments. Inaccurate alignments can introduce large and unknown systematic biases into downstream analyses such as phylogenetic tree reconstruction and substitution rate estimation. We describe a new program for multiple sequence alignment which can align protein, RNA and DNA sequence and improves on the accuracy of existing approaches on benchmarks of protein and RNA structural alignments and simulated mammalian and fly genomic alignments. Our approach, which seeks to find the alignment which is closest to the truth under our statistical model, leaves unrelated sequences largely unaligned and produces concordant alignments in protein and codon space. It is fast enough for difficult problems such as aligning orthologous genomic regions or aligning hundreds or thousands of proteins. It furthermore has a companion GUI for visualizing the estimated alignment reliability.
Collapse
|
13
|
Miklós I, Novák Á, Satija R, Lyngsø R, Hein J. Stochastic models of sequence evolution including insertion—deletion events. Stat Methods Med Res 2009; 18:453-85. [DOI: 10.1177/0962280208099500] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Comparison of sequences that have descended from a common ancestor based on an explicit stochastic model of substitutions, insertions and deletions has risen to prominence in the last decade. Making statements about the positions of insertions-deletions (abbr. indels) is central in sequence and genome analysis and is called alignment. This statistical approach is harder conceptually and computationally, than competing approaches based on choosing an alignment according to some optimality criteria. But it has major practical advantages in terms of testing evolutionary hypotheses and parameter estimation. Basic dynamic approaches can allow the analysis of up to 4—5 sequences. MCMC techniques can bring this to about 10—15 sequences. Beyond this, different or heuristic approaches must be used. Besides the computational challenges, increasing realism in the underlying models is presently being addressed. A recent development that has been especially fruitful is combining statistical alignment with the problem of sequence annotation, making statements about the function of each nucleotide/amino acid. So far gene finding, protein secondary structure prediction and regulatory signal detection has been tackled within this framework. Much progress can be reported, but clearly major challenges remain if this approach is to be central in the analyses of large incoming sequence data sets.
Collapse
Affiliation(s)
- István Miklós
- Bioinformatics Group, Alfréd Rényi Institute of Mathematics, Hungarian Academy of Sciences, 1053 Budapest, Reáltanoda u. 13-15, Hungary, , Bioinformatics Group, Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK, Data Mining and Search Research Group, Computer and Automation Institute, Hungarian Academy of Sciences, 1111 Budapest, Lágymányosi u. 11., Hungary
| | - Ádám Novák
- Bioinformatics Group, Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
| | - Rahul Satija
- Bioinformatics Group, Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
| | - Rune Lyngsø
- Bioinformatics Group, Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
| | - Jotun Hein
- Bioinformatics Group, Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
| |
Collapse
|
14
|
Probabilistic phylogenetic inference with insertions and deletions. PLoS Comput Biol 2008; 4:e1000172. [PMID: 18787703 PMCID: PMC2527138 DOI: 10.1371/journal.pcbi.1000172] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2007] [Accepted: 07/31/2008] [Indexed: 11/19/2022] Open
Abstract
A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth-death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new "concordance test" benchmark on real ribosomal RNA alignments, we show that the extended program dnamlepsilon improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.
Collapse
|
15
|
Abstract
Protein sequence alignment is the task of identifying evolutionarily or structurally related positions in a collection of amino acid sequences. Although the protein alignment problem has been studied for several decades, many recent studies have demonstrated considerable progress in improving the accuracy or scalability of multiple and pairwise alignment tools, or in expanding the scope of tasks handled by an alignment program. In this chapter, we review state-of-the-art protein sequence alignment and provide practical advice for users of alignment tools.
Collapse
Affiliation(s)
- Chuong B Do
- Computer Science Department, Stanford University, Stanford, CA, USA
| | | |
Collapse
|
16
|
Bradley RK, Holmes I. Transducers: an emerging probabilistic framework for modeling indels on trees. Bioinformatics 2007; 23:3258-62. [PMID: 17804440 DOI: 10.1093/bioinformatics/btm402] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
17
|
Diallo AB, Makarenkov V, Blanchette M. Exact and heuristic algorithms for the Indel Maximum Likelihood Problem. J Comput Biol 2007; 14:446-61. [PMID: 17572023 DOI: 10.1089/cmb.2007.a006] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Given a multiple alignment of orthologous DNA sequences and a phylogenetic tree for these sequences, we investigate the problem of reconstructing the most likely scenario of insertions and deletions capable of explaining the gaps observed in the alignment. This problem, that we called the Indel Maximum Likelihood Problem (IMLP), is an important step toward the reconstruction of ancestral genomics sequences, and is important for studying evolutionary processes, genome function, adaptation and convergence. We solve the IMLP using a new type of tree hidden Markov model whose states correspond to single-base evolutionary scenarios and where transitions model dependencies between neighboring columns. The standard Viterbi and Forward-backward algorithms are optimized to produce the most likely ancestral reconstruction and to compute the level of confidence associated to specific regions of the reconstruction. A heuristic is presented to make the method practical for large data sets, while retaining an extremely high degree of accuracy. The methods are illustrated on a 1-Mb alignment of the CFTR regions from 12 mammals.
Collapse
Affiliation(s)
- Abdoulaye Banire Diallo
- McGill Centre for Bioinformatics and School of Computer Science, McGill University, Montréal, Québec, Canada
| | | | | |
Collapse
|
18
|
Mitrophanov AY, Borodovsky M. Convergence rate estimation for the TKF91 model of biological sequence length evolution. Math Biosci 2007; 209:470-85. [PMID: 17448505 DOI: 10.1016/j.mbs.2007.02.011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2006] [Revised: 02/17/2007] [Accepted: 02/23/2007] [Indexed: 10/23/2022]
Abstract
The TKF91 model of biological sequence evolution describes changes in the sequence length via an infinite state-space birth-death process, which we term the TKF91-BD process. The TKF91 model assumes that, for any pair of modern sequences, the ancestral sequence has equilibrium length distribution, an assumption whose validity has not been rigorously investigated. We obtain explicit upper and lower bounds on the rate of convergence to equilibrium for the distribution of the TKF91-BD process. We show that the rate of convergence of the TKF91-BD process for protein sequences with parameter values inferred from sequence data on alpha and beta globins is too low to guarantee convergence to equilibrium on a reasonable timescale. For the analyzed nucleotide sequences, the convergence is faster, but the equilibrium sequence length is unrealistically small. The Jukes-Cantor model of nucleotide substitutions can converge considerably faster than the length evolution model for both amino acid and nucleotide sequences, while the speed of convergence for the Kimura model is close to that for the TKF91-BD process describing nucleotide sequences.
Collapse
|
19
|
Chindelevitch L, Li Z, Blais E, Blanchette M. On the inference of parsimonious indel evolutionary scenarios. J Bioinform Comput Biol 2006; 4:721-44. [PMID: 16960972 DOI: 10.1142/s0219720006002168] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2005] [Revised: 12/02/2005] [Accepted: 12/31/2005] [Indexed: 11/18/2022]
Abstract
Given a multiple alignment of orthologous DNA sequences and a phylogenetic tree for these sequences, we investigate the problem of reconstructing a most parsimonious scenario of insertions and deletions capable of explaining the gaps observed in the alignment. This problem, called the Indel Parsimony Problem, is a crucial component of the problem of ancestral genome reconstruction, and its solution provides valuable information to many genome functional annotation approaches. We first show that the problem is NP-complete. Second, we provide an algorithm, based on the fractional relaxation of an integer linear programming formulation. The algorithm is fast in practice, and the solutions it produces are, in most cases, provably optimal. We describe a divide-and-conquer approach that makes it possible to solve very large instances on a simple desktop machine, while retaining guaranteed optimality. Our algorithms are tested and shown efficient and accurate on a set of 1.8 Mb mammalian orthologous sequences in the CFTR region.
Collapse
Affiliation(s)
- Leonid Chindelevitch
- School of Computer Science, McGill University, 3480 University Street, Montreal, Quebec, H3A 2A7, Canada.
| | | | | | | |
Collapse
|
20
|
Song YS. A Sufficient Condition for Reducing Recursions in Hidden Markov Models. Bull Math Biol 2006; 68:361-84. [PMID: 16794935 DOI: 10.1007/s11538-005-9045-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2004] [Accepted: 05/09/2005] [Indexed: 11/25/2022]
Abstract
In hidden Markov models, the probability of observing a set of strings can be computed using recursion relations. We construct a sufficient condition for simplifying the recursion relations for a certain class of hidden Markov models. If the condition is satisfied, then one can construct a reduced recursion where the dependence on Markov states completely disappears. We discuss a specific example--namely, statistical multiple alignment based on the TKF-model--in which the sufficient condition is satisfied.
Collapse
Affiliation(s)
- Yun S Song
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| |
Collapse
|
21
|
Fleissner R, Metzler D, von Haeseler A. Simultaneous statistical multiple alignment and phylogeny reconstruction. Syst Biol 2006; 54:548-61. [PMID: 16085574 DOI: 10.1080/10635150590950371] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022] Open
Abstract
Although the reconstruction of phylogenetic trees and the computation of multiple sequence alignments are highly interdependent, these two areas of research lead quite separate lives, the former often making use of stochastic modeling, whereas the latter normally does not. Despite the fact that reasonable insertion and deletion models for sequence pairs were already introduced more than 10 years ago, they have only recently been applied to multiple alignment and only in their simplest version. In this paper we present and discuss a strategy based on simulated annealing, which makes use of these models to infer a phylogenetic tree for a set of DNA or protein sequences together with the sequences'indel history, i.e., their multiple alignment augmented with information about the positioning of insertion and deletion events in the tree. Our method is also the first application of the TKF2 model in the context of multiple sequence alignment. We validate the method via simulations and illustrate it using a data set of primate mtDNA.
Collapse
Affiliation(s)
- Roland Fleissner
- Initiative for Bioinformatics and Evolutionary Studies (IBEST), University of Idaho Moscow, Idaho 83844-1103, USA.
| | | | | |
Collapse
|
22
|
|
23
|
Abstract
We describe a novel model and algorithm for simultaneously estimating multiple molecular sequence alignments and the phylogenetic trees that relate the sequences. Unlike current techniques that base phylogeny estimates on a single estimate of the alignment, we take alignment uncertainty into account by considering all possible alignments. Furthermore, because the alignment and phylogeny are constructed simultaneously, a guide tree is not needed. This sidesteps the problem in which alignments created by progressive alignment are biased toward the guide tree used to generate them. Joint estimation also allows us to model rate variation between sites when estimating the alignment and to use the evidence in shared insertion/deletions (indels) to group sister taxa in the phylogeny. Our indel model makes use of affine gap penalties and considers indels of multiple letters. We make the simplifying assumption that the indel process is identical on all branches. As a result, the probability of a gap is independent of branch length. We use a Markov chain Monte Carlo (MCMC) method to sample from the posterior of the joint model, estimating the most probable alignment and tree and their support simultaneously. We describe a new MCMC transition kernel that improves our algorithm's mixing efficiency, allowing the MCMC chains to converge even when started from arbitrary alignments. Our software implementation can estimate alignment uncertainty and we describe a method for summarizing this uncertainty in a single plot.
Collapse
Affiliation(s)
- Benjamin D Redelings
- Department of Biomathematics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1766, USA
| | | |
Collapse
|
24
|
Csurös M, Miklós I. Statistical alignment of retropseudogenes and their functional paralogs. Mol Biol Evol 2005; 22:2457-71. [PMID: 16107591 DOI: 10.1093/molbev/msi238] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
We describe a model for the sequence evolution of a processed pseudogene and its paralog from a common protein-coding ancestor. The model accounts for substitutions, insertions, and deletions and combines nucleotide- and codon-level mutation models. We give a dynamic programming method for calculating the likelihood of homology between two sequences in the model and describe the accompanying alignment algorithm. We also describe how ancestral codons can be computed when the same gene produced multiple pseudogene homologs. We apply our methods to the evolution of human cytochrome c.
Collapse
Affiliation(s)
- Miklós Csurös
- Department of Computer Science and Operations Research, Université de Montréal, succursale Centre-Ville, Montréal, Québec H3C 3J7, Canada.
| | | |
Collapse
|
25
|
Lunter G, Miklós I, Drummond A, Jensen JL, Hein J. Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics 2005; 6:83. [PMID: 15804354 PMCID: PMC1087833 DOI: 10.1186/1471-2105-6-83] [Citation(s) in RCA: 158] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2005] [Accepted: 04/01/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Two central problems in computational biology are the determination of the alignment and phylogeny of a set of biological sequences. The traditional approach to this problem is to first build a multiple alignment of these sequences, followed by a phylogenetic reconstruction step based on this multiple alignment. However, alignment and phylogenetic inference are fundamentally interdependent, and ignoring this fact leads to biased and overconfident estimations. Whether the main interest be in sequence alignment or phylogeny, a major goal of computational biology is the co-estimation of both. RESULTS We developed a fully Bayesian Markov chain Monte Carlo method for coestimating phylogeny and sequence alignment, under the Thorne-Kishino-Felsenstein model of substitution and single nucleotide insertion-deletion (indel) events. In our earlier work, we introduced a novel and efficient algorithm, termed the "indel peeling algorithm", which includes indels as phylogenetically informative evolutionary events, and resembles Felsenstein's peeling algorithm for substitutions on a phylogenetic tree. For a fixed alignment, our extension analytically integrates out both substitution and indel events within a proper statistical model, without the need for data augmentation at internal tree nodes, allowing for efficient sampling of tree topologies and edge lengths. To additionally sample multiple alignments, we here introduce an efficient partial Metropolized independence sampler for alignments, and combine these two algorithms into a fully Bayesian co-estimation procedure for the alignment and phylogeny problem. Our approach results in estimates for the posterior distribution of evolutionary rate parameters, for the maximum a-posteriori (MAP) phylogenetic tree, and for the posterior decoding alignment. Estimates for the evolutionary tree and multiple alignment are augmented with confidence estimates for each node height and alignment column. Our results indicate that the patterns in reliability broadly correspond to structural features of the proteins, and thus provides biologically meaningful information which is not existent in the usual point-estimate of the alignment. Our methods can handle input data of moderate size (10-20 protein sequences, each 100-200 bp), which we analyzed overnight on a standard 2 GHz personal computer. CONCLUSION Joint analysis of multiple sequence alignment, evolutionary trees and additional evolutionary parameters can be now done within a single coherent statistical framework.
Collapse
Affiliation(s)
- Gerton Lunter
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK
| | - István Miklós
- MTA-ELTE Theoretical Biology and Ecology Group, Pázmány Péter sétány 1/c 1117 Budapest, Hungary
| | - Alexei Drummond
- Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, UK
| | - Jens Ledet Jensen
- Department of Mathematical Sciences, University of Aarhus, Ny Munkegade, Building 530, DK-8000 Aarhus C, Denmark
| | - Jotun Hein
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK
| |
Collapse
|
26
|
Blanchette M, Green ED, Miller W, Haussler D. Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res 2004; 14:2412-23. [PMID: 15574820 PMCID: PMC534665 DOI: 10.1101/gr.2800104] [Citation(s) in RCA: 106] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2004] [Accepted: 10/05/2004] [Indexed: 11/25/2022]
Abstract
It is believed that most modern mammalian lineages arose from a series of rapid speciation events near the Cretaceous-Tertiary boundary. It is shown that such a phylogeny makes the common ancestral genome sequence an ideal target for reconstruction. Simulations suggest that with methods currently available, we can expect to get 98% of the bases correct in reconstructing megabase-scale euchromatic regions of an eutherian ancestral genome from the genomes of approximately 20 optimally chosen modern mammals. Using actual genomic sequences from 19 extant mammals, we reconstruct 1.1 Mb of ancient genome sequence around the CFTR locus. Detailed examination suggests the reconstruction is accurate and that it allows us to identify features in modern species, such as remnants of ancient transposon insertions, that were not identified by direct analysis. Tracing the predicted evolutionary history of the bases in the reconstructed region, estimates are made of the amount of DNA turnover due to insertion, deletion, and substitution in the different placental mammalian lineages since the common eutherian ancestor, showing considerable variation between lineages. In coming years, such reconstructions may help in identifying and understanding the genetic features common to eutherian mammals and may shed light on the evolution of human or primate-specific traits.
Collapse
Affiliation(s)
- Mathieu Blanchette
- Howard Hughes Medical Institute, University of California, Santa Cruz, California 95064, USA.
| | | | | | | |
Collapse
|
27
|
Knudsen B, Miyamoto MM. Sequence Alignments and Pair Hidden Markov Models Using Evolutionary History. J Mol Biol 2003; 333:453-60. [PMID: 14529629 DOI: 10.1016/j.jmb.2003.08.015] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
This work presents a novel pairwise statistical alignment method based on an explicit evolutionary model of insertions and deletions (indels). Indel events of any length are possible according to a geometric distribution. The geometric distribution parameter, the indel rate, and the evolutionary time are all maximum likelihood estimated from the sequences being aligned. Probability calculations are done using a pair hidden Markov model (HMM) with transition probabilities calculated from the indel parameters. Equations for the transition probabilities make the pair HMM closely approximate the specified indel model. The method provides an optimal alignment, its likelihood, the likelihood of all possible alignments, and the reliability of individual alignment regions. Human alpha and beta-hemoglobin sequences are aligned, as an illustration of the potential utility of this pair HMM approach.
Collapse
Affiliation(s)
- Bjarne Knudsen
- Department of Zoology, Box 118525, University of Florida, Gainesville, FL 32611-8525, USA.
| | | |
Collapse
|