Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For:	Miyazawa S. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng 1995;8:999-1009. [PMID: 8771180 DOI: 10.1093/protein/8.10.999] [Citation(s) in RCA: 69] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]

Number

Cited by Other Article(s)

Zhan Q, Wang N, Jin S, Tan R, Jiang Q, Wang Y. ProbPFP: a multiple sequence alignment algorithm combining hidden Markov model optimized by particle swarm optimization with partition function. BMC Bioinformatics 2019;20:573. [PMID: 31760933 PMCID: PMC6876095 DOI: 10.1186/s12859-019-3132-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

During procedures for conducting multiple sequence alignment, that is so essential to use the substitution score of pairwise alignment. To compute adaptive scores for alignment, researchers usually use Hidden Markov Model or probabilistic consistency methods such as partition function. Recent studies show that optimizing the parameters for hidden Markov model, as well as integrating hidden Markov model with partition function can raise the accuracy of alignment. The combination of partition function and optimized HMM, which could further improve the alignment's accuracy, however, was ignored by these researches.

RESULTS

A novel algorithm for MSA called ProbPFP is presented in this paper. It intergrate optimized HMM by particle swarm with partition function. The algorithm of PSO was applied to optimize HMM's parameters. After that, the posterior probability obtained by the HMM was combined with the one obtained by partition function, and thus to calculate an integrated substitution score for alignment. In order to evaluate the effectiveness of ProbPFP, we compared it with 13 outstanding or classic MSA methods. The results demonstrate that the alignments obtained by ProbPFP got the maximum mean TC scores and mean SP scores on these two benchmark datasets: SABmark and OXBench, and it got the second highest mean TC scores and mean SP scores on the benchmark dataset BAliBASE. ProbPFP is also compared with 4 other outstanding methods, by reconstructing the phylogenetic trees for six protein families extracted from the database TreeFam, based on the alignments obtained by these 5 methods. The result indicates that the reference trees are closer to the phylogenetic trees reconstructed from the alignments obtained by ProbPFP than the other methods.

CONCLUSIONS

We propose a new multiple sequence alignment method combining optimized HMM and partition function in this paper. The performance validates this method could make a great improvement of the alignment's accuracy.

Collapse

Frith MC. How sequence alignment scores correspond to probability models. Bioinformatics 2019;36:408-415. [PMID: 31329241 PMCID: PMC9883716 DOI: 10.1093/bioinformatics/btz576] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2019] [Revised: 05/31/2019] [Accepted: 07/17/2019] [Indexed: 02/03/2023] Open

Guo R, Zhao Y, Zou Q, Fang X, Peng S. Bioinformatics applications on Apache Spark. Gigascience 2018;7:5067872. [PMID: 30101283 PMCID: PMC6113509 DOI: 10.1093/gigascience/giy098] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Accepted: 07/28/2018] [Indexed: 11/13/2022] Open

Lladós J, Guirado F, Cores F. PPCAS: Implementation of a Probabilistic Pairwise Model for Consistency-Based Multiple Alignment in Apache Spark. In: Ibrahim S, Choo KR, Yan Z, Pedrycz W, editors. Algorithms and Architectures for Parallel Processing. Cham: Springer International Publishing; 2017. pp. 601-10. [DOI: 10.1007/978-3-319-65482-9_45] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/13/2023]

Gudyś A, Deorowicz S. QuickProbs 2: Towards rapid construction of high-quality alignments of large protein families. Sci Rep 2017;7:41553. [PMID: 28139687 PMCID: PMC5282490 DOI: 10.1038/srep41553] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Accepted: 12/21/2016] [Indexed: 01/05/2023] Open

Herman JL, Novák Á, Lyngsø R, Szabó A, Miklós I, Hein J. Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. BMC Bioinformatics 2015;16:108. [PMID: 25888064 PMCID: PMC4395974 DOI: 10.1186/s12859-015-0516-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Accepted: 02/24/2015] [Indexed: 11/30/2022] Open

Abstract

BACKGROUND

A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment.

RESULTS

In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased.

CONCLUSIONS

The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign .

Collapse

Song Y, Hua L, Shapiro BA, Wang JTL. Effective alignment of RNA pseudoknot structures using partition function posterior log-odds scores. BMC Bioinformatics 2015;16:39. [PMID: 25727492 PMCID: PMC4339682 DOI: 10.1186/s12859-015-0464-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Accepted: 01/13/2015] [Indexed: 11/18/2022] Open

Asai K, Hamada M. RNA structural alignments, part II: non-Sankoff approaches for structural alignments. Methods Mol Biol 2014;1097:291-301. [PMID: 24639165 DOI: 10.1007/978-1-62703-709-9_14] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]

Modzelewski M, Dojer N. MSARC: Multiple sequence alignment by residue clustering. Algorithms Mol Biol 2014;9:12. [PMID: 24735785 DOI: 10.1186/1748-7188-9-12] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2013] [Accepted: 04/06/2014] [Indexed: 11/10/2022] Open

Nánási M, Vinař T, Brejová B. Probabilistic approaches to alignment with tandem repeats. Algorithms Mol Biol 2014;9:3. [PMID: 24580741 PMCID: PMC3975930 DOI: 10.1186/1748-7188-9-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2013] [Accepted: 02/24/2014] [Indexed: 11/16/2022] Open

Yoon BJ. Sequence alignment by passing messages. BMC Genomics 2014;15 Suppl 1:S14. [PMID: 24564436 PMCID: PMC4046711 DOI: 10.1186/1471-2164-15-s1-s14] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open

Roshan U. Multiple sequence alignment using Probcons and Probalign. Methods Mol Biol 2014;1079:147-153. [PMID: 24170400 DOI: 10.1007/978-1-62703-646-7_9] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]

Liu Y, Schmidt B. Multiple protein sequence alignment with MSAProbs. Methods Mol Biol 2014;1079:211-8. [PMID: 24170405 DOI: 10.1007/978-1-62703-646-7_14] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]

Gotoh O. Heuristic alignment methods. Methods Mol Biol 2014;1079:29-43. [PMID: 24170393 DOI: 10.1007/978-1-62703-646-7_2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]

Rivas E. The four ingredients of single-sequence RNA secondary structure prediction. A unifying perspective. RNA Biol 2013;10:1185-96. [PMID: 23695796 PMCID: PMC3849167 DOI: 10.4161/rna.24971] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2013] [Revised: 05/06/2013] [Accepted: 05/08/2013] [Indexed: 12/31/2022] Open

Hamada M. Fighting against uncertainty: an essential issue in bioinformatics. Brief Bioinform 2013;15:748-67. [PMID: 23803300 DOI: 10.1093/bib/bbt038] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Hamada M, Asai K. A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA). J Comput Biol 2012;19:532-49. [PMID: 22313125 DOI: 10.1089/cmb.2011.0197] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Rivas E, Lang R, Eddy SR. A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more. RNA 2012;18:193-212. [PMID: 22194308 PMCID: PMC3264907 DOI: 10.1261/rna.030049.111] [Citation(s) in RCA: 64] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2011] [Accepted: 11/01/2011] [Indexed: 05/07/2023]

Dimitrov R, Gouliamova D. New Method for Sequence Alignment Based on Probabilities of Nucleotide Correspondences. BIOTECHNOL BIOTEC EQ 2012. [DOI: 10.5504/50yrtimb.2011.0039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open

Dimitrov R, Gouliamova D. Biological Sequence Comparison, Molecular Evolution and Phylogenetics. BIOTECHNOL BIOTEC EQ 2012. [DOI: 10.5504/50yrtimb.2011.0038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open

Zaki N, Wolfsheimer S, Nuel G, Khuri S. Conotoxin protein classification using free scores of words and support vector machines. BMC Bioinformatics 2011;12:217. [PMID: 21619696 DOI: 10.1186/1471-2105-12-217] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2010] [Accepted: 05/29/2011] [Indexed: 11/23/2022] Open

Hamada M, Kiryu H, Iwasaki W, Asai K. Generalized centroid estimators in bioinformatics. PLoS One 2011;6:e16450. [PMID: 21365017 PMCID: PMC3041832 DOI: 10.1371/journal.pone.0016450] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2010] [Accepted: 12/22/2010] [Indexed: 11/27/2022] Open

Chen H, Kihara D. Effect of using suboptimal alignments in template-based protein structure prediction. Proteins 2011;79:315-34. [PMID: 21058297 PMCID: PMC3058269 DOI: 10.1002/prot.22885] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]

Liu Y, Schmidt B, Maskell DL. MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics 2010;26:1958-64. [PMID: 20576627 DOI: 10.1093/bioinformatics/btq338] [Citation(s) in RCA: 188] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Sierk ML, Smoot ME, Bass EJ, Pearson WR. Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments. BMC Bioinformatics 2010;11:146. [PMID: 20307279 PMCID: PMC2850363 DOI: 10.1186/1471-2105-11-146] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2009] [Accepted: 03/22/2010] [Indexed: 11/10/2022] Open

Abstract

Background

While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate.

Results

We compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alignments and both the near-optimal sequence alignments produced by commonly used scoring parameters for sequences that share significant sequence similarity (E-values < 10^-5) and the ensemble of probA alignments. We constructed a logistic regression model incorporating three input variables derived from sets of near-optimal alignments: robustness, edge frequency, and maximum bits-per-position. A ROC analysis shows that this model more accurately classifies amino acid pairs (edges in the alignment path graph) according to the likelihood of appearance in structural alignments than the robustness score alone. We investigated various trimming protocols for removing incorrect edges from the optimal sequence alignment; the most effective protocol is to remove matches from the semi-global optimal alignment that are outside the boundaries of the local alignment, although trimming according to the model-generated probabilities achieves a similar level of improvement. The model can also be used to generate novel alignments by using the probabilities in lieu of a scoring matrix. These alignments are typically better than the optimal sequence alignment, and include novel correct structural edges. We find that the probA alignments sample a larger variety of alignments than the Zuker set, which more frequently results in alignments that are closer to the structural alignments, but that using the probA alignments as input to the regression model does not increase performance.

Conclusions

The pool of suboptimal pairwise protein sequence alignments substantially overlaps structure-based alignments for pairs with statistically significant similarity, and a regression model based on information contained in this alignment pool improves the accuracy of pairwise alignments with respect to structure-based alignments.

Collapse

Frith MC, Hamada M, Horton P. Parameters for accurate genome alignment. BMC Bioinformatics 2010;11:80. [PMID: 20144198 PMCID: PMC2829014 DOI: 10.1186/1471-2105-11-80] [Citation(s) in RCA: 136] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2009] [Accepted: 02/09/2010] [Indexed: 11/25/2022] Open

Wolfsheimer S, Melchert O, Hartmann AK. Finite-temperature local protein sequence alignment: percolation and free-energy distribution. Phys Rev E Stat Nonlin Soft Matter Phys 2009;80:061913. [PMID: 20365196 DOI: 10.1103/physreve.80.061913] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/13/2009] [Indexed: 05/29/2023]

Hamada M, Sato K, Kiryu H, Mituyama T, Asai K. CentroidAlign: fast and accurate aligner for structured RNAs by maximizing expected sum-of-pairs score. Bioinformatics 2009;25:3236-43. [DOI: 10.1093/bioinformatics/btp580] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open

Hamada M, Sato K, Kiryu H, Mituyama T, Asai K. Predictions of RNA secondary structure by combining homologous sequence information. ACTA ACUST UNITED AC 2009;25:i330-8. [PMID: 19478007 PMCID: PMC2687982 DOI: 10.1093/bioinformatics/btp228] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]

Kinjo AR. Profile conditional random fields for modeling protein families with structural information. Biophysics (Nagoya-shi) 2009;5:37-44. [PMID: 27857577 PMCID: PMC5036637 DOI: 10.2142/biophysics.5.37] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2009] [Accepted: 05/12/2009] [Indexed: 12/01/2022] Open

Newberg LA, Lawrence CE. Exact calculation of distributions on integers, with application to sequence alignment. J Comput Biol 2009;16:1-18. [PMID: 19119992 DOI: 10.1089/cmb.2008.0137] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Morita K, Saito Y, Sato K, Oka K, Hotta K, Sakakibara Y. Genome-wide searching with base-pairing kernel functions for noncoding RNAs: computational and expression analysis of snoRNA families in Caenorhabditis elegans. Nucleic Acids Res 2009;37:999-1009. [PMID: 19129214 PMCID: PMC2647286 DOI: 10.1093/nar/gkn1054] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Seemann SE, Gorodkin J, Backofen R. Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Res 2008;36:6355-62. [PMID: 18836192 PMCID: PMC2582601 DOI: 10.1093/nar/gkn544] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Frith MC, Park Y, Sheetlin SL, Spouge JL. The whole alignment and nothing but the alignment: the problem of spurious alignment flanks. Nucleic Acids Res 2008;36:5863-71. [PMID: 18796526 PMCID: PMC2566872 DOI: 10.1093/nar/gkn579] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Chen H, Kihara D. Estimating quality of template-based protein models by alignment stability. Proteins 2008;71:1255-74. [PMID: 18041762 DOI: 10.1002/prot.21819] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]

Abstract

The error in protein tertiary structure prediction is unavoidable, but it is not explicitly shown in most of the current prediction algorithms. Estimated error of a predicted structure is crucial information for experimental biologists to use the prediction model for design and interpretation of experiments. Here, we propose a method to estimate errors in predicted structures based on the stability of the optimal target-template alignment when compared with a set of suboptimal alignments. The stability of the optimal alignment is quantified by an index named the SuboPtimal Alignment Diversity (SPAD). We implemented SPAD in a profile-based threading algorithm and investigated how well SPAD can indicate errors in threading models using a large benchmark dataset of 5232 alignments. SPAD shows a very good correlation not only to alignment shift errors but also structure-level errors, the root mean square deviation (RMSD) of predicted structure models to the native structures (i.e. global errors), and local errors at each residue position. We have further compared SPAD with seven other quality measures, six from sequence alignment-based measures and one atomic statistical potential, discrete optimized protein energy (DOPE), in terms of the correlation coefficient to the global and local structure-level errors. In terms of the correlation to the RMSD of structure models, when a target and a template are in the same SCOP family, the sequence identity showed a best correlation to the RMSD; in the superfamily level, SPAD was the best; and in the fold level, DOPE was best. However, in a head-to-head comparison, SPAD wins over the other measures. Next, SPAD is compared with three other measures of local errors. In this comparison, SPAD was best in all of the family, the superfamily and the fold levels. Using the discovered correlation, we have also predicted the global and local error of our predicted structures of CASP7 targets by the SPAD. Finally, we proposed a sausage representation of predicted tertiary structures which intuitively indicate the predicted structure and the estimated error range of the structure simultaneously.

Collapse

Newberg LA. Memory-efficient dynamic programming backtrace and pairwise local sequence alignment. ACTA ACUST UNITED AC 2008;24:1772-8. [PMID: 18558620 PMCID: PMC2668612 DOI: 10.1093/bioinformatics/btn308] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]

Eddy SR. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol 2008;4:e1000069. [PMID: 18516236 PMCID: PMC2396288 DOI: 10.1371/journal.pcbi.1000069] [Citation(s) in RCA: 221] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2007] [Accepted: 03/26/2008] [Indexed: 11/19/2022] Open

Abstract

Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments.

Sequence database searches are a fundamental tool of molecular biology, enabling researchers to identify related sequences in other organisms, which often provides invaluable clues to the function and evolutionary history of genes. The power of database searches to detect more and more remote evolutionary relationships – essentially, to look back deeper in time – has improved steadily, with the adoption of more complex and realistic models. However, database searches require not just a realistic scoring model, but also the ability to distinguish good scores from bad ones – the ability to calculate the statistical significance of scores. For many models and scoring schemes, accurate statistical significance calculations have either involved expensive computational simulations, or not been feasible at all. Here, I introduce a probabilistic model of local sequence alignment that has readily predictable score statistics for position-specific profile scoring systems, and not just for traditional optimal alignment scores, but also for more powerful log-likelihood ratio scores derived in a full probabilistic inference framework. These results remove one of the main obstacles that have impeded the use of more powerful and biologically realistic statistical inference methods in sequence homology searches.

Collapse

Webb-Robertson BJ, McCue LA, Lawrence CE. Measuring global credibility with application to local sequence alignment. PLoS Comput Biol 2008;4:e1000077. [PMID: 18464927 DOI: 10.1371/journal.pcbi.1000077] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2007] [Accepted: 03/31/2008] [Indexed: 11/19/2022] Open

Abstract

Computational biology is replete with high-dimensional (high-D) discrete prediction and inference problems, including sequence alignment, RNA structure prediction, phylogenetic inference, motif finding, prediction of pathways, and model selection problems in statistical genetics. Even though prediction and inference in these settings are uncertain, little attention has been focused on the development of global measures of uncertainty. Regardless of the procedure employed to produce a prediction, when a procedure delivers a single answer, that answer is a point estimate selected from the solution ensemble, the set of all possible solutions. For high-D discrete space, these ensembles are immense, and thus there is considerable uncertainty. We recommend the use of Bayesian credibility limits to describe this uncertainty, where a (1−α)%, 0≤α≤1, credibility limit is the minimum Hamming distance radius of a hyper-sphere containing (1−α)% of the posterior distribution. Because sequence alignment is arguably the most extensively used procedure in computational biology, we employ it here to make these general concepts more concrete. The maximum similarity estimator (i.e., the alignment that maximizes the likelihood) and the centroid estimator (i.e., the alignment that minimizes the mean Hamming distance from the posterior weighted ensemble of alignments) are used to demonstrate the application of Bayesian credibility limits to alignment estimators. Application of Bayesian credibility limits to the alignment of 20 human/rodent orthologous sequence pairs and 125 orthologous sequence pairs from six Shewanella species shows that credibility limits of the alignments of promoter sequences of these species vary widely, and that centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments.

Sequence alignment is the cornerstone capability used by a multitude of computational biology applications, such as phylogeny reconstruction and identification of common regulatory mechanisms. Sequence alignment methods typically seek a high-scoring alignment between a pair of sequences, and assign a statistical significance to this single alignment. However, because a single alignment of two (or more) sequences is a point estimate, it may not be representative of the entire set (ensemble) of possible alignments of those sequences; thus, there may be considerable uncertainty associated with any one alignment among an immense ensemble of possibilities. To address the uncertainty of a proposed alignment, we used a Bayesian probabilistic approach to assess an alignment's reliability in the context of the entire ensemble of possible alignments. Our approach performs a global assessment of the degree to which the members of the ensemble depart from a selected alignment, thereby determining a credibility limit. In an evaluation of the popular maximum similarity alignment and the centroid alignment (i.e., the alignment that is in the center of the posterior distribution of alignments), we find that the centroid yields tighter credibility limits (on average) than the maximum similarity alignment. Beyond the usual interest in putting error limits on point estimates, our findings of substantial variability in credibility limits of alignments argue for wider adoption of these limits, so the degree of error is delineated prior to the subsequent use of the alignments.

Collapse

Carvalho LE, Lawrence CE. Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc Natl Acad Sci U S A 2008;105:3209-14. [PMID: 18305160 DOI: 10.1073/pnas.0712329105] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Biegert A, Söding J. De novo identification of highly diverged protein repeats by probabilistic consistency. Bioinformatics 2008;24:807-14. [DOI: 10.1093/bioinformatics/btn039] [Citation(s) in RCA: 123] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Roshan U, Chikkagoudar S, Livesay DR. Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities. BMC Bioinformatics 2008;9:61. [PMID: 18226231 PMCID: PMC2248559 DOI: 10.1186/1471-2105-9-61] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2007] [Accepted: 01/28/2008] [Indexed: 11/11/2022] Open

Abstract

Background

Identification of RNA homologs within genomic stretches is difficult when pairwise sequence identity is low or unalignable flanking residues are present. In both cases structure-sequence or profile/family-sequence alignment programs become difficult to apply because of unreliable RNA structures or family alignments. As such, local sequence-sequence alignment programs are frequently used instead. We have recently demonstrated that maximal expected accuracy alignments using partition function match probabilities (implemented in Probalign) are significantly better than contemporary methods on heterogeneous length protein sequence datasets, thus suggesting an affinity for local alignment.

Results

We create a pairwise RNA-genome alignment benchmark from RFAM families with average pairwise sequence identity up to 60%. Each dataset contains a query RNA aligned to a target RNA (of the same family) embedded in a genomic sequence at least 5K nucleotides long. To simulate common conditions when exact ends of an ncRNA are unknown, each query RNA has 5' and 3' genomic flanks of size 50, 100, and 150 nucleotides. We subsequently compare the error of the Probalign program (adjusted for local alignment) to the commonly used local alignment programs HMMER, SSEARCH, and BLAST, and the popular ClustalW program with zero end-gap penalties. Parameters were optimized for each program on a small subset of the benchmark. Probalign has overall highest accuracies on the full benchmark. It leads by 10% accuracy over SSEARCH (the next best method) on 5 out of 22 families. On datasets restricted to maximum of 30% sequence identity, Probalign's overall median error is 71.2% vs. 83.4% for SSEARCH (P-value < 0.05). Furthermore, on these datasets Probalign leads SSEARCH by at least 10% on five families; SSEARCH leads Probalign by the same margin on two of the fourteen families. We also demonstrate that the Probalign mean posterior probability, compared to the normalized SSEARCH Z-score, is a better discriminator of alignment quality. All datasets and software are available online.

Conclusion

We demonstrate, for the first time, that partition function match probabilities used for expected accuracy alignment, as done in Probalign, provide statistically significant improvement over current approaches for identifying distantly related RNA sequences in larger genomic segments.

Collapse

Tabei Y, Kiryu H, Kin T, Asai K. A fast structural multiple alignment method for long RNA sequences. BMC Bioinformatics 2008;9:33. [PMID: 18215258 PMCID: PMC2375124 DOI: 10.1186/1471-2105-9-33] [Citation(s) in RCA: 132] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2007] [Accepted: 01/23/2008] [Indexed: 11/10/2022] Open

Kiryu H, Kin T, Asai K. Rfold: an exact algorithm for computing local base pairing probabilities. ACTA ACUST UNITED AC 2007;24:367-73. [PMID: 18056736 DOI: 10.1093/bioinformatics/btm591] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Xu X, Ji Y, Stormo GD. RNA Sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics 2007;23:1883-91. [PMID: 17537756 DOI: 10.1093/bioinformatics/btm272] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open

Gront D, Kolinski A. T-Pile--a package for thermodynamic calculations for biomolecules. Bioinformatics 2007;23:1840-2. [PMID: 17510173 DOI: 10.1093/bioinformatics/btm259] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Kiryu H, Tabei Y, Kin T, Asai K. Murlet: a practical multiple alignment tool for structural RNA sequences. ACTA ACUST UNITED AC 2007;23:1588-98. [PMID: 17459961 DOI: 10.1093/bioinformatics/btm146] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]

Abstract

MOTIVATION

Structural RNA genes exhibit unique evolutionary patterns that are designed to conserve their secondary structures; these patterns should be taken into account while constructing accurate multiple alignments of RNA genes. The Sankoff algorithm is a natural alignment algorithm that includes the effect of base-pair covariation in the alignment model. However, the extremely high computational cost of the Sankoff algorithm precludes its application to most RNA sequences.

RESULTS

We propose an efficient algorithm for the multiple alignment of structural RNA sequences. Our algorithm is a variant of the Sankoff algorithm, and it uses an efficient scoring system that reduces the time and space requirements considerably without compromising on the alignment quality. First, our algorithm computes the match probability matrix that measures the alignability of each position pair between sequences as well as the base pairing probability matrix for each sequence. These probabilities are then combined to score the alignment using the Sankoff algorithm. By itself, our algorithm does not predict the consensus secondary structure of the alignment but uses external programs for the prediction. We demonstrate that both the alignment quality and the accuracy of the consensus secondary structure prediction from our alignment are the highest among the other programs examined. We also demonstrate that our algorithm can align relatively long RNA sequences such as the eukaryotic-type signal recognition particle RNA that is approximately 300 nt in length; multiple alignment of such sequences has not been possible by using other Sankoff-based algorithms. The algorithm is implemented in the software named 'Murlet'.

AVAILABILITY

The C++ source code of the Murlet software and the test dataset used in this study are available at http://www.ncrna.org/papers/Murlet/.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Yeramian E, Debonneuil E. Probabilistic sequence alignments: realistic models with efficient algorithms. Phys Rev Lett 2007;98:078101. [PMID: 17359063 DOI: 10.1103/physrevlett.98.078101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/19/2006] [Indexed: 05/14/2023]

Kiryu H, Kin T, Asai K. Robust prediction of consensus secondary structures using averaged base pairing probability matrices. Bioinformatics 2006;23:434-41. [PMID: 17182698 DOI: 10.1093/bioinformatics/btl636] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Abstract

MOTIVATION

Recent transcriptomic studies have revealed the existence of a considerable number of non-protein-coding RNA transcripts in higher eukaryotic cells. To investigate the functional roles of these transcripts, it is of great interest to find conserved secondary structures from multiple alignments on a genomic scale. Since multiple alignments are often created using alignment programs that neglect the special conservation patterns of RNA secondary structures for computational efficiency, alignment failures can cause potential risks of overlooking conserved stem structures.

RESULTS

We investigated the dependence of the accuracy of secondary structure prediction on the quality of alignments. We compared three algorithms that maximize the expected accuracy of secondary structures as well as other frequently used algorithms. We found that one of our algorithms, called McCaskill-MEA, was more robust against alignment failures than others. The McCaskill-MEA method first computes the base pairing probability matrices for all the sequences in the alignment and then obtains the base pairing probability matrix of the alignment by averaging over these matrices. The consensus secondary structure is predicted from this matrix such that the expected accuracy of the prediction is maximized. We show that the McCaskill-MEA method performs better than other methods, particularly when the alignment quality is low and when the alignment consists of many sequences. Our model has a parameter that controls the sensitivity and specificity of predictions. We discussed the uses of that parameter for multi-step screening procedures to search for conserved secondary structures and for assigning confidence values to the predicted base pairs.

AVAILABILITY

The C++ source code that implements the McCaskill-MEA algorithm and the test dataset used in this paper are available at http://www.ncrna.org/papers/McCaskillMEA/.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Roshan U, Livesay DR. Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics 2006;22:2715-21. [PMID: 16954142 DOI: 10.1093/bioinformatics/btl472] [Citation(s) in RCA: 143] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Pei J, Grishin NV. MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res 2006;34:4364-74. [PMID: 16936316 PMCID: PMC1636350 DOI: 10.1093/nar/gkl514] [Citation(s) in RCA: 89] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open