1
|
Kuhle B, Chen Q, Schimmel P. tRNA renovatio: Rebirth through fragmentation. Mol Cell 2023; 83:3953-3971. [PMID: 37802077 PMCID: PMC10841463 DOI: 10.1016/j.molcel.2023.09.016] [Citation(s) in RCA: 40] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Revised: 08/15/2023] [Accepted: 09/12/2023] [Indexed: 10/08/2023]
Abstract
tRNA function is based on unique structures that enable mRNA decoding using anticodon trinucleotides. These structures interact with specific aminoacyl-tRNA synthetases and ribosomes using 3D shape and sequence signatures. Beyond translation, tRNAs serve as versatile signaling molecules interacting with other RNAs and proteins. Through evolutionary processes, tRNA fragmentation emerges as not merely random degradation but an act of recreation, generating specific shorter molecules called tRNA-derived small RNAs (tsRNAs). These tsRNAs exploit their linear sequences and newly arranged 3D structures for unexpected biological functions, epitomizing the tRNA "renovatio" (from Latin, meaning renewal, renovation, and rebirth). Emerging methods to uncover full tRNA/tsRNA sequences and modifications, combined with techniques to study RNA structures and to integrate AI-powered predictions, will enable comprehensive investigations of tRNA fragmentation products and new interaction potentials in relation to their biological functions. We anticipate that these directions will herald a new era for understanding biological complexity and advancing pharmaceutical engineering.
Collapse
Affiliation(s)
- Bernhard Kuhle
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, USA; Department of Cellular Biochemistry, University Medical Center Göttingen, Göttingen, Germany
| | - Qi Chen
- Molecular Medicine Program, Department of Human Genetics, and Division of Urology, Department of Surgery, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Paul Schimmel
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, USA.
| |
Collapse
|
2
|
Dupont MJ, Major F. D-ORB: A Web Server to Extract Structural Features of Related But Unaligned RNA Sequences. J Mol Biol 2023; 435:168181. [PMID: 37468182 DOI: 10.1016/j.jmb.2023.168181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2022] [Revised: 06/02/2023] [Accepted: 06/06/2023] [Indexed: 07/21/2023]
Abstract
Identifying the common structural elements of functionally related RNA sequences (family) is usually based on an alignment of the sequences, which is often subject to human bias and may not be accurate. The resulting covariance model (CM) provides probabilities for each base to covary with another, which allows to support evolutionarily the formation of double helical regions and possibly pseudoknots. The coexistence of alternative folds in RNA, resulting from its dynamic nature, may lead to the potential omission of motifs by CM. To overcome this limitation, we present D-ORB, a system of algorithms that identifies overrepresented motifs in the secondary conformational landscapes of a family when compared to those of unrelated sequences. The algorithms are bundled into an easy-to-use website allowing users to submit a family, and optionally provide unrelated sequences. D-ORB produces a non-pseudoknotted secondary structure based on the overrepresented motifs, a deep neural network classifier and two decision trees. When used to model an Rfam family, D-ORB fits overrepresented motifs in the corresponding Rfam structure; more than a hundred Rfam families have been modeled. The statistical approach behind D-ORB derives the structural composition of an RNA family, making it a valuable tool for analyzing and modeling it. Its easy-to-use interface and advanced algorithms make it an essential resource for researchers studying RNA structure. D-ORB is available at https://d-orb.major.iric.ca/.
Collapse
Affiliation(s)
- Mathieu J Dupont
- Department of Computer Science and Operations Research, and the Institute for Research in Immunology and Cancer, Université de Montréal, Montreal, Quebec H3C 3J7, Canada
| | - François Major
- Department of Computer Science and Operations Research, and the Institute for Research in Immunology and Cancer, Université de Montréal, Montreal, Quebec H3C 3J7, Canada. https://twitter.com/francois_major
| |
Collapse
|
3
|
Sato K, Hamada M. Recent trends in RNA informatics: a review of machine learning and deep learning for RNA secondary structure prediction and RNA drug discovery. Brief Bioinform 2023; 24:bbad186. [PMID: 37232359 PMCID: PMC10359090 DOI: 10.1093/bib/bbad186] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 04/24/2023] [Accepted: 04/25/2023] [Indexed: 05/27/2023] Open
Abstract
Computational analysis of RNA sequences constitutes a crucial step in the field of RNA biology. As in other domains of the life sciences, the incorporation of artificial intelligence and machine learning techniques into RNA sequence analysis has gained significant traction in recent years. Historically, thermodynamics-based methods were widely employed for the prediction of RNA secondary structures; however, machine learning-based approaches have demonstrated remarkable advancements in recent years, enabling more accurate predictions. Consequently, the precision of sequence analysis pertaining to RNA secondary structures, such as RNA-protein interactions, has also been enhanced, making a substantial contribution to the field of RNA biology. Additionally, artificial intelligence and machine learning are also introducing technical innovations in the analysis of RNA-small molecule interactions for RNA-targeted drug discovery and in the design of RNA aptamers, where RNA serves as its own ligand. This review will highlight recent trends in the prediction of RNA secondary structure, RNA aptamers and RNA drug discovery using machine learning, deep learning and related technologies, and will also discuss potential future avenues in the field of RNA informatics.
Collapse
Affiliation(s)
- Kengo Sato
- School of System Design and Technology, Tokyo Denki University, 5 Senju Asahi-cho, Adachi-ku, Tokyo 120-8551, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, 55N-06-10, 3-4-1, Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL) , National Institute of Advanced Industrial Science and Technology (AIST), 3-4-1, Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
- Graduate School of Medicine, Nippon Medical School, 1-1-5, Sendagi, Bunkyo-ku, Tokyo 113-8602, Japan
| |
Collapse
|
4
|
RNA Secondary Structure Prediction Based on Energy Models. Methods Mol Biol 2023; 2586:89-105. [PMID: 36705900 DOI: 10.1007/978-1-0716-2768-6_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
This chapter introduces the RNA secondary structure prediction based on the nearest neighbor energy model, which is one of the most popular architectures of modeling RNA secondary structure without pseudoknots. We discuss the parameterization and the parameter determination by experimental and machine learning-based approaches as well as an integrated approach that compensates each other's shortcomings. Then, folding algorithms for the minimum free energy and the maximum expected accuracy using the dynamic programming technique are introduced. Finally, we compare the prediction accuracy of the method described so far with benchmark datasets.
Collapse
|
5
|
Xiao H, Yang X, Zhang Y, Zhang Z, Zhang G, Zhang BT. RNA-targeted small-molecule drug discoveries: a machine-learning perspective. RNA Biol 2023; 20:384-397. [PMID: 37337437 PMCID: PMC10283424 DOI: 10.1080/15476286.2023.2223498] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/06/2023] [Indexed: 06/21/2023] Open
Abstract
In the past two decades, machine learning (ML) has been extensively adopted in protein-targeted small molecule (SM) discovery. Once trained, ML models could exert their predicting abilities on large volumes of molecules within a short time. However, applying ML approaches to discover RNA-targeted SMs is still in its early stages. This is primarily because of the intrinsic structural instability of RNA molecules that impede the structure-based screening or designing of RNA-targeted SMs. Recently, with more studies revealing RNA structures and a growing number of RNA-targeted ligands being identified, it resulted in an increased interest in the field of drugging RNA. Undeniably, intracellular RNA is much more abundant than protein and, if successfully targeted, will be a major alternative target for therapeutics. Therefore, in this context, as well as under the premise of having RNA-related research data, ML-based methods can get involved in improving the speed of traditional experimental processes. [Figure: see text].
Collapse
Affiliation(s)
- Huan Xiao
- School of Chinese Medicine, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, China
| | - Xin Yang
- School of Chinese Medicine, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, China
| | - Yihao Zhang
- School of Chinese Medicine, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, China
| | - Zongkang Zhang
- School of Chinese Medicine, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, China
| | - Ge Zhang
- Law Sau Fai Institute for Advancing Translational Medicine in Bone & Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, China
- Institute of Integrated Bioinformedicine and Translational Science, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, China
- Institute of Precision Medicine and Innovative Drug Discovery, HKBU Institute for Research and Continuing Education, Shenzhen, China
| | - Bao-Ting Zhang
- School of Chinese Medicine, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, China
| |
Collapse
|
6
|
Loomis SP, Crutchfield JP. Exploring predictive states via Cantor embeddings and Wasserstein distance. CHAOS (WOODBURY, N.Y.) 2022; 32:123115. [PMID: 36587324 DOI: 10.1063/5.0102603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Accepted: 11/02/2022] [Indexed: 06/17/2023]
Abstract
Predictive states for stochastic processes are a nonparametric and interpretable construct with relevance across a multitude of modeling paradigms. Recent progress on the self-supervised reconstruction of predictive states from time-series data focused on the use of reproducing kernel Hilbert spaces. Here, we examine how Wasserstein distances may be used to detect predictive equivalences in symbolic data. We compute Wasserstein distances between distributions over sequences ("predictions") using a finite-dimensional embedding of sequences based on the Cantor set for the underlying geometry. We show that exploratory data analysis using the resulting geometry via hierarchical clustering and dimension reduction provides insight into the temporal structure of processes ranging from the relatively simple (e.g., generated by finite-state hidden Markov models) to the very complex (e.g., generated by infinite-state indexed grammars).
Collapse
Affiliation(s)
- Samuel P Loomis
- Complexity Sciences Center and Department of Physics and Astronomy, University of California at Davis, One Shields Avenue, Davis, California 95616, USA
| | - James P Crutchfield
- Complexity Sciences Center and Department of Physics and Astronomy, University of California at Davis, One Shields Avenue, Davis, California 95616, USA
| |
Collapse
|
7
|
Iwano N, Adachi T, Aoki K, Nakamura Y, Hamada M. Generative aptamer discovery using RaptGen. NATURE COMPUTATIONAL SCIENCE 2022; 2:378-386. [PMID: 38177576 PMCID: PMC10766510 DOI: 10.1038/s43588-022-00249-6] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/13/2021] [Accepted: 04/21/2022] [Indexed: 01/06/2024]
Abstract
Nucleic acid aptamers are generated by an in vitro molecular evolution method known as systematic evolution of ligands by exponential enrichment (SELEX). Various candidates are limited by actual sequencing data from an experiment. Here we developed RaptGen, which is a variational autoencoder for in silico aptamer generation. RaptGen exploits a profile hidden Markov model decoder to represent motif sequences effectively. We showed that RaptGen embedded simulation sequence data into low-dimensional latent space on the basis of motif information. We also performed sequence embedding using two independent SELEX datasets. RaptGen successfully generated aptamers from the latent space even though they were not included in high-throughput sequencing. RaptGen could also generate a truncated aptamer with a short learning model. We demonstrated that RaptGen could be applied to activity-guided aptamer generation according to Bayesian optimization. We concluded that a generative method by RaptGen and latent representation are useful for aptamer discovery.
Collapse
Affiliation(s)
- Natsuki Iwano
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, Japan
| | | | | | | | - Michiaki Hamada
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, Japan.
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.
- Graduate School of Medicine, Nippon Medical School, Tokyo, Japan.
| |
Collapse
|
8
|
Zhao Q, Zhao Z, Fan X, Yuan Z, Mao Q, Yao Y. Review of machine learning methods for RNA secondary structure prediction. PLoS Comput Biol 2021; 17:e1009291. [PMID: 34437528 PMCID: PMC8389396 DOI: 10.1371/journal.pcbi.1009291] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Abstract
Secondary structure plays an important role in determining the function of noncoding RNAs. Hence, identifying RNA secondary structures is of great value to research. Computational prediction is a mainstream approach for predicting RNA secondary structure. Unfortunately, even though new methods have been proposed over the past 40 years, the performance of computational prediction methods has stagnated in the last decade. Recently, with the increasing availability of RNA structure data, new methods based on machine learning (ML) technologies, especially deep learning, have alleviated the issue. In this review, we provide a comprehensive overview of RNA secondary structure prediction methods based on ML technologies and a tabularized summary of the most important methods in this field. The current pending challenges in the field of RNA secondary structure prediction and future trends are also discussed.
Collapse
Affiliation(s)
- Qi Zhao
- College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, China
| | - Zheng Zhao
- School of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning, China
| | - Xiaoya Fan
- School of Software, Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian University of Technology, Dalian, Liaoning, China
| | - Zhengwei Yuan
- Key Laboratory of Health Ministry for Congenital Malformation, Shengjing Hospital of China Medical University, Shenyang, Liaoning, China
| | - Qian Mao
- College of Light Industry, Liaoning University, Shenyang, Liaoning, China
- Key Laboratory of Agroproducts Processing Technology, Changchun University, Changchun, Jilin, China
| | - Yudong Yao
- Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, New Jersey, United States of America
| |
Collapse
|
9
|
|
10
|
Novikov IB, Wilkins AD, Lichtarge O. An Evolutionary Trace method defines functionally important bases and sites common to RNA families. PLoS Comput Biol 2020; 16:e1007583. [PMID: 32208421 PMCID: PMC7092961 DOI: 10.1371/journal.pcbi.1007583] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2018] [Accepted: 11/27/2019] [Indexed: 11/18/2022] Open
Abstract
Functional non-coding (fnc)RNAs are nucleotide sequences of varied lengths, structures, and mechanisms that ubiquitously influence gene expression and translation, genome stability and dynamics, and human health and disease. Here, to shed light on their functional determinants, we seek to exploit the evolutionary record of variation and divergence read from sequence comparisons. The approach follows the phylogenetic Evolutionary Trace (ET) paradigm, first developed and extensively validated on proteins. We assigned a relative rank of importance to every base in a study of 1070 functional RNAs, including the ribosome, and observed evolutionary patterns strikingly similar to those seen in proteins, namely, (1) the top-ranked bases clustered in secondary and tertiary structures. (2) In turn, these clusters mapped functional regions for catalysis, binding proteins and drugs, post-transcriptional modification, and deleterious mutations. (3) Moreover, the quantitative quality of these clusters correlated with the identification of functional regions. (4) As a result of this correlation, smoother structural distributions of evolutionary important nucleotides improved functional site predictions. Thus, in practice, phylogenetic analysis can broadly identify functional determinants in RNA sequences and functional sites in RNA structures, and reveal details on the basis of RNA molecular functions. As example of application, we report several previously undocumented and potentially functional ET nucleotide clusters in the ribosome. This work is broadly relevant to studies of structure-function in ribonucleic acids. Additionally, this generalization of ET shows that evolutionary constraints among sequence, structure, and function are similar in structured RNA and proteins. RNA ET is currently available as part of the ET command-line package, and will be available as a web-server. Traditionally, RNA has been delegated to the role of an intermediate between DNA and proteins. However, we now recognize that RNAs are broadly functional beyond their role in translation, and that a number of diverse classes exist. Because functional, non-coding RNAs are prevalent in biology and impact human health, it is important to better understand their functional determinants. However, the classical solution to this problem, targeted mutagenesis, is time-consuming and scales poorly. We propose an alternative computational approach to this problem, the Evolutionary Trace method. Previously developed and validated for proteins, Evolutionary Trace examines evolutionary history of a molecule and predicts evolutionarily important residues in the sequence. We apply Evolutionary Trace to a set of diverse RNAs, and find that the evolutionarily important nucleotides cluster on the three-dimensional structure, and that these clusters closely overlap functional sites. We also find that the clustering property can be used to refine and improve predictions. These findings are in close agreement with our observations of Evolutionary Trace in proteins, and suggest that structured functional RNAs and proteins evolve under similar constraints. In practice, the approach is to be used by RNA researches seeking insight into their molecule of interest, and the Evolutionary Trace program, along with a working example, is available at https://github.com/LichtargeLab/RNA_ET_ms.
Collapse
Affiliation(s)
- Ilya B. Novikov
- Department of Biochemistry and Molecular Biology, Baylor College of Medicine, Houston, Texas, United States of America
| | - Angela D. Wilkins
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
- * E-mail:
| |
Collapse
|
11
|
Nawrocki EP, Jones TA, Eddy SR. Group I introns are widespread in archaea. Nucleic Acids Res 2019; 46:7970-7976. [PMID: 29788499 PMCID: PMC6125680 DOI: 10.1093/nar/gky414] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2018] [Accepted: 05/04/2018] [Indexed: 01/28/2023] Open
Abstract
Group I catalytic introns have been found in bacterial, viral, organellar, and some eukaryotic genomes, but not in archaea. All known archaeal introns are bulge-helix-bulge (BHB) introns, with the exception of a few group II introns. It has been proposed that BHB introns arose from extinct group I intron ancestors, much like eukaryotic spliceosomal introns are thought to have descended from group II introns. However, group I introns have little sequence conservation, making them difficult to detect with standard sequence similarity searches. Taking advantage of recent improvements in a computational homology search method that accounts for both conserved sequence and RNA secondary structure, we have identified 39 group I introns in a wide range of archaeal phyla, including examples of group I introns and BHB introns in the same host gene.
Collapse
Affiliation(s)
- Eric P Nawrocki
- National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, MD 20894, USA
| | - Thomas A Jones
- Howard Hughes Medical Institute, Harvard University, Cambridge, USA.,Department of Molecular and Cellular Biology, Harvard University, Cambridge, USA
| | - Sean R Eddy
- Howard Hughes Medical Institute, Harvard University, Cambridge, USA.,Department of Molecular and Cellular Biology, Harvard University, Cambridge, USA.,School of Engineering and Applied Sciences, Harvard University, Cambridge, USA
| |
Collapse
|
12
|
|
13
|
Akiyama M, Sato K, Sakakibara Y. A max-margin training of RNA secondary structure prediction integrated with the thermodynamic model. J Bioinform Comput Biol 2019; 16:1840025. [PMID: 30616476 DOI: 10.1142/s0219720018400255] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
A popular approach for predicting RNA secondary structure is the thermodynamic nearest-neighbor model that finds a thermodynamically most stable secondary structure with minimum free energy (MFE). For further improvement, an alternative approach that is based on machine learning techniques has been developed. The machine learning-based approach can employ a fine-grained model that includes much richer feature representations with the ability to fit the training data. Although a machine learning-based fine-grained model achieved extremely high performance in prediction accuracy, a possibility of the risk of overfitting for such a model has been reported. In this paper, we propose a novel algorithm for RNA secondary structure prediction that integrates the thermodynamic approach and the machine learning-based weighted approach. Our fine-grained model combines the experimentally determined thermodynamic parameters with a large number of scoring parameters for detailed contexts of features that are trained by the structured support vector machine (SSVM) with the [Formula: see text] regularization to avoid overfitting. Our benchmark shows that our algorithm achieves the best prediction accuracy compared with existing methods, and heavy overfitting cannot be observed. The implementation of our algorithm is available at https://github.com/keio-bioinformatics/mxfold .
Collapse
Affiliation(s)
- Manato Akiyama
- Department of Biosciences and Informatics, Keio University, 3–14–1 Hiyoshi, Kohoku-ku, Yokohama 223–8522, Japan
| | - Kengo Sato
- Department of Biosciences and Informatics, Keio University, 3–14–1 Hiyoshi, Kohoku-ku, Yokohama 223–8522, Japan
| | - Yasubumi Sakakibara
- Department of Biosciences and Informatics, Keio University, 3–14–1 Hiyoshi, Kohoku-ku, Yokohama 223–8522, Japan
| |
Collapse
|
14
|
Arslan AN, Anandan J, Fry E, Monschke K, Ganneboina N, Bowerman J. Efficient RNA structure comparison algorithms. J Bioinform Comput Biol 2017; 15:1740009. [DOI: 10.1142/s0219720017400091] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Recently proposed relative addressing-based ([Formula: see text]) RNA secondary structure representation has important features by which an RNA structure database can be stored into a suffix array. A fast substructure search algorithm has been proposed based on binary search on this suffix array. Using this substructure search algorithm, we present a fast algorithm that finds the largest common substructure of given multiple RNA structures in [Formula: see text] format. The multiple RNA structure comparison problem is NP-hard in its general formulation. We introduced a new problem for comparing multiple RNA structures. This problem has more strict similarity definition and objective, and we propose an algorithm that solves this problem efficiently. We also develop another comparison algorithm that iteratively calls this algorithm to locate nonoverlapping large common substructures in compared RNAs. With the new resulting tools, we improved the RNASSAC website (linked from http://faculty.tamuc.edu/aarslan ). This website now also includes two drawing tools: one specialized for preparing RNA substructures that can be used as input by the search tool, and another one for automatically drawing the entire RNA structure from a given structure sequence.
Collapse
Affiliation(s)
- Abdullah N. Arslan
- Department of Computer Science, Texas A&M University-Commerce, Commerce, TX 75428, USA
| | - Jithendar Anandan
- Department of Computer Science, Texas A&M University-Commerce, Commerce, TX 75428, USA
| | - Eric Fry
- Department of Computer Science, Texas A&M University-Commerce, Commerce, TX 75428, USA
| | - Keith Monschke
- Department of Computer Science, Texas A&M University-Commerce, Commerce, TX 75428, USA
| | - Nitin Ganneboina
- Department of Computer Science, Texas A&M University-Commerce, Commerce, TX 75428, USA
| | - Jason Bowerman
- Department of Computer Science, Texas A&M University-Commerce, Commerce, TX 75428, USA
| |
Collapse
|
15
|
Choudhary K, Deng F, Aviran S. Comparative and integrative analysis of RNA structural profiling data: current practices and emerging questions. QUANTITATIVE BIOLOGY 2017; 5:3-24. [PMID: 28717530 PMCID: PMC5510538 DOI: 10.1007/s40484-017-0093-6] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2016] [Revised: 12/08/2016] [Accepted: 12/15/2016] [Indexed: 12/30/2022]
Abstract
BACKGROUND Structure profiling experiments provide single-nucleotide information on RNA structure. Recent advances in chemistry combined with application of high-throughput sequencing have enabled structure profiling at transcriptome scale and in living cells, creating unprecedented opportunities for RNA biology. Propelled by these experimental advances, massive data with ever-increasing diversity and complexity have been generated, which give rise to new challenges in interpreting and analyzing these data. RESULTS We review current practices in analysis of structure profiling data with emphasis on comparative and integrative analysis as well as highlight emerging questions. Comparative analysis has revealed structural patterns across transcriptomes and has become an integral component of recent profiling studies. Additionally, profiling data can be integrated into traditional structure prediction algorithms to improve prediction accuracy. CONCLUSIONS To keep pace with experimental developments, methods to facilitate, enhance and refine such analyses are needed. Parallel advances in analysis methodology will complement profiling technologies and help them reach their full potential.
Collapse
Affiliation(s)
| | | | - Sharon Aviran
- Department of Biomedical Engineering and Genome Center, University of California at Davis, Davis, CA 95616, USA
| |
Collapse
|
16
|
Huang FW, Reidys CM. Topological language for RNA. Math Biosci 2016; 282:109-120. [DOI: 10.1016/j.mbs.2016.10.006] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2016] [Revised: 10/17/2016] [Accepted: 10/17/2016] [Indexed: 12/26/2022]
|
17
|
Barquist L, Burge SW, Gardner PP. Studying RNA Homology and Conservation with Infernal: From Single Sequences to RNA Families. CURRENT PROTOCOLS IN BIOINFORMATICS 2016; 54:12.13.1-12.13.25. [PMID: 27322404 PMCID: PMC5010141 DOI: 10.1002/cpbi.4] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Emerging high-throughput technologies have led to a deluge of putative non-coding RNA (ncRNA) sequences identified in a wide variety of organisms. Systematic characterization of these transcripts will be a tremendous challenge. Homology detection is critical to making maximal use of functional information gathered about ncRNAs: identifying homologous sequence allows us to transfer information gathered in one organism to another quickly and with a high degree of confidence. ncRNA presents a challenge for homology detection, as the primary sequence is often poorly conserved and de novo secondary structure prediction and search remain difficult. This unit introduces methods developed by the Rfam database for identifying "families" of homologous ncRNAs starting from single "seed" sequences, using manually curated sequence alignments to build powerful statistical models of sequence and structure conservation known as covariance models (CMs), implemented in the Infernal software package. We provide a step-by-step iterative protocol for identifying ncRNA homologs and then constructing an alignment and corresponding CM. We also work through an example for the bacterial small RNA MicA, discovering a previously unreported family of divergent MicA homologs in genus Xenorhabdus in the process. © 2016 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Lars Barquist
- Institute for Molecular Infection Biology, University of Würzburg, Würzburg, D-97080 Germany
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA United Kingdom; Fax: +44 (0)1223 494919
| | - Sarah W. Burge
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA United Kingdom; Fax: +44 (0)1223 494919
| | - Paul P. Gardner
- School of Biological Sciences, University of Canterbury, Private Bag 4800, Christchurch, New Zealand
- Biomolecular Interaction Centre, University of Canterbury, Private Bag 4800, Christchurch, New Zealand
| |
Collapse
|
18
|
Sahoo S, Świtnicki MP, Pedersen JS. ProbFold: a probabilistic method for integration of probing data in RNA secondary structure prediction. Bioinformatics 2016; 32:2626-35. [PMID: 27153612 DOI: 10.1093/bioinformatics/btw175] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2015] [Accepted: 03/28/2016] [Indexed: 12/22/2022] Open
Abstract
MOTIVATION Recently, new RNA secondary structure probing techniques have been developed, including Next Generation Sequencing based methods capable of probing transcriptome-wide. These techniques hold great promise for improving structure prediction accuracy. However, each new data type comes with its own signal properties and biases, which may even be experiment specific. There is therefore a growing need for RNA structure prediction methods that can be automatically trained on new data types and readily extended to integrate and fully exploit multiple types of data. RESULTS Here, we develop and explore a modular probabilistic approach for integrating probing data in RNA structure prediction. It can be automatically trained given a set of known structures with probing data. The approach is demonstrated on SHAPE datasets, where we evaluate and selectively model specific correlations. The approach often makes superior use of the probing data signal compared to other methods. We illustrate the use of ProbFold on multiple data types using both simulations and a small set of structures with both SHAPE, DMS and CMCT data. Technically, the approach combines stochastic context-free grammars (SCFGs) with probabilistic graphical models. This approach allows rapid adaptation and integration of new probing data types. AVAILABILITY AND IMPLEMENTATION ProbFold is implemented in C ++. Models are specified using simple textual formats. Data reformatting is done using separate C ++ programs. Source code, statically compiled binaries for x86 Linux machines, C ++ programs, example datasets and a tutorial is available from http://moma.ki.au.dk/prj/probfold/ CONTACT : jakob.skou@clin.au.dk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sudhakar Sahoo
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus N 8200, Denmark
| | - Michał P Świtnicki
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus N 8200, Denmark
| | - Jakob Skou Pedersen
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus N 8200, Denmark Bioinformatics Research Centre, Aarhus University, Aarhus C DK-8000, Denmark
| |
Collapse
|
19
|
Zirbel CL, Roll J, Sweeney BA, Petrov AI, Pirrung M, Leontis NB. Identifying novel sequence variants of RNA 3D motifs. Nucleic Acids Res 2015; 43:7504-20. [PMID: 26130723 PMCID: PMC4551918 DOI: 10.1093/nar/gkv651] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Accepted: 05/29/2015] [Indexed: 02/06/2023] Open
Abstract
Predicting RNA 3D structure from sequence is a major challenge in biophysics. An important sub-goal is accurately identifying recurrent 3D motifs from RNA internal and hairpin loop sequences extracted from secondary structure (2D) diagrams. We have developed and validated new probabilistic models for 3D motif sequences based on hybrid Stochastic Context-Free Grammars and Markov Random Fields (SCFG/MRF). The SCFG/MRF models are constructed using atomic-resolution RNA 3D structures. To parameterize each model, we use all instances of each motif found in the RNA 3D Motif Atlas and annotations of pairwise nucleotide interactions generated by the FR3D software. Isostericity relations between non-Watson–Crick basepairs are used in scoring sequence variants. SCFG techniques model nested pairs and insertions, while MRF ideas handle crossing interactions and base triples. We use test sets of randomly-generated sequences to set acceptance and rejection thresholds for each motif group and thus control the false positive rate. Validation was carried out by comparing results for four motif groups to RMDetect. The software developed for sequence scoring (JAR3D) is structured to automatically incorporate new motifs as they accumulate in the RNA 3D Motif Atlas when new structures are solved and is available free for download.
Collapse
Affiliation(s)
- Craig L Zirbel
- Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, OH 43403, USA
| | - James Roll
- Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, OH 43403, USA
| | - Blake A Sweeney
- Department of Biology, Bowling Green State University, Bowling Green, OH 43403, USA
| | - Anton I Petrov
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Meg Pirrung
- Department of Pharmacology, University of Colorado Denver, Aurora, CO 80045, USA
| | - Neocles B Leontis
- Department of Chemistry, Bowling Green State University, Bowling Green, OH 43403, USA
| |
Collapse
|
20
|
Datta S, Mukhopadhyay S. A grammar inference approach for predicting kinase specific phosphorylation sites. PLoS One 2015; 10:e0122294. [PMID: 25886273 PMCID: PMC4401752 DOI: 10.1371/journal.pone.0122294] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2014] [Accepted: 02/13/2015] [Indexed: 01/22/2023] Open
Abstract
Kinase mediated phosphorylation site detection is the key mechanism of post translational mechanism that plays an important role in regulating various cellular processes and phenotypes. Many diseases, like cancer are related with the signaling defects which are associated with protein phosphorylation. Characterizing the protein kinases and their substrates enhances our ability to understand the mechanism of protein phosphorylation and extends our knowledge of signaling network; thereby helping us to treat such diseases. Experimental methods for predicting phosphorylation sites are labour intensive and expensive. Also, manifold increase of protein sequences in the databanks over the years necessitates the improvement of high speed and accurate computational methods for predicting phosphorylation sites in protein sequences. Till date, a number of computational methods have been proposed by various researchers in predicting phosphorylation sites, but there remains much scope of improvement. In this communication, we present a simple and novel method based on Grammatical Inference (GI) approach to automate the prediction of kinase specific phosphorylation sites. In this regard, we have used a popular GI algorithm Alergia to infer Deterministic Stochastic Finite State Automata (DSFA) which equally represents the regular grammar corresponding to the phosphorylation sites. Extensive experiments on several datasets generated by us reveal that, our inferred grammar successfully predicts phosphorylation sites in a kinase specific manner. It performs significantly better when compared with the other existing phosphorylation site prediction methods. We have also compared our inferred DSFA with two other GI inference algorithms. The DSFA generated by our method performs superior which indicates that our method is robust and has a potential for predicting the phosphorylation sites in a kinase specific manner.
Collapse
Affiliation(s)
- Sutapa Datta
- Department of Biophysics, Molecular Biology and Bioinformatics and Distributed Information Centre for Bioinformatics, University of Calcutta, Kolkata, West Bengal, India
| | - Subhasis Mukhopadhyay
- Department of Biophysics, Molecular Biology and Bioinformatics and Distributed Information Centre for Bioinformatics, University of Calcutta, Kolkata, West Bengal, India
| |
Collapse
|
21
|
Gardner PP, Fasold M, Burge SW, Ninova M, Hertel J, Kehr S, Steeves TE, Griffiths-Jones S, Stadler PF. Conservation and losses of non-coding RNAs in avian genomes. PLoS One 2015; 10:e0121797. [PMID: 25822729 PMCID: PMC4378963 DOI: 10.1371/journal.pone.0121797] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2014] [Accepted: 02/03/2015] [Indexed: 11/21/2022] Open
Abstract
Here we present the results of a large-scale bioinformatics annotation of non-coding RNA loci in 48 avian genomes. Our approach uses probabilistic models of hand-curated families from the Rfam database to infer conserved RNA families within each avian genome. We supplement these annotations with predictions from the tRNA annotation tool, tRNAscan-SE and microRNAs from miRBase. We identify 34 lncRNA-associated loci that are conserved between birds and mammals and validate 12 of these in chicken. We report several intriguing cases where a reported mammalian lncRNA, but not its function, is conserved. We also demonstrate extensive conservation of classical ncRNAs (e.g., tRNAs) and more recently discovered ncRNAs (e.g., snoRNAs and miRNAs) in birds. Furthermore, we describe numerous “losses” of several RNA families, and attribute these to either genuine loss, divergence or missing data. In particular, we show that many of these losses are due to the challenges associated with assembling avian microchromosomes. These combined results illustrate the utility of applying homology-based methods for annotating novel vertebrate genomes.
Collapse
Affiliation(s)
- Paul P. Gardner
- School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
- Biomolecular Interaction Centre, University of Canterbury, Christchurch, New Zealand
- * E-mail:
| | - Mario Fasold
- Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany
- ecSeq Bioinformatics, Brandvorwerkstr.43, D-04275 Leipzig, Germany
| | - Sarah W. Burge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
| | - Maria Ninova
- Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom
| | - Jana Hertel
- Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany
| | - Stephanie Kehr
- Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany
| | - Tammy E. Steeves
- School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
| | - Sam Griffiths-Jones
- Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom
| | - Peter F. Stadler
- Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany
- Fraunhofer Institute for Cell Therapy and Immunology, Perlickstrasse 1, D-04103 Leipzig, Germany
- Department of Theoretical Chemistry of the University of Vienna, Währingerstrasse 17, A-1090 Vienna, Austria
- Center for RNA in Technology and Health, Univ. Copenhagen, Grønnegårdsvej 3, Frederiksberg C, Denmark
- Santa Fe Institute, 1399 Hyde Park Road, Santa Fe NM 87501, USA
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Germany
| |
Collapse
|
22
|
Gardner PP, Eldai H. Annotating RNA motifs in sequences and alignments. Nucleic Acids Res 2015; 43:691-8. [PMID: 25520192 PMCID: PMC4333381 DOI: 10.1093/nar/gku1327] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2014] [Revised: 11/30/2014] [Accepted: 12/05/2014] [Indexed: 11/21/2022] Open
Abstract
RNA performs a diverse array of important functions across all cellular life. These functions include important roles in translation, building translational machinery and maturing messenger RNA. More recent discoveries include the miRNAs and bacterial sRNAs that regulate gene expression, the thermosensors, riboswitches and other cis-regulatory elements that help prokaryotes sense their environment and eukaryotic piRNAs that suppress transposition. However, there can be a long period between the initial discovery of a RNA and determining its function. We present a bioinformatic approach to characterize RNA motifs, which are critical components of many RNA structure-function relationships. These motifs can, in some instances, provide researchers with functional hypotheses for uncharacterized RNAs. Moreover, we introduce a new profile-based database of RNA motifs--RMfam--and illustrate some applications for investigating the evolution and functional characterization of RNA. All the data and scripts associated with this work are available from: https://github.com/ppgardne/RMfam.
Collapse
Affiliation(s)
- Paul P Gardner
- School of Biological Sciences, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand Biomolecular Interaction Centre, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand
| | - Hisham Eldai
- School of Biological Sciences, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand Biomolecular Interaction Centre, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand
| |
Collapse
|
23
|
Abstract
Stochastic context free grammars are a formalism which plays a prominent role in RNA secondary structure analysis. This chapter provides the theoretical background on stochastic context free grammars. We recall the general definitions and study the basic properties, virtues, and shortcomings of stochastic context free grammars. We then introduce two ways in which they are used in RNA secondary structure analysis, secondary structure prediction and RNA family modeling. This prepares for the discussion of applications of stochastic context free grammars in the chapters on RFAM (6), Pfold (8), and INFERNAL (9).
Collapse
|
24
|
Li N, Cushing W, Kambhampati S, Yoon S. Learning Probabilistic Hierarchical Task Networks as Probabilistic Context-Free Grammars to Capture User Preferences. ACM T INTEL SYST TEC 2014. [DOI: 10.1145/2589481] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
We introduce an algorithm to automatically learn probabilistic hierarchical task networks (pHTNs) that capture a user's preferences on plans by observing only the user's behavior. HTNs are a common choice of representation for a variety of purposes in planning, including work on learning in planning. Our contributions are twofold. First, in contrast with prior work, which employs HTNs to represent domain physics or search control knowledge, we use HTNs to model user preferences. Second, while most prior work on HTN learning requires additional information (e.g., annotated traces or tasks) to assist the learning process, our system only takes plan traces as input. Initially, we will assume that users carry out preferred plans more frequently, and thus the observed distribution of plans is an accurate representation of user preference. We then generalize to the situation where feasibility constraints frequently prevent the execution of preferred plans. Taking the prevalent perspective of viewing HTNs as grammars over primitive actions, we adapt an expectation-maximization (EM) technique from the discipline of probabilistic grammar induction to acquire probabilistic context-free grammars (pCFG) that capture the distribution on plans. To account for the difference between the distributions of possible and preferred plans, we subsequently modify this core EM technique by rescaling its input. We empirically demonstrate that the proposed approaches are able to learn HTNs representing user preferences better than the inside-outside algorithm. Furthermore, when feasibility constraints are obfuscated, the algorithm with rescaled input performs better than the algorithm with the original input.
Collapse
Affiliation(s)
- Nan Li
- Carnegie Mellon University, Pittsburgh, PA
| | | | | | | |
Collapse
|
25
|
Sun EI, Rodionov DA. Computational analysis of riboswitch-based regulation. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2014; 1839:900-907. [PMID: 24583554 DOI: 10.1016/j.bbagrm.2014.02.011] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/30/2013] [Revised: 01/28/2014] [Accepted: 02/18/2014] [Indexed: 11/17/2022]
Abstract
Advances in computational analysis of riboswitches in the last decade have contributed greatly to our understanding of riboswitch regulatory roles and mechanisms. Riboswitches were originally discovered as part of the sequence analysis of the 5'-untranslated region of mRNAs in the hope of finding novel gene regulatory sites, and the existence of structural RNAs appeared to be a spurious phenomenon. As more riboswitches were discovered, they illustrated the diversity and adaptability of these RNA regulatory sequences. The fact that a chemically monotonous molecule like RNA can discern a wide range of substrates and exert a variety of regulatory mechanisms was subsequently demonstrated in diverse genomes and has hastened the development of sophisticated algorithms for their analysis and prediction. In this review, we focus on some of the computational tools for riboswitch detection and secondary structure prediction. The study of this simple yet efficient form of gene regulation promises to provide a more complete picture of a world that RNA once dominated and allows rational design of artificial riboswitches. This article is part of a Special Issue entitled: Riboswitches.
Collapse
Affiliation(s)
- Eric I Sun
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093, USA
| | - Dmitry A Rodionov
- Sanford-Burnham Medical Research Institute, La Jolla, CA 92037, USA; A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow 127994, Russia.
| |
Collapse
|
26
|
Asymptotic distribution of motifs in a stochastic context-free grammar model of RNA folding. J Math Biol 2014; 69:1743-72. [PMID: 24384698 DOI: 10.1007/s00285-013-0750-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2012] [Revised: 08/23/2013] [Indexed: 10/25/2022]
Abstract
We analyze the distribution of RNA secondary structures given by the Knudsen-Hein stochastic context-free grammar used in the prediction program Pfold. Our main theorem gives relations between the expected number of these motifs--independent of the grammar probabilities. These relations are a consequence of proving that the distribution of base pairs, of helices, and of different types of loops is asymptotically Gaussian in this model of RNA folding. Proof techniques use singularity analysis of probability generating functions. We also demonstrate that these asymptotic results capture well the expected number of RNA base pairs in native ribosomal structures, and certain other aspects of their predicted secondary structures. In particular, we find that the predicted structures largely satisfy the expected relations, although the native structures do not.
Collapse
|
27
|
Abstract
De novo discovery of "motifs" capturing the commonalities among related noncoding ncRNA structured RNAs is among the most difficult problems in computational biology. This chapter outlines the challenges presented by this problem, together with some approaches towards solving them, with an emphasis on an approach based on the CMfinder CMfinder program as a case study. Applications to genomic screens for novel de novo structured ncRNA ncRNA s, including structured RNA elements in untranslated portions of protein-coding genes, are presented.
Collapse
Affiliation(s)
- Walter L Ruzzo
- Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | | |
Collapse
|
28
|
Combinatorial Insights into RNA Secondary Structure. DISCRETE AND TOPOLOGICAL MODELS IN MOLECULAR BIOLOGY 2014. [DOI: 10.1007/978-3-642-40193-0_7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
29
|
|
30
|
Abstract
Transcriptomics experiments and computational predictions both enable systematic discovery of new functional RNAs. However, many putative noncoding transcripts arise instead from artifacts and biological noise, and current computational prediction methods have high false positive rates. I discuss prospects for improving computational methods for analyzing and identifying functional RNAs, with a focus on detecting signatures of conserved RNA secondary structure. An interesting new front is the application of chemical and enzymatic experiments that probe RNA structure on a transcriptome-wide scale. I review several proposed approaches for incorporating structure probing data into the computational prediction of RNA secondary structure. Using probabilistic inference formalisms, I show how all these approaches can be unified in a well-principled framework, which in turn allows RNA probing data to be easily integrated into a wide range of analyses that depend on RNA secondary structure inference. Such analyses include homology search and genome-wide detection of new structural RNAs.
Collapse
Affiliation(s)
- Sean R Eddy
- Howard Hughes Medical Institute Janelia Farm Research Campus, Ashburn, Virginia 20147;
| |
Collapse
|
31
|
Probabilistic grammatical model for helix-helix contact site classification. Algorithms Mol Biol 2013; 8:31. [PMID: 24350601 PMCID: PMC3892132 DOI: 10.1186/1748-7188-8-31] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2013] [Accepted: 11/28/2013] [Indexed: 11/25/2022] Open
Abstract
Background Hidden Markov Models power many state‐of‐the‐art tools in
the field of protein bioinformatics. While excelling in their tasks, these
methods of protein analysis do not convey directly information on
medium‐ and long‐range residue‐residue interactions. This
requires an expressive power of at least context‐free grammars.
However, application of more powerful grammar formalisms to protein analysis
has been surprisingly limited. Results In this work, we present a probabilistic grammatical framework for
problem‐specific protein languages and apply it to classification of
transmembrane helix‐helix pairs configurations. The core of the model
consists of a probabilistic context‐free grammar, automatically
inferred by a genetic algorithm from only a generic set of
expert‐based rules and positive training samples. The model was
applied to produce sequence based descriptors of four classes of
transmembrane helix‐helix contact site configurations. The highest
performance of the classifiers reached AUCROC of 0.70. The analysis of grammar parse trees revealed the ability
of representing structural features of helix‐helix contact sites. Conclusions We demonstrated that our probabilistic context‐free framework for
analysis of protein sequences outperforms the state of the art in the task
of helix‐helix contact site classification. However, this is achieved
without necessarily requiring modeling long range dependencies between
interacting residues. A significant feature of our approach is that grammar
rules and parse trees are human‐readable. Thus they could provide
biologically meaningful information for molecular biologists.
Collapse
|
32
|
Li X, Kazan H, Lipshitz HD, Morris QD. Finding the target sites of RNA-binding proteins. WILEY INTERDISCIPLINARY REVIEWS-RNA 2013; 5:111-30. [PMID: 24217996 PMCID: PMC4253089 DOI: 10.1002/wrna.1201] [Citation(s) in RCA: 72] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/15/2013] [Revised: 09/27/2013] [Accepted: 10/01/2013] [Indexed: 12/15/2022]
Abstract
RNA–protein interactions differ from DNA–protein interactions because of the central role of RNA secondary structure. Some RNA-binding domains (RBDs) recognize their target sites mainly by their shape and geometry and others are sequence-specific but are sensitive to secondary structure context. A number of small- and large-scale experimental approaches have been developed to measure RNAs associated in vitro and in vivo with RNA-binding proteins (RBPs). Generalizing outside of the experimental conditions tested by these assays requires computational motif finding. Often RBP motif finding is done by adapting DNA motif finding methods; but modeling secondary structure context leads to better recovery of RBP-binding preferences. Genome-wide assessment of mRNA secondary structure has recently become possible, but these data must be combined with computational predictions of secondary structure before they add value in predicting in vivo binding. There are two main approaches to incorporating structural information into motif models: supplementing primary sequence motif models with preferred secondary structure contexts (e.g., MEMERIS and RNAcontext) and directly modeling secondary structure recognized by the RBP using stochastic context-free grammars (e.g., CMfinder and RNApromo). The former better reconstruct known binding preferences for sequence-specific RBPs but are not suitable for modeling RBPs that recognize shape and geometry of RNAs. Future work in RBP motif finding should incorporate interactions between multiple RBDs and multiple RBPs in binding to RNA. WIREs RNA 2014, 5:111–130. doi: 10.1002/wrna.1201
Collapse
Affiliation(s)
- Xiao Li
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | | | | | | |
Collapse
|
33
|
Rivas E. The four ingredients of single-sequence RNA secondary structure prediction. A unifying perspective. RNA Biol 2013; 10:1185-96. [PMID: 23695796 PMCID: PMC3849167 DOI: 10.4161/rna.24971] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2013] [Revised: 05/06/2013] [Accepted: 05/08/2013] [Indexed: 12/31/2022] Open
Abstract
Any method for RNA secondary structure prediction is determined by four ingredients. The architecture is the choice of features implemented by the model (such as stacked basepairs, loop length distributions, etc.). The architecture determines the number of parameters in the model. The scoring scheme is the nature of those parameters (whether thermodynamic, probabilistic, or weights). The parameterization stands for the specific values assigned to the parameters. These three ingredients are referred to as "the model." The fourth ingredient is the folding algorithms used to predict plausible secondary structures given the model and the sequence of a structural RNA. Here, I make several unifying observations drawn from looking at more than 40 years of methods for RNA secondary structure prediction in the light of this classification. As a final observation, there seems to be a performance ceiling that affects all methods with complex architectures, a ceiling that impacts all scoring schemes with remarkable similarity. This suggests that modeling RNA secondary structure by using intrinsic sequence-based plausible "foldability" will require the incorporation of other forms of information in order to constrain the folding space and to improve prediction accuracy. This could give an advantage to probabilistic scoring systems since a probabilistic framework is a natural platform to incorporate different sources of information into one single inference problem.
Collapse
Affiliation(s)
- Elena Rivas
- Janelia Farm Research Campus; Howard Hughes Medical Institute; Ashburn, VA USA
| |
Collapse
|
34
|
Ge P, Zhang S. Incorporating phylogenetic-based covarying mutations into RNAalifold for RNA consensus structure prediction. BMC Bioinformatics 2013; 14:142. [PMID: 23621982 PMCID: PMC3691524 DOI: 10.1186/1471-2105-14-142] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2012] [Accepted: 04/04/2013] [Indexed: 01/18/2023] Open
Abstract
Background RNAalifold, a popular computational method for RNA consensus structure prediction, incorporates covarying mutations into a thermodynamic model to fold the aligned RNA sequences. When quantifying covariance, it evaluates conserved signals of two aligned columns with base-pairing rules. This scoring scheme performs better than some other approaches, such as mutual information. However it ignores the phylogenetic history of the aligned sequences, which is an important criterion to evaluate the level of sequence covariance. Results In this article, in order to improve the accuracy of consensus structure folding, we propose a novel approach named PhyloRNAalifold. It incorporates the number of covarying mutations on the phylogenetic tree of the aligned sequences into the covariance scoring of RNAalifold. The benchmarking results show that the new scoring scheme of PhyloRNAalifold can improve the consensus structure detection of RNAalifold. Conclusion Incorporating additional phylogenetic information of aligned sequences into the covariance scoring of RNAalifold can improve its performance of consensus structures folding. This improvement is correlated with alignment characteristics, such as pair-wise identity and the number of sequences in the alignment.
Collapse
Affiliation(s)
- Ping Ge
- Department of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816-2362, USA
| | | |
Collapse
|
35
|
Datta S, Mukhopadhyay S. A composite method based on formal grammar and DNA structural features in detecting human polymerase II promoter region. PLoS One 2013; 8:e54843. [PMID: 23437045 PMCID: PMC3577817 DOI: 10.1371/journal.pone.0054843] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2012] [Accepted: 12/17/2012] [Indexed: 11/25/2022] Open
Abstract
An important step in understanding gene regulation is to identify the promoter regions where the transcription factor binding takes place. Predicting a promoter region de novo has been a theoretical goal for many researchers for a long time. There exists a number of in silico methods to predict the promoter region de novo but most of these methods are still suffering from various shortcomings, a major one being the selection of appropriate features of promoter region distinguishing them from non-promoters. In this communication, we have proposed a new composite method that predicts promoter sequences based on the interrelationship between structural profiles of DNA and primary sequence elements of the promoter regions. We have shown that a Context Free Grammar (CFG) can formalize the relationships between different primary sequence features and by utilizing the CFG, we demonstrate that an efficient parser can be constructed for extracting these relationships from DNA sequences to distinguish the true promoter sequences from non-promoter sequences. Along with CFG, we have extracted the structural features of the promoter region to improve upon the efficiency of our prediction system. Extensive experiments performed on different datasets reveals that our method is effective in predicting promoter sequences on a genome-wide scale and performs satisfactorily as compared to other promoter prediction techniques.
Collapse
Affiliation(s)
- Sutapa Datta
- Department of Biophysics, Molecular Biology and Bioinformatics and Distributed Information Centre for Bioinformatics, University of Calcutta, Kolkata, West Bengal, India.
| | | |
Collapse
|
36
|
Sato K, Kato Y, Akutsu T, Asai K, Sakakibara Y. DAFS: simultaneous aligning and folding of RNA sequences via dual decomposition. ACTA ACUST UNITED AC 2012; 28:3218-24. [PMID: 23060618 DOI: 10.1093/bioinformatics/bts612] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
MOTIVATION It is well known that the accuracy of RNA secondary structure prediction from a single sequence is limited, and thus a comparative approach that predicts a common secondary structure from aligned sequences is a better choice if homologous sequences with reliable alignments are available. However, correct secondary structure information is needed to produce reliable alignments of RNA sequences. To tackle this dilemma, we require a fast and accurate aligner that takes structural information into consideration to yield reliable structural alignments, which are suitable for common secondary structure prediction. RESULTS We develop DAFS, a novel algorithm that simultaneously aligns and folds RNA sequences based on maximizing expected accuracy of a predicted common secondary structure and its alignment. DAFS decomposes the pairwise structural alignment problem into two independent secondary structure prediction problems and one pairwise (non-structural) alignment problem by the dual decomposition technique, and maintains the consistency of a pairwise structural alignment by imposing penalties on inconsistent base pairs and alignment columns that are iteratively updated. Furthermore, we extend DAFS to consider pseudoknots in RNA structural alignments by integrating IPknot for predicting a pseudoknotted structure. The experiments on publicly available datasets showed that DAFS can produce reliable structural alignments from unaligned sequences in terms of accuracy of common secondary structure prediction.
Collapse
Affiliation(s)
- Kengo Sato
- Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan.
| | | | | | | | | |
Collapse
|
37
|
Fitch WT, Friederici AD. Artificial grammar learning meets formal language theory: an overview. Philos Trans R Soc Lond B Biol Sci 2012; 367:1933-55. [PMID: 22688631 PMCID: PMC3367694 DOI: 10.1098/rstb.2012.0103] [Citation(s) in RCA: 100] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Formal language theory (FLT), part of the broader mathematical theory of computation, provides a systematic terminology and set of conventions for describing rules and the structures they generate, along with a rich body of discoveries and theorems concerning generative rule systems. Despite its name, FLT is not limited to human language, but is equally applicable to computer programs, music, visual patterns, animal vocalizations, RNA structure and even dance. In the last decade, this theory has been profitably used to frame hypotheses and to design brain imaging and animal-learning experiments, mostly using the 'artificial grammar-learning' paradigm. We offer a brief, non-technical introduction to FLT and then a more detailed analysis of empirical research based on this theory. We suggest that progress has been hampered by a pervasive conflation of distinct issues, including hierarchy, dependency, complexity and recursion. We offer clarifications of several relevant hypotheses and the experimental designs necessary to test them. We finally review the recent brain imaging literature, using formal languages, identifying areas of convergence and outstanding debates. We conclude that FLT has much to offer scientists who are interested in rigorous empirical investigations of human cognition from a neuroscientific and comparative perspective.
Collapse
Affiliation(s)
- W Tecumseh Fitch
- Department of Cognitive Biology, University of Vienna, Althanstrasse 14, Vienna 1090, Austria.
| | | |
Collapse
|
38
|
WJ Anderson J, Tataru P, Staines J, Hein J, Lyngsø R. Evolving stochastic context--free grammars for RNA secondary structure prediction. BMC Bioinformatics 2012; 13:78. [PMID: 22559985 PMCID: PMC3464655 DOI: 10.1186/1471-2105-13-78] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2011] [Accepted: 05/04/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Stochastic Context-Free Grammars (SCFGs) were applied successfully to RNA secondary structure prediction in the early 90s, and used in combination with comparative methods in the late 90s. The set of SCFGs potentially useful for RNA secondary structure prediction is very large, but a few intuitively designed grammars have remained dominant. In this paper we investigate two automatic search techniques for effective grammars - exhaustive search for very compact grammars and an evolutionary algorithm to find larger grammars. We also examine whether grammar ambiguity is as problematic to structure prediction as has been previously suggested. RESULTS These search techniques were applied to predict RNA secondary structure on a maximal data set and revealed new and interesting grammars, though none are dramatically better than classic grammars. In general, results showed that many grammars with quite different structure could have very similar predictive ability. Many ambiguous grammars were found which were at least as effective as the best current unambiguous grammars. CONCLUSIONS Overall the method of evolving SCFGs for RNA secondary structure prediction proved effective in finding many grammars that had strong predictive accuracy, as good or slightly better than those designed manually. Furthermore, several of the best grammars found were ambiguous, demonstrating that such grammars should not be disregarded.
Collapse
Affiliation(s)
- James WJ Anderson
- Department of Statistics, University of Oxford, 1 South Parks Road, UK
| | - Paula Tataru
- Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, Denmark
| | - Joe Staines
- Department of Computer Science, University College London, Gower Street, UK
| | - Jotun Hein
- Department of Statistics, University of Oxford, 1 South Parks Road, UK
| | - Rune Lyngsø
- Department of Statistics, University of Oxford, 1 South Parks Road, UK
| |
Collapse
|
39
|
Mukherjee S, Mitra S. HIDDEN MARKOV MODELS, GRAMMARS, AND BIOLOGY: A TUTORIAL. J Bioinform Comput Biol 2011; 3:491-526. [PMID: 15852517 DOI: 10.1142/s0219720005001077] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2004] [Revised: 01/05/2004] [Accepted: 01/06/2005] [Indexed: 11/18/2022]
Abstract
Biological sequences and structures have been modelled using various machine learning techniques and abstract mathematical concepts. This article surveys methods using Hidden Markov Model and functional grammars for this purpose. We provide a formal introduction to Hidden Markov Model and grammars, stressing on a comprehensive mathematical description of the methods and their natural continuity. The basic algorithms and their application to analyzing biological sequences and modelling structures of bio-molecules like proteins and nucleic acids are discussed. A comparison of the different approaches is discussed, and possible areas of work and problems are highlighted. Related databases and softwares, available on the internet, are also mentioned.
Collapse
Affiliation(s)
- Shibaji Mukherjee
- Association for Studies in Computational Biology, Kolkata 700 018, India.
| | | |
Collapse
|
40
|
SATO KENGO, HAMADA MICHIAKI, MITUYAMA TOUTAI, ASAI KIYOSHI, SAKAKIBARA YASUBUMI. A NON-PARAMETRIC BAYESIAN APPROACH FOR PREDICTING RNA SECONDARY STRUCTURES. J Bioinform Comput Biol 2011. [DOI: 10.1142/s0219720010004926] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Since many functional RNAs form stable secondary structures which are related to their functions, RNA secondary structure prediction is a crucial problem in bioinformatics. We propose a novel model for generating RNA secondary structures based on a non-parametric Bayesian approach, called hierarchical Dirichlet processes for stochastic context-free grammars (HDP-SCFGs). Here non-parametric means that some meta-parameters, such as the number of non-terminal symbols and production rules, do not have to be fixed. Instead their distributions are inferred in order to be adapted (in the Bayesian sense) to the training sequences provided. The results of our RNA secondary structure predictions show that HDP-SCFGs are more accurate than the MFE-based and other generative models.
Collapse
Affiliation(s)
- KENGO SATO
- Graduate School of Frontier Sciences, University of Tokyo, 5–1–5 Kashiwanoha, Kashiwa 277–8562, Japan
| | - MICHIAKI HAMADA
- Mizuho Information & Research Institute, Inc., 2–3 Kanda-Nishikicho, Chiyoda-ku, Tokyo 101–8443, Japan
| | - TOUTAI MITUYAMA
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2–41–6, Aomi, Koto-ku, Tokyo 135–0064, Japan
| | - KIYOSHI ASAI
- Graduate School of Frontier Sciences, University of Tokyo, 5–1–5 Kashiwanoha, Kashiwa 277–8562, Japan
| | - YASUBUMI SAKAKIBARA
- Department of Biosciences and Informatics, Keio University, 3–14–1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223–8522, Japan
| |
Collapse
|
41
|
Nebel ME, Scheid A. Analysis of the free energy in a stochastic RNA secondary structure model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1468-82. [PMID: 21116040 DOI: 10.1109/tcbb.2010.126] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/07/2023]
Abstract
There are two custom ways for predicting RNA secondary structures: minimizing the free energy of a conformation according to a thermodynamic model and maximizing the probability of a folding according to a stochastic model. In most cases, stochastic grammars are used for the latter alternative applying the maximum likelihood principle for determining a grammar's probabilities. In this paper, building on such a stochastic model, we will analyze the expected minimum free energy of an RNA molecule according to Turner's energy rules. Even if the parameters of our grammar are chosen with respect to structural properties of native molecules only (and therefore, independent of molecules' free energy), we prove formulae for the expected minimum free energy and the corresponding variance as functions of the molecule's size which perfectly fit the native behavior of free energies. This gives proof for a high quality of our stochastic model making it a handy tool for further investigations. In fact, the stochastic model for RNA secondary structures presented in this work has, for example, been used as the basis of a new algorithm for the (nonuniform) generation of random RNA secondary structures.
Collapse
Affiliation(s)
- Markus E Nebel
- Department of Computer Science, University of Kaiserslautern, PO Box 3049, D-67653 Kaiserslautern, Germany
| | | |
Collapse
|
42
|
Zakov S, Goldberg Y, Elhadad M, Ziv-Ukelson M. Rich parameterization improves RNA structure prediction. J Comput Biol 2011; 18:1525-42. [PMID: 22035327 DOI: 10.1089/cmb.2011.0184] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Current approaches to RNA structure prediction range from physics-based methods, which rely on thousands of experimentally measured thermodynamic parameters, to machine-learning (ML) techniques. While the methods for parameter estimation are successfully shifting toward ML-based approaches, the model parameterizations so far remained fairly constant. We study the potential contribution of increasing the amount of information utilized by RNA folding prediction models to the improvement of their prediction quality. This is achieved by proposing novel models, which refine previous ones by examining more types of structural elements, and larger sequential contexts for these elements. Our proposed fine-grained models are made practical thanks to the availability of large training sets, advances in machine-learning, and recent accelerations to RNA folding algorithms. We show that the application of more detailed models indeed improves prediction quality, while the corresponding running time of the folding algorithm remains fast. An additional important outcome of this experiment is a new RNA folding prediction model (coupled with a freely available implementation), which results in a significantly higher prediction quality than that of previous models. This final model has about 70,000 free parameters, several orders of magnitude more than previous models. Being trained and tested over the same comprehensive data sets, our model achieves a score of 84% according to the F₁-measure over correctly-predicted base-pairs (i.e., 16% error rate), compared to the previously best reported score of 70% (i.e., 30% error rate). That is, the new model yields an error reduction of about 50%. Trained models and source code are available at www.cs.bgu.ac.il/?negevcb/contextfold.
Collapse
Affiliation(s)
- Shay Zakov
- Department of Computer Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel
| | | | | | | |
Collapse
|
43
|
Nebel ME, Scheid A, Weinberg F. Random generation of RNA secondary structures according to native distributions. Algorithms Mol Biol 2011; 6:24. [PMID: 21992500 PMCID: PMC3354341 DOI: 10.1186/1748-7188-6-24] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2011] [Accepted: 10/12/2011] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Random biological sequences are a topic of great interest in genome analysis since, according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate. Accordingly, the generation of random sequences has been investigated for a long time. Similarly, random object of a more complicated structure like RNA molecules or proteins are of interest. RESULTS In this article, we present a new general framework for deriving algorithms for the non-uniform random generation of combinatorial objects according to the encoding and probability distribution implied by a stochastic context-free grammar. Briefly, the framework extends on the well-known recursive method for (uniform) random generation and uses the popular framework of admissible specifications of combinatorial classes, introducing weighted combinatorial classes to allow for the non-uniform generation by means of unranking. This framework is used to derive an algorithm for the generation of RNA secondary structures of a given fixed size. We address the random generation of these structures according to a realistic distribution obtained from real-life data by using a very detailed context-free grammar (that models the class of RNA secondary structures by distinguishing between all known motifs in RNA structure). Compared to well-known sampling approaches used in several structure prediction tools (such as SFold) ours has two major advantages: Firstly, after a preprocessing step in time O(n2) for the computation of all weighted class sizes needed, with our approach a set of m random secondary structures of a given structure size n can be computed in worst-case time complexity Om⋅n⋅ log(n) while other algorithms typically have a runtime in O(m⋅n2). Secondly, our approach works with integer arithmetic only which is faster and saves us from all the discomforting details of using floating point arithmetic with logarithmized probabilities. CONCLUSION A number of experimental results shows that our random generation method produces realistic output, at least with respect to the appearance of the different structural motifs. The algorithm is available as a webservice at http://wwwagak.cs.uni-kl.de/NonUniRandGen and can be used for generating random secondary structures of any specified RNA type. A link to download an implementation of our method (in Wolfram Mathematica) can be found there, too.
Collapse
Affiliation(s)
- Markus E Nebel
- Department of Computer Science, University of Kaiserslautern, Germany
| | - Anika Scheid
- Department of Computer Science, University of Kaiserslautern, Germany
| | - Frank Weinberg
- Department of Computer Science, University of Kaiserslautern, Germany
| |
Collapse
|
44
|
Zakov S, Tsur D, Ziv-Ukelson M. Reducing the worst case running times of a family of RNA and CFG problems, using Valiant's approach. Algorithms Mol Biol 2011; 6:20. [PMID: 21851589 PMCID: PMC3741081 DOI: 10.1186/1748-7188-6-20] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2010] [Accepted: 08/18/2011] [Indexed: 01/11/2023] Open
Abstract
Background RNA secondary structure prediction is a mainstream bioinformatic domain, and is key to computational analysis of functional RNA. In more than 30 years, much research has been devoted to defining different variants of RNA structure prediction problems, and to developing techniques for improving prediction quality. Nevertheless, most of the algorithms in this field follow a similar dynamic programming approach as that presented by Nussinov and Jacobson in the late 70's, which typically yields cubic worst case running time algorithms. Recently, some algorithmic approaches were applied to improve the complexity of these algorithms, motivated by new discoveries in the RNA domain and by the need to efficiently analyze the increasing amount of accumulated genome-wide data. Results We study Valiant's classical algorithm for Context Free Grammar recognition in sub-cubic time, and extract features that are common to problems on which Valiant's approach can be applied. Based on this, we describe several problem templates, and formulate generic algorithms that use Valiant's technique and can be applied to all problems which abide by these templates, including many problems within the world of RNA Secondary Structures and Context Free Grammars. Conclusions The algorithms presented in this paper improve the theoretical asymptotic worst case running time bounds for a large family of important problems. It is also possible that the suggested techniques could be applied to yield a practical speedup for these problems. For some of the problems (such as computing the RNA partition function and base-pair binding probabilities), the presented techniques are the only ones which are currently known for reducing the asymptotic running time bounds of the standard algorithms.
Collapse
|
45
|
Abstract
Non-coding RNAs (ncRNAs) are receiving more and more attention not only as an abundant class of genes, but also as regulatory structural elements (some located in mRNAs). A key feature of RNA function is its structure. Computational methods were developed early for folding and prediction of RNA structure with the aim of assisting in functional analysis. With the discovery of more and more ncRNAs, it has become clear that a large fraction of these are highly structured. Interestingly, a large part of the structure is comprised of regular Watson-Crick and GU wobble base pairs. This and the increased amount of available genomes have made it possible to employ structure-based methods for genomic screens. The field has moved from folding prediction of single sequences to computational screens for ncRNAs in genomic sequence using the RNA structure as the main characteristic feature. Whereas early methods focused on energy-directed folding of single sequences, comparative analysis based on structure preserving changes of base pairs has been efficient in improving accuracy, and today this constitutes a key component in genomic screens. Here, we cover the basic principles of RNA folding and touch upon some of the concepts in current methods that have been applied in genomic screens for de novo RNA structures in searches for novel ncRNA genes and regulatory RNA structure on mRNAs. We discuss the strengths and weaknesses of the different strategies and how they can complement each other.
Collapse
|
46
|
Wei D, Alpert LV, Lawrence CE. RNAG: a new Gibbs sampler for predicting RNA secondary structure for unaligned sequences. ACTA ACUST UNITED AC 2011; 27:2486-93. [PMID: 21788211 PMCID: PMC3167047 DOI: 10.1093/bioinformatics/btr421] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION RNA secondary structure plays an important role in the function of many RNAs, and structural features are often key to their interaction with other cellular components. Thus, there has been considerable interest in the prediction of secondary structures for RNA families. In this article, we present a new global structural alignment algorithm, RNAG, to predict consensus secondary structures for unaligned sequences. It uses a blocked Gibbs sampling algorithm, which has a theoretical advantage in convergence time. This algorithm iteratively samples from the conditional probability distributions P(Structure | Alignment) and P(Alignment | Structure). Not surprisingly, there is considerable uncertainly in the high-dimensional space of this difficult problem, which has so far received limited attention in this field. We show how the samples drawn from this algorithm can be used to more fully characterize the posterior space and to assess the uncertainty of predictions. RESULTS Our analysis of three publically available datasets showed a substantial improvement in RNA structure prediction by RNAG over extant prediction methods. Additionally, our analysis of 17 RNA families showed that the RNAG sampled structures were generally compact around their ensemble centroids, and at least 11 families had at least two well-separated clusters of predicted structures. In general, the distance between a reference structure and our predicted structure was large relative to the variation among structures within an ensemble. AVAILABILITY The Perl implementation of the RNAG algorithm and the data necessary to reproduce the results described in Sections 3.1 and 3.2 are available at http://ccmbweb.ccv.brown.edu/rnag.html CONTACT charles_lawrence@brown.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Donglai Wei
- Department of Mathematics, Center for Computational Molecular Biology, Brown University, Providence, RI 02912, USA
| | | | | |
Collapse
|
47
|
Yoon BJ. Hidden Markov Models and their Applications in Biological Sequence Analysis. Curr Genomics 2011; 10:402-15. [PMID: 20190955 PMCID: PMC2766791 DOI: 10.2174/138920209789177575] [Citation(s) in RCA: 146] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2008] [Revised: 02/28/2009] [Accepted: 03/02/2009] [Indexed: 12/21/2022] Open
Abstract
Hidden Markov models (HMMs) have been extensively used in biological sequence analysis. In this paper, we give a tutorial review of HMMs and their applications in a variety of problems in molecular biology. We especially focus on three types of HMMs: the profile-HMMs, pair-HMMs, and context-sensitive HMMs. We show how these HMMs can be used to solve various sequence analysis problems, such as pairwise and multiple sequence alignments, gene annotation, classification, similarity search, and many others.
Collapse
Affiliation(s)
- Byung-Jun Yoon
- Department of Electrical & Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA
| |
Collapse
|
48
|
D'Ulizia A, Ferri F, Grifoni P. A Learning Algorithm for Multimodal Grammar Inference. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS. PART B, CYBERNETICS : A PUBLICATION OF THE IEEE SYSTEMS, MAN, AND CYBERNETICS SOCIETY 2011; 41:1495-510. [PMID: 21724519 DOI: 10.1109/tsmcb.2011.2155057] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The high costs of development and maintenance of multimodal grammars in integrating and understanding input in multimodal interfaces lead to the investigation of novel algorithmic solutions in automating grammar generation and in updating processes. Many algorithms for context-free grammar inference have been developed in the natural language processing literature. An extension of these algorithms toward the inference of multimodal grammars is necessary for multimodal input processing. In this paper, we propose a novel grammar inference mechanism that allows us to learn a multimodal grammar from its positive samples of multimodal sentences. The algorithm first generates the multimodal grammar that is able to parse the positive samples of sentences and, afterward, makes use of two learning operators and the minimum description length metrics in improving the grammar description and in avoiding the over-generalization problem. The experimental results highlight the acceptable performances of the algorithm proposed in this paper since it has a very high probability of parsing valid sentences.
Collapse
|
49
|
Gardner PP, Barquist L, Bateman A, Nawrocki EP, Weinberg Z. RNIE: genome-wide prediction of bacterial intrinsic terminators. Nucleic Acids Res 2011; 39:5845-52. [PMID: 21478170 PMCID: PMC3152330 DOI: 10.1093/nar/gkr168] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Bacterial Rho-independent terminators (RITs) are important genomic landmarks involved in gene regulation and terminating gene expression. In this investigation we present RNIE, a probabilistic approach for predicting RITs. The method is based upon covariance models which have been known for many years to be the most accurate computational tools for predicting homology in structural non-coding RNAs. We show that RNIE has superior performance in model species from a spectrum of bacterial phyla. Further analysis of species where a low number of RITs were predicted revealed a highly conserved structural sequence motif enriched near the genic termini of the pathogenic Actinobacteria, Mycobacterium tuberculosis. This motif, together with classical RITs, account for up to 90% of all the significantly structured regions from the termini of M. tuberculosis genic elements. The software, predictions and alignments described below are available from http://github.com/ppgardne/RNIE.
Collapse
Affiliation(s)
- Paul P Gardner
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA0, UK.
| | | | | | | | | |
Collapse
|
50
|
Backofen R, Tsur D, Zakov S, Ziv-Ukelson M. Sparse RNA folding: Time and space efficient algorithms. ACTA ACUST UNITED AC 2011. [DOI: 10.1016/j.jda.2010.09.001] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|