1
|
Krueger RK, Ward M. JAX-RNAfold: scalable differentiable folding. Bioinformatics 2025; 41:btaf203. [PMID: 40279486 PMCID: PMC12064173 DOI: 10.1093/bioinformatics/btaf203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2024] [Revised: 03/11/2025] [Accepted: 04/24/2025] [Indexed: 04/27/2025] Open
Abstract
SUMMARY Differentiable folding is an emerging paradigm for RNA design in which a probabilistic sequence representation is optimized via gradient descent. However, given the significant memory overhead of differentiating the expected partition function over all RNA sequences, the existing proof-of-concept algorithm only scales to ≤50 nucleotides. We present JAX-RNAfold, an open-source software package for our drastically improved differentiable folding algorithm that scales to 1,250 nucleotides on a single GPU. Our software permits the natural inclusion of differentiable folding as a module in larger deep learning pipelines, as well as complex RNA design procedures such as mRNA design with flexible objective functions. AVAILABILITY AND IMPLEMENTATION JAX-RNAfold is hosted on GitHub (https://github.com/rkruegs123/jax-rnafold) and can be installed locally as a Python package. All source code is also archived on Zenodo (https://doi.org/10.5281/zenodo.15003072).
Collapse
Affiliation(s)
- Ryan K Krueger
- School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, United States
| | - Max Ward
- Department of Computer Science and Software Engineering, The University of Western Australia, Crawley, WA 6009, Australia
| |
Collapse
|
2
|
Stewart JM. RNA nanotechnology on the horizon: Self-assembly, chemical modifications, and functional applications. Curr Opin Chem Biol 2024; 81:102479. [PMID: 38889473 DOI: 10.1016/j.cbpa.2024.102479] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Revised: 05/20/2024] [Accepted: 05/25/2024] [Indexed: 06/20/2024]
Abstract
RNA nanotechnology harnesses the unique chemical and structural properties of RNA to build nanoassemblies and supramolecular structures with dynamic and functional capabilities. This review focuses on design and assembly approaches to building RNA structures, the RNA chemical modifications used to enhance stability and functionality, and modern-day applications in therapeutics, biosensing, and bioimaging.
Collapse
|
3
|
Rinaldi S, Moroni E, Rozza R, Magistrato A. Frontiers and Challenges of Computing ncRNAs Biogenesis, Function and Modulation. J Chem Theory Comput 2024; 20:993-1018. [PMID: 38287883 DOI: 10.1021/acs.jctc.3c01239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2024]
Abstract
Non-coding RNAs (ncRNAs), generated from nonprotein coding DNA sequences, constitute 98-99% of the human genome. Non-coding RNAs encompass diverse functional classes, including microRNAs, small interfering RNAs, PIWI-interacting RNAs, small nuclear RNAs, small nucleolar RNAs, and long non-coding RNAs. With critical involvement in gene expression and regulation across various biological and physiopathological contexts, such as neuronal disorders, immune responses, cardiovascular diseases, and cancer, non-coding RNAs are emerging as disease biomarkers and therapeutic targets. In this review, after providing an overview of non-coding RNAs' role in cell homeostasis, we illustrate the potential and the challenges of state-of-the-art computational methods exploited to study non-coding RNAs biogenesis, function, and modulation. This can be done by directly targeting them with small molecules or by altering their expression by targeting the cellular engines underlying their biosynthesis. Drawing from applications, also taken from our work, we showcase the significance and role of computer simulations in uncovering fundamental facets of ncRNA mechanisms and modulation. This information may set the basis to advance gene modulation tools and therapeutic strategies to address unmet medical needs.
Collapse
Affiliation(s)
- Silvia Rinaldi
- National Research Council of Italy (CNR) - Institute of Chemistry of OrganoMetallic Compounds (ICCOM), c/o Area di Ricerca CNR di Firenze Via Madonna del Piano 10, 50019 Sesto Fiorentino, Florence, Italy
| | - Elisabetta Moroni
- National Research Council of Italy (CNR) - Institute of Chemical Sciences and Technologies (SCITEC), via Mario Bianco 9, 20131 Milano, Italy
| | - Riccardo Rozza
- National Research Council of Italy (CNR) - Institute of Material Foundry (IOM) c/o International School for Advanced Studies (SISSA), Via Bonomea, 265, 34136 Trieste, Italy
| | - Alessandra Magistrato
- National Research Council of Italy (CNR) - Institute of Material Foundry (IOM) c/o International School for Advanced Studies (SISSA), Via Bonomea, 265, 34136 Trieste, Italy
| |
Collapse
|
4
|
Nasaev SS, Mukanov AR, Kuznetsov II, Veselovsky AV. AliNA - a deep learning program for RNA secondary structure prediction. Mol Inform 2023; 42:e202300113. [PMID: 37710142 DOI: 10.1002/minf.202300113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 09/13/2023] [Accepted: 09/14/2023] [Indexed: 09/16/2023]
Abstract
Nowadays there are numerous discovered natural RNA variations participating in different cellular processes and artificial RNA, e. g., aptamers, riboswitches. One of the required tasks in the investigation of their functions and mechanism of influence on cells and interaction with targets is the prediction of RNA secondary structures. The classic thermodynamic-based prediction algorithms do not consider the specificity of biological folding and deep learning methods that were designed to resolve this issue suffer from homology-based methods problems. Herein, we present a method for RNA secondary structure prediction based on deep learning - AliNA (ALIgned Nucleic Acids). Our method successfully predicts secondary structures for non-homologous to train-data RNA families thanks to usage of the data augmentation techniques. Augmentation extends existing datasets with easily-accessible simulated data. The proposed method shows a high quality of prediction across different benchmarks including pseudoknots. The method is available on GitHub for free (https://github.com/Arty40m/AliNA).
Collapse
Affiliation(s)
- Shamsudin S Nasaev
- Institute of Biomedical Chemistry, 10, Pogodinskaya str., 119121, Moscow, Russia
| | - Artem R Mukanov
- A.M. Butlerov Institute of Chemistry, Kazan Federal University, 18, Kremlyovskaya str., 420008, Kazan, Russia
| | - Ivan I Kuznetsov
- Moscow University of Finance and Law, 10 block 1, Serpuhovsky val str., 115191, Moscow, Russia
| | | |
Collapse
|
5
|
Sieg JP, Jolley EA, Huot MJ, Babitzke P, Bevilacqua P. In vivo-like nearest neighbor parameters improve prediction of fractional RNA base-pairing in cells. Nucleic Acids Res 2023; 51:11298-11317. [PMID: 37855684 PMCID: PMC10639048 DOI: 10.1093/nar/gkad807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Revised: 09/11/2023] [Accepted: 09/27/2023] [Indexed: 10/20/2023] Open
Abstract
We conducted a thermodynamic analysis of RNA stability in Eco80 artificial cytoplasm, which mimics in vivo conditions, and compared it to transcriptome-wide probing of mRNA. Eco80 contains 80% of Escherichia coli metabolites, with biological concentrations of metal ions, including 2 mM free Mg2+ and 29 mM metabolite-chelated Mg2+. Fluorescence-detected binding isotherms (FDBI) were used to conduct a thermodynamic analysis of 24 RNA helices and found that these helices, which have an average stability of -12.3 kcal/mol, are less stable by ΔΔGo37 ∼1 kcal/mol. The FDBI data was used to determine a set of Watson-Crick free energy nearest neighbor parameters (NNPs), which revealed that Eco80 reduces the stability of three NNPs. This information was used to adjust the NN model using the RNAstructure package. The in vivo-like adjustments have minimal effects on the prediction of RNA secondary structures determined in vitro and in silico, but markedly improve prediction of fractional RNA base pairing in E. coli, as benchmarked with our in vivo DMS and EDC RNA chemical probing data. In summary, our thermodynamic and chemical probing analyses of RNA helices indicate that RNA secondary structures are less stable in cells than in artificially stable in vitro buffer conditions.
Collapse
Affiliation(s)
- Jacob P Sieg
- Department of Chemistry, Pennsylvania State University, University Park, PA 16802, USA
- Center for RNA Molecular Biology, Pennsylvania State University, University Park, PA 16802, USA
| | - Elizabeth A Jolley
- Department of Chemistry, Pennsylvania State University, University Park, PA 16802, USA
- Center for RNA Molecular Biology, Pennsylvania State University, University Park, PA 16802, USA
| | - Melanie J Huot
- Department of Biology, Pennsylvania State University, University Park, PA 16802, USA
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA 16802, USA
| | - Paul Babitzke
- Center for RNA Molecular Biology, Pennsylvania State University, University Park, PA 16802, USA
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA 16802, USA
| | - Philip C Bevilacqua
- Department of Chemistry, Pennsylvania State University, University Park, PA 16802, USA
- Center for RNA Molecular Biology, Pennsylvania State University, University Park, PA 16802, USA
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
6
|
Zhang H, Zhang L, Lin A, Xu C, Li Z, Liu K, Liu B, Ma X, Zhao F, Jiang H, Chen C, Shen H, Li H, Mathews DH, Zhang Y, Huang L. Algorithm for optimized mRNA design improves stability and immunogenicity. Nature 2023; 621:396-403. [PMID: 37130545 PMCID: PMC10499610 DOI: 10.1038/s41586-023-06127-z] [Citation(s) in RCA: 152] [Impact Index Per Article: 76.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2022] [Accepted: 04/25/2023] [Indexed: 05/04/2023]
Abstract
Messenger RNA (mRNA) vaccines are being used to combat the spread of COVID-19 (refs. 1-3), but they still exhibit critical limitations caused by mRNA instability and degradation, which are major obstacles for the storage, distribution and efficacy of the vaccine products4. Increasing secondary structure lengthens mRNA half-life, which, together with optimal codons, improves protein expression5. Therefore, a principled mRNA design algorithm must optimize both structural stability and codon usage. However, owing to synonymous codons, the mRNA design space is prohibitively large-for example, there are around 2.4 × 10632 candidate mRNA sequences for the SARS-CoV-2 spike protein. This poses insurmountable computational challenges. Here we provide a simple and unexpected solution using the classical concept of lattice parsing in computational linguistics, where finding the optimal mRNA sequence is analogous to identifying the most likely sentence among similar-sounding alternatives6. Our algorithm LinearDesign finds an optimal mRNA design for the spike protein in just 11 minutes, and can concurrently optimize stability and codon usage. LinearDesign substantially improves mRNA half-life and protein expression, and profoundly increases antibody titre by up to 128 times in mice compared to the codon-optimization benchmark on mRNA vaccines for COVID-19 and varicella-zoster virus. This result reveals the great potential of principled mRNA design and enables the exploration of previously unreachable but highly stable and efficient designs. Our work is a timely tool for vaccines and other mRNA-based medicines encoding therapeutic proteins such as monoclonal antibodies and anti-cancer drugs7,8.
Collapse
Affiliation(s)
- He Zhang
- Baidu Research USA, Sunnyvale, CA, USA
- School of EECS, Oregon State University, Corvallis, OR, USA
| | - Liang Zhang
- Baidu Research USA, Sunnyvale, CA, USA
- School of EECS, Oregon State University, Corvallis, OR, USA
- Vaccine Center, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing, China
| | - Ang Lin
- StemiRNA Therapeutics, Shanghai, China
- Vaccine Center, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing, China
| | | | - Ziyu Li
- Baidu Research USA, Sunnyvale, CA, USA
| | - Kaibo Liu
- Baidu Research USA, Sunnyvale, CA, USA
- School of EECS, Oregon State University, Corvallis, OR, USA
| | - Boxiang Liu
- Baidu Research USA, Sunnyvale, CA, USA
- Department of Pharmacy, National University of Singapore, Singapore, Singapore
| | | | | | | | | | | | | | - David H Mathews
- Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, USA.
- Center for RNA Biology, University of Rochester Medical Center, Rochester, NY, USA.
- Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY, USA.
- Coderna.ai, Inc., Sunnyvale, CA, USA.
| | - Yujian Zhang
- StemiRNA Therapeutics, Shanghai, China.
- , Gaithersburg, MD, USA.
| | - Liang Huang
- Baidu Research USA, Sunnyvale, CA, USA.
- School of EECS, Oregon State University, Corvallis, OR, USA.
- Coderna.ai, Inc., Sunnyvale, CA, USA.
| |
Collapse
|
7
|
Sato K, Hamada M. Recent trends in RNA informatics: a review of machine learning and deep learning for RNA secondary structure prediction and RNA drug discovery. Brief Bioinform 2023; 24:bbad186. [PMID: 37232359 PMCID: PMC10359090 DOI: 10.1093/bib/bbad186] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 04/24/2023] [Accepted: 04/25/2023] [Indexed: 05/27/2023] Open
Abstract
Computational analysis of RNA sequences constitutes a crucial step in the field of RNA biology. As in other domains of the life sciences, the incorporation of artificial intelligence and machine learning techniques into RNA sequence analysis has gained significant traction in recent years. Historically, thermodynamics-based methods were widely employed for the prediction of RNA secondary structures; however, machine learning-based approaches have demonstrated remarkable advancements in recent years, enabling more accurate predictions. Consequently, the precision of sequence analysis pertaining to RNA secondary structures, such as RNA-protein interactions, has also been enhanced, making a substantial contribution to the field of RNA biology. Additionally, artificial intelligence and machine learning are also introducing technical innovations in the analysis of RNA-small molecule interactions for RNA-targeted drug discovery and in the design of RNA aptamers, where RNA serves as its own ligand. This review will highlight recent trends in the prediction of RNA secondary structure, RNA aptamers and RNA drug discovery using machine learning, deep learning and related technologies, and will also discuss potential future avenues in the field of RNA informatics.
Collapse
Affiliation(s)
- Kengo Sato
- School of System Design and Technology, Tokyo Denki University, 5 Senju Asahi-cho, Adachi-ku, Tokyo 120-8551, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, 55N-06-10, 3-4-1, Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL) , National Institute of Advanced Industrial Science and Technology (AIST), 3-4-1, Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
- Graduate School of Medicine, Nippon Medical School, 1-1-5, Sendagi, Bunkyo-ku, Tokyo 113-8602, Japan
| |
Collapse
|
8
|
Qiu X. Sequence similarity governs generalizability of de novo deep learning models for RNA secondary structure prediction. PLoS Comput Biol 2023; 19:e1011047. [PMID: 37068100 PMCID: PMC10138783 DOI: 10.1371/journal.pcbi.1011047] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Revised: 04/27/2023] [Accepted: 03/25/2023] [Indexed: 04/18/2023] Open
Abstract
Making no use of physical laws or co-evolutionary information, de novo deep learning (DL) models for RNA secondary structure prediction have achieved far superior performances than traditional algorithms. However, their statistical underpinning raises the crucial question of generalizability. We present a quantitative study of the performance and generalizability of a series of de novo DL models, with a minimal two-module architecture and no post-processing, under varied similarities between seen and unseen sequences. Our models demonstrate excellent expressive capacities and outperform existing methods on common benchmark datasets. However, model generalizability, i.e., the performance gap between the seen and unseen sets, degrades rapidly as the sequence similarity decreases. The same trends are observed from several recent DL and machine learning models. And an inverse correlation between performance and generalizability is revealed collectively across all learning-based models with wide-ranging architectures and sizes. We further quantitate how generalizability depends on sequence and structure identity scores via pairwise alignment, providing unique quantitative insights into the limitations of statistical learning. Generalizability thus poses a major hurdle for deploying de novo DL models in practice and various pathways for future advances are discussed.
Collapse
Affiliation(s)
- Xiangyun Qiu
- Department of Physics, George Washington University, Washington DC, United States of America
| |
Collapse
|
9
|
RNA Secondary Structure Prediction Based on Energy Models. Methods Mol Biol 2023; 2586:89-105. [PMID: 36705900 DOI: 10.1007/978-1-0716-2768-6_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
This chapter introduces the RNA secondary structure prediction based on the nearest neighbor energy model, which is one of the most popular architectures of modeling RNA secondary structure without pseudoknots. We discuss the parameterization and the parameter determination by experimental and machine learning-based approaches as well as an integrated approach that compensates each other's shortcomings. Then, folding algorithms for the minimum free energy and the maximum expected accuracy using the dynamic programming technique are introduced. Finally, we compare the prediction accuracy of the method described so far with benchmark datasets.
Collapse
|
10
|
Zhang J, Fei Y, Sun L, Zhang QC. Advances and opportunities in RNA structure experimental determination and computational modeling. Nat Methods 2022; 19:1193-1207. [PMID: 36203019 DOI: 10.1038/s41592-022-01623-y] [Citation(s) in RCA: 64] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Accepted: 08/23/2022] [Indexed: 11/09/2022]
Abstract
Beyond transferring genetic information, RNAs are molecules with diverse functions that include catalyzing biochemical reactions and regulating gene expression. Most of these activities depend on RNAs' specific structures. Therefore, accurately determining RNA structure is integral to advancing our understanding of RNA functions. Here, we summarize the state-of-the-art experimental and computational technologies developed to evaluate RNA secondary and tertiary structures. We also highlight how the rapid increase of experimental data facilitates the integrative modeling approaches for better resolving RNA structures. Finally, we provide our thoughts on the latest advances and challenges in RNA structure determination methods, as well as on future directions for both experimental approaches and artificial intelligence-based computational tools to model RNA structure. Ultimately, we hope the technological advances will deepen our understanding of RNA biology and facilitate RNA structure-based biomedical research such as designing specific RNA structures for therapeutics and deploying RNA-targeting small-molecule drugs.
Collapse
Affiliation(s)
- Jinsong Zhang
- MOE Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing, China.,Beijing Advanced Innovation Center for Structural Biology & Frontier Research Center for Biological Structure, School of Life Sciences, Tsinghua University, Beijing, China.,Tsinghua-Peking Center for Life Sciences, Beijing, China
| | - Yuhan Fei
- MOE Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing, China.,Beijing Advanced Innovation Center for Structural Biology & Frontier Research Center for Biological Structure, School of Life Sciences, Tsinghua University, Beijing, China.,Tsinghua-Peking Center for Life Sciences, Beijing, China
| | - Lei Sun
- MOE Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing, China. .,Beijing Advanced Innovation Center for Structural Biology & Frontier Research Center for Biological Structure, School of Life Sciences, Tsinghua University, Beijing, China. .,Tsinghua-Peking Center for Life Sciences, Beijing, China.
| | - Qiangfeng Cliff Zhang
- MOE Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing, China. .,Beijing Advanced Innovation Center for Structural Biology & Frontier Research Center for Biological Structure, School of Life Sciences, Tsinghua University, Beijing, China. .,Tsinghua-Peking Center for Life Sciences, Beijing, China.
| |
Collapse
|
11
|
Szikszai M, Wise M, Datta A, Ward M, Mathews DH. Deep learning models for RNA secondary structure prediction (probably) do not generalize across families. Bioinformatics 2022; 38:3892-3899. [PMID: 35748706 PMCID: PMC9364374 DOI: 10.1093/bioinformatics/btac415] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 06/09/2022] [Accepted: 06/21/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The secondary structure of RNA is of importance to its function. Over the last few years, several papers attempted to use machine learning to improve de novo RNA secondary structure prediction. Many of these papers report impressive results for intra-family predictions but seldom address the much more difficult (and practical) inter-family problem. RESULTS We demonstrate that it is nearly trivial with convolutional neural networks to generate pseudo-free energy changes, modelled after structure mapping data that improve the accuracy of structure prediction for intra-family cases. We propose a more rigorous method for inter-family cross-validation that can be used to assess the performance of learning-based models. Using this method, we further demonstrate that intra-family performance is insufficient proof of generalization despite the widespread assumption in the literature and provide strong evidence that many existing learning-based models have not generalized inter-family. AVAILABILITY AND IMPLEMENTATION Source code and data are available at https://github.com/marcellszi/dl-rna. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marcell Szikszai
- Department of Computer Science & Software Engineering, The University of Western Australia, Perth, WA 6009, Australia
| | - Michael Wise
- Department of Computer Science & Software Engineering, The University of Western Australia, Perth, WA 6009, Australia
- The Marshall Centre for Infectious Diseases Research and Training, The University of Western Australia, Perth, WA 6009, Australia
| | - Amitava Datta
- Department of Computer Science & Software Engineering, The University of Western Australia, Perth, WA 6009, Australia
| | - Max Ward
- Department of Computer Science & Software Engineering, The University of Western Australia, Perth, WA 6009, Australia
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA
| | - David H Mathews
- Department of Biochemistry & Biophysics, Center for RNA Biology, and Department of Biostatistics & Computational Biology, University of Rochester, Rochester, NY 14642, USA
| |
Collapse
|
12
|
Flamm C, Wielach J, Wolfinger MT, Badelt S, Lorenz R, Hofacker IL. Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction. FRONTIERS IN BIOINFORMATICS 2022; 2:835422. [PMID: 36304289 PMCID: PMC9580944 DOI: 10.3389/fbinf.2022.835422] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Accepted: 06/09/2022] [Indexed: 11/18/2022] Open
Abstract
Machine learning (ML) and in particular deep learning techniques have gained popularity for predicting structures from biopolymer sequences. An interesting case is the prediction of RNA secondary structures, where well established biophysics based methods exist. The accuracy of these classical methods is limited due to lack of experimental parameters and certain simplifying assumptions and has seen little improvement over the last decade. This makes RNA folding an attractive target for machine learning and consequently several deep learning models have been proposed in recent years. However, for ML approaches to be competitive for de-novo structure prediction, the models must not just demonstrate good phenomenological fits, but be able to learn a (complex) biophysical model. In this contribution we discuss limitations of current approaches, in particular due to biases in the training data. Furthermore, we propose to study capabilities and limitations of ML models by first applying them on synthetic data (obtained from a simplified biophysical model) that can be generated in arbitrary amounts and where all biases can be controlled. We assume that a deep learning model that performs well on these synthetic, would also perform well on real data, and vice versa. We apply this idea by testing several ML models of varying complexity. Finally, we show that the best models are capable of capturing many, but not all, properties of RNA secondary structures. Most severely, the number of predicted base pairs scales quadratically with sequence length, even though a secondary structure can only accommodate a linear number of pairs.
Collapse
Affiliation(s)
- Christoph Flamm
- Department of Theoretical Chemistry, University of Vienna, Vienna, Austria
| | - Julia Wielach
- Department of Theoretical Chemistry, University of Vienna, Vienna, Austria
| | - Michael T. Wolfinger
- Department of Theoretical Chemistry, University of Vienna, Vienna, Austria
- Research Group Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, Vienna, Austria
| | - Stefan Badelt
- Department of Theoretical Chemistry, University of Vienna, Vienna, Austria
| | - Ronny Lorenz
- Department of Theoretical Chemistry, University of Vienna, Vienna, Austria
| | - Ivo L. Hofacker
- Department of Theoretical Chemistry, University of Vienna, Vienna, Austria
- Research Group Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, Vienna, Austria
- *Correspondence: Ivo L. Hofacker,
| |
Collapse
|
13
|
Abstract
Recent events have pushed RNA research into the spotlight. Continued discoveries of RNA with unexpected diverse functions in healthy and diseased cells, such as the role of RNA as both the source and countermeasure to a severe acute respiratory syndrome coronavirus 2 infection, are igniting a new passion for understanding this functionally and structurally versatile molecule. Although RNA structure is key to function, many foundational characteristics of RNA structure are misunderstood, and the default state of RNA is often thought of and depicted as a single floppy strand. The purpose of this perspective is to help adjust mental models, equipping the community to better use the fundamental aspects of RNA structural information in new mechanistic models, enhance experimental design to test these models, and refine data interpretation. We discuss six core observations focused on the inherent nature of RNA structure and how to incorporate these characteristics to better understand RNA structure. We also offer some ideas for future efforts to make validated RNA structural information available and readily used by all researchers.
Collapse
Affiliation(s)
- Quentin Vicens
- Department of Biochemistry and Molecular Genetics, University of Colorado Anschutz Medical Campus, School of Medicine, Aurora, CO 80045
- RNA BioScience Initiative, University of Colorado Denver School of Medicine, Aurora, CO 80045
| | - Jeffrey S. Kieft
- Department of Biochemistry and Molecular Genetics, University of Colorado Anschutz Medical Campus, School of Medicine, Aurora, CO 80045
- RNA BioScience Initiative, University of Colorado Denver School of Medicine, Aurora, CO 80045
| |
Collapse
|
14
|
Secondary structure prediction for RNA sequences including N 6-methyladenosine. Nat Commun 2022; 13:1271. [PMID: 35277476 PMCID: PMC8917230 DOI: 10.1038/s41467-022-28817-4] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 02/10/2022] [Indexed: 01/22/2023] Open
Abstract
There is increasing interest in the roles of covalently modified nucleotides in RNA. There has been, however, an inability to account for modifications in secondary structure prediction because of a lack of software and thermodynamic parameters. We report the solution for these issues for N6-methyladenosine (m6A), allowing secondary structure prediction for an alphabet of A, C, G, U, and m6A. The RNAstructure software now works with user-defined nucleotide alphabets of any size. We also report a set of nearest neighbor parameters for helices and loops containing m6A, using experiments. Interestingly, N6-methylation decreases folding stability for adenosines in the middle of a helix, has little effect on folding stability for adenosines at the ends of helices, and increases folding stability for unpaired adenosines stacked on a helix. We demonstrate predictions for an N6-methylation-activated protein recognition site from MALAT1 and human transcriptome-wide effects of N6-methylation on the probability of adenosine being buried in a helix. RNA folding free energy nearest neighbor parameters were determined for sequences with the nucleotide m6A. The RNAstructure software package can accommodate modified nucleotides, enabling secondary structure prediction of sequences with m6A.
Collapse
|
15
|
Zhao Q, Zhao Z, Fan X, Yuan Z, Mao Q, Yao Y. Review of machine learning methods for RNA secondary structure prediction. PLoS Comput Biol 2021; 17:e1009291. [PMID: 34437528 PMCID: PMC8389396 DOI: 10.1371/journal.pcbi.1009291] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Abstract
Secondary structure plays an important role in determining the function of noncoding RNAs. Hence, identifying RNA secondary structures is of great value to research. Computational prediction is a mainstream approach for predicting RNA secondary structure. Unfortunately, even though new methods have been proposed over the past 40 years, the performance of computational prediction methods has stagnated in the last decade. Recently, with the increasing availability of RNA structure data, new methods based on machine learning (ML) technologies, especially deep learning, have alleviated the issue. In this review, we provide a comprehensive overview of RNA secondary structure prediction methods based on ML technologies and a tabularized summary of the most important methods in this field. The current pending challenges in the field of RNA secondary structure prediction and future trends are also discussed.
Collapse
Affiliation(s)
- Qi Zhao
- College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, China
| | - Zheng Zhao
- School of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning, China
| | - Xiaoya Fan
- School of Software, Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian University of Technology, Dalian, Liaoning, China
| | - Zhengwei Yuan
- Key Laboratory of Health Ministry for Congenital Malformation, Shengjing Hospital of China Medical University, Shenyang, Liaoning, China
| | - Qian Mao
- College of Light Industry, Liaoning University, Shenyang, Liaoning, China
- Key Laboratory of Agroproducts Processing Technology, Changchun University, Changchun, Jilin, China
| | - Yudong Yao
- Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, New Jersey, United States of America
| |
Collapse
|
16
|
Rivas E. Evolutionary conservation of RNA sequence and structure. WILEY INTERDISCIPLINARY REVIEWS-RNA 2021; 12:e1649. [PMID: 33754485 PMCID: PMC8250186 DOI: 10.1002/wrna.1649] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 02/24/2021] [Accepted: 02/25/2021] [Indexed: 12/22/2022]
Abstract
An RNA structure prediction from a single‐sequence RNA folding program is not evidence for an RNA whose structure is important for function. Random sequences have plausible and complex predicted structures not easily distinguishable from those of structural RNAs. How to tell when an RNA has a conserved structure is a question that requires looking at the evolutionary signature left by the conserved RNA. This question is important not just for long noncoding RNAs which usually lack an identified function, but also for RNA binding protein motifs which can be single stranded RNAs or structures. Here we review recent advances using sequence and structural analysis to determine when RNA structure is conserved or not. Although covariation measures assess structural RNA conservation, one must distinguish covariation due to RNA structure from covariation due to independent phylogenetic substitutions. We review a statistical test to measure false positives expected under the null hypothesis of phylogenetic covariation alone (specificity). We also review a complementary test that measures power, that is, expected covariation derived from sequence variation alone (sensitivity). Power in the absence of covariation signals the absence of a conserved RNA structure. We analyze artifacts that falsely identify conserved RNA structure such as the misuse of programs that do not assess significance, the use of inappropriate statistics confounded by signals other than covariation, or misalignments that induce spurious covariation. Among artifacts that obscure the signal of a conserved RNA structure, we discuss the inclusion of pseudogenes in alignments which increase power but destroy covariation. This article is categorized under:RNA Structure and Dynamics > RNA Structure, Dynamics and Chemistry RNA Evolution and Genomics > Computational Analyses of RNA RNA Evolution and Genomics > RNA and Ribonucleoprotein Evolution
Collapse
Affiliation(s)
- Elena Rivas
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts, USA
| |
Collapse
|
17
|
Li P, Zhou X, Xu K, Zhang QC. RASP: an atlas of transcriptome-wide RNA secondary structure probing data. Nucleic Acids Res 2021; 49:D183-D191. [PMID: 33068412 PMCID: PMC7779053 DOI: 10.1093/nar/gkaa880] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 09/13/2020] [Accepted: 09/26/2020] [Indexed: 02/06/2023] Open
Abstract
RNA molecules fold into complex structures that are important across many biological processes. Recent technological developments have enabled transcriptome-wide probing of RNA secondary structure using nucleases and chemical modifiers. These approaches have been widely applied to capture RNA secondary structure in many studies, but gathering and presenting such data from very different technologies in a comprehensive and accessible way has been challenging. Existing RNA structure probing databases usually focus on low-throughput or very specific datasets. Here, we present a comprehensive RNA structure probing database called RASP (RNA Atlas of Structure Probing) by collecting 161 deduplicated transcriptome-wide RNA secondary structure probing datasets from 38 papers. RASP covers 18 species across animals, plants, bacteria, fungi, and also viruses, and categorizes 18 experimental methods including DMS-seq, SHAPE-Seq, SHAPE-MaP, and icSHAPE, etc. Specially, RASP curates the up-to-date datasets of several RNA secondary structure probing studies for the RNA genome of SARS-CoV-2, the RNA virus that caused the on-going COVID-19 pandemic. RASP also provides a user-friendly interface to query, browse, and visualize RNA structure profiles, offering a shortcut to accessing RNA secondary structures grounded in experimental data. The database is freely available at http://rasp.zhanglab.net.
Collapse
MESH Headings
- Animals
- COVID-19/epidemiology
- COVID-19/prevention & control
- COVID-19/virology
- Computational Biology/methods
- Computational Biology/statistics & numerical data
- Databases, Genetic/statistics & numerical data
- Genome, Viral/genetics
- High-Throughput Nucleotide Sequencing/methods
- High-Throughput Nucleotide Sequencing/statistics & numerical data
- Humans
- Nucleic Acid Conformation
- Pandemics
- RNA/chemistry
- RNA/genetics
- RNA Probes/genetics
- RNA, Bacterial/chemistry
- RNA, Bacterial/genetics
- RNA, Fungal/chemistry
- RNA, Fungal/genetics
- RNA, Plant/chemistry
- RNA, Plant/genetics
- RNA, Viral/chemistry
- RNA, Viral/genetics
- SARS-CoV-2/genetics
- SARS-CoV-2/physiology
- Transcriptome
Collapse
Affiliation(s)
- Pan Li
- MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural Biology & Frontier Research Center for Biological Structure, Center for Synthetic and Systems Biology, Tsinghua-Peking Joint Center for Life Sciences, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| | - Xiaolin Zhou
- MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural Biology & Frontier Research Center for Biological Structure, Center for Synthetic and Systems Biology, Tsinghua-Peking Joint Center for Life Sciences, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| | - Kui Xu
- MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural Biology & Frontier Research Center for Biological Structure, Center for Synthetic and Systems Biology, Tsinghua-Peking Joint Center for Life Sciences, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| | - Qiangfeng Cliff Zhang
- MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural Biology & Frontier Research Center for Biological Structure, Center for Synthetic and Systems Biology, Tsinghua-Peking Joint Center for Life Sciences, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| |
Collapse
|
18
|
Ward M, Sun H, Datta A, Wise M, Mathews DH. Determining parameters for non-linear models of multi-loop free energy change. Bioinformatics 2020; 35:4298-4306. [PMID: 30923811 DOI: 10.1093/bioinformatics/btz222] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2018] [Revised: 02/10/2019] [Accepted: 03/27/2019] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION Predicting the secondary structure of RNA is a fundamental task in bioinformatics. Algorithms that predict secondary structure given only the primary sequence, and a model to evaluate the quality of a structure, are an integral part of this. These algorithms have been updated as our model of RNA thermodynamics changed and expanded. An exception to this has been the treatment of multi-loops. Although more advanced models of multi-loop free energy change have been suggested, a simple, linear model has been used since the 1980s. However, recently, new dynamic programing algorithms for secondary structure prediction that could incorporate these models were presented. Unfortunately, these models appear to have lower accuracy for secondary structure prediction. RESULTS We apply linear regression and a new parameter optimization algorithm to find better parameters for the existing linear model and advanced non-linear multi-loop models. These include the Jacobson-Stockmayer and Aalberts & Nandagopal models. We find that the current linear model parameters may be near optimal for the linear model, and that no advanced model performs better than the existing linear model parameters even after parameter optimization. AVAILABILITY AND IMPLEMENTATION Source code and data is available at https://github.com/maxhwardg/advanced_multiloops. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Max Ward
- Computer Science & Software Engineering, The University of Western Australia, Crawley, WA, Australia
| | - Hongying Sun
- Department of Biochemistry & Biophysics, University of Rochester, Rochester, NY, USA.,Center for RNA Biology, University of Rochester, Rochester, NY, USA
| | - Amitava Datta
- Computer Science & Software Engineering, The University of Western Australia, Crawley, WA, Australia
| | - Michael Wise
- Computer Science & Software Engineering, The University of Western Australia, Crawley, WA, Australia.,The Marshall Centre for Infectious Diseases Research and Training, The University of Western Australia, Crawley, WA, Australia
| | - David H Mathews
- Department of Biostatistics & Computational Biology, University of Rochester, Rochester, NY, USA
| |
Collapse
|
19
|
Singh J, Hanson J, Paliwal K, Zhou Y. RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat Commun 2019; 10:5407. [PMID: 31776342 PMCID: PMC6881452 DOI: 10.1038/s41467-019-13395-9] [Citation(s) in RCA: 175] [Impact Index Per Article: 29.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2019] [Accepted: 11/01/2019] [Indexed: 01/03/2023] Open
Abstract
The majority of our human genome transcribes into noncoding RNAs with unknown structures and functions. Obtaining functional clues for noncoding RNAs requires accurate base-pairing or secondary-structure prediction. However, the performance of such predictions by current folding-based algorithms has been stagnated for more than a decade. Here, we propose the use of deep contextual learning for base-pair prediction including those noncanonical and non-nested (pseudoknot) base pairs stabilized by tertiary interactions. Since only [Formula: see text]250 nonredundant, high-resolution RNA structures are available for model training, we utilize transfer learning from a model initially trained with a recent high-quality bpRNA dataset of [Formula: see text]10,000 nonredundant RNAs made available through comparative analysis. The resulting method achieves large, statistically significant improvement in predicting all base pairs, noncanonical and non-nested base pairs in particular. The proposed method (SPOT-RNA), with a freely available server and standalone software, should be useful for improving RNA structure modeling, sequence alignment, and functional annotations.
Collapse
Affiliation(s)
- Jaswinder Singh
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD, 4111, Australia
| | - Jack Hanson
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD, 4111, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD, 4111, Australia.
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr., Southport, QLD, 4222, Australia.
| |
Collapse
|
20
|
Petersen NP, Ort T, Torda AE. Improving the Numerical Stability of the NAST Force Field for RNA Simulations. J Chem Theory Comput 2019; 15:3402-3409. [DOI: 10.1021/acs.jctc.9b00089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Nils P. Petersen
- Centre for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146 Hamburg, Germany
| | - Thomas Ort
- Laboratory Automation and Biomanufacturing Engineering, Fraunhofer Institute for Manufacturing Engineering and Automation IPA, Nobelstrasse 12, 70569 Stuttgart, Germany
| | - Andrew E. Torda
- Centre for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146 Hamburg, Germany
| |
Collapse
|
21
|
Geary C, Meunier PÉ, Schabanel N, Seki S. Oritatami: A Computational Model for Molecular Co-Transcriptional Folding. Int J Mol Sci 2019; 20:ijms20092259. [PMID: 31067813 PMCID: PMC6539498 DOI: 10.3390/ijms20092259] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Revised: 04/25/2019] [Accepted: 04/30/2019] [Indexed: 12/12/2022] Open
Abstract
We introduce and study the computational power of Oritatami, a theoretical model that explores greedy molecular folding, whereby a molecular strand begins to fold before its production is complete. This model is inspired by our recent experimental work demonstrating the construction of shapes at the nanoscale from RNA, where strands of RNA fold into programmable shapes during their transcription from an engineered sequence of synthetic DNA. In the model of Oritatami, we explore the process of folding a single-strand bit by bit in such a way that the final fold emerges as a space-time diagram of computation. One major requirement in order to compute within this model is the ability to program a single sequence to fold into different shapes dependent on the state of the surrounding inputs. Another challenge is to embed all of the computing components within a contiguous strand, and in such a way that different fold patterns of the same strand perform different functions of computation. Here, we introduce general design techniques to solve these challenges in the Oritatami model. Our main result in this direction is the demonstration of a periodic Oritatami system that folds upon itself algorithmically into a prescribed set of shapes, depending on its current local environment, and whose final folding displays the sequence of binary integers from 0 to N=2k−1 with a seed of size O(k). We prove that designing Oritatami is NP-hard in the number of possible local environments for the folding. Nevertheless, we provide an efficient algorithm, linear in the length of the sequence, that solves the Oritatami design problem when the number of local environments is a small fixed constant. This shows that this problem is in fact fixed parameter tractable (FPT) and can thus be solved in practice efficiently. We hope that the numerous structural strategies employed in Oritatami enabling computation will inspire new architectures for computing in RNA that take advantage of the rapid kinetic-folding of RNA.
Collapse
Affiliation(s)
- Cody Geary
- Computer Science Computation and Neural Systems Bioengineering Caltech, MS 136-93, Moore Building, Pasadena, CA 91125, USA.
| | | | - Nicolas Schabanel
- CNRS, École normale supérieure de Lyon (LIP), CEDEX 07, 69364 Lyon, France.
| | - Shinnosuke Seki
- Computer and Network Engineering Dept, University of Electro-Communications, 1-5-1, Chofugaoka, Chofu, Tokyo 1828585, Japan.
| |
Collapse
|
22
|
Mathews DH. How to benchmark RNA secondary structure prediction accuracy. Methods 2019; 162-163:60-67. [PMID: 30951834 DOI: 10.1016/j.ymeth.2019.04.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2018] [Revised: 03/24/2019] [Accepted: 04/01/2019] [Indexed: 11/18/2022] Open
Abstract
RNA secondary structure prediction is widely used. As new methods are developed, these are often benchmarked for accuracy against existing methods. This review discusses good practices for performing these benchmarks, including the choice of benchmarking structures, metrics to quantify accuracy, the importance of allowing flexibility for pairs in the accepted structure, and the importance of statistical testing for significance.
Collapse
Affiliation(s)
- David H Mathews
- Center for RNA Biology, Department of Biochemistry & Biophysics, and Department of Biostatistics & Computational Biology, University of Rochester Medical Center, 601 Elmwood Avenue, Box 712, Rochester, NY 14642, United States.
| |
Collapse
|
23
|
Akiyama M, Sato K, Sakakibara Y. A max-margin training of RNA secondary structure prediction integrated with the thermodynamic model. J Bioinform Comput Biol 2019; 16:1840025. [PMID: 30616476 DOI: 10.1142/s0219720018400255] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
A popular approach for predicting RNA secondary structure is the thermodynamic nearest-neighbor model that finds a thermodynamically most stable secondary structure with minimum free energy (MFE). For further improvement, an alternative approach that is based on machine learning techniques has been developed. The machine learning-based approach can employ a fine-grained model that includes much richer feature representations with the ability to fit the training data. Although a machine learning-based fine-grained model achieved extremely high performance in prediction accuracy, a possibility of the risk of overfitting for such a model has been reported. In this paper, we propose a novel algorithm for RNA secondary structure prediction that integrates the thermodynamic approach and the machine learning-based weighted approach. Our fine-grained model combines the experimentally determined thermodynamic parameters with a large number of scoring parameters for detailed contexts of features that are trained by the structured support vector machine (SSVM) with the [Formula: see text] regularization to avoid overfitting. Our benchmark shows that our algorithm achieves the best prediction accuracy compared with existing methods, and heavy overfitting cannot be observed. The implementation of our algorithm is available at https://github.com/keio-bioinformatics/mxfold .
Collapse
Affiliation(s)
- Manato Akiyama
- Department of Biosciences and Informatics, Keio University, 3–14–1 Hiyoshi, Kohoku-ku, Yokohama 223–8522, Japan
| | - Kengo Sato
- Department of Biosciences and Informatics, Keio University, 3–14–1 Hiyoshi, Kohoku-ku, Yokohama 223–8522, Japan
| | - Yasubumi Sakakibara
- Department of Biosciences and Informatics, Keio University, 3–14–1 Hiyoshi, Kohoku-ku, Yokohama 223–8522, Japan
| |
Collapse
|
24
|
Zhu Y, Xie Z, Li Y, Zhu M, Chen YPP. Research on folding diversity in statistical learning methods for RNA secondary structure prediction. Int J Biol Sci 2018; 14:872-882. [PMID: 29989089 PMCID: PMC6036747 DOI: 10.7150/ijbs.24595] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2017] [Accepted: 02/21/2018] [Indexed: 12/24/2022] Open
Abstract
How to improve the prediction accuracy of RNA secondary structure is currently a hot topic. The existing prediction methods for a single sequence do not fully consider the folding diversity which may occur among RNAs with different functions or sources. This paper explores the relationship between folding diversity and prediction accuracy, and puts forward a new method to improve the prediction accuracy of RNA secondary structure. Our research investigates the following: 1. The folding feature based on stochastic context-free grammar is proposed. By using dimension reduction and clustering techniques, some public data sets are analyzed. The results show that there is significant folding diversity among different RNA families. 2. To assign folding rules to RNAs without structural information, a classification method based on production probability is proposed. The experimental results show that the classification method proposed in this paper can effectively classify the RNAs of unknown structure. 3. Based on the existing prediction methods of statistical learning models, an RNA secondary structure prediction framework is proposed, namely "Cluster - Training - Parameter Selection - Prediction". The results show that, with information on folding diversity, prediction accuracy can be significantly improved.
Collapse
Affiliation(s)
- Yu Zhu
- College of Computer Science, Sichuan University, China
| | - ZhaoYang Xie
- College of Computer Science, Sichuan University, China
| | - YiZhou Li
- College of Chemistry, Sichuan University, China
| | - Min Zhu
- Vice Dean of College of Computer Science, Sichuan University
| | - Yi-Ping Phoebe Chen
- Department of Computer Science and Information Technology, La Trobe University, Australia
| |
Collapse
|
25
|
Findeiß S, Etzel M, Will S, Mörl M, Stadler PF. Design of Artificial Riboswitches as Biosensors. SENSORS 2017; 17:s17091990. [PMID: 28867802 PMCID: PMC5621056 DOI: 10.3390/s17091990] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Revised: 08/23/2017] [Accepted: 08/25/2017] [Indexed: 12/11/2022]
Abstract
RNA aptamers readily recognize small organic molecules, polypeptides, as well as other nucleic acids in a highly specific manner. Many such aptamers have evolved as parts of regulatory systems in nature. Experimental selection techniques such as SELEX have been very successful in finding artificial aptamers for a wide variety of natural and synthetic ligands. Changes in structure and/or stability of aptamers upon ligand binding can propagate through larger RNA constructs and cause specific structural changes at distal positions. In turn, these may affect transcription, translation, splicing, or binding events. The RNA secondary structure model realistically describes both thermodynamic and kinetic aspects of RNA structure formation and refolding at a single, consistent level of modelling. Thus, this framework allows studying the function of natural riboswitches in silico. Moreover, it enables rationally designing artificial switches, combining essentially arbitrary sensors with a broad choice of read-out systems. Eventually, this approach sets the stage for constructing versatile biosensors.
Collapse
Affiliation(s)
- Sven Findeiß
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany.
- Faculty of Computer Science, Research Group Bioinformatics and Computational Biology, University of Vienna, Währingerstraße 29, A-1090 Vienna, Austria.
- Faculty of Chemistry, Department of Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Vienna, Austria.
| | - Maja Etzel
- Institute for Biochemistry, Leipzig University, Brüderstraße 34, 04103 Leipzig, Germany.
| | - Sebastian Will
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany.
- Faculty of Chemistry, Department of Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Vienna, Austria.
- Institute for Biochemistry, Leipzig University, Brüderstraße 34, 04103 Leipzig, Germany.
| | - Mario Mörl
- Institute for Biochemistry, Leipzig University, Brüderstraße 34, 04103 Leipzig, Germany.
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany.
- Faculty of Chemistry, Department of Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Vienna, Austria.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, 04103 Leipzig, Germany.
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103 Leipzig, Germany.
- Fraunhofer Institute for Cell Therapy and Immunology, Perlickstrasse 1, 04103 Leipzig, Germany.
- Center for RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg , Denmark.
- Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA.
| |
Collapse
|
26
|
Yang Y, Li X, Zhao H, Zhan J, Wang J, Zhou Y. Genome-scale characterization of RNA tertiary structures and their functional impact by RNA solvent accessibility prediction. RNA (NEW YORK, N.Y.) 2017; 23:14-22. [PMID: 27807179 PMCID: PMC5159645 DOI: 10.1261/rna.057364.116] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/08/2016] [Accepted: 10/31/2016] [Indexed: 06/06/2023]
Abstract
As most RNA structures are elusive to structure determination, obtaining solvent accessible surface areas (ASAs) of nucleotides in an RNA structure is an important first step to characterize potential functional sites and core structural regions. Here, we developed RNAsnap, the first machine-learning method trained on protein-bound RNA structures for solvent accessibility prediction. Built on sequence profiles from multiple sequence alignment (RNAsnap-prof), the method provided robust prediction in fivefold cross-validation and an independent test (Pearson correlation coefficients, r, between predicted and actual ASA values are 0.66 and 0.63, respectively). Application of the method to 6178 mRNAs revealed its positive correlation to mRNA accessibility by dimethyl sulphate (DMS) experimentally measured in vivo (r = 0.37) but not in vitro (r = 0.07), despite the lack of training on mRNAs and the fact that DMS accessibility is only an approximation to solvent accessibility. We further found strong association across coding and noncoding regions between predicted solvent accessibility of the mutation site of a single nucleotide variant (SNV) and the frequency of that variant in the population for 2.2 million SNVs obtained in the 1000 Genomes Project. Moreover, mapping solvent accessibility of RNAs to the human genome indicated that introns, 5' cap of 5' and 3' cap of 3' untranslated regions, are more solvent accessible, consistent with their respective functional roles. These results support conformational selections as the mechanism for the formation of RNA-protein complexes and highlight the utility of genome-scale characterization of RNA tertiary structures by RNAsnap. The server and its stand-alone downloadable version are available at http://sparks-lab.org.
Collapse
Affiliation(s)
- Yuedong Yang
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Gold Coast, QLD 4222, Australia
| | - Xiaomei Li
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Gold Coast, QLD 4222, Australia
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China
| | - Huiying Zhao
- Institute of Health and Biomedical Innovation, Queensland University of Technology, Queensland 4222, Australia
| | - Jian Zhan
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Gold Coast, QLD 4222, Australia
| | - Jihua Wang
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Gold Coast, QLD 4222, Australia
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China
| |
Collapse
|
27
|
Effects of metal ions and cosolutes on G-quadruplex topology. J Inorg Biochem 2016; 166:190-198. [PMID: 27665315 DOI: 10.1016/j.jinorgbio.2016.09.001] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Revised: 08/31/2016] [Accepted: 09/13/2016] [Indexed: 12/11/2022]
Abstract
Topologies of G-quadruplexes depend on oligonucleotide sequences and on environmental factors, and the diversity of G-quadruplex topologies complicates investigation of functions of these nucleic acid structures. To investigate how metal ions and cosolutes regulate topologies of G-quadruplexes, we stabilized the antiparallel conformation by insertion of 2'-deoxyxanthosine and 8-oxo-2'-deoxyguanosine into selected positions of an oligonucleotide. Thermodynamic analyses of the oligonucleotide revealed that Na+ stabilized the antiparallel G-quadruplex, whereas K+ destabilized this topology. This result suggests that metal ions selectively stabilize G-quadruplex topologies with cavities between G-quartet planes of certain sizes. In the presence of KCl in 20wt% poly(ethylene glycol) with average molecular weight of 200, the antiparallel basket-type G-quadruplex conformation was not stabilized compared with the dilute condition. In the presence of NaCl, the cosolute did stabilize the G-quadruplex with respect to the dilute condition. The presented data show that metal ions and cosolutes regulate topologies of G-quadruplexes through mechanisms that depend on sizes of metal ion cavities and hydration states.
Collapse
|
28
|
Wu Y, Shi B, Ding X, Liu T, Hu X, Yip KY, Yang ZR, Mathews DH, Lu ZJ. Improved prediction of RNA secondary structure by integrating the free energy model with restraints derived from experimental probing data. Nucleic Acids Res 2015; 43:7247-59. [PMID: 26170232 PMCID: PMC4551937 DOI: 10.1093/nar/gkv706] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2015] [Accepted: 06/30/2015] [Indexed: 12/30/2022] Open
Abstract
Recently, several experimental techniques have emerged for probing RNA structures based on high-throughput sequencing. However, most secondary structure prediction tools that incorporate probing data are designed and optimized for particular types of experiments. For example, RNAstructure-Fold is optimized for SHAPE data, while SeqFold is optimized for PARS data. Here, we report a new RNA secondary structure prediction method, restrained MaxExpect (RME), which can incorporate multiple types of experimental probing data and is based on a free energy model and an MEA (maximizing expected accuracy) algorithm. We first demonstrated that RME substantially improved secondary structure prediction with perfect restraints (base pair information of known structures). Next, we collected structure-probing data from diverse experiments (e.g. SHAPE, PARS and DMS-seq) and transformed them into a unified set of pairing probabilities with a posterior probabilistic model. By using the probability scores as restraints in RME, we compared its secondary structure prediction performance with two other well-known tools, RNAstructure-Fold (based on a free energy minimization algorithm) and SeqFold (based on a sampling algorithm). For SHAPE data, RME and RNAstructure-Fold performed better than SeqFold, because they markedly altered the energy model with the experimental restraints. For high-throughput data (e.g. PARS and DMS-seq) with lower probing efficiency, the secondary structure prediction performances of the tested tools were comparable, with performance improvements for only a portion of the tested RNAs. However, when the effects of tertiary structure and protein interactions were removed, RME showed the highest prediction accuracy in the DMS-accessible regions by incorporating in vivo DMS-seq data.
Collapse
Affiliation(s)
- Yang Wu
- MOE Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Center for Plant Biology and Tsinghua-Peking Joint Center for Life Sciences, School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Binbin Shi
- MOE Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Center for Plant Biology and Tsinghua-Peking Joint Center for Life Sciences, School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Xinqiang Ding
- MOE Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Center for Plant Biology and Tsinghua-Peking Joint Center for Life Sciences, School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Tong Liu
- MOE Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Center for Plant Biology and Tsinghua-Peking Joint Center for Life Sciences, School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Xihao Hu
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, China
| | - Kevin Y Yip
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, China
| | - Zheng Rong Yang
- School of Biosciences, University of Exeter, UK Exeter EX4 4QD, UK
| | - David H Mathews
- Department of Biochemistry and Biophysics and Center for RNA Biology, University of Rochester Medical Center, Rochester, New York 14642, USA
| | - Zhi John Lu
- MOE Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Center for Plant Biology and Tsinghua-Peking Joint Center for Life Sciences, School of Life Sciences, Tsinghua University, Beijing 100084, China
| |
Collapse
|
29
|
Saule C, Giegerich R. Pareto optimization in algebraic dynamic programming. Algorithms Mol Biol 2015; 10:22. [PMID: 26150892 PMCID: PMC4491898 DOI: 10.1186/s13015-015-0051-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2014] [Accepted: 05/07/2015] [Indexed: 11/10/2022] Open
Abstract
Pareto optimization combines independent objectives by computing the Pareto front of its search space, defined as the set of all solutions for which no other candidate solution scores better under all objectives. This gives, in a precise sense, better information than an artificial amalgamation of different scores into a single objective, but is more costly to compute. Pareto optimization naturally occurs with genetic algorithms, albeit in a heuristic fashion. Non-heuristic Pareto optimization so far has been used only with a few applications in bioinformatics. We study exact Pareto optimization for two objectives in a dynamic programming framework. We define a binary Pareto product operator [Formula: see text] on arbitrary scoring schemes. Independent of a particular algorithm, we prove that for two scoring schemes A and B used in dynamic programming, the scoring scheme [Formula: see text] correctly performs Pareto optimization over the same search space. We study different implementations of the Pareto operator with respect to their asymptotic and empirical efficiency. Without artificial amalgamation of objectives, and with no heuristics involved, Pareto optimization is faster than computing the same number of answers separately for each objective. For RNA structure prediction under the minimum free energy versus the maximum expected accuracy model, we show that the empirical size of the Pareto front remains within reasonable bounds. Pareto optimization lends itself to the comparative investigation of the behavior of two alternative scoring schemes for the same purpose. For the above scoring schemes, we observe that the Pareto front can be seen as a composition of a few macrostates, each consisting of several microstates that differ in the same limited way. We also study the relationship between abstract shape analysis and the Pareto front, and find that they extract information of a different nature from the folding space and can be meaningfully combined.
Collapse
|
30
|
Abstract
The ViennaRNA package is a widely used collection of programs for thermodynamic RNA secondary structure prediction. Over the years, many additional tools have been developed building on the core programs of the package to also address issues related to noncoding RNA detection, RNA folding kinetics, or efficient sequence design considering RNA-RNA hybridizations. The ViennaRNA web services provide easy and user-friendly web access to these tools. This chapter describes how to use this online platform to perform tasks such as prediction of minimum free energy structures, prediction of RNA-RNA hybrids, or noncoding RNA detection. The ViennaRNA web services can be used free of charge and can be accessed via http://rna.tbi.univie.ac.at.
Collapse
Affiliation(s)
- Andreas R Gruber
- Biozentrum, University of Basel, Klingelbergstrasse 50-70, 4056, Basel, Switzerland,
| | | | | |
Collapse
|
31
|
Sterpone F, Melchionna S, Tuffery P, Pasquali S, Mousseau N, Cragnolini T, Chebaro Y, St-Pierre JF, Kalimeri M, Barducci A, Laurin Y, Tek A, Baaden M, Nguyen PH, Derreumaux P. The OPEP protein model: from single molecules, amyloid formation, crowding and hydrodynamics to DNA/RNA systems. Chem Soc Rev 2014; 43:4871-93. [PMID: 24759934 PMCID: PMC4426487 DOI: 10.1039/c4cs00048j] [Citation(s) in RCA: 123] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
The OPEP coarse-grained protein model has been applied to a wide range of applications since its first release 15 years ago. The model, which combines energetic and structural accuracy and chemical specificity, allows the study of single protein properties, DNA-RNA complexes, amyloid fibril formation and protein suspensions in a crowded environment. Here we first review the current state of the model and the most exciting applications using advanced conformational sampling methods. We then present the current limitations and a perspective on the ongoing developments.
Collapse
Affiliation(s)
- Fabio Sterpone
- Laboratoire de Biochimie Théorique, UPR 9080 CNRS, Université Paris Diderot, Sorbonne Paris Cité, IBPC, 13 rue Pierre et Marie Curie, 75005, Paris, France.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|