1
|
Winkler J, Urgese G, Ficarra E, Reinert K. LaRA 2: parallel and vectorized program for sequence-structure alignment of RNA sequences. BMC Bioinformatics 2022; 23:18. [PMID: 34991448 PMCID: PMC8734264 DOI: 10.1186/s12859-021-04532-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Accepted: 12/13/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The function of non-coding RNA sequences is largely determined by their spatial conformation, namely the secondary structure of the molecule, formed by Watson-Crick interactions between nucleotides. Hence, modern RNA alignment algorithms routinely take structural information into account. In order to discover yet unknown RNA families and infer their possible functions, the structural alignment of RNAs is an essential task. This task demands a lot of computational resources, especially for aligning many long sequences, and it therefore requires efficient algorithms that utilize modern hardware when available. A subset of the secondary structures contains overlapping interactions (called pseudoknots), which add additional complexity to the problem and are often ignored in available software. RESULTS We present the SeqAn-based software LaRA 2 that is significantly faster than comparable software for accurate pairwise and multiple alignments of structured RNA sequences. In contrast to other programs our approach can handle arbitrary pseudoknots. As an improved re-implementation of the LaRA tool for structural alignments, LaRA 2 uses multi-threading and vectorization for parallel execution and a new heuristic for computing a lower boundary of the solution. Our algorithmic improvements yield a program that is up to 130 times faster than the previous version. CONCLUSIONS With LaRA 2 we provide a tool to analyse large sets of RNA secondary structures in relatively short time, based on structural alignment. The produced alignments can be used to derive structural motifs for the search in genomic databases.
Collapse
Affiliation(s)
- Jörg Winkler
- Department of Mathematics and Computer Science, Free University Berlin, Takustraße 9, 14195 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, 14195 Berlin, Germany
| | - Gianvito Urgese
- Interuniversity Department of Regional and Urban Studies and Planning, Politecnico di Torino, C.so Duca degli Abruzzi 24, 10129 Turin, Italy
| | - Elisa Ficarra
- Department of Control and Computer Science, Politecnico di Torino, C.so Duca degli Abruzzi 24, 10129 Turin, Italy
| | - Knut Reinert
- Department of Mathematics and Computer Science, Free University Berlin, Takustraße 9, 14195 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, 14195 Berlin, Germany
| |
Collapse
|
2
|
Ahmed SA, Mneimneh S. Gibbs/MCMC Sampling for Multiple RNA Interaction with Sub-optimal Solutions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:703-712. [PMID: 30629511 DOI: 10.1109/tcbb.2018.2890519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Multiple RNA interaction can be modeled as a problem in combinatorial optimization, where the "optimal" structure is driven by an energy-minimization-like algorithm. However, the actual structure may not be optimal in this computational sense. Moreover, it is not necessarily unique. Therefore, alternative sub-optimal solutions are needed to cover the biological ground. We present a combinatorial formulation for the Multiple RNA Interaction problem with approximation algorithms to handle various interaction patterns, which when combined with Gibbs sampling and MCMC (Markov Chain Monte Carlo), can efficiently generate a reasonable number of optimal and sub-optimal solutions. When viable structures are far from an optimal solution, exploring dependence among different parts of the interaction can increase their score and boost their candidacy for the sampling algorithm. By clustering the solutions, we identify few representatives that are distinct enough to suggest possible alternative structures.
Collapse
|
3
|
Abstract
RNA structure is conserved by evolution to a greater extent than sequence. Predicting the conserved structure for multiple homologous sequences can be much more accurate than predicting the structure for a single sequence. RNAstructure is a software package that includes the programs Dynalign, Multilign, TurboFold, and PARTS for predicting conserved RNA secondary structure. This chapter provides protocols for using these programs.
Collapse
|
4
|
Sloma MF, Mathews DH. Improving RNA secondary structure prediction with structure mapping data. Methods Enzymol 2015; 553:91-114. [PMID: 25726462 DOI: 10.1016/bs.mie.2014.10.053] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Methods to probe RNA secondary structure, such as small molecule modifying agents, secondary structure-specific nucleases, inline probing, and SHAPE chemistry, are widely used to study the structure of functional RNA. Computational secondary structure prediction programs can incorporate probing data to predict structure with high accuracy. In this chapter, an overview of current methods for probing RNA secondary structure is provided, including modern high-throughput methods. Methods for guiding secondary structure prediction algorithms using these data are explained, and best practices for using these data are provided. This chapter concludes by listing a number of open questions about how to best use probing data, and what these data can provide.
Collapse
Affiliation(s)
- Michael F Sloma
- Department of Biochemistry & Biophysics, University of Rochester Medical Center, Box 712, Rochester, New York, USA; Center for RNA Biology, University of Rochester Medical Center, Box 712, Rochester, New York, USA
| | - David H Mathews
- Department of Biochemistry & Biophysics, University of Rochester Medical Center, Box 712, Rochester, New York, USA; Center for RNA Biology, University of Rochester Medical Center, Box 712, Rochester, New York, USA.
| |
Collapse
|
5
|
Abstract
It has been well accepted that the RNA secondary structures of most functional non-coding RNAs (ncRNAs) are closely related to their functions and are conserved during evolution. Hence, prediction of conserved secondary structures from evolutionarily related sequences is one important task in RNA bioinformatics; the methods are useful not only to further functional analyses of ncRNAs but also to improve the accuracy of secondary structure predictions and to find novel functional RNAs from the genome. In this review, I focus on common secondary structure prediction from a given aligned RNA sequence, in which one secondary structure whose length is equal to that of the input alignment is predicted. I systematically review and classify existing tools and algorithms for the problem, by utilizing the information employed in the tools and by adopting a unified viewpoint based on maximum expected gain (MEG) estimators. I believe that this classification will allow a deeper understanding of each tool and provide users with useful information for selecting tools for common secondary structure predictions.
Collapse
|
6
|
Giulietti M, Grillo G, Liuni S, Pesole G. A guideline for the annotation of UTR regulatory elements in the UTRsite collection. Methods Mol Biol 2015; 1269:339-48. [PMID: 25577389 DOI: 10.1007/978-1-4939-2291-8_21] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Gene expression regulatory elements are scattered in gene promoters and pre-mRNAs. In particular, RNA elements lying in untranslated regions (5' and 3'UTRs) are poorly studied because of their peculiar features (i.e., a combination of primary and secondary structure elements) which also pose remarkable computational challenges. Several years ago, we began collecting experimentally characterized UTR regulatory elements, developing the specialized database UTRsite. This paper describes the detailed guidelines to annotate cis-regulatory elements in 5' and 3' UnTranslated Regions (UTRs) by computational analyses, retracing all main steps used by UTRsite curators.
Collapse
Affiliation(s)
- Matteo Giulietti
- Institute of Biomembranes and Bioenergetics, National Research Council, 70126, Bari, Italy
| | | | | | | |
Collapse
|
7
|
Pervouchine DD. IRBIS: a systematic search for conserved complementarity. RNA (NEW YORK, N.Y.) 2014; 20:1519-31. [PMID: 25142064 PMCID: PMC4174434 DOI: 10.1261/rna.045088.114] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2014] [Accepted: 06/26/2014] [Indexed: 05/28/2023]
Abstract
IRBIS is a computational pipeline for detecting conserved complementary regions in unaligned orthologous sequences. Unlike other methods, it follows the "first-fold-then-align" principle in which all possible combinations of complementary k-mers are searched for simultaneous conservation. The novel trimming procedure reduces the size of the search space and improves the performance to the point where large-scale analyses of intra- and intermolecular RNA-RNA interactions become possible. In this article, I provide a rigorous description of the method, benchmarking on simulated and real data, and a set of stringent predictions of intramolecular RNA structure in placental mammals, drosophilids, and nematodes. I discuss two particular cases of long-range RNA structures that are likely to have a causal effect on single- and multiple-exon skipping, one in the mammalian gene Dystonin and the other in the insect gene Ca-α1D. In Dystonin, one of the two complementary boxes contains a binding site of Rbfox protein similar to one recently described in Enah gene. I also report that snoRNAs and long noncoding RNAs (lncRNAs) have a high capacity of base-pairing to introns of protein-coding genes, suggesting possible involvement of these transcripts in splicing regulation. I also find that conserved sequences that occur equally likely on both strands of DNA (e.g., transcription factor binding sites) contribute strongly to the false-discovery rate and, therefore, would confound every such analysis. IRBIS is an open-source software that is available at http://genome.crg.es/~dmitri/irbis/.
Collapse
Affiliation(s)
- Dmitri D Pervouchine
- Centre for Genomic Regulation and UPF, Barcelona 08003, Spain Faculty of Bioengineering and Bioinformatics, Moscow State University, 119992 Moscow, Russia
| |
Collapse
|
8
|
Abstract
Transcriptomics experiments and computational predictions both enable systematic discovery of new functional RNAs. However, many putative noncoding transcripts arise instead from artifacts and biological noise, and current computational prediction methods have high false positive rates. I discuss prospects for improving computational methods for analyzing and identifying functional RNAs, with a focus on detecting signatures of conserved RNA secondary structure. An interesting new front is the application of chemical and enzymatic experiments that probe RNA structure on a transcriptome-wide scale. I review several proposed approaches for incorporating structure probing data into the computational prediction of RNA secondary structure. Using probabilistic inference formalisms, I show how all these approaches can be unified in a well-principled framework, which in turn allows RNA probing data to be easily integrated into a wide range of analyses that depend on RNA secondary structure inference. Such analyses include homology search and genome-wide detection of new structural RNAs.
Collapse
Affiliation(s)
- Sean R Eddy
- Howard Hughes Medical Institute Janelia Farm Research Campus, Ashburn, Virginia 20147;
| |
Collapse
|
9
|
Abstract
Many bioinformatics problems, such as sequence alignment, gene prediction, phylogenetic tree estimation and RNA secondary structure prediction, are often affected by the 'uncertainty' of a solution, that is, the probability of the solution is extremely small. This situation arises for estimation problems on high-dimensional discrete spaces in which the number of possible discrete solutions is immense. In the analysis of biological data or the development of prediction algorithms, this uncertainty should be handled carefully and appropriately. In this review, I will explain several methods to combat this uncertainty, presenting a number of examples in bioinformatics. The methods include (i) avoiding point estimation, (ii) maximum expected accuracy (MEA) estimations and (iii) several strategies to design a pipeline involving several prediction methods. I believe that the basic concepts and ideas described in this review will be generally useful for estimation problems in various areas of bioinformatics.
Collapse
|
10
|
Hamada M. Direct updating of an RNA base-pairing probability matrix with marginal probability constraints. J Comput Biol 2013; 19:1265-76. [PMID: 23210474 DOI: 10.1089/cmb.2012.0215] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
A base-pairing probability matrix (BPPM) stores the probabilities for every possible base pair in an RNA sequence and has been used in many algorithms in RNA informatics (e.g., RNA secondary structure prediction and motif search). In this study, we propose a novel algorithm to perform iterative updates of a given BPPM, satisfying marginal probability constraints that are (approximately) given by recently developed biochemical experiments, such as SHAPE, PAR, and FragSeq. The method is easily implemented and is applicable to common models for RNA secondary structures, such as energy-based or machine-learning-based models. In this article, we focus mainly on the details of the algorithms, although preliminary computational experiments will also be presented.
Collapse
Affiliation(s)
- Michiaki Hamada
- The University of Tokyo, Graduate School of Frontier Science, Kashiwa, Japan.
| |
Collapse
|
11
|
Puton T, Kozlowski LP, Rother KM, Bujnicki JM. CompaRNA: a server for continuous benchmarking of automated methods for RNA secondary structure prediction. Nucleic Acids Res 2013; 41:4307-23. [PMID: 23435231 PMCID: PMC3627593 DOI: 10.1093/nar/gkt101] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
We present a continuous benchmarking approach for the assessment of RNA secondary structure prediction methods implemented in the CompaRNA web server. As of 3 October 2012, the performance of 28 single-sequence and 13 comparative methods has been evaluated on RNA sequences/structures released weekly by the Protein Data Bank. We also provide a static benchmark generated on RNA 2D structures derived from the RNAstrand database. Benchmarks on both data sets offer insight into the relative performance of RNA secondary structure prediction methods on RNAs of different size and with respect to different types of structure. According to our tests, on the average, the most accurate predictions obtained by a comparative approach are generated by CentroidAlifold, MXScarna, RNAalifold and TurboFold. On the average, the most accurate predictions obtained by single-sequence analyses are generated by CentroidFold, ContextFold and IPknot. The best comparative methods typically outperform the best single-sequence methods if an alignment of homologous RNA sequences is available. This article presents the results of our benchmarks as of 3 October 2012, whereas the rankings presented online are continuously updated. We will gladly include new prediction methods and new measures of accuracy in the new editions of CompaRNA benchmarks.
Collapse
Affiliation(s)
- Tomasz Puton
- Bioinformatics Laboratory, Institute for Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, ul. Umultowska 89, 61-614 Poznan, Poland
| | | | | | | |
Collapse
|
12
|
Hamada M, Asai K. A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA). J Comput Biol 2012; 19:532-49. [PMID: 22313125 DOI: 10.1089/cmb.2011.0197] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Many estimation problems in bioinformatics are formulated as point estimation problems in a high-dimensional discrete space. In general, it is difficult to design reliable estimators for this type of problem, because the number of possible solutions is immense, which leads to an extremely low probability for every solution-even for the one with the highest probability. Therefore, maximum score and maximum likelihood estimators do not work well in this situation although they are widely employed in a number of applications. Maximizing expected accuracy (MEA) estimation, in which accuracy measures of the target problem and the entire distribution of solutions are considered, is a more successful approach. In this review, we provide an extensive discussion of algorithms and software based on MEA. We describe how a number of algorithms used in previous studies can be classified from the viewpoint of MEA. We believe that this review will be useful not only for users wishing to utilize software to solve the estimation problems appearing in this article, but also for developers wishing to design algorithms on the basis of MEA.
Collapse
Affiliation(s)
- Michiaki Hamada
- Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan.
| | | |
Collapse
|