1
|
Mittal A, Ali SE, Mathews DH. Using the RNAstructure Software Package to Predict Conserved RNA Structures. Curr Protoc 2024; 4:e70054. [PMID: 39540715 DOI: 10.1002/cpz1.70054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
The structures of many non-coding RNAs (ncRNA) are conserved by evolution to a greater extent than their sequences. By predicting the conserved structure of two or more homologous sequences, the accuracy of secondary structure prediction can be improved as compared to structure prediction for a single sequence. Here, we provide protocols for the use of four programs in the RNAstructure suite to predict conserved structures: Multilign, TurboFold, Dynalign, and PARTS. TurboFold iteratively aligns multiple homologous sequences and estimates the pairing probabilities for the conserved structure. Dynalign, PARTS, and Multilign are dynamic programming algorithms that simultaneously align sequences and identify the common secondary structure. Dynalign uses a pair of homologs and finds the lowest free energy common structure. PARTS uses a pair of homologs and estimates pairing probabilities from the base pairing probabilities estimated for each sequence. Multilign uses two or more homologs and finds the lowest free energy common structure using multiple pairwise calculations with Dynalign. It scales linearly with the number of sequences. We outline the strengths of each program. These programs can be run through web servers, on the command line, or with graphical user interfaces. © 2024 Wiley Periodicals LLC. Basic Protocol 1: Predicting a structure conserved in three or more sequences with the RNAstructure web server Basic Protocol 2: Predicting a structure conserved in two sequences with the RNAstructure web server Alternative Protocol 1: Predicting a structure conserved in multiple sequences in the RNAstructure graphical user interface Alternative Protocol 2: Predicting a structure conserved in two sequences with Dynalign in the RNAstructure graphical user interface Alternative Protocol 3: Running TurboFold on the command line.
Collapse
Affiliation(s)
- Abhinav Mittal
- Department of Biochemistry & Biophysics and Center for RNA Biology, University of Rochester Medical Center, Rochester, New York
| | - Sara E Ali
- Department of Biochemistry & Biophysics and Center for RNA Biology, University of Rochester Medical Center, Rochester, New York
| | - David H Mathews
- Department of Biochemistry & Biophysics and Center for RNA Biology, University of Rochester Medical Center, Rochester, New York
| |
Collapse
|
2
|
Penev PI, Fakhretaha-Aval S, Patel VJ, Cannone JJ, Gutell RR, Petrov AS, Williams LD, Glass JB. Supersized Ribosomal RNA Expansion Segments in Asgard Archaea. Genome Biol Evol 2021; 12:1694-1710. [PMID: 32785681 PMCID: PMC7594248 DOI: 10.1093/gbe/evaa170] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/07/2020] [Indexed: 12/11/2022] Open
Abstract
The ribosome’s common core, comprised of ribosomal RNA (rRNA) and universal ribosomal proteins, connects all life back to a common ancestor and serves as a window to relationships among organisms. The rRNA of the common core is similar to rRNA of extant bacteria. In eukaryotes, the rRNA of the common core is decorated by expansion segments (ESs) that vastly increase its size. Supersized ESs have not been observed previously in Archaea, and the origin of eukaryotic ESs remains enigmatic. We discovered that the large ribosomal subunit (LSU) rRNA of two Asgard phyla, Lokiarchaeota and Heimdallarchaeota, considered to be the closest modern archaeal cell lineages to Eukarya, bridge the gap in size between prokaryotic and eukaryotic LSU rRNAs. The elongated LSU rRNAs in Lokiarchaeota and Heimdallarchaeota stem from two supersized ESs, called ES9 and ES39. We applied chemical footprinting experiments to study the structure of Lokiarchaeota ES39. Furthermore, we used covariation and sequence analysis to study the evolution of Asgard ES39s and ES9s. By defining the common eukaryotic ES39 signature fold, we found that Asgard ES39s have more and longer helices than eukaryotic ES39s. Although Asgard ES39s have sequences and structures distinct from eukaryotic ES39s, we found overall conservation of a three-way junction across the Asgard species that matches eukaryotic ES39 topology, a result consistent with the accretion model of ribosomal evolution.
Collapse
Affiliation(s)
- Petar I Penev
- Georgia Institute of Technology, NASA Center for the Origin of Life, Atlanta, Georgia.,School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia
| | - Sara Fakhretaha-Aval
- Georgia Institute of Technology, NASA Center for the Origin of Life, Atlanta, Georgia.,School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia
| | - Vaishnavi J Patel
- Department of Integrative Biology, The University of Texas at Austin, Austin, Texas
| | - Jamie J Cannone
- Department of Integrative Biology, The University of Texas at Austin, Austin, Texas
| | - Robin R Gutell
- Department of Integrative Biology, The University of Texas at Austin, Austin, Texas
| | - Anton S Petrov
- Georgia Institute of Technology, NASA Center for the Origin of Life, Atlanta, Georgia.,School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia
| | - Loren Dean Williams
- Georgia Institute of Technology, NASA Center for the Origin of Life, Atlanta, Georgia.,School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia.,School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia
| | - Jennifer B Glass
- Georgia Institute of Technology, NASA Center for the Origin of Life, Atlanta, Georgia.,School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia.,School of Earth and Atmospheric Sciences, Georgia Institute of Technology, Atlanta, Georgia
| |
Collapse
|
3
|
Discoveries of Exoribonuclease-Resistant Structures of Insect-Specific Flaviviruses Isolated in Zambia. Viruses 2020; 12:v12091017. [PMID: 32933075 PMCID: PMC7551683 DOI: 10.3390/v12091017] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2020] [Revised: 09/08/2020] [Accepted: 09/08/2020] [Indexed: 12/13/2022] Open
Abstract
To monitor the arthropod-borne virus transmission in mosquitoes, we have attempted both to detect and isolate viruses from 3304 wild-caught female mosquitoes in the Livingstone (Southern Province) and Mongu (Western Province) regions in Zambia in 2017. A pan-flavivirus RT-PCR assay was performed to identify flavivirus genomes in total RNA extracted from mosquito lysates, followed by virus isolation and full genome sequence analysis using next-generation sequencing and rapid amplification of cDNA ends. We isolated a newly identified Barkedji virus (BJV Zambia) (10,899 nt) and a novel flavivirus, tentatively termed Barkedji-like virus (BJLV) (10,885 nt) from Culex spp. mosquitoes which shared 96% and 75% nucleotide identity with BJV which has been isolated in Israel, respectively. These viruses could replicate in C6/36 cells but not in mammalian and avian cell lines. In parallel, a comparative genomics screening was conducted to study evolutionary traits of the 5'- and 3'-untranslated regions (UTRs) of isolated viruses. Bioinformatic analyses of the secondary structures in the UTRs of both viruses revealed that the 5'-UTRs exhibit canonical stem-loop structures, while the 3'-UTRs contain structural homologs to exoribonuclease-resistant RNAs (xrRNAs), SL-III, dumbbell, and terminal stem-loop (3'SL) structures. The function of predicted xrRNA structures to stop RNA degradation by Xrn1 exoribonuclease was further proved by the in vitro Xrn1 resistance assay.
Collapse
|
4
|
Chen CC, Qian X, Yoon BJ. RNAdetect: efficient computational detection of novel non-coding RNAs. Bioinformatics 2020; 35:1133-1141. [PMID: 30169792 DOI: 10.1093/bioinformatics/bty765] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2017] [Revised: 07/30/2018] [Accepted: 08/30/2018] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Non-coding RNAs (ncRNAs) are known to play crucial roles in various biological processes, and there is a pressing need for accurate computational detection methods that could be used to efficiently scan genomes to detect novel ncRNAs. However, unlike coding genes, ncRNAs often lack distinctive sequence features that could be used for recognizing them. Although many ncRNAs are known to have a well conserved secondary structure, which provides useful cues for computational prediction, it has been also shown that a structure-based approach alone may not be sufficient for detecting ncRNAs in a single sequence. Currently, the most effective ncRNA detection methods combine structure-based techniques with a comparative genome analysis approach to improve the prediction performance. RESULTS In this paper, we propose RNAdetect, a computational method incorporating novel features for accurate detection of ncRNAs in combination with comparative genome analysis. Given a sequence alignment, RNAdetect can accurately detect the presence of functional ncRNAs by incorporating novel predictive features based on the concept of generalized ensemble defect (GED), which assesses the degree of structure conservation across multiple related sequences and the conformation of the individual folding structures to a common consensus structure. Furthermore, n-gram models (NGMs) are used to extract features that can effectively capture sequence homology to known ncRNA families. Utilization of NGMs can enhance the detection of ncRNAs that have sparse folding structures with many unpaired bases. Extensive performance evaluation based on the Rfam database and bacterial genomes demonstrate that RNAdetect can accurately and reliably detect novel ncRNAs, outperforming the current state-of-the-art methods. AVAILABILITY AND IMPLEMENTATION The source code for RNAdetect and the benchmark data used in this paper can be downloaded at https://github.com/bjyoontamu/RNAdetect.
Collapse
Affiliation(s)
- Chun-Chi Chen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA.,TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, TX, USA
| | - Xiaoning Qian
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA.,TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, TX, USA
| | - Byung-Jun Yoon
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA.,TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, TX, USA
| |
Collapse
|
5
|
Abstract
During the last decade, ncRNAs have been investigated intensively and revealed their regulatory role in various biological processes. Worldwide research efforts have identified numerous ncRNAs and multiple RNA subtypes, which are attributed to diverse functionalities known to interact with different functional layers, from DNA and RNA to proteins. This makes the prediction of functions for newly identified ncRNAs challenging. Current bioinformatics and systems biology approaches show promising results to facilitate an identification of these diverse ncRNA functionalities. Here, we review (a) current experimental protocols, i.e., for Next Generation Sequencing, for a successful identification of ncRNAs; (b) sequencing data analysis workflows as well as available computational environments; and (c) state-of-the-art approaches to functionally characterize ncRNAs, e.g., by means of transcriptome-wide association studies, molecular network analyses, or artificial intelligence guided prediction. In addition, we present a strategy to cover the identification and functional characterization of unknown transcripts by using connective workflows.
Collapse
|
6
|
Akhter S, Aziz RK, Kashef MT, Ibrahim ES, Bailey B, Edwards RA. Kullback Leibler divergence in complete bacterial and phage genomes. PeerJ 2017; 5:e4026. [PMID: 29204318 PMCID: PMC5712468 DOI: 10.7717/peerj.4026] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 10/22/2017] [Indexed: 12/11/2022] Open
Abstract
The amino acid content of the proteins encoded by a genome may predict the coding potential of that genome and may reflect lifestyle restrictions of the organism. Here, we calculated the Kullback–Leibler divergence from the mean amino acid content as a metric to compare the amino acid composition for a large set of bacterial and phage genome sequences. Using these data, we demonstrate that (i) there is a significant difference between amino acid utilization in different phylogenetic groups of bacteria and phages; (ii) many of the bacteria with the most skewed amino acid utilization profiles, or the bacteria that host phages with the most skewed profiles, are endosymbionts or parasites; (iii) the skews in the distribution are not restricted to certain metabolic processes but are common across all bacterial genomic subsystems; (iv) amino acid utilization profiles strongly correlate with GC content in bacterial genomes but very weakly correlate with the G+C percent in phage genomes. These findings might be exploited to distinguish coding from non-coding sequences in large data sets, such as metagenomic sequence libraries, to help in prioritizing subsequent analyses.
Collapse
Affiliation(s)
- Sajia Akhter
- Computational Science Research Center, San Diego State University, San Diego, CA, USA
| | - Ramy K Aziz
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt.,Department of Computer Science, San Diego State University, San Diego, CA, United States of America
| | - Mona T Kashef
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt
| | - Eslam S Ibrahim
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt
| | - Barbara Bailey
- Department of Mathematics & Statistics, San Diego State University, San Diego, CA, USA
| | - Robert A Edwards
- Computational Science Research Center, San Diego State University, San Diego, CA, USA.,Department of Computer Science, San Diego State University, San Diego, CA, United States of America.,Department of Mathematics & Statistics, San Diego State University, San Diego, CA, USA.,Department of Biology, San Diego State University, San Diego, CA, USA
| |
Collapse
|
7
|
Pian C, Zhang G, Chen Z, Chen Y, Zhang J, Yang T, Zhang L. LncRNApred: Classification of Long Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble Algorithm with a New Hybrid Feature. PLoS One 2016; 11:e0154567. [PMID: 27228152 PMCID: PMC4882039 DOI: 10.1371/journal.pone.0154567] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2015] [Accepted: 04/15/2016] [Indexed: 12/31/2022] Open
Abstract
As a novel class of noncoding RNAs, long noncoding RNAs (lncRNAs) have been verified to be associated with various diseases. As large scale transcripts are generated every year, it is significant to accurately and quickly identify lncRNAs from thousands of assembled transcripts. To accurately discover new lncRNAs, we develop a classification tool of random forest (RF) named LncRNApred based on a new hybrid feature. This hybrid feature set includes three new proposed features, which are MaxORF, RMaxORF and SNR. LncRNApred is effective for classifying lncRNAs and protein coding transcripts accurately and quickly. Moreover,our RF model only requests the training using data on human coding and non-coding transcripts. Other species can also be predicted by using LncRNApred. The result shows that our method is more effective compared with the Coding Potential Calculate (CPC). The web server of LncRNApred is available for free at http://mm20132014.wicp.net:57203/LncRNApred/home.jsp.
Collapse
Affiliation(s)
- Cong Pian
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Guangle Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Zhi Chen
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Yuanyuan Chen
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Jin Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Tao Yang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| | - Liangyun Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, Jiangsu, People’s Republic of China
| |
Collapse
|
8
|
Chen JL, Bellaousov S, Tubbs JD, Kennedy SD, Lopez MJ, Mathews DH, Turner DH. Nuclear Magnetic Resonance-Assisted Prediction of Secondary Structure for RNA: Incorporation of Direction-Dependent Chemical Shift Constraints. Biochemistry 2015; 54:6769-82. [PMID: 26451676 PMCID: PMC4666457 DOI: 10.1021/acs.biochem.5b00833] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
![]()
Knowledge
of RNA
structure is necessary to determine structure–function relationships
and to facilitate design of potential therapeutics.
RNA secondary structure prediction can be improved by applying constraints
from nuclear magnetic resonance (NMR) experiments to a dynamic programming
algorithm. Imino proton walks from NOESY spectra reveal double-stranded
regions. Chemical shifts of protons in GH1, UH3, and UH5 of GU pairs,
UH3, UH5, and AH2 of AU pairs, and GH1 of GC pairs were analyzed to
identify constraints for the 5′ to 3′ directionality
of base pairs in helices. The 5′ to 3′ directionality
constraints were incorporated into an NMR-assisted prediction of secondary
structure (NAPSS-CS) program. When it was tested on 18 structures,
including nine pseudoknots, the sensitivity and positive predictive
value were improved relative to those of three unrestrained programs.
The prediction accuracy for the pseudoknots improved the most. The
program also facilitates assignment of chemical shifts to individual
nucleotides, a necessary step for determining three-dimensional structure.
Collapse
Affiliation(s)
- Jonathan L Chen
- Department of Chemistry, University of Rochester , Rochester, New York 14627, United States
| | - Stanislav Bellaousov
- Department of Biochemistry and Biophysics, University of Rochester School of Medicine and Dentistry , Rochester, New York 14642, United States
| | - Jason D Tubbs
- Department of Chemistry, University of Rochester , Rochester, New York 14627, United States
| | - Scott D Kennedy
- Department of Biochemistry and Biophysics, University of Rochester School of Medicine and Dentistry , Rochester, New York 14642, United States
| | - Michael J Lopez
- Department of Chemistry, University of Rochester , Rochester, New York 14627, United States
| | - David H Mathews
- Department of Biochemistry and Biophysics, University of Rochester School of Medicine and Dentistry , Rochester, New York 14642, United States.,Center for RNA Biology, University of Rochester , Rochester, New York 14642, United States
| | - Douglas H Turner
- Department of Chemistry, University of Rochester , Rochester, New York 14627, United States.,Center for RNA Biology, University of Rochester , Rochester, New York 14642, United States
| |
Collapse
|
9
|
Abstract
Genomic studies have greatly expanded our knowledge of structural non-coding RNAs (ncRNAs). These RNAs fold into characteristic secondary structures and perform specific-structure dependent biological functions. Hence RNA secondary structure prediction is one of the most well studied problems in computational RNA biology. Comparative sequence analysis is one of the more reliable RNA structure prediction approaches as it exploits information of multiple related sequences to infer the consensus secondary structure. This class of methods essentially learns a global secondary structure from the input sequences. In this paper, we consider the more general problem of unearthing common local secondary structure based patterns from a set of related sequences. The input sequences for example could correspond to 3(') or 5(') untranslated regions of a set of orthologous genes and the unearthed local patterns could correspond to regulatory motifs found in these regions. These sequences could also correspond to in vitro selected RNA, genomic segments housing ncRNA genes from the same family and so on. Here, we give a detailed review of the various computational techniques proposed in literature attempting to solve this general motif discovery problem. We also give empirical comparisons of some of the current state of the art methods and point out future directions of research.
Collapse
Affiliation(s)
- Avinash Achar
- Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway
| | - Pål Sætrom
- Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway.
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway.
| |
Collapse
|
10
|
Corley M, Solem A, Qu K, Chang HY, Laederach A. Detecting riboSNitches with RNA folding algorithms: a genome-wide benchmark. Nucleic Acids Res 2015; 43:1859-68. [PMID: 25618847 PMCID: PMC4330374 DOI: 10.1093/nar/gkv010] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Ribonucleic acid (RNA) secondary structure prediction continues to be a significant challenge, in particular when attempting to model sequences with less rigidly defined structures, such as messenger and non-coding RNAs. Crucial to interpreting RNA structures as they pertain to individual phenotypes is the ability to detect RNAs with large structural disparities caused by a single nucleotide variant (SNV) or riboSNitches. A recently published human genome-wide parallel analysis of RNA structure (PARS) study identified a large number of riboSNitches as well as non-riboSNitches, providing an unprecedented set of RNA sequences against which to benchmark structure prediction algorithms. Here we evaluate 11 different RNA folding algorithms’ riboSNitch prediction performance on these data. We find that recent algorithms designed specifically to predict the effects of SNVs on RNA structure, in particular remuRNA, RNAsnp and SNPfold, perform best on the most rigorously validated subsets of the benchmark data. In addition, our benchmark indicates that general structure prediction algorithms (e.g. RNAfold and RNAstructure) have overall better performance if base pairing probabilities are considered rather than minimum free energy calculations. Although overall aggregate algorithmic performance on the full set of riboSNitches is relatively low, significant improvement is possible if the highest confidence predictions are evaluated independently.
Collapse
Affiliation(s)
- Meredith Corley
- Department of Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC 37599, USA Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Amanda Solem
- Department of Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC 37599, USA
| | - Kun Qu
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Howard Y Chang
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA 94305, USA Howard Hughes Medical Institute, Stanford University, Stanford, CA 94305, USA
| | - Alain Laederach
- Department of Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC 37599, USA Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
11
|
Mathews DH. Using the RNAstructure Software Package to Predict Conserved RNA Structures. ACTA ACUST UNITED AC 2014; 46:12.4.1-12.4.22. [PMID: 24939126 DOI: 10.1002/0471250953.bi1204s46] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
The structures of many non-coding RNA (ncRNA) are conserved by evolution to a greater extent than their sequences. By predicting the conserved structure of two or more homologous sequences, the accuracy of secondary structure prediction can be improved as compared to structure prediction for a single sequence. This unit provides protocols for the use of four programs in the RNAstructure suite for prediction of conserved structures, Multilign, TurboFold, Dynalign, and PARTS. These programs can be run via Web servers, on the command line, or with graphical interfaces.
Collapse
Affiliation(s)
- David H Mathews
- Department of Biochemistry & Biophysics and Center for RNA Biology, University of Rochester Medical Center, Rochester, New York
| |
Collapse
|
12
|
Lertampaiporn S, Thammarongtham C, Nukoolkit C, Kaewkamnerdpong B, Ruengjitchatchawalya M. Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm. Nucleic Acids Res 2014; 42:e93. [PMID: 24771344 PMCID: PMC4066759 DOI: 10.1093/nar/gku325] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2014] [Revised: 04/02/2014] [Accepted: 04/07/2014] [Indexed: 12/13/2022] Open
Abstract
To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained on a well-balanced dataset with a discriminative set of features and achieved an accuracy, sensitivity and specificity of 92.11%, 90.7% and 93.5%, respectively. The selected feature set includes a new proposed feature, SCORE. This feature is generated based on a logistic regression function that combines five significant features-structure, sequence, modularity, structural robustness and coding potential-to enable improved characterization of long ncRNA (lncRNA) elements. The use of SCORE improved the performance of the RF-based classifier in the identification of Rfam lncRNA families. A genome-wide ncRNA classification framework was applied to a wide variety of organisms, with an emphasis on those of economic, social, public health, environmental and agricultural significance, such as various bacteria genomes, the Arthrospira (Spirulina) genome, and rice and human genomic regions. Our framework was able to identify known ncRNAs with sensitivities of greater than 90% and 77.7% for prokaryotic and eukaryotic sequences, respectively. Our classifier is available at http://ncrna-pred.com/HLRF.htm.
Collapse
Affiliation(s)
- Supatcha Lertampaiporn
- Biological Engineering Program, Faculty of Engineering, King Mongkut's University of Technology Thonburi, 126 Pracha Uthit Rd, Bangmod, Thung Khru, Bangkok 10140, Thailand
| | - Chinae Thammarongtham
- Biochemical Engineering and Pilot Plant Research and Development Unit, National Center for Genetic Engineering and Biotechnology at King Mongkut's University of Technology Thonburi (Bang Khun Thian Campus), 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Rd, Tha Kham, Bangkok 10150, Thailand
| | - Chakarida Nukoolkit
- School of Information Technology, King Mongkut's University of Technology Thonburi, 126 Pracha Uthit Rd, Bangmod, Thung Khru, Bangkok 10140, Thailand
| | - Boonserm Kaewkamnerdpong
- Biological Engineering Program, Faculty of Engineering, King Mongkut's University of Technology Thonburi, 126 Pracha Uthit Rd, Bangmod, Thung Khru, Bangkok 10140, Thailand
| | - Marasri Ruengjitchatchawalya
- Biotechnology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi (Bang Khun Thian Campus), 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Rd, Tha Kham, Bangkok 10150, Thailand Bioinformatics and Systems Biology Program, King Mongkut's University of Technology Thonburi (Bang Khun Thian Campus), 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Rd, Tha Kham, Bangkok 10150, Thailand
| |
Collapse
|
13
|
Smith MA, Gesell T, Stadler PF, Mattick JS. Widespread purifying selection on RNA structure in mammals. Nucleic Acids Res 2013; 41:8220-36. [PMID: 23847102 PMCID: PMC3783177 DOI: 10.1093/nar/gkt596] [Citation(s) in RCA: 130] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2013] [Revised: 05/29/2013] [Accepted: 06/16/2013] [Indexed: 12/14/2022] Open
Abstract
Evolutionarily conserved RNA secondary structures are a robust indicator of purifying selection and, consequently, molecular function. Evaluating their genome-wide occurrence through comparative genomics has consistently been plagued by high false-positive rates and divergent predictions. We present a novel benchmarking pipeline aimed at calibrating the precision of genome-wide scans for consensus RNA structure prediction. The benchmarking data obtained from two refined structure prediction algorithms, RNAz and SISSIz, were then analyzed to fine-tune the parameters of an optimized workflow for genomic sliding window screens. When applied to consistency-based multiple genome alignments of 35 mammals, our approach confidently identifies >4 million evolutionarily constrained RNA structures using a conservative sensitivity threshold that entails historically low false discovery rates for such analyses (5-22%). These predictions comprise 13.6% of the human genome, 88% of which fall outside any known sequence-constrained element, suggesting that a large proportion of the mammalian genome is functional. As an example, our findings identify both known and novel conserved RNA structure motifs in the long noncoding RNA MALAT1. This study provides an extensive set of functional transcriptomic annotations that will assist researchers in uncovering the precise mechanisms underlying the developmental ontologies of higher eukaryotes.
Collapse
Affiliation(s)
- Martin A. Smith
- RNA Biology and Plasticity Laboratory, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, Sydney, NSW 2010 Australia, Genomics and Computational Biology Division, Institute for Molecular Bioscience, 306 Carmody Rd, University of Queensland, Brisbane, 4067 Australia, Department of Structural and Computational Biology; and Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), University of Vienna, Medical University of Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria, Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16–18, D-04107 Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany, Center for Non-coding RNA in Technology and Health, Department of Basic Veterinary and Animal Sciences, Faculty of Life Sciences University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C Denmark, Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA and St Vincent’s Clinical School, University of New South Wales, Level 5, de Lacy, Victoria St, St Vincent's Hospital, Sydney, NSW 2010 Australia
| | - Tanja Gesell
- RNA Biology and Plasticity Laboratory, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, Sydney, NSW 2010 Australia, Genomics and Computational Biology Division, Institute for Molecular Bioscience, 306 Carmody Rd, University of Queensland, Brisbane, 4067 Australia, Department of Structural and Computational Biology; and Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), University of Vienna, Medical University of Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria, Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16–18, D-04107 Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany, Center for Non-coding RNA in Technology and Health, Department of Basic Veterinary and Animal Sciences, Faculty of Life Sciences University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C Denmark, Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA and St Vincent’s Clinical School, University of New South Wales, Level 5, de Lacy, Victoria St, St Vincent's Hospital, Sydney, NSW 2010 Australia
| | - Peter F. Stadler
- RNA Biology and Plasticity Laboratory, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, Sydney, NSW 2010 Australia, Genomics and Computational Biology Division, Institute for Molecular Bioscience, 306 Carmody Rd, University of Queensland, Brisbane, 4067 Australia, Department of Structural and Computational Biology; and Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), University of Vienna, Medical University of Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria, Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16–18, D-04107 Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany, Center for Non-coding RNA in Technology and Health, Department of Basic Veterinary and Animal Sciences, Faculty of Life Sciences University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C Denmark, Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA and St Vincent’s Clinical School, University of New South Wales, Level 5, de Lacy, Victoria St, St Vincent's Hospital, Sydney, NSW 2010 Australia
| | - John S. Mattick
- RNA Biology and Plasticity Laboratory, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, Sydney, NSW 2010 Australia, Genomics and Computational Biology Division, Institute for Molecular Bioscience, 306 Carmody Rd, University of Queensland, Brisbane, 4067 Australia, Department of Structural and Computational Biology; and Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), University of Vienna, Medical University of Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria, Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16–18, D-04107 Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany, Center for Non-coding RNA in Technology and Health, Department of Basic Veterinary and Animal Sciences, Faculty of Life Sciences University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C Denmark, Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA and St Vincent’s Clinical School, University of New South Wales, Level 5, de Lacy, Victoria St, St Vincent's Hospital, Sydney, NSW 2010 Australia
| |
Collapse
|
14
|
Bussotti G, Notredame C, Enright AJ. Detecting and comparing non-coding RNAs in the high-throughput era. Int J Mol Sci 2013; 14:15423-58. [PMID: 23887659 PMCID: PMC3759867 DOI: 10.3390/ijms140815423] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2013] [Revised: 07/16/2013] [Accepted: 07/17/2013] [Indexed: 02/07/2023] Open
Abstract
In recent years there has been a growing interest in the field of non-coding RNA. This surge is a direct consequence of the discovery of a huge number of new non-coding genes and of the finding that many of these transcripts are involved in key cellular functions. In this context, accurately detecting and comparing RNA sequences has become important. Aligning nucleotide sequences is a key requisite when searching for homologous genes. Accurate alignments reveal evolutionary relationships, conserved regions and more generally any biologically relevant pattern. Comparing RNA molecules is, however, a challenging task. The nucleotide alphabet is simpler and therefore less informative than that of amino-acids. Moreover for many non-coding RNAs, evolution is likely to be mostly constrained at the structural level and not at the sequence level. This results in very poor sequence conservation impeding comparison of these molecules. These difficulties define a context where new methods are urgently needed in order to exploit experimental results to their full potential. This review focuses on the comparative genomics of non-coding RNAs in the context of new sequencing technologies and especially dealing with two extremely important and timely research aspects: the development of new methods to align RNAs and the analysis of high-throughput data.
Collapse
Affiliation(s)
- Giovanni Bussotti
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK; E-Mail:
| | - Cedric Notredame
- Bioinformatics and Genomics Program, Centre for Genomic Regulation (CRG), Aiguader, 88, 08003 Barcelona, Spain; E-Mail:
| | - Anton J. Enright
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK; E-Mail:
| |
Collapse
|
15
|
Kishore S, Gruber AR, Jedlinski DJ, Syed AP, Jorjani H, Zavolan M. Insights into snoRNA biogenesis and processing from PAR-CLIP of snoRNA core proteins and small RNA sequencing. Genome Biol 2013; 14:R45. [PMID: 23706177 PMCID: PMC4053766 DOI: 10.1186/gb-2013-14-5-r45] [Citation(s) in RCA: 106] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2013] [Revised: 05/15/2013] [Accepted: 05/26/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In recent years, a variety of small RNAs derived from other RNAs with well-known functions such as tRNAs and snoRNAs, have been identified. The functional relevance of these RNAs is largely unknown. To gain insight into the complexity of snoRNA processing and the functional relevance of snoRNA-derived small RNAs, we sequence long and short RNAs, small RNAs that co-precipitate with the Argonaute 2 protein and RNA fragments obtained in photoreactive nucleotide-enhanced crosslinking and immunoprecipitation (PAR-CLIP) of core snoRNA-associated proteins. RESULTS Analysis of these data sets reveals that many loci in the human genome reproducibly give rise to C/D box-like snoRNAs, whose expression and evolutionary conservation are typically less pronounced relative to the snoRNAs that are currently cataloged. We further find that virtually all C/D box snoRNAs are specifically processed inside the regions of terminal complementarity, retaining in the mature form only 4-5 nucleotides upstream of the C box and 2-5 nucleotides downstream of the D box. Sequencing of the total and Argonaute 2-associated populations of small RNAs reveals that despite their cellular abundance, C/D box-derived small RNAs are not efficiently incorporated into the Ago2 protein. CONCLUSIONS We conclude that the human genome encodes a large number of snoRNAs that are processed along the canonical pathway and expressed at relatively low levels. Generation of snoRNA-derived processing products with alternative, particularly miRNA-like, functions appears to be uncommon.
Collapse
Affiliation(s)
- Shivendra Kishore
- Computational and Systems Biology, Biozentrum, University of Basel, Klingelbergstrasse 50-70, 4056 Basel, Switzerland
| | - Andreas R Gruber
- Computational and Systems Biology, Biozentrum, University of Basel, Klingelbergstrasse 50-70, 4056 Basel, Switzerland
| | - Dominik J Jedlinski
- Computational and Systems Biology, Biozentrum, University of Basel, Klingelbergstrasse 50-70, 4056 Basel, Switzerland
| | - Afzal P Syed
- Computational and Systems Biology, Biozentrum, University of Basel, Klingelbergstrasse 50-70, 4056 Basel, Switzerland
| | - Hadi Jorjani
- Computational and Systems Biology, Biozentrum, University of Basel, Klingelbergstrasse 50-70, 4056 Basel, Switzerland
| | - Mihaela Zavolan
- Computational and Systems Biology, Biozentrum, University of Basel, Klingelbergstrasse 50-70, 4056 Basel, Switzerland
| |
Collapse
|
16
|
Lei J, Techa-Angkoon P, Sun Y. Chain-RNA: a comparative ncRNA search tool based on the two-dimensional chain algorithm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:274-285. [PMID: 23929857 DOI: 10.1109/tcbb.2012.137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Noncoding RNA (ncRNA) identification is highly important to modern biology. The state-of-the-art method for ncRNA identification is based on comparative genomics, in which evolutionary conservations of sequences and secondary structures provide important evidence for ncRNA search. For ncRNAs with low sequence conservation but high structural similarity, conventional local alignment tools such as BLAST yield low sensitivity. Thus, there is a need for ncRNA search methods that can incorporate both sequence and structural similarities. We introduce chain-RNA, a pairwise structural alignment tool that can effectively locate cross-species conserved RNA elements with low sequence similarity. In chain-RNA, stem-loop structures are extracted from dot plots generated by an efficient local-folding algorithm. Then, we formulate stem alignment as an extended 2D chain problem and employ existing chain algorithms. Chain-RNA is tested on a data set containing annotated ncRNA homologs and is applied to novel ncRNA search in a transcriptomic data set. The experimental results show that chain-RNA has better tradeoff between sensitivity and false positive rate in ncRNA prediction than conventional sequence similarity search tools and is more time efficient than structural alignment tools. The source codes of chain-RNA can be downloaded at http://sourceforge.net/projects/chain-rna/ or at http://www.cse.msu.edu/~leijikai/chain-rna/.
Collapse
Affiliation(s)
- Jikai Lei
- Michigan State University, East Lansing, MI 48824, USA
| | | | | |
Collapse
|
17
|
Achawanantakun R, Sun Y. Shape and secondary structure prediction for ncRNAs including pseudoknots based on linear SVM. BMC Bioinformatics 2013; 14 Suppl 2:S1. [PMID: 23369147 PMCID: PMC3549817 DOI: 10.1186/1471-2105-14-s2-s1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Accurate secondary structure prediction provides important information to undefirstafinding the tertiary structures and thus the functions of ncRNAs. However, the accuracy of the native structure derivation of ncRNAs is still not satisfactory, especially on sequences containing pseudoknots. It is recently shown that using the abstract shapes, which retain adjacency and nesting of structural features but disregard the length details of helix and loop regions, can improve the performance of structure prediction. In this work, we use SVM-based feature selection to derive the consensus abstract shape of homologous ncRNAs and apply the predicted shape to structure prediction including pseudoknots. Results Our approach was applied to predict shapes and secondary structures on hundreds of ncRNA data sets with and without psuedoknots. The experimental results show that we can achieve 18% higher accuracy in shape prediction than the state-of-the-art consensus shape prediction tools. Using predicted shapes in structure prediction allows us to achieve approximate 29% higher sensitivity and 10% higher positive predictive value than other pseudoknot prediction tools. Conclusions Extensive analysis of RNA properties based on SVM allows us to identify important properties of sequences and structures related to their shapes. The combination of mass data analysis and SVM-based feature selection makes our approach a promising method for shape and structure prediction. The implemented tools, Knot Shape and Knot Structure are open source software and can be downloaded at: http://www.cse.msu.edu/~achawana/KnotShape.
Collapse
Affiliation(s)
- Rujira Achawanantakun
- Department of Computer Science and Engineering, Michigan State University, Michigan, USA
| | | |
Collapse
|
18
|
Karlin D, Belshaw R. Detecting remote sequence homology in disordered proteins: discovery of conserved motifs in the N-termini of Mononegavirales phosphoproteins. PLoS One 2012; 7:e31719. [PMID: 22403617 PMCID: PMC3293882 DOI: 10.1371/journal.pone.0031719] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2011] [Accepted: 01/18/2012] [Indexed: 11/19/2022] Open
Abstract
Paramyxovirinae are a large group of viruses that includes measles virus and parainfluenza viruses. The viral Phosphoprotein (P) plays a central role in viral replication. It is composed of a highly variable, disordered N-terminus and a conserved C-terminus. A second viral protein alternatively expressed, the V protein, also contains the N-terminus of P, fused to a zinc finger. We suspected that, despite their high variability, the N-termini of P/V might all be homologous; however, using standard approaches, we could previously identify sequence conservation only in some Paramyxovirinae. We now compared the N-termini using sensitive sequence similarity search programs, able to detect residual similarities unnoticeable by conventional approaches. We discovered that all Paramyxovirinae share a short sequence motif in their first 40 amino acids, which we called soyuz1. Despite its short length (11-16aa), several arguments allow us to conclude that soyuz1 probably evolved by homologous descent, unlike linear motifs. Conservation across such evolutionary distances suggests that soyuz1 plays a crucial role and experimental data suggest that it binds the viral nucleoprotein to prevent its illegitimate self-assembly. In some Paramyxovirinae, the N-terminus of P/V contains a second motif, soyuz2, which might play a role in blocking interferon signaling. Finally, we discovered that the P of related Mononegavirales contain similarly overlooked motifs in their N-termini, and that their C-termini share a previously unnoticed structural similarity suggesting a common origin. Our results suggest several testable hypotheses regarding the replication of Mononegavirales and suggest that disordered regions with little overall sequence similarity, common in viral and eukaryotic proteins, might contain currently overlooked motifs (intermediate in length between linear motifs and disordered domains) that could be detected simply by comparing orthologous proteins.
Collapse
Affiliation(s)
- David Karlin
- Department of Zoology, University of Oxford, Oxford, United Kingdom.
| | | |
Collapse
|
19
|
Meyer M, Westhof E, Masquida B. A structural module in RNase P expands the variety of RNA kinks. RNA Biol 2012; 9:254-60. [PMID: 22336704 DOI: 10.4161/rna.19434] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
RNA structures are built from recurrent modules that can be identified by structural and comparative sequence analysis. In order to assemble sets of helices in compact architectures, modules that introduce bends and kinks are necessary. Among such modules, kink-turns form an important family that presents sequence and structural characteristics. Here, we describe an internal loop in the bacterial type A RNase P RNA that sets helices bound at the junctions exactly in the same relative positions as in kink-turns but without the structural signatures typical of kink-turns. Our work suggests that identifying a structural module in a subset of RNA sequences constitutes a strategy to identify distinct sequential motifs sharing common structural characteristics.
Collapse
Affiliation(s)
- Mélanie Meyer
- Architecture et Réactivité de l'ARN, Université de Strasbourg, IBMC, CNRS, Strasbourg, France
| | | | | |
Collapse
|
20
|
Abstract
RNA is now appreciated to serve numerous cellular roles, and understanding RNA structure is important for understanding a mechanism of action. This contribution discusses the methods available for predicting RNA structure. Secondary structure is the set of the canonical base pairs, and secondary structure can be accurately determined by comparative sequence analysis. Secondary structure can also be predicted. The most commonly used method is free energy minimization. The accuracy of structure prediction is improved either by using experimental mapping data or by predicting a structure conserved in a set of homologous sequences. Additionally, tertiary structure, the three-dimensional arrangement of atoms, can be modeled with guidance from comparative analysis and experimental techniques. New approaches are also available for predicting tertiary structure.
Collapse
Affiliation(s)
- Matthew G Seetin
- Department of Biochemistry & Biophysics, University of Rochester Medical Center, Rochester, NY, USA
| | | |
Collapse
|
21
|
Chursov A, Walter MC, Schmidt T, Mironov A, Shneider A, Frishman D. Sequence-structure relationships in yeast mRNAs. Nucleic Acids Res 2011; 40:956-62. [PMID: 21954438 PMCID: PMC3273797 DOI: 10.1093/nar/gkr790] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
It is generally accepted that functionally important RNA structure is more conserved than sequence due to compensatory mutations that may alter the sequence without disrupting the structure. For small RNA molecules sequence–structure relationships are relatively well understood. However, structural bioinformatics of mRNAs is still in its infancy due to a virtual absence of experimental data. This report presents the first quantitative assessment of sequence–structure divergence in the coding regions of mRNA molecules based on recently published transcriptome-wide experimental determination of their base paring patterns. Structural resemblance in paralogous mRNA pairs quickly drops as sequence identity decreases from 100% to 85–90%. Structures of mRNAs sharing sequence identity below roughly 85% are essentially uncorrelated. This outcome is in dramatic contrast to small functional non-coding RNAs where sequence and structure divergence are correlated at very low levels of sequence similarity. The fact that very similar mRNA sequences can have vastly different secondary structures may imply that the particular global shape of base paired elements in coding regions does not play a major role in modulating gene expression and translation efficiency. Apparently, the need to maintain stable three-dimensional structures of encoded proteins places a much higher evolutionary pressure on mRNA sequences than on their RNA structures.
Collapse
Affiliation(s)
- Andrey Chursov
- Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftzentrum Weihenstephan, Maximus-von-Imhof-Forum 3, D-85354, Freising, Germany
| | | | | | | | | | | |
Collapse
|
22
|
Harmanci AO, Sharma G, Mathews DH. TurboFold: iterative probabilistic estimation of secondary structures for multiple RNA sequences. BMC Bioinformatics 2011; 12:108. [PMID: 21507242 PMCID: PMC3120699 DOI: 10.1186/1471-2105-12-108] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2010] [Accepted: 04/20/2011] [Indexed: 01/07/2023] Open
Abstract
Background The prediction of secondary structure, i.e. the set of canonical base pairs between nucleotides, is a first step in developing an understanding of the function of an RNA sequence. The most accurate computational methods predict conserved structures for a set of homologous RNA sequences. These methods usually suffer from high computational complexity. In this paper, TurboFold, a novel and efficient method for secondary structure prediction for multiple RNA sequences, is presented. Results TurboFold takes, as input, a set of homologous RNA sequences and outputs estimates of the base pairing probabilities for each sequence. The base pairing probabilities for a sequence are estimated by combining intrinsic information, derived from the sequence itself via the nearest neighbor thermodynamic model, with extrinsic information, derived from the other sequences in the input set. For a given sequence, the extrinsic information is computed by using pairwise-sequence-alignment-based probabilities for co-incidence with each of the other sequences, along with estimated base pairing probabilities, from the previous iteration, for the other sequences. The extrinsic information is introduced as free energy modifications for base pairing in a partition function computation based on the nearest neighbor thermodynamic model. This process yields updated estimates of base pairing probability. The updated base pairing probabilities in turn are used to recompute extrinsic information, resulting in the overall iterative estimation procedure that defines TurboFold. TurboFold is benchmarked on a number of ncRNA datasets and compared against alternative secondary structure prediction methods. The iterative procedure in TurboFold is shown to improve estimates of base pairing probability with each iteration, though only small gains are obtained beyond three iterations. Secondary structures composed of base pairs with estimated probabilities higher than a significance threshold are shown to be more accurate for TurboFold than for alternative methods that estimate base pairing probabilities. TurboFold-MEA, which uses base pairing probabilities from TurboFold in a maximum expected accuracy algorithm for secondary structure prediction, has accuracy comparable to the best performing secondary structure prediction methods. The computational and memory requirements for TurboFold are modest and, in terms of sequence length and number of sequences, scale much more favorably than joint alignment and folding algorithms. Conclusions TurboFold is an iterative probabilistic method for predicting secondary structures for multiple RNA sequences that efficiently and accurately combines the information from the comparative analysis between sequences with the thermodynamic folding model. Unlike most other multi-sequence structure prediction methods, TurboFold does not enforce strict commonality of structures and is therefore useful for predicting structures for homologous sequences that have diverged significantly. TurboFold can be downloaded as part of the RNAstructure package at http://rna.urmc.rochester.edu.
Collapse
Affiliation(s)
- Arif O Harmanci
- Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY, USA
| | | | | |
Collapse
|
23
|
Hamada M, Kiryu H, Iwasaki W, Asai K. Generalized centroid estimators in bioinformatics. PLoS One 2011; 6:e16450. [PMID: 21365017 PMCID: PMC3041832 DOI: 10.1371/journal.pone.0016450] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2010] [Accepted: 12/22/2010] [Indexed: 11/27/2022] Open
Abstract
In a number of estimation problems in bioinformatics, accuracy measures of the target problem are usually given, and it is important to design estimators that are suitable to those accuracy measures. However, there is often a discrepancy between an employed estimator and a given accuracy measure of the problem. In this study, we introduce a general class of efficient estimators for estimation problems on high-dimensional binary spaces, which represent many fundamental problems in bioinformatics. Theoretical analysis reveals that the proposed estimators generally fit with commonly-used accuracy measures (e.g. sensitivity, PPV, MCC and F-score) as well as it can be computed efficiently in many cases, and cover a wide range of problems in bioinformatics from the viewpoint of the principle of maximum expected accuracy (MEA). It is also shown that some important algorithms in bioinformatics can be interpreted in a unified manner. Not only the concept presented in this paper gives a useful framework to design MEA-based estimators but also it is highly extendable and sheds new light on many problems in bioinformatics.
Collapse
Affiliation(s)
- Michiaki Hamada
- Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan.
| | | | | | | |
Collapse
|
24
|
Xu Z, Mathews DH. Multilign: an algorithm to predict secondary structures conserved in multiple RNA sequences. ACTA ACUST UNITED AC 2010; 27:626-32. [PMID: 21193521 DOI: 10.1093/bioinformatics/btq726] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
MOTIVATION With recent advances in sequencing, structural and functional studies of RNA lag behind the discovery of sequences. Computational analysis of RNA is increasingly important to reveal structure-function relationships with low cost and speed. The purpose of this study is to use multiple homologous sequences to infer a conserved RNA structure. RESULTS A new algorithm, called Multilign, is presented to find the lowest free energy RNA secondary structure common to multiple sequences. Multilign is based on Dynalign, which is a program that simultaneously aligns and folds two sequences to find the lowest free energy conserved structure. For Multilign, Dynalign is used to progressively construct a conserved structure from multiple pairwise calculations, with one sequence used in all pairwise calculations. A base pair is predicted only if it is contained in the set of low free energy structures predicted by all Dynalign calculations. In this way, Multilign improves prediction accuracy by keeping the genuine base pairs and excluding competing false base pairs. Multilign has computational complexity that scales linearly in the number of sequences. Multilign was tested on extensive datasets of sequences with known structure and its prediction accuracy is among the best of available algorithms. Multilign can run on long sequences (> 1500 nt) and an arbitrarily large number of sequences. AVAILABILITY The algorithm is implemented in ANSI C++ and can be downloaded as part of the RNAstructure package at: http://rna.urmc.rochester.edu.
Collapse
Affiliation(s)
- Zhenjiang Xu
- Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, USA
| | | |
Collapse
|
25
|
Hamada M, Sato K, Asai K. Improving the accuracy of predicting secondary structure for aligned RNA sequences. Nucleic Acids Res 2010; 39:393-402. [PMID: 20843778 PMCID: PMC3025558 DOI: 10.1093/nar/gkq792] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Considerable attention has been focused on predicting the secondary structure for aligned RNA sequences since it is useful not only for improving the limiting accuracy of conventional secondary structure prediction but also for finding non-coding RNAs in genomic sequences. Although there exist many algorithms of predicting secondary structure for aligned RNA sequences, further improvement of the accuracy is still awaited. In this article, toward improving the accuracy, a theoretical classification of state-of-the-art algorithms of predicting secondary structure for aligned RNA sequences is presented. The classification is based on the viewpoint of maximum expected accuracy (MEA), which has been successfully applied in various problems in bioinformatics. The classification reveals several disadvantages of the current algorithms but we propose an improvement of a previously introduced algorithm (CentroidAlifold). Finally, computational experiments strongly support the theoretical classification and indicate that the improved CentroidAlifold substantially outperforms other algorithms.
Collapse
Affiliation(s)
- Michiaki Hamada
- Mizuho Information & Research Institute, Inc, Chiyoda-ku, Tokyo, Japan.
| | | | | |
Collapse
|
26
|
|