1
|
Gadekar V, Munk AW, Miladi M, Junge A, Backofen R, Seemann S, Gorodkin J. Clusters of mammalian conserved RNA structures in UTRs associate with RBP binding sites. NAR Genom Bioinform 2024; 6:lqae089. [PMID: 39131818 PMCID: PMC11310781 DOI: 10.1093/nargab/lqae089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 06/26/2024] [Accepted: 07/16/2024] [Indexed: 08/13/2024] Open
Abstract
RNA secondary structures play essential roles in the formation of the tertiary structure and function of a transcript. Recent genome-wide studies highlight significant potential for RNA structures in the mammalian genome. However, a major challenge is assigning functional roles to these structured RNAs. In this study, we conduct a guilt-by-association analysis of clusters of computationally predicted conserved RNA structure (CRSs) in human untranslated regions (UTRs) to associate them with gene functions. We filtered a broad pool of ∼500 000 human CRSs for UTR overlap, resulting in 4734 and 24 754 CRSs from the 5' and 3' UTR of protein-coding genes, respectively. We separately clustered these CRSs for both sets using RNAscClust, obtaining 793 and 2403 clusters, each containing an average of five CRSs per cluster. We identified overrepresented binding sites for 60 and 43 RNA-binding proteins co-localizing with the clustered CRSs. Furthermore, 104 and 441 clusters from the 5' and 3' UTRs, respectively, showed enrichment for various Gene Ontologies, including biological processes such as 'signal transduction', 'nervous system development', molecular functions like 'transferase activity' and the cellular components such as 'synapse' among others. Our study shows that significant functional insights can be gained by clustering RNA structures based on their structural characteristics.
Collapse
Affiliation(s)
- Veerendra P Gadekar
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Ridebanevej 9, 1870 Frederiksberg, Denmark
- Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Frederiksberg, 1870 Frederiksberg, Denmark
- Centre for Integrative Biology and Systems Medicine (IBSE), IIT Madras, Chennai, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai, India
| | - Alexander Welford Munk
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Ridebanevej 9, 1870 Frederiksberg, Denmark
- Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Frederiksberg, 1870 Frederiksberg, Denmark
| | - Milad Miladi
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg im Breisgau, Germany
| | - Alexander Junge
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Ridebanevej 9, 1870 Frederiksberg, Denmark
- Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Frederiksberg, 1870 Frederiksberg, Denmark
| | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg im Breisgau, Germany
| | - Stefan E Seemann
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Ridebanevej 9, 1870 Frederiksberg, Denmark
- Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Frederiksberg, 1870 Frederiksberg, Denmark
| | - Jan Gorodkin
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Ridebanevej 9, 1870 Frederiksberg, Denmark
- Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Frederiksberg, 1870 Frederiksberg, Denmark
| |
Collapse
|
2
|
Backofen R, Gorodkin J, Hofacker IL, Stadler PF. Comparative RNA Genomics. Methods Mol Biol 2024; 2802:347-393. [PMID: 38819565 DOI: 10.1007/978-1-0716-3838-5_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Over the last quarter of a century it has become clear that RNA is much more than just a boring intermediate in protein expression. Ancient RNAs still appear in the core information metabolism and comprise a surprisingly large component in bacterial gene regulation. A common theme with these types of mostly small RNAs is their reliance of conserved secondary structures. Large-scale sequencing projects, on the other hand, have profoundly changed our understanding of eukaryotic genomes. Pervasively transcribed, they give rise to a plethora of large and evolutionarily extremely flexible non-coding RNAs that exert a vastly diverse array of molecule functions. In this chapter we provide a-necessarily incomplete-overview of the current state of comparative analysis of non-coding RNAs, emphasizing computational approaches as a means to gain a global picture of the modern RNA world.
Collapse
Affiliation(s)
- Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
- Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark
| | - Jan Gorodkin
- Center for Non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Frederiksberg, Denmark
| | - Ivo L Hofacker
- Institute for Theoretical Chemistry, University of Vienna, Wien, Austria
- Bioinformatics and Computational Biology research group, University of Vienna, Vienna, Austria
- Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, University of Leipzig, Leipzig, Germany.
- Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany.
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany.
- Universidad National de Colombia, Bogotá, Colombia.
- Institute for Theoretical Chemistry, University of Vienna, Wien, Austria.
- Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark.
- Santa Fe Institute, Santa Fe, NM, USA.
| |
Collapse
|
3
|
Genome-Wide RNA Secondary Structure Prediction. Methods Mol Biol 2023; 2586:35-48. [PMID: 36705897 DOI: 10.1007/978-1-0716-2768-6_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
The information of RNA secondary structure has been widely applied to the inference of RNA function. However, a classical prediction method is not feasible to long RNAs such as mRNA due to the problems of computational time and numerical errors. To overcome those problems, sliding window methods have been applied while their results are not directly comparable to global RNA structure prediction. In this chapter, we introduce ParasoR, a method designed for parallel computation of genome-wide RNA secondary structures. To enable genome-wide prediction, ParasoR distributes dynamic programming (DP) matrices required for structure prediction to multiple computational nodes. Using the database of not the original DP variable but the ratio of variables, ParasoR can locally compute the structure scores such as stem probability or accessibility on demand. A comprehensive analysis of local secondary structures by ParasoR is expected to be a promising way to detect the statistical constraints on long RNAs.
Collapse
|
4
|
Sim M, Lee J, Kwon D, Lee D, Park N, Wy S, Ko Y, Kim J. Reference-based read clustering improves the de novo genome assembly of microbial strains. Comput Struct Biotechnol J 2022; 21:444-451. [PMID: 36618978 PMCID: PMC9804104 DOI: 10.1016/j.csbj.2022.12.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 12/17/2022] [Accepted: 12/19/2022] [Indexed: 12/24/2022] Open
Abstract
Constructing accurate microbial genome assemblies is necessary to understand genetic diversity in microbial genomes and its functional consequences. However, it still remains as a challenging task especially when only short-read sequencing technologies are used. Here, we present a new read-clustering algorithm, called RBRC, for improving de novo microbial genome assembly, by accurately estimating read proximity using multiple reference genomes. The performance of RBRC was confirmed by simulation-based evaluation in terms of assembly contiguity and the number of misassemblies, and was successfully applied to existing fungal and bacterial genomes by improving the quality of the assemblies without using additional sequencing data. RBRC is a very useful read-clustering algorithm that can be used (i) for generating high-quality genome assemblies of microbial strains when genome assemblies of related strains are available, and (ii) for upgrading existing microbial genome assemblies when the generation of additional sequencing data, such as long reads, is difficult.
Collapse
Affiliation(s)
- Mikang Sim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea
| | - Jongin Lee
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea
| | - Daehong Kwon
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea
| | - Daehwan Lee
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea
| | - Nayoung Park
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea
| | - Suyeon Wy
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea
| | - Younhee Ko
- Division of Biomedical Engineering, Hankuk University of Foreign Studies, Gyeonggi-do 17035, Republic of Korea
| | - Jaebum Kim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea,Corresponding author.
| |
Collapse
|
5
|
Ono Y, Katayama K, Onuma T, Kubo K, Tsuyuzaki H, Hamada M, Sato M. Structure-based screening for functional non-coding RNAs in fission yeast identifies a factor repressing untimely initiation of sexual differentiation. Nucleic Acids Res 2022; 50:11229-11242. [PMID: 36259651 PMCID: PMC9638895 DOI: 10.1093/nar/gkac825] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 09/06/2022] [Accepted: 09/14/2022] [Indexed: 12/04/2022] Open
Abstract
Non-coding RNAs (ncRNAs) ubiquitously exist in normal and cancer cells. Despite their prevalent distribution, the functions of most long ncRNAs remain uncharacterized. The fission yeast Schizosaccharomyces pombe expresses >1800 ncRNAs annotated to date, but most unconventional ncRNAs (excluding tRNA, rRNA, snRNA and snoRNA) remain uncharacterized. To discover the functional ncRNAs, here we performed a combinatory screening of computational and biological tests. First, all S. pombe ncRNAs were screened in silico for those showing conservation in sequence as well as in secondary structure with ncRNAs in closely related species. Almost a half of the 151 selected conserved ncRNA genes were uncharacterized. Twelve ncRNA genes that did not overlap with protein-coding sequences were next chosen for biological screening that examines defects in growth or sexual differentiation, as well as sensitivities to drugs and stresses. Finally, we highlighted an ncRNA transcribed from SPNCRNA.1669, which inhibited untimely initiation of sexual differentiation. A domain that was predicted as conserved secondary structure by the computational operations was essential for the ncRNA to function. Thus, this study demonstrates that in silico selection focusing on conservation of the secondary structure over species is a powerful method to pinpoint novel functional ncRNAs.
Collapse
Affiliation(s)
- Yu Ono
- Laboratory of Cytoskeletal Logistics, Department of Life Science and Medical Bioscience, School of Advanced Science and Engineering, Waseda University, 2-2 Wakamatsucho, Shinjuku-ku, Tokyo 162-8480, Japan
| | - Kenta Katayama
- Laboratory of Cytoskeletal Logistics, Department of Life Science and Medical Bioscience, School of Advanced Science and Engineering, Waseda University, 2-2 Wakamatsucho, Shinjuku-ku, Tokyo 162-8480, Japan.,Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
| | - Tomoki Onuma
- Laboratory of Cytoskeletal Logistics, Department of Life Science and Medical Bioscience, School of Advanced Science and Engineering, Waseda University, 2-2 Wakamatsucho, Shinjuku-ku, Tokyo 162-8480, Japan
| | - Kento Kubo
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan.,Bioinformatics Laboratory, Department of Electrical Engineering and Bioscience, School of Advanced Science and Engineering, Waseda University, 3-4-1 Okubo Shinjuku-ku, Tokyo 169-8555, Japan
| | - Hayato Tsuyuzaki
- Laboratory of Cytoskeletal Logistics, Department of Life Science and Medical Bioscience, School of Advanced Science and Engineering, Waseda University, 2-2 Wakamatsucho, Shinjuku-ku, Tokyo 162-8480, Japan.,Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
| | - Michiaki Hamada
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan.,Bioinformatics Laboratory, Department of Electrical Engineering and Bioscience, School of Advanced Science and Engineering, Waseda University, 3-4-1 Okubo Shinjuku-ku, Tokyo 169-8555, Japan.,Institute for Medical-oriented Structural Biology, Waseda University, 2-2 Wakamatsucho, Shinjuku-ku, Tokyo 162-8480, Japan
| | - Masamitsu Sato
- Laboratory of Cytoskeletal Logistics, Department of Life Science and Medical Bioscience, School of Advanced Science and Engineering, Waseda University, 2-2 Wakamatsucho, Shinjuku-ku, Tokyo 162-8480, Japan.,Institute for Medical-oriented Structural Biology, Waseda University, 2-2 Wakamatsucho, Shinjuku-ku, Tokyo 162-8480, Japan.,Institute for Advanced Research of Biosystem Dynamics, Waseda Research Institute for Science and Engineering, Graduate School of Advanced Science and Engineering, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
| |
Collapse
|
6
|
Asim MN, Ibrahim MA, Imran Malik M, Dengel A, Ahmed S. Advances in Computational Methodologies for Classification and Sub-Cellular Locality Prediction of Non-Coding RNAs. Int J Mol Sci 2021; 22:8719. [PMID: 34445436 PMCID: PMC8395733 DOI: 10.3390/ijms22168719] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2021] [Revised: 08/02/2021] [Accepted: 08/03/2021] [Indexed: 02/06/2023] Open
Abstract
Apart from protein-coding Ribonucleic acids (RNAs), there exists a variety of non-coding RNAs (ncRNAs) which regulate complex cellular and molecular processes. High-throughput sequencing technologies and bioinformatics approaches have largely promoted the exploration of ncRNAs which revealed their crucial roles in gene regulation, miRNA binding, protein interactions, and splicing. Furthermore, ncRNAs are involved in the development of complicated diseases like cancer. Categorization of ncRNAs is essential to understand the mechanisms of diseases and to develop effective treatments. Sub-cellular localization information of ncRNAs demystifies diverse functionalities of ncRNAs. To date, several computational methodologies have been proposed to precisely identify the class as well as sub-cellular localization patterns of RNAs). This paper discusses different types of ncRNAs, reviews computational approaches proposed in the last 10 years to distinguish coding-RNA from ncRNA, to identify sub-types of ncRNAs such as piwi-associated RNA, micro RNA, long ncRNA, and circular RNA, and to determine sub-cellular localization of distinct ncRNAs and RNAs. Furthermore, it summarizes diverse ncRNA classification and sub-cellular localization determination datasets along with benchmark performance to aid the development and evaluation of novel computational methodologies. It identifies research gaps, heterogeneity, and challenges in the development of computational approaches for RNA sequence analysis. We consider that our expert analysis will assist Artificial Intelligence researchers with knowing state-of-the-art performance, model selection for various tasks on one platform, dominantly used sequence descriptors, neural architectures, and interpreting inter-species and intra-species performance deviation.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Muhammad Imran Malik
- National Center for Artificial Intelligence (NCAI), National University of Sciences and Technology, Islamabad 44000, Pakistan;
- School of Electrical Engineering & Computer Science, National University of Sciences and Technology, Islamabad 44000, Pakistan
| | - Andreas Dengel
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- DeepReader GmbH, Trippstadter Str. 122, 67663 Kaiserslautern, Germany
| |
Collapse
|
7
|
Miladi M, Raden M, Will S, Backofen R. Fast and accurate structure probability estimation for simultaneous alignment and folding of RNAs with Markov chains. Algorithms Mol Biol 2020; 15:19. [PMID: 33292340 PMCID: PMC7666477 DOI: 10.1186/s13015-020-00179-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Accepted: 10/16/2020] [Indexed: 12/14/2022] Open
Abstract
MOTIVATION Simultaneous alignment and folding (SA&F) of RNAs is the indispensable gold standard for inferring the structure of non-coding RNAs and their general analysis. The original algorithm, proposed by Sankoff, solves the theoretical problem exactly with a complexity of [Formula: see text] in the full energy model. Over the last two decades, several variants and improvements of the Sankoff algorithm have been proposed to reduce its extreme complexity by proposing simplified energy models or imposing restrictions on the predicted alignments. RESULTS Here, we introduce a novel variant of Sankoff's algorithm that reconciles the simplifications of PMcomp, namely moving from the full energy model to a simpler base pair-based model, with the accuracy of the loop-based full energy model. Instead of estimating pseudo-energies from unconditional base pair probabilities, our model calculates energies from conditional base pair probabilities that allow to accurately capture structure probabilities, which obey a conditional dependency. This model gives rise to the fast and highly accurate novel algorithm Pankov (Probabilistic Sankoff-like simultaneous alignment and folding of RNAs inspired by Markov chains). CONCLUSIONS Pankov benefits from the speed-up of excluding unreliable base-pairing without compromising the loop-based free energy model of the Sankoff's algorithm. We show that Pankov outperforms its predecessors LocARNA and SPARSE in folding quality and is faster than LocARNA.
Collapse
Affiliation(s)
- Milad Miladi
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, Freiburg, Germany
| | - Martin Raden
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, Freiburg, Germany
| | - Sebastian Will
- Theoretical Biochemistry Group (TBI), Institute for Theoretical Chemistry, University of Vienna, Währingerstrasse 17, Vienna, Austria
- Bioinformatics group (AMIBIO), Laboratoire d’Informatique de l’École Polytechnique (LIX), Institut Polytechnique de Paris (IPP), Batiment Turing, 1 rue d’Estienne d’Orve, Palaiseau, France
| | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, Freiburg, Germany
- Signalling Research Centres BIOSS and CIBSS, University of Freiburg, Schänzlestr. 18, Freiburg, Germany
| |
Collapse
|
8
|
Chatterjee A, Mukherjee S, Basu B, Roy D, Basu R, Ghosh H, Mishra L, Bhattacharya M, Chakraborty N. Insight into the distinctive paradigm of Human Cytomegalovirus associated intrahepatic and extrahepatic cholestasis in neonates. Sci Rep 2020; 10:15861. [PMID: 32985571 PMCID: PMC7522230 DOI: 10.1038/s41598-020-73009-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2020] [Accepted: 08/18/2020] [Indexed: 01/15/2023] Open
Abstract
Human Cytomegalovirus has been implicated as a probable cause for the development of hepatic cholestasis among neonates. Our study tried to ascertain the exact demographic, biochemical and immunological markers to differentially diagnose patients with HCMV associated intrahepatic and extrahepatic cholestasis and also decipher the phylogenetic variability among the viral strains infecting the two groups. A total of 110 neonates collected over a span of 2 years were selected for the study classified into four different groups based on the presence of hepatic cholestasis and active HCMV infection. Our analysis predicted that total Cholesterol, GGT, ALP and TNFα were the only significant biological markers with exact cut-off scores, capable of distinguishing between HCMV associated intrahepatic and extrahepatic cholestasis. We confirmed that in patients belonging to both of these groups, the inflammasome is activated and the extent of this activation is more or less same except for the initial activators NLRP3 and AIM2 respectively. When we performed two separate phylogenetic analyses with HCMV gM and gN gene sequences, we found that in both cases the sequences from the IHC and EHC groups formed almost separate phylogenetic clusters. Our study has shown that the HCMV clinical strains infecting at intrahepatic and extrahepatic sites are phylogenetically segregated as distinct clusters. These two separate groups show different physiological as well as immunological modulations while infecting a similar host.
Collapse
Affiliation(s)
- Aroni Chatterjee
- Virus Research Laboratory, ICMR-National Institute of Cholera and Enteric Diseases, GB4, ID & BG Hospital Campus, Dr. S.C Banerjee Road, Beliaghata, Kolkata, West Bengal, 700010, India
| | - Sumit Mukherjee
- Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel
| | - Biswanath Basu
- Department of Paediatrics, N. R. S. Medical College & Hospital, Kolkata, 700014, India
| | - Debsopan Roy
- Virus Research Laboratory, ICMR-National Institute of Cholera and Enteric Diseases, GB4, ID & BG Hospital Campus, Dr. S.C Banerjee Road, Beliaghata, Kolkata, West Bengal, 700010, India
| | - Rivu Basu
- Department of Community Medicine, R. G. Kar Medical College and Hospital, Kolkata, 700004, India
| | - Hiya Ghosh
- Department of Endocrinology and Metabolism, IPGMER & SSKM Hospital, Kolkata, 700020, India
| | - Lopamudra Mishra
- Department of Paediatrics, Dr. B. C. Roy Post Graduate Institute of Paediatric Sciences, Kolkata, 700054, India
| | - Mala Bhattacharya
- Department of Paediatrics, Dr. B. C. Roy Post Graduate Institute of Paediatric Sciences, Kolkata, 700054, India
| | - Nilanjan Chakraborty
- Virus Research Laboratory, ICMR-National Institute of Cholera and Enteric Diseases, GB4, ID & BG Hospital Campus, Dr. S.C Banerjee Road, Beliaghata, Kolkata, West Bengal, 700010, India.
| |
Collapse
|
9
|
Müller T, Miladi M, Hutter F, Hofacker I, Will S, Backofen R. The locality dilemma of Sankoff-like RNA alignments. Bioinformatics 2020; 36:i242-i250. [PMID: 32657398 PMCID: PMC7355259 DOI: 10.1093/bioinformatics/btaa431] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Motivation Elucidating the functions of non-coding RNAs by homology has been strongly limited due to fundamental computational and modeling issues. While existing simultaneous alignment and folding (SA&F) algorithms successfully align homologous RNAs with precisely known boundaries (global SA&F), the more pressing problem of identifying new classes of homologous RNAs in the genome (local SA&F) is intrinsically more difficult and much less understood. Typically, the length of local alignments is strongly overestimated and alignment boundaries are dramatically mispredicted. We hypothesize that local SA&F approaches are compromised this way due to a score bias, which is caused by the contribution of RNA structure similarity to their overall alignment score. Results In the light of this hypothesis, we study pairwise local SA&F for the first time systematically—based on a novel local RNA alignment benchmark set and quality measure. First, we vary the relative influence of structure similarity compared to sequence similarity. Putting more emphasis on the structure component leads to overestimating the length of local alignments. This clearly shows the bias of current scores and strongly hints at the structure component as its origin. Second, we study the interplay of several important scoring parameters by learning parameters for local and global SA&F. The divergence of these optimized parameter sets underlines the fundamental obstacles for local SA&F. Third, by introducing a position-wise correction term in local SA&F, we constructively solve its principal issues. Availability and implementation The benchmark data, detailed results and scripts are available at https://github.com/BackofenLab/local_alignment. The RNA alignment tool LocARNA, including the modifications proposed in this work, is available at https://github.com/s-will/LocARNA/releases/tag/v2.0.0RC6. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Teresa Müller
- Bioinformatics Group, University of Freiburg, Freiburg 79110, Germany
| | - Milad Miladi
- Bioinformatics Group, University of Freiburg, Freiburg 79110, Germany
| | - Frank Hutter
- Machine Learning Lab, Department of Computer Science, University of Freiburg, Freiburg 79110, Germany
| | - Ivo Hofacker
- Theoretical Biochemistry Group (TBI), Institute for Theoretical Chemistry, University of Vienna, Vienna, Wien 1090, Austria
| | - Sebastian Will
- Theoretical Biochemistry Group (TBI), Institute for Theoretical Chemistry, University of Vienna, Vienna, Wien 1090, Austria.,Bioinformatics Group AMIBio, LIX-Laboratoire d'Informatique d'École Polytechnique, IPP, Palaiseau 91120, France
| | - Rolf Backofen
- Bioinformatics Group, University of Freiburg, Freiburg 79110, Germany.,Signalling Research Centres BIOSS and CIBSS, University of Freiburg, Freiburg 79104, Germany
| |
Collapse
|
10
|
Mautner S, Montaseri S, Miladi M, Raden M, Costa F, Backofen R. ShaKer: RNA SHAPE prediction using graph kernel. Bioinformatics 2020; 35:i354-i359. [PMID: 31510707 PMCID: PMC6612843 DOI: 10.1093/bioinformatics/btz395] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Summary SHAPE experiments are used to probe the structure of RNA molecules. We present ShaKer to predict SHAPE data for RNA using a graph-kernel-based machine learning approach that is trained on experimental SHAPE information. While other available methods require a manually curated reference structure, ShaKer predicts reactivity data based on sequence input only and by sampling the ensemble of possible structures. Thus, ShaKer is well placed to enable experiment-driven, transcriptome-wide SHAPE data prediction to enable the study of RNA structuredness and to improve RNA structure and RNA–RNA interaction prediction. For performance evaluation, we use accuracy and accessibility comparing to experimental SHAPE data and competing methods. We can show that Shaker outperforms its competitors and is able to predict high quality SHAPE annotations even when no reference structure is provided. Availability and implementation ShaKer is freely available at https://github.com/BackofenLab/ShaKer.
Collapse
Affiliation(s)
- Stefan Mautner
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
| | - Soheila Montaseri
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
| | - Milad Miladi
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
| | - Martin Raden
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
| | - Fabrizio Costa
- Department Computer Science, University of Exeter, Exeter, UK
| | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany.,Signalling Research Centres BIOSS and CIBSS, University of Freiburg, Freiburg, Germany
| |
Collapse
|
11
|
Berkemer SJ, Maier LK, Amman F, Bernhart SH, Wörtz J, Märkle P, Pfeiffer F, Stadler PF, Marchfelder A. Identification of RNA 3´ ends and termination sites in Haloferax volcanii. RNA Biol 2020; 17:663-676. [PMID: 32041469 PMCID: PMC7237163 DOI: 10.1080/15476286.2020.1723328] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Archaeal genomes are densely packed; thus, correct transcription termination is an important factor for orchestrated gene expression. A systematic analysis of RNA 3´ termini, to identify transcription termination sites (TTS) using RNAseq data has hitherto only been performed in two archaea, Methanosarcina mazei and Sulfolobus acidocaldarius. In this study, only regions directly downstream of annotated genes were analysed, and thus, only part of the genome had been investigated. Here, we developed a novel algorithm (Internal Enrichment-Peak Calling) that allows an unbiased, genome-wide identification of RNA 3´ termini independent of annotation. In an RNA fraction enriched for primary transcripts by terminator exonuclease (TEX) treatment we identified 1,543 RNA 3´ termini. Approximately half of these were located in intergenic regions, and the remainder were found in coding regions. A strong sequence signature consistent with known termination events at intergenic loci indicates a clear enrichment for native TTS among them. Using these data we determined distinct putative termination motifs for intergenic (a T stretch) and coding regions (AGATC). In vivo reporter gene tests of selected TTS confirmed termination at these sites, which exemplify the different motifs. For several genes, more than one termination site was detected, resulting in transcripts with different lengths of the 3´ untranslated region (3´ UTR).
Collapse
Affiliation(s)
- Sarah J Berkemer
- Bioinformatics Group, Department of Computer Science - and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany.,Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
| | | | - Fabian Amman
- Institute for Theoretical Chemistry, University of Vienna, Vienna, Austria.,Division of Cell and Developmental Biology, Medical University Vienna, Vienna, Austria
| | - Stephan H Bernhart
- Bioinformatics Group, Department of Computer Science - and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany.,Transcriptome Bioinformatics, Interdisciplinary Center for Bioinformatics, Leipzig University, Leipzig, Germany
| | | | | | - Friedhelm Pfeiffer
- Computational Biology Group, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science - and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany.,Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany.,Institute for Theoretical Chemistry, University of Vienna, Vienna, Austria.,Facultad de Ciencias, Universidad Nacional de Colombia, Bogotá, Colombia.,Center for RNA in Technology and Health, University Copenhagen, Frederiksberg C, Denmark.,Santa Fe Institute, Santa Fe, NM, USA.,German Centre for Integrative Biodiversity Research (iDiv), Halle, Jena and Leipzig, Germany.,Competence Center for Scalable Data Services and Solutions, and Leipzig, Research Center for Civilization Diseases, University Leipzig, Leipzig, Germany
| | | |
Collapse
|
12
|
Miladi M, Sokhoyan E, Houwaart T, Heyne S, Costa F, Grüning B, Backofen R. GraphClust2: Annotation and discovery of structured RNAs with scalable and accessible integrative clustering. Gigascience 2019; 8:giz150. [PMID: 31808801 PMCID: PMC6897289 DOI: 10.1093/gigascience/giz150] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2019] [Revised: 08/23/2019] [Accepted: 11/20/2019] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND RNA plays essential roles in all known forms of life. Clustering RNA sequences with common sequence and structure is an essential step towards studying RNA function. With the advent of high-throughput sequencing techniques, experimental and genomic data are expanding to complement the predictive methods. However, the existing methods do not effectively utilize and cope with the immense amount of data becoming available. RESULTS Hundreds of thousands of non-coding RNAs have been detected; however, their annotation is lagging behind. Here we present GraphClust2, a comprehensive approach for scalable clustering of RNAs based on sequence and structural similarities. GraphClust2 bridges the gap between high-throughput sequencing and structural RNA analysis and provides an integrative solution by incorporating diverse experimental and genomic data in an accessible manner via the Galaxy framework. GraphClust2 can efficiently cluster and annotate large datasets of RNAs and supports structure-probing data. We demonstrate that the annotation performance of clustering functional RNAs can be considerably improved. Furthermore, an off-the-shelf procedure is introduced for identifying locally conserved structure candidates in long RNAs. We suggest the presence and the sparseness of phylogenetically conserved local structures for a collection of long non-coding RNAs. CONCLUSIONS By clustering data from 2 cross-linking immunoprecipitation experiments, we demonstrate the benefits of GraphClust2 for motif discovery under the presence of biological and methodological biases. Finally, we uncover prominent targets of double-stranded RNA binding protein Roquin-1, such as BCOR's 3' untranslated region that contains multiple binding stem-loops that are evolutionary conserved.
Collapse
Affiliation(s)
- Milad Miladi
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, 79110 Freiburg, Germany
| | - Eteri Sokhoyan
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, 79110 Freiburg, Germany
| | - Torsten Houwaart
- Institute of Medical Microbiology and Hospital Hygiene, University of Dusseldorf, Universitaetsstr. 1, 40225 Dusseldorf, Germany
| | - Steffen Heyne
- Max Planck Institute of Immunobiology and Epigenetics, Freiburg, Stuebeweg 51, 79108 Freiburg, Germany
| | - Fabrizio Costa
- Department of Computer Science, University of Exeter, North Park Road, EX4 4QF Exeter, UK
| | - Björn Grüning
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, 79110 Freiburg, Germany
- ZBSA Centre for Biological Systems Analysis, University of Freiburg, Hauptstr. 1, 79104 Freiburg, Germany
| | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, 79110 Freiburg, Germany
- ZBSA Centre for Biological Systems Analysis, University of Freiburg, Hauptstr. 1, 79104 Freiburg, Germany
- Signalling Research Centres BIOSS and CIBSS, University of Freiburg, Schaenzlestr. 18, 79104 Freiburg, Germany
| |
Collapse
|
13
|
Crum M, Ram-Mohan N, Meyer MM. Regulatory context drives conservation of glycine riboswitch aptamers. PLoS Comput Biol 2019; 15:e1007564. [PMID: 31860665 PMCID: PMC6944388 DOI: 10.1371/journal.pcbi.1007564] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2019] [Revised: 01/06/2020] [Accepted: 11/25/2019] [Indexed: 12/13/2022] Open
Abstract
In comparison to protein coding sequences, the impact of mutation and natural selection on the sequence and function of non-coding (ncRNA) genes is not well understood. Many ncRNA genes are narrowly distributed to only a few organisms, and appear to be rapidly evolving. Compared to protein coding sequences, there are many challenges associated with assessment of ncRNAs that are not well addressed by conventional phylogenetic approaches, including: short sequence length, lack of primary sequence conservation, and the importance of secondary structure for biological function. Riboswitches are structured ncRNAs that directly interact with small molecules to regulate gene expression in bacteria. They typically consist of a ligand-binding domain (aptamer) whose folding changes drive changes in gene expression. The glycine riboswitch is among the most well-studied due to the widespread occurrence of a tandem aptamer arrangement (tandem), wherein two homologous aptamers interact with glycine and each other to regulate gene expression. However, a significant proportion of glycine riboswitches are comprised of single aptamers (singleton). Here we use graph clustering to circumvent the limitations of traditional phylogenetic analysis when studying the relationship between the tandem and singleton glycine aptamers. Graph clustering enables a broader range of pairwise comparison measures to be used to assess aptamer similarity. Using this approach, we show that one aptamer of the tandem glycine riboswitch pair is typically much more highly conserved, and that which aptamer is conserved depends on the regulated gene. Furthermore, our analysis also reveals that singleton aptamers are more similar to either the first or second tandem aptamer, again based on the regulated gene. Taken together, our findings suggest that tandem glycine riboswitches degrade into functional singletons, with the regulated gene(s) dictating which glycine-binding aptamer is conserved.
Collapse
Affiliation(s)
- Matt Crum
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
| | - Nikhil Ram-Mohan
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
| | - Michelle M. Meyer
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
| |
Collapse
|
14
|
Eggenhofer F, Hofacker IL, Backofen R, Höner Zu Siederdissen C. CMV: visualization for RNA and protein family models and their comparisons. Bioinformatics 2019; 34:2676-2678. [PMID: 29554223 PMCID: PMC6061798 DOI: 10.1093/bioinformatics/bty158] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2017] [Accepted: 03/13/2018] [Indexed: 11/14/2022] Open
Abstract
Summary A standard method for the identification of novel RNAs or proteins is homology search via probabilistic models. One approach relies on the definition of families, which can be encoded as covariance models (CMs) or Hidden Markov Models (HMMs). While being powerful tools, their complexity makes it tedious to investigate them in their (default) tabulated form. This specifically applies to the interpretation of comparisons between multiple models as in family clans. The Covariance model visualization tools (CMV) visualize CMs or HMMs to: I) Obtain an easily interpretable representation of HMMs and CMs; II) Put them in context with the structural sequence alignments they have been created from; III) Investigate results of model comparisons and highlight regions of interest. Availability and implementation Source code (http://www.github.com/eggzilla/cmv), web-service (http://rna.informatik.uni-freiburg.de/CMVS). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Florian Eggenhofer
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany.,Institute for Theoretical Chemistry, University of Vienna, Vienna, Austria
| | - Ivo L Hofacker
- Institute for Theoretical Chemistry, University of Vienna, Vienna, Austria.,Bioinformatics and Computational Biology Research Group, University of Vienna, Vienna, Austria
| | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany.,Centre for Biological Signalling Studies (BIOSS), University of Freiburg, Freiburg, Germany
| | - Christian Höner Zu Siederdissen
- Institute for Theoretical Chemistry, University of Vienna, Vienna, Austria.,Bioinformatics Group, Department of Computer Science, University of Leipzig, D-04107 Leipzig, Germany.,Interdisciplinary Center for Bioinformatics, University of Leipzig, D-04107 Leipzig, Germany
| |
Collapse
|
15
|
Aoki G, Sakakibara Y. Convolutional neural networks for classification of alignments of non-coding RNA sequences. Bioinformatics 2019; 34:i237-i244. [PMID: 29949978 PMCID: PMC6022636 DOI: 10.1093/bioinformatics/bty228] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Motivation The convolutional neural network (CNN) has been applied to the classification problem of DNA sequences, with the additional purpose of motif discovery. The training of CNNs with distributed representations of four nucleotides has successfully derived position weight matrices on the learned kernels that corresponded to sequence motifs such as protein-binding sites. Results We propose a novel application of CNNs to classification of pairwise alignments of sequences for accurate clustering of sequences and show the benefits of the CNN method of inputting pairwise alignments for clustering of non-coding RNA (ncRNA) sequences and for motif discovery. Classification of a pairwise alignment of two sequences into positive and negative classes corresponds to the clustering of the input sequences. After we combined the distributed representation of RNA nucleotides with the secondary-structure information specific to ncRNAs and furthermore with mapping profiles of next-generation sequence reads, the training of CNNs for classification of alignments of RNA sequences yielded accurate clustering in terms of ncRNA families and outperformed the existing clustering methods for ncRNA sequences. Several interesting sequence motifs and secondary-structure motifs known for the snoRNA family and specific to microRNA and tRNA families were identified. Availability and implementation The source code of our CNN software in the deep-learning framework Chainer is available at http://www.dna.bio.keio.ac.jp/cnn/, and the dataset used for performance evaluation in this work is available at the same URL.
Collapse
Affiliation(s)
- Genta Aoki
- Department of Biosciences and Informatics, Keio University, Yokohama, Japan
| | | |
Collapse
|
16
|
Kirsch R, Seemann SE, Ruzzo WL, Cohen SM, Stadler PF, Gorodkin J. Identification and characterization of novel conserved RNA structures in Drosophila. BMC Genomics 2018; 19:899. [PMID: 30537930 PMCID: PMC6288889 DOI: 10.1186/s12864-018-5234-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Accepted: 11/08/2018] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Comparative genomics approaches have facilitated the discovery of many novel non-coding and structured RNAs (ncRNAs). The increasing availability of related genomes now makes it possible to systematically search for compensatory base changes - and thus for conserved secondary structures - even in genomic regions that are poorly alignable in the primary sequence. The wealth of available transcriptome data can add valuable insight into expression and possible function for new ncRNA candidates. Earlier work identifying ncRNAs in Drosophila melanogaster made use of sequence-based alignments and employed a sliding window approach, inevitably biasing identification toward RNAs encoded in the more conserved parts of the genome. RESULTS To search for conserved RNA structures (CRSs) that may not be highly conserved in sequence and to assess the expression of CRSs, we conducted a genome-wide structural alignment screen of 27 insect genomes including D. melanogaster and integrated this with an extensive set of tiling array data. The structural alignment screen revealed ∼30,000 novel candidate CRSs at an estimated false discovery rate of less than 10%. With more than one quarter of all individual CRS motifs showing sequence identities below 60%, the predicted CRSs largely complement the findings of sliding window approaches applied previously. While a sixth of the CRSs were ubiquitously expressed, we found that most were expressed in specific developmental stages or cell lines. Notably, most statistically significant enrichment of CRSs were observed in pupae, mainly in exons of untranslated regions, promotors, enhancers, and long ncRNAs. Interestingly, cell lines were found to express a different set of CRSs than were found in vivo. Only a small fraction of intergenic CRSs were co-expressed with the adjacent protein coding genes, which suggests that most intergenic CRSs are independent genetic units. CONCLUSIONS This study provides a more comprehensive view of the ncRNA transcriptome in fly as well as evidence for differential expression of CRSs during development and in cell lines.
Collapse
Affiliation(s)
- Rebecca Kirsch
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Department of Veterinary and Animal Science, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, Leipzig, D-04107 Germany
| | - Stefan E. Seemann
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Department of Veterinary and Animal Science, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
| | - Walter L. Ruzzo
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- School of Computer Science and Engineering, University of Washington, Box 352350, Seattle, 98195-2350 WA USA
- Department of Genome Sciences, University of Washington, Box 355065, Seattle, 98195-5065 WA USA
- Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, 98109-1024 WA USA
| | - Stephen M. Cohen
- Department of Cellular and Molecular Medicine, University of Copenhagen, Blegdamsvej 3, Copenhagen N, DK-2200 Denmark
| | - Peter F. Stadler
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, Leipzig, D-04107 Germany
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig, D-04103 Germany
- Faculdad de Ciencias, Universidad Nacional de Colombia, Sede Bogotá, Ciudad Universitaria, Bogotá, COL-111321 D.C. Colombia
- Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, Vienna, A-1090 Austria
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM87501 USA
| | - Jan Gorodkin
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Department of Veterinary and Animal Science, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
| |
Collapse
|
17
|
Multiple Sequence Alignments Enhance Boundary Definition of RNA Structures. Genes (Basel) 2018; 9:genes9120604. [PMID: 30518121 PMCID: PMC6315940 DOI: 10.3390/genes9120604] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2018] [Revised: 11/28/2018] [Accepted: 11/29/2018] [Indexed: 02/03/2023] Open
Abstract
Self-contained structured domains of RNA sequences have often distinct molecular functions. Determining the boundaries of structured domains of a non-coding RNA (ncRNA) is needed for many ncRNA gene finder programs that predict RNA secondary structures in aligned genomes because these methods do not necessarily provide precise information about the boundaries or the location of the RNA structure inside the predicted ncRNA. Even without having a structure prediction, it is of interest to search for structured domains, such as for finding common RNA motifs in RNA-protein binding assays. The precise definition of the boundaries are essential for downstream analyses such as RNA structure modelling, e.g., through covariance models, and RNA structure clustering for the search of common motifs. Such efforts have so far been focused on single sequences, thus here we present a comparison for boundary definition between single sequence and multiple sequence alignments. We also present a novel approach, named RNAbound, for finding the boundaries that are based on probabilities of evolutionarily conserved base pairings. We tested the performance of two different methods on a limited number of Rfam families using the annotated structured RNA regions in the human genome and their multiple sequence alignments created from 14 species. The results show that multiple sequence alignments improve the boundary prediction for branched structures compared to single sequences independent of the chosen method. The actual performance of the two methods differs on single hairpin structures and branched structures. For the RNA families with branched structures, including transfer RNA (tRNA) and small nucleolar RNAs (snoRNAs), RNAbound improves the boundary predictions using multiple sequence alignments to median differences of −6 and −11.5 nucleotides (nts) for left and right boundary, respectively (window size of 200 nts).
Collapse
|
18
|
Phylogenomic and comparative analysis of the distribution and regulatory patterns of TPP riboswitches in fungi. Sci Rep 2018; 8:5563. [PMID: 29615754 PMCID: PMC5882874 DOI: 10.1038/s41598-018-23900-7] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2017] [Accepted: 03/20/2018] [Indexed: 01/01/2023] Open
Abstract
Riboswitches are metabolite or ion sensing cis-regulatory elements that regulate the expression of the associated genes involved in biosynthesis or transport of the corresponding metabolite. Among the nearly 40 different classes of riboswitches discovered in bacteria so far, only the TPP riboswitch has also been found in algae, plants, and in fungi where their presence has been experimentally validated in a few instances. We analyzed all the available complete fungal and related genomes and identified TPP riboswitch-based regulation systems in 138 fungi and 15 oomycetes. We find that TPP riboswitches are most abundant in Ascomycota and Basidiomycota where they regulate TPP biosynthesis and/or transporter genes. Many of these transporter genes were found to contain conserved domains consistent with nucleoside, urea and amino acid transporter gene families. The genomic location of TPP riboswitches when correlated with the intron structure of the regulated genes enabled prediction of the precise regulation mechanism employed by each riboswitch. Our comprehensive analysis of TPP riboswitches in fungi provides insights about the phylogenomic distribution, regulatory patterns and functioning mechanisms of TPP riboswitches across diverse fungal species and provides a useful resource that will enhance the understanding of RNA-based gene regulation in eukaryotes.
Collapse
|
19
|
Dotu I, Adamson SI, Coleman B, Fournier C, Ricart-Altimiras E, Eyras E, Chuang JH. SARNAclust: Semi-automatic detection of RNA protein binding motifs from immunoprecipitation data. PLoS Comput Biol 2018; 14:e1006078. [PMID: 29596423 PMCID: PMC5892938 DOI: 10.1371/journal.pcbi.1006078] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2017] [Revised: 04/10/2018] [Accepted: 03/05/2018] [Indexed: 12/02/2022] Open
Abstract
RNA-protein binding is critical to gene regulation, controlling fundamental processes including splicing, translation, localization and stability, and aberrant RNA-protein interactions are known to play a role in a wide variety of diseases. However, molecular understanding of RNA-protein interactions remains limited; in particular, identification of RNA motifs that bind proteins has long been challenging, especially when such motifs depend on both sequence and structure. Moreover, although RNA binding proteins (RBPs) often contain more than one binding domain, algorithms capable of identifying more than one binding motif simultaneously have not been developed. In this paper we present a novel pipeline to determine binding peaks in crosslinking immunoprecipitation (CLIP) data, to discover multiple possible RNA sequence/structure motifs among them, and to experimentally validate such motifs. At the core is a new semi-automatic algorithm SARNAclust, the first unsupervised method to identify and deconvolve multiple sequence/structure motifs simultaneously. SARNAclust computes similarity between sequence/structure objects using a graph kernel, providing the ability to isolate the impact of specific features through the bulge graph formalism. Application of SARNAclust to synthetic data shows its capability of clustering 5 motifs at once with a V-measure value of over 0.95, while GraphClust achieves only a V-measure of 0.083 and RNAcontext cannot detect any of the motifs. When applied to existing eCLIP sets, SARNAclust finds known motifs for SLBP and HNRNPC and novel motifs for several other RBPs such as AGGF1, AKAP8L and ILF3. We demonstrate an experimental validation protocol, a targeted Bind-n-Seq-like high-throughput sequencing approach that relies on RNA inverse folding for oligo pool design, that can validate the components within the SLBP motif. Finally, we use this protocol to experimentally interrogate the SARNAclust motif predictions for protein ILF3. Our results support a newly identified partially double-stranded UUUUUGAGA motif similar to that known for the splicing factor HNRNPC. RNA-protein binding is critical to gene regulation, and aberrant RNA-protein interactions play a role in a wide variety of diseases. However, molecular understanding of these interactions remains limited because of the difficulty of ascertaining the motifs that bind each protein. To address this challenge, we have developed a novel algorithm, SARNAclust, to computationally identify combined structure/sequence motifs from immunoprecipitation data. SARNAclust can deconvolve multiple motifs simultaneously and determine the importance of specific features through a graph kernel and bulge graph formalism. We have verified SARNAclust to be effective on synthetic motif data and also tested it on ENCODE eCLIP datasets, identifying known motifs and novel predictions. We have experimentally validated SARNAclust for two proteins, SLBP and ILF3, using RNA Bind-n-Seq measurements. Applying SARNAclust to ENCODE data provides new evidence for previously unknown regulatory interactions, notably splicing co-regulation by ILF3 and the splicing factor hnRNPC.
Collapse
Affiliation(s)
- Ivan Dotu
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM)–Pompeu Fabra University (UPF), Barcelona, Spain
| | - Scott I. Adamson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
- UCONN Health, Department of Genetics and Genome Sciences, Farmington, CT, United States of America
| | - Benjamin Coleman
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
| | - Cyril Fournier
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
| | - Emma Ricart-Altimiras
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM)–Pompeu Fabra University (UPF), Barcelona, Spain
| | - Eduardo Eyras
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM)–Pompeu Fabra University (UPF), Barcelona, Spain
- Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, Spain
| | - Jeffrey H. Chuang
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
- UCONN Health, Department of Genetics and Genome Sciences, Farmington, CT, United States of America
- * E-mail:
| |
Collapse
|
20
|
Smith MA, Seemann SE, Quek XC, Mattick JS. DotAligner: identification and clustering of RNA structure motifs. Genome Biol 2017; 18:244. [PMID: 29284541 PMCID: PMC5747123 DOI: 10.1186/s13059-017-1371-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Accepted: 12/05/2017] [Indexed: 01/01/2023] Open
Abstract
The diversity of processed transcripts in eukaryotic genomes poses a challenge for the classification of their biological functions. Sparse sequence conservation in non-coding sequences and the unreliable nature of RNA structure predictions further exacerbate this conundrum. Here, we describe a computational method, DotAligner, for the unsupervised discovery and classification of homologous RNA structure motifs from a set of sequences of interest. Our approach outperforms comparable algorithms at clustering known RNA structure families, both in speed and accuracy. It identifies clusters of known and novel structure motifs from ENCODE immunoprecipitation data for 44 RNA-binding proteins.
Collapse
Affiliation(s)
- Martin A Smith
- RNA Biology and Plasticity Group, Garvan Institute of Medical Research, 384 Victoria Street, Sydney, NSW 2010, Australia. .,St Vincent's Clinical School, Faculty of Medicine, UNSW Australia, Sydney, NSW 2010, Australia.
| | - Stefan E Seemann
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Groennegaardsvej 3, Frederiksberg, 1870, Denmark.,Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, DK-1870, Frederiksberg, Denmark
| | - Xiu Cheng Quek
- RNA Biology and Plasticity Group, Garvan Institute of Medical Research, 384 Victoria Street, Sydney, NSW 2010, Australia.,St Vincent's Clinical School, Faculty of Medicine, UNSW Australia, Sydney, NSW 2010, Australia
| | - John S Mattick
- RNA Biology and Plasticity Group, Garvan Institute of Medical Research, 384 Victoria Street, Sydney, NSW 2010, Australia.,St Vincent's Clinical School, Faculty of Medicine, UNSW Australia, Sydney, NSW 2010, Australia
| |
Collapse
|
21
|
Kato Y, Gorodkin J, Havgaard JH. Alignment-free comparative genomic screen for structured RNAs using coarse-grained secondary structure dot plots. BMC Genomics 2017; 18:935. [PMID: 29197323 PMCID: PMC5712110 DOI: 10.1186/s12864-017-4309-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Accepted: 11/15/2017] [Indexed: 01/01/2023] Open
Abstract
Background Structured non-coding RNAs play many different roles in the cells, but the annotation of these RNAs is lacking even within the human genome. The currently available computational tools are either too computationally heavy for use in full genomic screens or rely on pre-aligned sequences. Methods Here we present a fast and efficient method, DotcodeR, for detecting structurally similar RNAs in genomic sequences by comparing their corresponding coarse-grained secondary structure dot plots at string level. This allows us to perform an all-against-all scan of all window pairs from two genomes without alignment. Results Our computational experiments with simulated data and real chromosomes demonstrate that the presented method has good sensitivity. Conclusions DotcodeR can be useful as a pre-filter in a genomic comparative scan for structured RNAs. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4309-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yuki Kato
- Department of RNA Biology and Neuroscience, Graduate School of Medicine, Osaka University, 2-2 Yamadaoka, Suita, 565-0871, Japan. .,Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Groennegaardsvej 3, Frederiksberg, 1870, Denmark.
| | - Jan Gorodkin
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Groennegaardsvej 3, Frederiksberg, 1870, Denmark
| | - Jakob Hull Havgaard
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Groennegaardsvej 3, Frederiksberg, 1870, Denmark.
| |
Collapse
|
22
|
Li X, Wong KC. Multiobjective Patient Stratification Using Evolutionary Multiobjective Optimization. IEEE J Biomed Health Inform 2017; 22:1619-1629. [PMID: 29990162 DOI: 10.1109/jbhi.2017.2769711] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
One of the main challenges in modern medic-ine is to stratify patients for personalized care. Many different clustering methods have been proposed to solve the problem in both quantitative and biologically meaningful manners. However, existing clustering algorithms suffer from numerous restrictions such as experimental noises, high dimensionality, and poor interpretability. To overcome those limitations altogether, we propose and formulate a multiobjective framework based on evolutionary multiobjective optimization to balance the feature relevance and redundancy for patient stratification. To demonstrate the effectiveness of our proposed algorithms, we benchmark our algorithms across 55 synthetic datasets based on a real human transcription regulation network model, 35 real cancer gene expression datasets, and two case studies. Experimental results suggest that the proposed algorithms perform better than the recent state-of-the-arts. In addition, time complexity analysis, convergence analysis, and parameter analysis are conducted to demonstrate the robustness of the proposed methods from different perspectives. Finally, the t-Distributed Stochastic Neighbor Embedding (t-SNE) is applied to project the selected feature subsets onto two or three dimensions to visualize the high-dimensional patient stratification data.
Collapse
|
23
|
Seemann SE, Mirza AH, Hansen C, Bang-Berthelsen CH, Garde C, Christensen-Dalsgaard M, Torarinsson E, Yao Z, Workman CT, Pociot F, Nielsen H, Tommerup N, Ruzzo WL, Gorodkin J. The identification and functional annotation of RNA structures conserved in vertebrates. Genome Res 2017; 27:1371-1383. [PMID: 28487280 PMCID: PMC5538553 DOI: 10.1101/gr.208652.116] [Citation(s) in RCA: 62] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2016] [Accepted: 05/04/2017] [Indexed: 01/15/2023]
Abstract
Structured elements of RNA molecules are essential in, e.g., RNA stabilization, localization, and protein interaction, and their conservation across species suggests a common functional role. We computationally screened vertebrate genomes for conserved RNA structures (CRSs), leveraging structure-based, rather than sequence-based, alignments. After careful correction for sequence identity and GC content, we predict ∼516,000 human genomic regions containing CRSs. We find that a substantial fraction of human–mouse CRS regions (1) colocalize consistently with binding sites of the same RNA binding proteins (RBPs) or (2) are transcribed in corresponding tissues. Additionally, a CaptureSeq experiment revealed expression of many of our CRS regions in human fetal brain, including 662 novel ones. For selected human and mouse candidate pairs, qRT-PCR and in vitro RNA structure probing supported both shared expression and shared structure despite low abundance and low sequence identity. About 30,000 CRS regions are located near coding or long noncoding RNA genes or within enhancers. Structured (CRS overlapping) enhancer RNAs and extended 3′ ends have significantly increased expression levels over their nonstructured counterparts. Our findings of transcribed uncharacterized regulatory regions that contain CRSs support their RNA-mediated functionality.
Collapse
Affiliation(s)
- Stefan E Seemann
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, DK-1870 Frederiksberg, Denmark
| | - Aashiq H Mirza
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Copenhagen Diabetes Research Center (CPH-DIRECT), Herlev University Hospital, DK-2730 Herlev, Denmark
| | - Claus Hansen
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Cellular and Molecular Medicine (ICMM), Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Claus H Bang-Berthelsen
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Obesity Biology and Department of Molecular Genetics, Novo Nordisk A/S, DK-2880 Bagsværd, Denmark
| | - Christian Garde
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Biotechnology and Biomedicine, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| | - Mikkel Christensen-Dalsgaard
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Cellular and Molecular Medicine (ICMM), Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Elfar Torarinsson
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark
| | - Zizhen Yao
- Allen Institute for Brain Science, Seattle, Washington 98109, USA
| | - Christopher T Workman
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Biotechnology and Biomedicine, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| | - Flemming Pociot
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Copenhagen Diabetes Research Center (CPH-DIRECT), Herlev University Hospital, DK-2730 Herlev, Denmark
| | - Henrik Nielsen
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Cellular and Molecular Medicine (ICMM), Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Niels Tommerup
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Cellular and Molecular Medicine (ICMM), Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Walter L Ruzzo
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,School of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA.,Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA
| | - Jan Gorodkin
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, DK-1870 Frederiksberg, Denmark
| |
Collapse
|