1
|
Ma H, Wen H, Xue Z, Li G, Zhang Z. RNANetMotif: Identifying sequence-structure RNA network motifs in RNA-protein binding sites. PLoS Comput Biol 2022; 18:e1010293. [PMID: 35819951 PMCID: PMC9275694 DOI: 10.1371/journal.pcbi.1010293] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 06/09/2022] [Indexed: 11/19/2022] Open
Abstract
RNA molecules can adopt stable secondary and tertiary structures, which are essential in mediating physical interactions with other partners such as RNA binding proteins (RBPs) and in carrying out their cellular functions. In vivo and in vitro experiments such as RNAcompete and eCLIP have revealed in vitro binding preferences of RBPs to RNA oligomers and in vivo binding sites in cells. Analysis of these binding data showed that the structure properties of the RNAs in these binding sites are important determinants of the binding events; however, it has been a challenge to incorporate the structure information into an interpretable model. Here we describe a new approach, RNANetMotif, which takes predicted secondary structure of thousands of RNA sequences bound by an RBP as input and uses a graph theory approach to recognize enriched subgraphs. These enriched subgraphs are in essence shared sequence-structure elements that are important in RBP-RNA binding. To validate our approach, we performed RNA structure modeling via coarse-grained molecular dynamics folding simulations for selected 4 RBPs, and RNA-protein docking for LIN28B. The simulation results, e.g., solvent accessibility and energetics, further support the biological relevance of the discovered network subgraphs. RNA binding proteins (RBPs) regulate every aspect of RNA biology, including splicing, translation, transportation, and degradation. High-throughput technologies such as eCLIP have identified thousands of binding sites for a given RBP throughout the genome. It has been shown by earlier studies that, in addition to nucleotide sequences, the structure and conformation of RNAs also play important role in RBP-RNA interactions. Analogous to protein-protein interactions or protein-DNA interactions, it is likely that there exist intrinsic sequence-structure motifs common to these RNAs that underlie their binding specificity to specific RBPs. It is known that RNAs form energetically favorable secondary structures, which can be represented as graphs, with nucleotides being nodes and backbone covalent bonds and base-pairing hydrogen bonds representing edges. We hypothesize that these graphs can be mined by graph theory approaches to identify sequence-structure motifs as enriched sub-graphs. In this article, we described the details of this approach, termed RNANetMotif and associated new concepts, namely EKS (Extended K-mer Subgraph) and GraphK graph algorithm. To test the utility of our approach, we conducted 3D structure modeling of selected RNA sequences through molecular dynamics (MD) folding simulation and evaluated the significance of the discovered RNA motifs by comparing their spatial exposure with other regions on the RNA. We believe that this approach has the novelty of treating the RNA sequence as a graph and RBP binding sites as enriched subgraph, which has broader applications beyond RBP-RNA interactions.
Collapse
Affiliation(s)
- Hongli Ma
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- School of Mathematics, Shandong University, Jinan, China
| | - Han Wen
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Zhiyuan Xue
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- School of Mathematics, Shandong University, Jinan, China
- School of Mathematical Science, Liaocheng University, Liaocheng, China
| | - Zhaolei Zhang
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
2
|
Method to Determine the Centroid of Non-Homogeneous Polygons Based on Suspension Theory. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION 2022. [DOI: 10.3390/ijgi11040233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
The centroid is most often used to describe the average position of an object’s mass and has very important applications in computational geometry, applied physics, and spatial information fields, amongst others. Based on the suspension theory of physics, this paper proposes a new method to determine the centroid of a non-homogeneous polygon by the intersection of the two balance lines. By considering the inside point value and distance to the balance line, the proposed method overcomes the traditional method’s limitation of only considering the geometric coordinates of the boundary points of the polygon. The results show that the consideration of grid distance and grid value is logical and consistent with the calculation of the centroid of a non-homogeneous polygon. While using this method, a suitable value for relative parameters needs to be established according to specific application instances. The proposed method can be applied to aid in solving specific problems such as location assessment, allocation of resources, spatial optimization, and other relative uses.
Collapse
|
3
|
Yu B, Lu Y, Zhang QC, Hou L. Prediction and differential analysis of RNA secondary structure. QUANTITATIVE BIOLOGY 2020. [DOI: 10.1007/s40484-020-0205-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
4
|
Takahashi S, Sugimoto N. Stability prediction of canonical and non-canonical structures of nucleic acids in various molecular environments and cells. Chem Soc Rev 2020; 49:8439-8468. [DOI: 10.1039/d0cs00594k] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
This review provides the biophysicochemical background and recent advances in stability prediction of canonical and non-canonical structures of nucleic acids in various molecular environments and cells.
Collapse
Affiliation(s)
- Shuntaro Takahashi
- Frontier Institute for Biomolecular Engineering Research (FIBER)
- Konan University
- Kobe
- Japan
| | - Naoki Sugimoto
- Frontier Institute for Biomolecular Engineering Research (FIBER)
- Konan University
- Kobe
- Japan
- Graduate School of Frontiers of Innovative Research in Science and Technology (FIRST)
| |
Collapse
|
5
|
Akiyama M, Sato K, Sakakibara Y. A max-margin training of RNA secondary structure prediction integrated with the thermodynamic model. J Bioinform Comput Biol 2019; 16:1840025. [PMID: 30616476 DOI: 10.1142/s0219720018400255] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
A popular approach for predicting RNA secondary structure is the thermodynamic nearest-neighbor model that finds a thermodynamically most stable secondary structure with minimum free energy (MFE). For further improvement, an alternative approach that is based on machine learning techniques has been developed. The machine learning-based approach can employ a fine-grained model that includes much richer feature representations with the ability to fit the training data. Although a machine learning-based fine-grained model achieved extremely high performance in prediction accuracy, a possibility of the risk of overfitting for such a model has been reported. In this paper, we propose a novel algorithm for RNA secondary structure prediction that integrates the thermodynamic approach and the machine learning-based weighted approach. Our fine-grained model combines the experimentally determined thermodynamic parameters with a large number of scoring parameters for detailed contexts of features that are trained by the structured support vector machine (SSVM) with the [Formula: see text] regularization to avoid overfitting. Our benchmark shows that our algorithm achieves the best prediction accuracy compared with existing methods, and heavy overfitting cannot be observed. The implementation of our algorithm is available at https://github.com/keio-bioinformatics/mxfold .
Collapse
Affiliation(s)
- Manato Akiyama
- Department of Biosciences and Informatics, Keio University, 3–14–1 Hiyoshi, Kohoku-ku, Yokohama 223–8522, Japan
| | - Kengo Sato
- Department of Biosciences and Informatics, Keio University, 3–14–1 Hiyoshi, Kohoku-ku, Yokohama 223–8522, Japan
| | - Yasubumi Sakakibara
- Department of Biosciences and Informatics, Keio University, 3–14–1 Hiyoshi, Kohoku-ku, Yokohama 223–8522, Japan
| |
Collapse
|
6
|
Johnston I, Hancock T, Mamitsuka H, Carvalho L. Gene-proximity models for genome-wide association studies. Ann Appl Stat 2016. [DOI: 10.1214/16-aoas907] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
7
|
Pham LM, Carvalho L, Schaus S, Kolaczyk ED. Perturbation Detection Through Modeling of Gene Expression on a Latent Biological Pathway Network: A Bayesian hierarchical approach. J Am Stat Assoc 2016; 111:73-92. [PMID: 27647944 DOI: 10.1080/01621459.2015.1110523] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Cellular response to a perturbation is the result of a dynamic system of biological variables linked in a complex network. A major challenge in drug and disease studies is identifying the key factors of a biological network that are essential in determining the cell's fate. Here our goal is the identification of perturbed pathways from high-throughput gene expression data. We develop a three-level hierarchical model, where (i) the first level captures the relationship between gene expression and biological pathways using confirmatory factor analysis, (ii) the second level models the behavior within an underlying network of pathways induced by an unknown perturbation using a conditional autoregressive model, and (iii) the third level is a spike-and-slab prior on the perturbations. We then identify perturbations through posterior-based variable selection. We illustrate our approach using gene transcription drug perturbation profiles from the DREAM7 drug sensitivity predication challenge data set. Our proposed method identified regulatory pathways that are known to play a causative role and that were not readily resolved using gene set enrichment analysis or exploratory factor models. Simulation results are presented assessing the performance of this model relative to a network-free variant and its robustness to inaccuracies in biological databases.
Collapse
|
8
|
Peng L, Carvalho L. Bayesian degree-corrected stochastic blockmodels for community detection. Electron J Stat 2016. [DOI: 10.1214/16-ejs1163] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
9
|
Parks MM, Lawrence CE, Raphael BJ. Detecting non-allelic homologous recombination from high-throughput sequencing data. Genome Biol 2015; 16:72. [PMID: 25886137 PMCID: PMC4425883 DOI: 10.1186/s13059-015-0633-1] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2015] [Accepted: 03/16/2015] [Indexed: 12/27/2022] Open
Abstract
Non-allelic homologous recombination (NAHR) is a common mechanism for generating genome rearrangements and is implicated in numerous genetic disorders, but its detection in high-throughput sequencing data poses a serious challenge. We present a probabilistic model of NAHR and demonstrate its ability to find NAHR in low-coverage sequencing data from 44 individuals. We identify NAHR-mediated deletions or duplications in 109 of 324 potential NAHR loci in at least one of the individuals. These calls segregate by ancestry, are more common in closely spaced repeats, often result in duplicated genes or pseudogenes, and affect highly studied genes such as GBA and CYP2E1.
Collapse
Affiliation(s)
- Matthew M Parks
- Division of Applied Mathematics, Brown University, Providence, USA.
| | - Charles E Lawrence
- Division of Applied Mathematics, Brown University, Providence, USA. .,Center for Computational Molecular Biology, Brown University, Providence, USA.
| | - Benjamin J Raphael
- Center for Computational Molecular Biology, Brown University, Providence, USA. .,Department of Computer Science, Brown University, Providence, USA.
| |
Collapse
|
10
|
Herman JL, Novák Á, Lyngsø R, Szabó A, Miklós I, Hein J. Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. BMC Bioinformatics 2015; 16:108. [PMID: 25888064 PMCID: PMC4395974 DOI: 10.1186/s12859-015-0516-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Accepted: 02/24/2015] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. RESULTS In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. CONCLUSIONS The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign .
Collapse
Affiliation(s)
- Joseph L Herman
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
- Division of Mathematical Biology, National Institute of Medical Research,, The Ridgeway, London, NW7 1AA, UK.
| | - Ádám Novák
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| | - Rune Lyngsø
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| | - Adrienn Szabó
- Institute of Computer Science and Control, Hungarian Academy of Sciences, Lagymanyosi u. 11., Budapest, 1111, Hungary.
| | - István Miklós
- Institute of Computer Science and Control, Hungarian Academy of Sciences, Lagymanyosi u. 11., Budapest, 1111, Hungary.
- Department of Stochastics, Rényi Institute, Reáltanoda u. 13-15, Budapest, 1053, Hungary.
| | - Jotun Hein
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| |
Collapse
|
11
|
Wang Z, He Q, Larget B, Newton MA. A multi-functional analyzer uses parameter constraints to improve the efficiency of model-based gene-set analysis. Ann Appl Stat 2015. [DOI: 10.1214/14-aoas777] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
12
|
Abstract
It has been well accepted that the RNA secondary structures of most functional non-coding RNAs (ncRNAs) are closely related to their functions and are conserved during evolution. Hence, prediction of conserved secondary structures from evolutionarily related sequences is one important task in RNA bioinformatics; the methods are useful not only to further functional analyses of ncRNAs but also to improve the accuracy of secondary structure predictions and to find novel functional RNAs from the genome. In this review, I focus on common secondary structure prediction from a given aligned RNA sequence, in which one secondary structure whose length is equal to that of the input alignment is predicted. I systematically review and classify existing tools and algorithms for the problem, by utilizing the information employed in the tools and by adopting a unified viewpoint based on maximum expected gain (MEG) estimators. I believe that this classification will allow a deeper understanding of each tool and provide users with useful information for selecting tools for common secondary structure predictions.
Collapse
|
13
|
Mori R, Hamada M, Asai K. Efficient calculation of exact probability distributions of integer features on RNA secondary structures. BMC Genomics 2014; 15 Suppl 10:S6. [PMID: 25560710 PMCID: PMC4304215 DOI: 10.1186/1471-2164-15-s10-s6] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Although the needs for analyses of secondary structures of RNAs are increasing, prediction of the secondary structures of RNAs are not always reliable. Because an RNA may have a complicated energy landscape, comprehensive representations of the whole ensemble of the secondary structures, such as the probability distributions of various features of RNA secondary structures are required. RESULTS A general method to efficiently compute the distribution of any integer scalar/vector function on the secondary structure is proposed. We also show two concrete algorithms, for Hamming distance from a reference structure and for 5'-3' distance, which can be constructed by following our general method. These practical applications of this method show the effectiveness of the proposed method. CONCLUSIONS The proposed method provides a clear and comprehensive procedure to construct algorithms for distributions of various integer features. In addition, distributions of integer vectors, that is a combination of different integer scores, can be also described by applying our 2D expanding technique.
Collapse
Affiliation(s)
- Ryota Mori
- Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa City, Chiba, Japan
| | - Michiaki Hamada
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Koto Ward, Tokyo, Japan
- Faculty of Science and Engineering, Waseda University, Shinjuku Ward, Tokyo, Japan
| | - Kiyoshi Asai
- Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa City, Chiba, Japan
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Koto Ward, Tokyo, Japan
| |
Collapse
|
14
|
Johnston I, Carvalho LE. A Bayesian hierarchical gene model on latent genotypes for genome-wide association studies. BMC Proc 2014; 8:S45. [PMID: 25519327 PMCID: PMC4143727 DOI: 10.1186/1753-6561-8-s1-s45] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
The primary goal of genome-wide association studies is to determine which genetic markers are associated with genetic traits, most commonly human diseases. As a result of the "large p, small n" nature of genome-wide association study data sets, and especially because of the collinearity due to linkage disequilibrium, multivariate regression results in an ill-posed problem. To overcome these obstacles, we propose preprocessing single-nucleotide polymorphisms to adjust for linkage disequilibrium, and a novel Bayesian statistical model that exploits a hierarchical structure between single-nucleotide polymorphisms and genes. We obtain posterior samples using a hybrid Metropolis-within-Gibbs sampler, and further conduct inference on single-nucleotide polymorphism and gene associations using centroid estimation. Finally, we illustrate the proposed model and estimation procedure and discuss results obtained on the data provided for the Genetic Analysis Workshop 18.
Collapse
Affiliation(s)
- Ian Johnston
- Mathematics and Statistics Department, Boston University, 111 Cummington Mall, Boston, MA 02215, USA
| | - Luis E Carvalho
- Mathematics and Statistics Department, Boston University, 111 Cummington Mall, Boston, MA 02215, USA
| |
Collapse
|
15
|
Ruggieri E, Lawrence CE. The Bayesian Change Point and Variable Selection Algorithm: Application to the δ18O Proxy Record of the Plio-Pleistocene. J Comput Graph Stat 2014. [DOI: 10.1080/10618600.2012.707852] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
16
|
Abstract
Motivation: Reconstruction of the network-level evolutionary history of protein–protein interactions provides a principled way to relate interactions in several present-day networks. Here, we present a general framework for inferring such histories and demonstrate how it can be used to determine what interactions existed in the ancestral networks, which present-day interactions we might expect to exist based on evolutionary evidence and what information extant networks contain about the order of ancestral protein duplications. Results: Our framework characterizes the space of likely parsimonious network histories. It results in a structure that can be used to find probabilities for a number of events associated with the histories. The framework is based on a directed hypergraph formulation of dynamic programming that we extend to enumerate many optimal and near-optimal solutions. The algorithm is applied to reconstructing ancestral interactions among bZIP transcription factors, imputing missing present-day interactions among the bZIPs and among proteins from five herpes viruses, and determining relative protein duplication order in the bZIP family. Our approach more accurately reconstructs ancestral interactions than existing approaches. In cross-validation tests, we find that our approach ranks the majority of the left-out present-day interactions among the top 2 and 17% of possible edges for the bZIP and herpes networks, respectively, making it a competitive approach for edge imputation. It also estimates relative bZIP protein duplication orders, using only interaction data and phylogenetic tree topology, which are significantly correlated with sequence-based estimates. Availability: The algorithm is implemented in C++, is open source and is available at http://www.cs.cmu.edu/ckingsf/software/parana2. Contact:robp@cs.cmu.edu or carlk@cs.cmu.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rob Patro
- Lane Center for Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
| | | |
Collapse
|
17
|
Carvalho L. Bayesian centroid estimation for motif discovery. PLoS One 2013; 8:e80511. [PMID: 24324603 PMCID: PMC3855595 DOI: 10.1371/journal.pone.0080511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2013] [Accepted: 10/03/2013] [Indexed: 11/29/2022] Open
Abstract
Biological sequences may contain patterns that signal important biomolecular functions; a classical example is regulation of gene expression by transcription factors that bind to specific patterns in genomic promoter regions. In motif discovery we are given a set of sequences that share a common motif and aim to identify not only the motif composition, but also the binding sites in each sequence of the set. We propose a new centroid estimator that arises from a refined and meaningful loss function for binding site inference. We discuss the main advantages of centroid estimation for motif discovery, including computational convenience, and how its principled derivation offers further insights about the posterior distribution of binding site configurations. We also illustrate, using simulated and real datasets, that the centroid estimator can differ from the traditional maximum a posteriori or maximum likelihood estimators.
Collapse
Affiliation(s)
- Luis Carvalho
- Department of Mathematics and Statistics, Boston University, Boston, Massachusetts, United States of America
| |
Collapse
|
18
|
Boca SM, Bravo HC, Caffo B, Leek JT, Parmigiani G. A decision-theory approach to interpretable set analysis for high-dimensional data. Biometrics 2013; 69:614-23. [PMID: 23909925 DOI: 10.1111/biom.12060] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Revised: 03/01/2013] [Accepted: 03/01/2013] [Indexed: 02/06/2023]
Abstract
A key problem in high-dimensional significance analysis is to find pre-defined sets that show enrichment for a statistical signal of interest; the classic example is the enrichment of gene sets for differentially expressed genes. Here, we propose a new decision-theory approach to the analysis of gene sets which focuses on estimating the fraction of non-null variables in a set. We introduce the idea of "atoms," non-overlapping sets based on the original pre-defined set annotations. Our approach focuses on finding the union of atoms that minimizes a weighted average of the number of false discoveries and missed discoveries. We introduce a new false discovery rate for sets, called the atomic false discovery rate (afdr), and prove that the optimal estimator in our decision-theory framework is to threshold the afdr. These results provide a coherent and interpretable framework for the analysis of sets that addresses the key issues of overlapping annotations and difficulty in interpreting p values in both competitive and self-contained tests. We illustrate our method and compare it to a popular existing method using simulated examples, as well as gene-set and brain ROI data analyses.
Collapse
Affiliation(s)
- Simina M Boca
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland 20892, U.S.A
| | | | | | | | | |
Collapse
|
19
|
Abstract
Many bioinformatics problems, such as sequence alignment, gene prediction, phylogenetic tree estimation and RNA secondary structure prediction, are often affected by the 'uncertainty' of a solution, that is, the probability of the solution is extremely small. This situation arises for estimation problems on high-dimensional discrete spaces in which the number of possible discrete solutions is immense. In the analysis of biological data or the development of prediction algorithms, this uncertainty should be handled carefully and appropriately. In this review, I will explain several methods to combat this uncertainty, presenting a number of examples in bioinformatics. The methods include (i) avoiding point estimation, (ii) maximum expected accuracy (MEA) estimations and (iii) several strategies to design a pipeline involving several prediction methods. I believe that the basic concepts and ideas described in this review will be generally useful for estimation problems in various areas of bioinformatics.
Collapse
|
20
|
Hamada M. Direct updating of an RNA base-pairing probability matrix with marginal probability constraints. J Comput Biol 2013; 19:1265-76. [PMID: 23210474 DOI: 10.1089/cmb.2012.0215] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
A base-pairing probability matrix (BPPM) stores the probabilities for every possible base pair in an RNA sequence and has been used in many algorithms in RNA informatics (e.g., RNA secondary structure prediction and motif search). In this study, we propose a novel algorithm to perform iterative updates of a given BPPM, satisfying marginal probability constraints that are (approximately) given by recently developed biochemical experiments, such as SHAPE, PAR, and FragSeq. The method is easily implemented and is applicable to common models for RNA secondary structures, such as energy-based or machine-learning-based models. In this article, we focus mainly on the details of the algorithms, although preliminary computational experiments will also be presented.
Collapse
Affiliation(s)
- Michiaki Hamada
- The University of Tokyo, Graduate School of Frontier Science, Kashiwa, Japan.
| |
Collapse
|
21
|
Sato K, Kato Y, Akutsu T, Asai K, Sakakibara Y. DAFS: simultaneous aligning and folding of RNA sequences via dual decomposition. ACTA ACUST UNITED AC 2012; 28:3218-24. [PMID: 23060618 DOI: 10.1093/bioinformatics/bts612] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
MOTIVATION It is well known that the accuracy of RNA secondary structure prediction from a single sequence is limited, and thus a comparative approach that predicts a common secondary structure from aligned sequences is a better choice if homologous sequences with reliable alignments are available. However, correct secondary structure information is needed to produce reliable alignments of RNA sequences. To tackle this dilemma, we require a fast and accurate aligner that takes structural information into consideration to yield reliable structural alignments, which are suitable for common secondary structure prediction. RESULTS We develop DAFS, a novel algorithm that simultaneously aligns and folds RNA sequences based on maximizing expected accuracy of a predicted common secondary structure and its alignment. DAFS decomposes the pairwise structural alignment problem into two independent secondary structure prediction problems and one pairwise (non-structural) alignment problem by the dual decomposition technique, and maintains the consistency of a pairwise structural alignment by imposing penalties on inconsistent base pairs and alignment columns that are iteratively updated. Furthermore, we extend DAFS to consider pseudoknots in RNA structural alignments by integrating IPknot for predicting a pseudoknotted structure. The experiments on publicly available datasets showed that DAFS can produce reliable structural alignments from unaligned sequences in terms of accuracy of common secondary structure prediction.
Collapse
Affiliation(s)
- Kengo Sato
- Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan.
| | | | | | | | | |
Collapse
|
22
|
Ruggieri E, Lawrence CE. On efficient calculations for Bayesian variable selection. Comput Stat Data Anal 2012. [DOI: 10.1016/j.csda.2011.09.026] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
23
|
Lange SJ, Maticzka D, Möhl M, Gagnon JN, Brown CM, Backofen R. Global or local? Predicting secondary structure and accessibility in mRNAs. Nucleic Acids Res 2012; 40:5215-26. [PMID: 22373926 PMCID: PMC3384308 DOI: 10.1093/nar/gks181] [Citation(s) in RCA: 116] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Determining the structural properties of mRNA is key to understanding vital post-transcriptional processes. As experimental data on mRNA structure are scarce, accurate structure prediction is required to characterize RNA regulatory mechanisms. Although various structure prediction approaches are available, it is often unclear which to choose and how to set their parameters. Furthermore, no standard measure to compare predictions of local structure exists. We assessed the performance of different methods using two types of data: transcriptome-wide enzymatic probing information and a large, curated set of cis-regulatory elements. To compare the approaches, we introduced structure accuracy, a measure that is applicable to both global and local methods. Our results showed that local folding was more accurate than the classic global approach. We investigated how the locality parameters, maximum base pair span and window size, influenced the prediction performance. A span of 150 provided a reasonable balance between maximizing the number of accurately predicted base pairs, while minimizing effects of incorrect long-range predictions. We characterized the error at artificial sequence ends, which we reduced by setting the window size sufficiently greater than the maximum span. Our method, LocalFold, diminished all border effects and produced the most robust performance.
Collapse
Affiliation(s)
- Sita J Lange
- Department of Computer Science and Centre for Biological Signalling Studies (BIOSS), Albert-Ludwigs-Universität Freiburg, Germany
| | | | | | | | | | | |
Collapse
|
24
|
Hamada M, Asai K. A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA). J Comput Biol 2012; 19:532-49. [PMID: 22313125 DOI: 10.1089/cmb.2011.0197] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Many estimation problems in bioinformatics are formulated as point estimation problems in a high-dimensional discrete space. In general, it is difficult to design reliable estimators for this type of problem, because the number of possible solutions is immense, which leads to an extremely low probability for every solution-even for the one with the highest probability. Therefore, maximum score and maximum likelihood estimators do not work well in this situation although they are widely employed in a number of applications. Maximizing expected accuracy (MEA) estimation, in which accuracy measures of the target problem and the entire distribution of solutions are considered, is a more successful approach. In this review, we provide an extensive discussion of algorithms and software based on MEA. We describe how a number of algorithms used in previous studies can be classified from the viewpoint of MEA. We believe that this review will be useful not only for users wishing to utilize software to solve the estimation problems appearing in this article, but also for developers wishing to design algorithms on the basis of MEA.
Collapse
Affiliation(s)
- Michiaki Hamada
- Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan.
| | | |
Collapse
|
25
|
Okada Y, Saito Y, Sato K, Sakakibara Y. Improved measurements of RNA structure conservation with generalized centroid estimators. Front Genet 2012; 2:54. [PMID: 22303350 PMCID: PMC3268607 DOI: 10.3389/fgene.2011.00054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2011] [Accepted: 08/08/2011] [Indexed: 11/13/2022] Open
Abstract
Identification of non-protein-coding RNAs (ncRNAs) in genomes is a crucial task for not only molecular cell biology but also bioinformatics. Secondary structures of ncRNAs are employed as a key feature of ncRNA analysis since biological functions of ncRNAs are deeply related to their secondary structures. Although the minimum free energy (MFE) structure of an RNA sequence is regarded as the most stable structure, MFE alone could not be an appropriate measure for identifying ncRNAs since the free energy is heavily biased by the nucleotide composition. Therefore, instead of MFE itself, several alternative measures for identifying ncRNAs have been proposed such as the structure conservation index (SCI) and the base pair distance (BPD), both of which employ MFE structures. However, these measurements are unfortunately not suitable for identifying ncRNAs in some cases including the genome-wide search and incur high false discovery rate. In this study, we propose improved measurements based on SCI and BPD, applying generalized centroid estimators to incorporate the robustness against low quality multiple alignments. Our experiments show that our proposed methods achieve higher accuracy than the original SCI and BPD for not only human-curated structural alignments but also low quality alignments produced by CLUSTAL W. Furthermore, the centroid-based SCI on CLUSTAL W alignments is more accurate than or comparable with that of the original SCI on structural alignments generated with RAF, a high quality structural aligner, for which twofold expensive computational time is required on average. We conclude that our methods are more suitable for genome-wide alignments which are of low quality from the point of view on secondary structures than the original SCI and BPD.
Collapse
Affiliation(s)
- Yohei Okada
- Department of Biosciences and Informatics, Keio University Yokohama, Japan
| | | | | | | |
Collapse
|
26
|
A model-based analysis to infer the functional content of a gene list. Stat Appl Genet Mol Biol 2012; 11:/j/sagmb.2012.11.issue-2/1544-6115.1716/1544-6115.1716.xml. [PMID: 22499692 DOI: 10.2202/1544-6115.1716] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
An important challenge in statistical genomics concerns integrating experimental data with exogenous information about gene function. A number of statistical methods are available to address this challenge, but most do not accommodate complexities in the functional record. To infer activity of a functional category (e.g., a gene ontology term), most methods use gene-level data on that category, but do not use other functional properties of the same genes. Not doing so creates undue errors in inference. Recent developments in model-based category analysis aim to overcome this difficulty, but in attempting to do so they are faced with serious computational problems. This paper investigates statistical properties and the structure of posterior computation in one such model for the analysis of functional category data. We examine the graphical structures underlying posterior computation in the original parameterization and in a new parameterization aimed at leveraging elements of the model. We characterize identifiability of the underlying activation states, describe a new prior distribution, and introduce approximations that aim to support numerical methods for posterior inference.
Collapse
|
27
|
SATO KENGO, HAMADA MICHIAKI, MITUYAMA TOUTAI, ASAI KIYOSHI, SAKAKIBARA YASUBUMI. A NON-PARAMETRIC BAYESIAN APPROACH FOR PREDICTING RNA SECONDARY STRUCTURES. J Bioinform Comput Biol 2011. [DOI: 10.1142/s0219720010004926] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Since many functional RNAs form stable secondary structures which are related to their functions, RNA secondary structure prediction is a crucial problem in bioinformatics. We propose a novel model for generating RNA secondary structures based on a non-parametric Bayesian approach, called hierarchical Dirichlet processes for stochastic context-free grammars (HDP-SCFGs). Here non-parametric means that some meta-parameters, such as the number of non-terminal symbols and production rules, do not have to be fixed. Instead their distributions are inferred in order to be adapted (in the Bayesian sense) to the training sequences provided. The results of our RNA secondary structure predictions show that HDP-SCFGs are more accurate than the MFE-based and other generative models.
Collapse
Affiliation(s)
- KENGO SATO
- Graduate School of Frontier Sciences, University of Tokyo, 5–1–5 Kashiwanoha, Kashiwa 277–8562, Japan
| | - MICHIAKI HAMADA
- Mizuho Information & Research Institute, Inc., 2–3 Kanda-Nishikicho, Chiyoda-ku, Tokyo 101–8443, Japan
| | - TOUTAI MITUYAMA
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2–41–6, Aomi, Koto-ku, Tokyo 135–0064, Japan
| | - KIYOSHI ASAI
- Graduate School of Frontier Sciences, University of Tokyo, 5–1–5 Kashiwanoha, Kashiwa 277–8562, Japan
| | - YASUBUMI SAKAKIBARA
- Department of Biosciences and Informatics, Keio University, 3–14–1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223–8522, Japan
| |
Collapse
|
28
|
Sato K, Kato Y, Hamada M, Akutsu T, Asai K. IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming. ACTA ACUST UNITED AC 2011; 27:i85-93. [PMID: 21685106 PMCID: PMC3117384 DOI: 10.1093/bioinformatics/btr215] [Citation(s) in RCA: 154] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
MOTIVATION Pseudoknots found in secondary structures of a number of functional RNAs play various roles in biological processes. Recent methods for predicting RNA secondary structures cover certain classes of pseudoknotted structures, but only a few of them achieve satisfying predictions in terms of both speed and accuracy. RESULTS We propose IPknot, a novel computational method for predicting RNA secondary structures with pseudoknots based on maximizing expected accuracy of a predicted structure. IPknot decomposes a pseudoknotted structure into a set of pseudoknot-free substructures and approximates a base-pairing probability distribution that considers pseudoknots, leading to the capability of modeling a wide class of pseudoknots and running quite fast. In addition, we propose a heuristic algorithm for refining base-paring probabilities to improve the prediction accuracy of IPknot. The problem of maximizing expected accuracy is solved by using integer programming with threshold cut. We also extend IPknot so that it can predict the consensus secondary structure with pseudoknots when a multiple sequence alignment is given. IPknot is validated through extensive experiments on various datasets, showing that IPknot achieves better prediction accuracy and faster running time as compared with several competitive prediction methods. AVAILABILITY The program of IPknot is available at http://www.ncrna.org/software/ipknot/. IPknot is also available as a web server at http://rna.naist.jp/ipknot/. CONTACT satoken@k.u-tokyo.ac.jp; ykato@is.naist.jp SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kengo Sato
- Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8561, Japan.
| | | | | | | | | |
Collapse
|
29
|
Wei D, Alpert LV, Lawrence CE. RNAG: a new Gibbs sampler for predicting RNA secondary structure for unaligned sequences. ACTA ACUST UNITED AC 2011; 27:2486-93. [PMID: 21788211 PMCID: PMC3167047 DOI: 10.1093/bioinformatics/btr421] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION RNA secondary structure plays an important role in the function of many RNAs, and structural features are often key to their interaction with other cellular components. Thus, there has been considerable interest in the prediction of secondary structures for RNA families. In this article, we present a new global structural alignment algorithm, RNAG, to predict consensus secondary structures for unaligned sequences. It uses a blocked Gibbs sampling algorithm, which has a theoretical advantage in convergence time. This algorithm iteratively samples from the conditional probability distributions P(Structure | Alignment) and P(Alignment | Structure). Not surprisingly, there is considerable uncertainly in the high-dimensional space of this difficult problem, which has so far received limited attention in this field. We show how the samples drawn from this algorithm can be used to more fully characterize the posterior space and to assess the uncertainty of predictions. RESULTS Our analysis of three publically available datasets showed a substantial improvement in RNA structure prediction by RNAG over extant prediction methods. Additionally, our analysis of 17 RNA families showed that the RNAG sampled structures were generally compact around their ensemble centroids, and at least 11 families had at least two well-separated clusters of predicted structures. In general, the distance between a reference structure and our predicted structure was large relative to the variation among structures within an ensemble. AVAILABILITY The Perl implementation of the RNAG algorithm and the data necessary to reproduce the results described in Sections 3.1 and 3.2 are available at http://ccmbweb.ccv.brown.edu/rnag.html CONTACT charles_lawrence@brown.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Donglai Wei
- Department of Mathematics, Center for Computational Molecular Biology, Brown University, Providence, RI 02912, USA
| | | | | |
Collapse
|
30
|
Kiryu H, Terai G, Imamura O, Yoneyama H, Suzuki K, Asai K. A detailed investigation of accessibilities around target sites of siRNAs and miRNAs. ACTA ACUST UNITED AC 2011; 27:1788-97. [PMID: 21531769 DOI: 10.1093/bioinformatics/btr276] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
MOTIVATION The importance of RNA sequence analysis has been increasing since the discovery of various types of non-coding RNAs transcribed in animal cells. Conventional RNA sequence analyses have mainly focused on structured regions, which are stabilized by the stacking energies acting on adjacent base pairs. On the other hand, recent findings regarding the mechanisms of small interfering RNAs (siRNAs) and transcription regulation by microRNAs (miRNAs) indicate the importance of analyzing accessible regions where no base pairs exist. So far, relatively few studies have investigated the nature of such regions. RESULTS We have conducted a detailed investigation of accessibilities around the target sites of siRNAs and miRNAs. We have exhaustively calculated the correlations between the accessibilities around the target sites and the repression levels of the corresponding mRNAs. We have computed the accessibilities with an originally developed software package, called 'Raccess', which computes the accessibility of all the segments of a fixed length for a given RNA sequence when the maximal distance between base pairs is limited to a fixed size W. We show that the computed accessibilities are relatively insensitive to the choice of the maximal span W. We have found that the efficacy of siRNAs depends strongly on the accessibility of the very 3'-end of their binding sites, which might reflect a target site recognition mechanism in the RNA-induced silencing complex. We also show that the efficacy of miRNAs has a similar dependence on the accessibilities, but some miRNAs also show positive correlations between the efficacy and the accessibilities in broad regions downstream of their putative binding sites, which might imply that the downstream regions of the target sites are bound by other proteins that allow the miRNAs to implement their functions. We have also investigated the off-target effects of an siRNA as a potential RNAi therapeutic. We show that the off-target effects of the siRNA have similar correlations to the miRNA repression, indicating that they are caused by the same mechanism. AVAILABILITY The C++ source code of the Raccess software is available at http://www.ncrna.org/software/Raccess/ The microarray data on the measurements of the siRNA off-target effects are also available at the same site. CONTACT kiryu-h@k.u-tokyo.ac.jp
Collapse
Affiliation(s)
- Hisanori Kiryu
- Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo, Chiba 277-8561, Japan.
| | | | | | | | | | | |
Collapse
|
31
|
Hamada M, Kiryu H, Iwasaki W, Asai K. Generalized centroid estimators in bioinformatics. PLoS One 2011; 6:e16450. [PMID: 21365017 PMCID: PMC3041832 DOI: 10.1371/journal.pone.0016450] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2010] [Accepted: 12/22/2010] [Indexed: 11/27/2022] Open
Abstract
In a number of estimation problems in bioinformatics, accuracy measures of the target problem are usually given, and it is important to design estimators that are suitable to those accuracy measures. However, there is often a discrepancy between an employed estimator and a given accuracy measure of the problem. In this study, we introduce a general class of efficient estimators for estimation problems on high-dimensional binary spaces, which represent many fundamental problems in bioinformatics. Theoretical analysis reveals that the proposed estimators generally fit with commonly-used accuracy measures (e.g. sensitivity, PPV, MCC and F-score) as well as it can be computed efficiently in many cases, and cover a wide range of problems in bioinformatics from the viewpoint of the principle of maximum expected accuracy (MEA). It is also shown that some important algorithms in bioinformatics can be interpreted in a unified manner. Not only the concept presented in this paper gives a useful framework to design MEA-based estimators but also it is highly extendable and sheds new light on many problems in bioinformatics.
Collapse
Affiliation(s)
- Michiaki Hamada
- Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan.
| | | | | | | |
Collapse
|
32
|
Clyde MA, Ghosh J, Littman ML. Bayesian Adaptive Sampling for Variable Selection and Model Averaging. J Comput Graph Stat 2011. [DOI: 10.1198/jcgs.2010.09049] [Citation(s) in RCA: 114] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
33
|
Hamada M, Sato K, Asai K. Prediction of RNA secondary structure by maximizing pseudo-expected accuracy. BMC Bioinformatics 2010; 11:586. [PMID: 21118522 PMCID: PMC3003279 DOI: 10.1186/1471-2105-11-586] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2010] [Accepted: 11/30/2010] [Indexed: 12/17/2022] Open
Abstract
Background Recent studies have revealed the importance of considering the entire distribution of possible secondary structures in RNA secondary structure predictions; therefore, a new type of estimator is proposed including the maximum expected accuracy (MEA) estimator. The MEA-based estimators have been designed to maximize the expected accuracy of the base-pairs and have achieved the highest level of accuracy. Those methods, however, do not give the single best prediction of the structure, but employ parameters to control the trade-off between the sensitivity and the positive predictive value (PPV). It is unclear what parameter value we should use, and even the well-trained default parameter value does not, in general, give the best result in popular accuracy measures to each RNA sequence. Results Instead of using the expected values of the popular accuracy measures for RNA secondary structure prediction, which is difficult to be calculated, the pseudo-expected accuracy, which can easily be computed from base-pairing probabilities, is introduced. It is shown that the pseudo-expected accuracy is a good approximation in terms of sensitivity, PPV, MCC, or F-score. The pseudo-expected accuracy can be approximately maximized for each RNA sequence by stochastic sampling. It is also shown that well-balanced secondary structures between sensitivity and PPV can be predicted with a small computational overhead by combining the pseudo-expected accuracy of MCC or F-score with the γ-centroid estimator. Conclusions This study gives not only a method for predicting the secondary structure that balances between sensitivity and PPV, but also a general method for approximately maximizing the (pseudo-)expected accuracy with respect to various evaluation measures including MCC and F-score.
Collapse
Affiliation(s)
- Michiaki Hamada
- Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, Japan.
| | | | | |
Collapse
|
34
|
Sheridan P, Kamimura T, Shimodaira H. A scale-free structure prior for graphical models with applications in functional genomics. PLoS One 2010; 5:e13580. [PMID: 21079769 PMCID: PMC2974640 DOI: 10.1371/journal.pone.0013580] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2010] [Accepted: 09/28/2010] [Indexed: 11/18/2022] Open
Abstract
The problem of reconstructing large-scale, gene regulatory networks from gene expression data has garnered considerable attention in bioinformatics over the past decade with the graphical modeling paradigm having emerged as a popular framework for inference. Analysis in a full Bayesian setting is contingent upon the assignment of a so-called structure prior—a probability distribution on networks, encoding a priori biological knowledge either in the form of supplemental data or high-level topological features. A key topological consideration is that a wide range of cellular networks are approximately scale-free, meaning that the fraction, , of nodes in a network with degree is roughly described by a power-law with exponent between and . The standard practice, however, is to utilize a random structure prior, which favors networks with binomially distributed degree distributions. In this paper, we introduce a scale-free structure prior for graphical models based on the formula for the probability of a network under a simple scale-free network model. Unlike the random structure prior, its scale-free counterpart requires a node labeling as a parameter. In order to use this prior for large-scale network inference, we design a novel Metropolis-Hastings sampler for graphical models that includes a node labeling as a state space variable. In a simulation study, we demonstrate that the scale-free structure prior outperforms the random structure prior at recovering scale-free networks while at the same time retains the ability to recover random networks. We then estimate a gene association network from gene expression data taken from a breast cancer tumor study, showing that scale-free structure prior recovers hubs, including the previously unknown hub SLC39A6, which is a zinc transporter that has been implicated with the spread of breast cancer to the lymph nodes. Our analysis of the breast cancer expression data underscores the value of the scale-free structure prior as an instrument to aid in the identification of candidate hub genes with the potential to direct the hypotheses of molecular biologists, and thus drive future experiments.
Collapse
Affiliation(s)
- Paul Sheridan
- Department of Mathematical and Computing Sciences, Tokyo Institute of Technology, Tokyo, Japan.
| | | | | |
Collapse
|
35
|
Hamada M, Sato K, Asai K. Improving the accuracy of predicting secondary structure for aligned RNA sequences. Nucleic Acids Res 2010; 39:393-402. [PMID: 20843778 PMCID: PMC3025558 DOI: 10.1093/nar/gkq792] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Considerable attention has been focused on predicting the secondary structure for aligned RNA sequences since it is useful not only for improving the limiting accuracy of conventional secondary structure prediction but also for finding non-coding RNAs in genomic sequences. Although there exist many algorithms of predicting secondary structure for aligned RNA sequences, further improvement of the accuracy is still awaited. In this article, toward improving the accuracy, a theoretical classification of state-of-the-art algorithms of predicting secondary structure for aligned RNA sequences is presented. The classification is based on the viewpoint of maximum expected accuracy (MEA), which has been successfully applied in various problems in bioinformatics. The classification reveals several disadvantages of the current algorithms but we propose an improvement of a previously introduced algorithm (CentroidAlifold). Finally, computational experiments strongly support the theoretical classification and indicate that the improved CentroidAlifold substantially outperforms other algorithms.
Collapse
Affiliation(s)
- Michiaki Hamada
- Mizuho Information & Research Institute, Inc, Chiyoda-ku, Tokyo, Japan.
| | | | | |
Collapse
|
36
|
Iwasaki W, Takagi T. An intuitive, informative, and most balanced representation of phylogenetic topologies. Syst Biol 2010; 59:584-93. [PMID: 20817714 PMCID: PMC2950835 DOI: 10.1093/sysbio/syq044] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
The recent explosion in the availability of genetic sequence data has made large-scale phylogenetic inference routine in many life sciences laboratories. The outcomes of such analyses are, typically, a variety of candidate phylogenetic relationships or tree topologies, even when the power of genome-scale data is exploited. Because much phylogenetic information must be buried in such topology distributions, it is important to reveal that information as effectively as possible; however, existing methods need to adopt complex structures to represent such information. Hence, researchers, in particular those not experts in evolutionary studies, sometimes hesitate to adopt these methods and much phylogenetic information could be overlooked and wasted. In this paper, we propose the centroid wheel tree representation, which is an informative representation of phylogenetic topology distributions, and which can be readily interpreted even by nonexperts. Furthermore, we mathematically prove this to be the most balanced representation of phylogenetic topologies and efficiently solvable in the framework of the traveling salesman problem, for which very sophisticated program packages are available. This theoretically and practically superior representation should aid biologists faced with abundant data. The centroid representation introduced here is fairly general, so it can be applied to other fields that are characterized by high-dimensional solution spaces and large quantities of noisy data. The software is implemented in Java and available via http://cwt.cb.k.u-tokyo.ac.jp/.
Collapse
Affiliation(s)
- Wataru Iwasaki
- Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba 277-8568, Japan.
| | | |
Collapse
|
37
|
Stovgaard K, Andreetta C, Ferkinghoff-Borg J, Hamelryck T. Calculation of accurate small angle X-ray scattering curves from coarse-grained protein models. BMC Bioinformatics 2010; 11:429. [PMID: 20718956 PMCID: PMC2931518 DOI: 10.1186/1471-2105-11-429] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2010] [Accepted: 08/18/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome sequencing projects have expanded the gap between the amount of known protein sequences and structures. The limitations of current high resolution structure determination methods make it unlikely that this gap will disappear in the near future. Small angle X-ray scattering (SAXS) is an established low resolution method for routinely determining the structure of proteins in solution. The purpose of this study is to develop a method for the efficient calculation of accurate SAXS curves from coarse-grained protein models. Such a method can for example be used to construct a likelihood function, which is paramount for structure determination based on statistical inference. RESULTS We present a method for the efficient calculation of accurate SAXS curves based on the Debye formula and a set of scattering form factors for dummy atom representations of amino acids. Such a method avoids the computationally costly iteration over all atoms. We estimated the form factors using generated data from a set of high quality protein structures. No ad hoc scaling or correction factors are applied in the calculation of the curves. Two coarse-grained representations of protein structure were investigated; two scattering bodies per amino acid led to significantly better results than a single scattering body. CONCLUSION We show that the obtained point estimates allow the calculation of accurate SAXS curves from coarse-grained protein models. The resulting curves are on par with the current state-of-the-art program CRYSOL, which requires full atomic detail. Our method was also comparable to CRYSOL in recognizing native structures among native-like decoys. As a proof-of-concept, we combined the coarse-grained Debye calculation with a previously described probabilistic model of protein structure, TorusDBN. This resulted in a significant improvement in the decoy recognition performance. In conclusion, the presented method shows great promise for use in statistical inference of protein structures from SAXS data.
Collapse
Affiliation(s)
- Kasper Stovgaard
- Department of Biology, University of Copenhagen, The Bioinformatics Centre, Denmark
| | | | | | | |
Collapse
|
38
|
Fromer M, Yanover C, Linial M. Design of multispecific protein sequences using probabilistic graphical modeling. Proteins 2010; 78:530-47. [PMID: 19842166 DOI: 10.1002/prot.22575] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
In nature, proteins partake in numerous protein- protein interactions that mediate their functions. Moreover, proteins have been shown to be physically stable in multiple structures, induced by cellular conditions, small ligands, or covalent modifications. Understanding how protein sequences achieve this structural promiscuity at the atomic level is a fundamental step in the drug design pipeline and a critical question in protein physics. One way to investigate this subject is to computationally predict protein sequences that are compatible with multiple states, i.e., multiple target structures or binding to distinct partners. The goal of engineering such proteins has been termed multispecific protein design. We develop a novel computational framework to efficiently and accurately perform multispecific protein design. This framework utilizes recent advances in probabilistic graphical modeling to predict sequences with low energies in multiple target states. Furthermore, it is also geared to specifically yield positional amino acid probability profiles compatible with these target states. Such profiles can be used as input to randomly bias high-throughput experimental sequence screening techniques, such as phage display, thus providing an alternative avenue for elucidating the multispecificity of natural proteins and the synthesis of novel proteins with specific functionalities. We prove the utility of such multispecific design techniques in better recovering amino acid sequence diversities similar to those resulting from millions of years of evolution. We then compare the approaches of prediction of low energy ensembles and of amino acid profiles and demonstrate their complementarity in providing more robust predictions for protein design.
Collapse
Affiliation(s)
- Menachem Fromer
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel.
| | | | | |
Collapse
|
39
|
Frith MC, Hamada M, Horton P. Parameters for accurate genome alignment. BMC Bioinformatics 2010; 11:80. [PMID: 20144198 PMCID: PMC2829014 DOI: 10.1186/1471-2105-11-80] [Citation(s) in RCA: 136] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2009] [Accepted: 02/09/2010] [Indexed: 11/25/2022] Open
Abstract
Background Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed. Results We have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that γ-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases. Conclusions These results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours http://last.cbrc.jp/.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Tokyo 135-0064, Japan.
| | | | | |
Collapse
|
40
|
Abstract
Motivation: Abstract shape analysis allows efficient computation of a representative sample of low-energy foldings of an RNA molecule. More comprehensive information is obtained by computing shape probabilities, accumulating the Boltzmann probabilities of all structures within each abstract shape. Such information is superior to free energies because it is independent of sequence length and base composition. However, up to this point, computation of shape probabilities evaluates all shapes simultaneously and comes with a computation cost which is exponential in the length of the sequence. Results: We device an approach called RapidShapes that computes the shapes above a specified probability threshold T by generating a list of promising shapes and constructing specialized folding programs for each shape to compute its share of Boltzmann probability. This aims at a heuristic improvement of runtime, while still computing exact probability values. Conclusion: Evaluating this approach and several substrategies, we find that only a small proportion of shapes have to be actually computed. For an RNA sequence of length 400, this leads, depending on the threshold, to a 10–138 fold speed-up compared with the previous complete method. Thus, probabilistic shape analysis has become feasible in medium-scale applications, such as the screening of RNA transcripts in a bacterial genome. Availability:RapidShapes is available via http://bibiserv.cebitec.uni-bielefeld.de/rnashapes Contact:robert@techfak.uni-bielefeld.de Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Stefan Janssen
- Practical Computer Science, Faculty of Technology, Bielefeld University, D-33615 Bielefeld, Germany
| | | |
Collapse
|
41
|
Distribution of distances between topologies and its effect on detection of phylogenetic recombination. ANN I STAT MATH 2009. [DOI: 10.1007/s10463-009-0259-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
42
|
Hamada M, Sato K, Kiryu H, Mituyama T, Asai K. CentroidAlign: fast and accurate aligner for structured RNAs by maximizing expected sum-of-pairs score. Bioinformatics 2009; 25:3236-43. [DOI: 10.1093/bioinformatics/btp580] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
|
43
|
Hamada M, Sato K, Kiryu H, Mituyama T, Asai K. Predictions of RNA secondary structure by combining homologous sequence information. ACTA ACUST UNITED AC 2009; 25:i330-8. [PMID: 19478007 PMCID: PMC2687982 DOI: 10.1093/bioinformatics/btp228] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Motivation: Secondary structure prediction of RNA sequences is an important problem. There have been progresses in this area, but the accuracy of prediction from an RNA sequence is still limited. In many cases, however, homologous RNA sequences are available with the target RNA sequence whose secondary structure is to be predicted. Results: In this article, we propose a new method for secondary structure predictions of individual RNA sequences by taking the information of their homologous sequences into account without assuming the common secondary structure of the entire sequences. The proposed method is based on posterior decoding techniques, which consider all the suboptimal secondary structures of the target and homologous sequences and all the suboptimal alignments between the target sequence and each of the homologous sequences. In our computational experiments, the proposed method provides better predictions than those performed only on the basis of the formation of individual RNA sequences and those performed by using methods for predicting the common secondary structure of the homologous sequences. Remarkably, we found that the common secondary predictions sometimes give worse predictions for the secondary structure of a target sequence than the predictions from the individual target sequence, while the proposed method always gives good predictions for the secondary structure of target sequences in all tested cases. Availability: Supporting information and software are available online at: http://www.ncrna.org/software/centroidfold/ismb2009/. Contact:hamada-michiaki@aist.go.jp Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|
44
|
Sato K, Hamada M, Asai K, Mituyama T. CENTROIDFOLD: a web server for RNA secondary structure prediction. Nucleic Acids Res 2009; 37:W277-80. [PMID: 19435882 PMCID: PMC2703931 DOI: 10.1093/nar/gkp367] [Citation(s) in RCA: 198] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
The CentroidFold web server (http://www.ncrna.org/centroidfold/) is a web application for RNA secondary structure prediction powered by one of the most accurate prediction engine. The server accepts two kinds of sequence data: a single RNA sequence and a multiple alignment of RNA sequences. It responses with a prediction result shown as a popular base-pair notation and a graph representation. PDF version of the graph representation is also available. For a multiple alignment sequence, the server predicts a common secondary structure. Usage of the server is quite simple. You can paste a single RNA sequence (FASTA or plain sequence text) or a multiple alignment (CLUSTAL-W format) into the textarea then click on the ‘execute CentroidFold’ button. The server quickly responses with a prediction result. The major advantage of this server is that it employs our original CentroidFold software as its prediction engine which scores the best accuracy in our benchmark results. Our web server is freely available with no login requirement.
Collapse
Affiliation(s)
- Kengo Sato
- Japan Biological Informatics Consortium, 2-45 Aomi, Koto-ku, Tokyo 135-8073, Japan.
| | | | | | | |
Collapse
|
45
|
Tabei Y, Asai K. A local multiple alignment method for detection of non-coding RNA sequences. ACTA ACUST UNITED AC 2009; 25:1498-505. [PMID: 19376823 DOI: 10.1093/bioinformatics/btp261] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Non-coding RNAs (ncRNAs) show a unique evolutionary process in which the substitutions of distant bases are correlated in order to conserve the secondary structure of the ncRNA molecule. Therefore, the multiple alignment method for the detection of ncRNAs should take into account both the primary sequence and the secondary structure. Recently, there has been intense focus on multiple alignment investigations for the detection of ncRNAs; however, most of the proposed methods are designed for global multiple alignments. For this reason, these methods are not appropriate to identify locally conserved ncRNAs among genomic sequences. A more efficient local multiple alignment method for the detection of ncRNAs is required. RESULTS We propose a new local multiple alignment method for the detection of ncRNAs. This method uses a local multiple alignment construction procedure inspired by ProDA, which is a local multiple aligner program for protein sequences with repeated and shuffled elements. To align sequences based on secondary structure information, we propose a new alignment model which incorporates secondary structure features. We define the conditional probability of an alignment via a conditional random field and use a gamma-centroid estimator to align sequences. The locally aligned subsequences are clustered into blocks of approximately globally alignable subsequences between pairwise alignments. Finally, these blocks are multiply aligned via MXSCARNA. In benchmark experiments, we demonstrate the high ability of the implemented software, SCARNA_LM, for local multiple alignment for the detection of ncRNAs. AVAILABILITY The C++ source code for SCARNA_LM and its experimental datasets are available at http://www.ncrna.org/software/scarna_lm/download. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yasuo Tabei
- Department of Computational biology, Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Chiba, Japan.
| | | |
Collapse
|
46
|
Advances in RNA structure prediction from sequence: new tools for generating hypotheses about viral RNA structure-function relationships. J Virol 2009; 83:6326-34. [PMID: 19369331 DOI: 10.1128/jvi.00251-09] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
47
|
Newberg LA, Lawrence CE. Exact calculation of distributions on integers, with application to sequence alignment. J Comput Biol 2009; 16:1-18. [PMID: 19119992 DOI: 10.1089/cmb.2008.0137] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Computational biology is replete with high-dimensional discrete prediction and inference problems. Dynamic programming recursions can be applied to several of the most important of these, including sequence alignment, RNA secondary-structure prediction, phylogenetic inference, and motif finding. In these problems, attention is frequently focused on some scalar quantity of interest, a score, such as an alignment score or the free energy of an RNA secondary structure. In many cases, score is naturally defined on integers, such as a count of the number of pairing differences between two sequence alignments, or else an integer score has been adopted for computational reasons, such as in the test of significance of motif scores. The probability distribution of the score under an appropriate probabilistic model is of interest, such as in tests of significance of motif scores, or in calculation of Bayesian confidence limits around an alignment. Here we present three algorithms for calculating the exact distribution of a score of this type; then, in the context of pairwise local sequence alignments, we apply the approach so as to find the alignment score distribution and Bayesian confidence limits.
Collapse
Affiliation(s)
- Lee A Newberg
- Center for Bioinformatics, Wadsworth Center, New York State Department of Health, Albany, New York, USA.
| | | |
Collapse
|
48
|
Joshi A, De Smet R, Marchal K, Van de Peer Y, Michoel T. Module networks revisited: computational assessment and prioritization of model predictions. Bioinformatics 2009; 25:490-6. [PMID: 19136553 DOI: 10.1093/bioinformatics/btn658] [Citation(s) in RCA: 77] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION The solution of high-dimensional inference and prediction problems in computational biology is almost always a compromise between mathematical theory and practical constraints, such as limited computational resources. As time progresses, computational power increases but well-established inference methods often remain locked in their initial suboptimal solution. RESULTS We revisit the approach of Segal et al. to infer regulatory modules and their condition-specific regulators from gene expression data. In contrast to their direct optimization-based solution, we use a more representative centroid-like solution extracted from an ensemble of possible statistical models to explain the data. The ensemble method automatically selects a subset of most informative genes and builds a quantitatively better model for them. Genes which cluster together in the majority of models produce functionally more coherent modules. Regulators which are consistently assigned to a module are more often supported by literature, but a single model always contains many regulator assignments not supported by the ensemble. Reliably detecting condition-specific or combinatorial regulation is particularly hard in a single optimum but can be achieved using ensemble averaging. AVAILABILITY All software developed for this study is available from http://bioinformatics.psb.ugent.be/software.
Collapse
Affiliation(s)
- Anagha Joshi
- Department of Plant Systems Biology, VIB, Ghent University, Technologiepark 927, B-9052 Gent, Belgium
| | | | | | | | | |
Collapse
|
49
|
Morita K, Saito Y, Sato K, Oka K, Hotta K, Sakakibara Y. Genome-wide searching with base-pairing kernel functions for noncoding RNAs: computational and expression analysis of snoRNA families in Caenorhabditis elegans. Nucleic Acids Res 2009; 37:999-1009. [PMID: 19129214 PMCID: PMC2647286 DOI: 10.1093/nar/gkn1054] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Despite the accumulating research on noncoding RNAs (ncRNAs), it is likely that we are seeing only the tip of the iceberg regarding our understanding of the functions and the regulatory roles served by ncRNAs in cellular metabolism, pathogenesis and host-pathogen interactions. Therefore, more powerful computational and experimental tools for analyzing ncRNAs need to be developed. To this end, we propose novel kernel functions, called base-pairing profile local alignment (BPLA) kernels, for analyzing functional ncRNA sequences using support vector machines (SVMs). We extend the local alignment kernels for amino acid sequences in order to handle RNA sequences by using STRAL's; scoring function, which takes into account sequence similarities as well as upstream and downstream base-pairing probabilities, thus enabling us to model secondary structures of RNA sequences. As a test of the performance of BPLA kernels, we applied our kernels to the problem of discriminating members of an RNA family from nonmembers using SVMs. The results indicated that the discrimination ability of our kernels is stronger than that of other existing methods. Furthermore, we demonstrated the applicability of our kernels to the problem of genome-wide search of snoRNA families in the Caenorhabditis elegans genome, and confirmed that the expression is valid in 14 out of 48 of our predicted candidates by using qRT-PCR. Finally, highly expressed six candidates were identified as the original target regions by DNA sequencing.
Collapse
Affiliation(s)
- Kensuke Morita
- Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan
| | | | | | | | | | | |
Collapse
|
50
|
Hamada M, Kiryu H, Sato K, Mituyama T, Asai K. Prediction of RNA secondary structure using generalized centroid estimators. ACTA ACUST UNITED AC 2008; 25:465-73. [PMID: 19095700 DOI: 10.1093/bioinformatics/btn601] [Citation(s) in RCA: 157] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Recent studies have shown that the methods for predicting secondary structures of RNAs on the basis of posterior decoding of the base-pairing probabilities has an advantage with respect to prediction accuracy over the conventionally utilized minimum free energy methods. However, there is room for improvement in the objective functions presented in previous studies, which are maximized in the posterior decoding with respect to the accuracy measures for secondary structures. RESULTS We propose novel estimators which improve the accuracy of secondary structure prediction of RNAs. The proposed estimators maximize an objective function which is the weighted sum of the expected number of the true positives and that of the true negatives of the base pairs. The proposed estimators are also improved versions of the ones used in previous works, namely CONTRAfold for secondary structure prediction from a single RNA sequence and McCaskill-MEA for common secondary structure prediction from multiple alignments of RNA sequences. We clarify the relations between the proposed estimators and the estimators presented in previous works, and theoretically show that the previous estimators include additional unnecessary terms in the evaluation measures with respect to the accuracy. Furthermore, computational experiments confirm the theoretical analysis by indicating improvement in the empirical accuracy. The proposed estimators represent extensions of the centroid estimators proposed in Ding et al. and Carvalho and Lawrence, and are applicable to a wide variety of problems in bioinformatics. AVAILABILITY Supporting information and the CentroidFold software are available online at: http://www.ncrna.org/software/centroidfold/.
Collapse
Affiliation(s)
- Michiaki Hamada
- Mizuho Information & Research Institute, Inc, 2-3 Kanda-Nishikicho, Chiyoda-ku, Tokyo 101-8443, Japan.
| | | | | | | | | |
Collapse
|