101
|
Hsu JBK, Bretaña NA, Lee TY, Huang HD. Incorporating evolutionary information and functional domains for identifying RNA splicing factors in humans. PLoS One 2011; 6:e27567. [PMID: 22110674 PMCID: PMC3217973 DOI: 10.1371/journal.pone.0027567] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2011] [Accepted: 10/19/2011] [Indexed: 11/19/2022] Open
Abstract
Regulation of pre-mRNA splicing is achieved through the interaction of RNA sequence elements and a variety of RNA-splicing related proteins (splicing factors). The splicing machinery in humans is not yet fully elucidated, partly because splicing factors in humans have not been exhaustively identified. Furthermore, experimental methods for splicing factor identification are time-consuming and lab-intensive. Although many computational methods have been proposed for the identification of RNA-binding proteins, there exists no development that focuses on the identification of RNA-splicing related proteins so far. Therefore, we are motivated to design a method that focuses on the identification of human splicing factors using experimentally verified splicing factors. The investigation of amino acid composition reveals that there are remarkable differences between splicing factors and non-splicing proteins. A support vector machine (SVM) is utilized to construct a predictive model, and the five-fold cross-validation evaluation indicates that the SVM model trained with amino acid composition could provide a promising accuracy (80.22%). Another basic feature, amino acid dipeptide composition, is also examined to yield a similar predictive performance to amino acid composition. In addition, this work presents that the incorporation of evolutionary information and domain information could improve the predictive performance. The constructed models have been demonstrated to effectively classify (73.65% accuracy) an independent data set of human splicing factors. The result of independent testing indicates that in silico identification could be a feasible means of conducting preliminary analyses of splicing factors and significantly reducing the number of potential targets that require further in vivo or in vitro confirmation.
Collapse
Affiliation(s)
- Justin Bo-Kai Hsu
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsin-Chu, Taiwan
| | - Neil Arvin Bretaña
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, Taiwan
| | - Tzong-Yi Lee
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, Taiwan
- * E-mail: (T-YL); (H-DH)
| | - Hsien-Da Huang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsin-Chu, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsin-Chu, Taiwan
- Core Facility for Structural Bioinformatics, National Chiao Tung University, Hsin-Chu, Taiwan
- * E-mail: (T-YL); (H-DH)
| |
Collapse
|
102
|
Zhao H, Yang Y, Zhou Y. Highly accurate and high-resolution function prediction of RNA binding proteins by fold recognition and binding affinity prediction. RNA Biol 2011; 8:988-96. [PMID: 21955494 DOI: 10.4161/rna.8.6.17813] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
A full understanding of the mechanism of post- transcriptional regulation requires more than simple two- state prediction (binding or not binding) for RNA binding proteins. Here we report a sequence-based technique dedicated for predicting complex structures of protein and RNA by combining fold recognition with binding affinity prediction. The method not only provides a highly accurate complex structure prediction (77% of residues are within 4°A RMSD from native in average for the independent test set) but also achieves the best performing two-state binding or non-binding prediction with an accuracy of 98%, precision of 84%, and Mathews correlation coefficient (MCC) of 0.62. Moreover, it predicts binding residues with an accuracy of 84%, precision of 66% and MCC value of 0.51. In addition, it has a success rate of 77% in predicting RNA binding types (mRNA, tRNA or rRNA). We further demonstrate that it makes more than 10% improvement either in precision or sensitivity than PSI- BLAST, HHPRED and our previously developed structure- based technique. This method expects to be useful for highly accurate genome-scale, high-resolution prediction of RNA-binding proteins and their complex structures. A web server (SPOT) is freely available for academic users at http://sparks.informatics.iupui.edu.
Collapse
Affiliation(s)
- Huiying Zhao
- School of Informatics, Indiana University Purdue University, Indianapolis, IN, USA
| | | | | |
Collapse
|
103
|
Lin WZ, Fang JA, Xiao X, Chou KC. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One 2011; 6:e24756. [PMID: 21935457 PMCID: PMC3174210 DOI: 10.1371/journal.pone.0024756] [Citation(s) in RCA: 196] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2011] [Accepted: 08/16/2011] [Indexed: 11/18/2022] Open
Abstract
DNA-binding proteins play crucial roles in various cellular processes. Developing high throughput tools for rapidly and effectively identifying DNA-binding proteins is one of the major challenges in the field of genome annotation. Although many efforts have been made in this regard, further effort is needed to enhance the prediction power. By incorporating the features into the general form of pseudo amino acid composition that were extracted from protein sequences via the “grey model” and by adopting the random forest operation engine, we proposed a new predictor, called iDNA-Prot, for identifying uncharacterized proteins as DNA-binding proteins or non-DNA binding proteins based on their amino acid sequences information alone. The overall success rate by iDNA-Prot was 83.96% that was obtained via jackknife tests on a newly constructed stringent benchmark dataset in which none of the proteins included has pairwise sequence identity to any other in a same subset. In addition to achieving high success rate, the computational time for iDNA-Prot is remarkably shorter in comparison with the relevant existing predictors. Hence it is anticipated that iDNA-Prot may become a useful high throughput tool for large-scale analysis of DNA-binding proteins. As a user-friendly web-server, iDNA-Prot is freely accessible to the public at the web-site on http://icpr.jci.edu.cn/bioinfo/iDNA-Prot or http://www.jci-bioinfo.cn/iDNA-Prot. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results.
Collapse
Affiliation(s)
- Wei-Zhong Lin
- Information Science and Technology School, Donghua University, Shanghai, China
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, China
| | - Jian-An Fang
- Information Science and Technology School, Donghua University, Shanghai, China
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, China
- Gordon Life Science Institute, San Diego, California, United States of America
- * E-mail:
| | - Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, California, United States of America
| |
Collapse
|
104
|
Zhang W, Xiong Y, Zhao M, Zou H, Ye X, Liu J. Prediction of conformational B-cell epitopes from 3D structures by random forests with a distance-based feature. BMC Bioinformatics 2011; 12:341. [PMID: 21846404 PMCID: PMC3228550 DOI: 10.1186/1471-2105-12-341] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2011] [Accepted: 08/17/2011] [Indexed: 11/29/2022] Open
Abstract
Background Antigen-antibody interactions are key events in immune system, which provide important clues to the immune processes and responses. In Antigen-antibody interactions, the specific sites on the antigens that are directly bound by the B-cell produced antibodies are well known as B-cell epitopes. The identification of epitopes is a hot topic in bioinformatics because of their potential use in the epitope-based drug design. Although most B-cell epitopes are discontinuous (or conformational), insufficient effort has been put into the conformational epitope prediction, and the performance of existing methods is far from satisfaction. Results In order to develop the high-accuracy model, we focus on some possible aspects concerning the prediction performance, including the impact of interior residues, different contributions of adjacent residues, and the imbalanced data which contain much more non-epitope residues than epitope residues. In order to address above issues, we take following strategies. Firstly, a concept of 'thick surface patch' instead of 'surface patch' is introduced to describe the local spatial context of each surface residue, which considers the impact of interior residue. The comparison between the thick surface patch and the surface patch shows that interior residues contribute to the recognition of epitopes. Secondly, statistical significance of the distance distribution difference between non-epitope patches and epitope patches is observed, thus an adjacent residue distance feature is presented, which reflects the unequal contributions of adjacent residues to the location of binding sites. Thirdly, a bootstrapping and voting procedure is adopted to deal with the imbalanced dataset. Based on the above ideas, we propose a new method to identify the B-cell conformational epitopes from 3D structures by combining conventional features and the proposed feature, and the random forest (RF) algorithm is used as the classification engine. The experiments show that our method can predict conformational B-cell epitopes with high accuracy. Evaluated by leave-one-out cross validation (LOOCV), our method achieves the mean AUC value of 0.633 for the benchmark bound dataset, and the mean AUC value of 0.654 for the benchmark unbound dataset. When compared with the state-of-the-art prediction models in the independent test, our method demonstrates comparable or better performance. Conclusions Our method is demonstrated to be effective for the prediction of conformational epitopes. Based on the study, we develop a tool to predict the conformational epitopes from 3D structures, available at http://code.google.com/p/my-project-bpredictor/downloads/list.
Collapse
Affiliation(s)
- Wen Zhang
- School of Computer, Wuhan University, Wuhan 430072, China.
| | | | | | | | | | | |
Collapse
|
105
|
Abstract
The awareness of important biological role played by functional, non coding (nc) RNA has grown tremendously in recent years. To perform their tasks, ncRNA molecules typically unite with protein partners, forming ribonucleoprotein complexes. Structural insight into their architectures can be greatly supplemented by computational docking techniques, as they provide means for the integration and refinement of experimental data that is often limited to fragments of larger assemblies or represents multiple levels of spatial resolution. Here, we present a coarse-grained force field for protein-RNA docking, implemented within the framework of the ATTRACT program. Complex structure prediction is based on energy minimization in rotational and translational degrees of freedom of binding partners, with possible extension to include structural flexibility. The coarse-grained representation allows for fast and efficient systematic docking search without any prior knowledge about complex geometry.
Collapse
Affiliation(s)
- Piotr Setny
- Physics Department T38, Technical University Munich, James-Franck-Strasse 1, 85748 Garching, Germany.
| | | |
Collapse
|
106
|
Shazman S, Elber G, Mandel-Gutfreund Y. From face to interface recognition: a differential geometric approach to distinguish DNA from RNA binding surfaces. Nucleic Acids Res 2011; 39:7390-9. [PMID: 21693557 PMCID: PMC3177183 DOI: 10.1093/nar/gkr395] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Protein nucleic acid interactions play a critical role in all steps of the gene expression pathway. Nucleic acid (NA) binding proteins interact with their partners, DNA or RNA, via distinct regions on their surface that are characterized by an ensemble of chemical, physical and geometrical properties. In this study, we introduce a novel methodology based on differential geometry, commonly used in face recognition, to characterize and predict NA binding surfaces on proteins. Applying the method on experimentally solved three-dimensional structures of proteins we successfully classify double-stranded DNA (dsDNA) from single-stranded RNA (ssRNA) binding proteins, with 83% accuracy. We show that the method is insensitive to conformational changes that occur upon binding and can be applicable for de novo protein-function prediction. Remarkably, when concentrating on the zinc finger motif, we distinguish successfully between RNA and DNA binding interfaces possessing the same binding motif even within the same protein, as demonstrated for the RNA polymerase transcription-factor, TFIIIA. In conclusion, we present a novel methodology to characterize protein surfaces, which can accurately tell apart dsDNA from an ssRNA binding interfaces. The strength of our method in recognizing fine-tuned differences on NA binding interfaces make it applicable for many other molecular recognition problems, with potential implications for drug design.
Collapse
Affiliation(s)
- Shula Shazman
- Department of Computer Science, Technion-Israel Institute of Technology, Haifa, Israel
| | | | | |
Collapse
|
107
|
Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ. Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 2011; 39:W385-90. [PMID: 21609959 PMCID: PMC3125735 DOI: 10.1093/nar/gkr284] [Citation(s) in RCA: 105] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Sequence-derived structural and physicochemical features have been extensively used for analyzing and predicting structural, functional, expression and interaction profiles of proteins and peptides. PROFEAT has been developed as a web server for computing commonly used features of proteins and peptides from amino acid sequence. To facilitate more extensive studies of protein and peptides, numerous improvements and updates have been made to PROFEAT. We added new functions for computing descriptors of protein–protein and protein–small molecule interactions, segment descriptors for local properties of protein sequences, topological descriptors for peptide sequences and small molecule structures. We also added new feature groups for proteins and peptides (pseudo-amino acid composition, amphiphilic pseudo-amino acid composition, total amino acid properties and atomic-level topological descriptors) as well as for small molecules (atomic-level topological descriptors). Overall, PROFEAT computes 11 feature groups of descriptors for proteins and peptides, and a feature group of more than 400 descriptors for small molecules plus the derived features for protein–protein and protein–small molecule interactions. Our computational algorithms have been extensively tested and used in a number of published works for predicting proteins of specific structural or functional classes, protein–protein interactions, peptides of specific functions and quantitative structure activity relationships of small molecules. PROFEAT is accessible free of charge at http://bidd.cz3.nus.edu.sg/cgi-bin/prof/protein/profnew.cgi.
Collapse
Affiliation(s)
- H B Rao
- College of Chemistry, Sichuan University, Chengdu, 610064, PR China
| | | | | | | | | |
Collapse
|
108
|
Dou Y, Geng X, Gao H, Yang J, Zheng X, Wang J. Sequence Conservation in the Prediction of Catalytic Sites. Protein J 2011; 30:229-39. [PMID: 21465136 DOI: 10.1007/s10930-011-9324-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
109
|
Lewis BA, Walia RR, Terribilini M, Ferguson J, Zheng C, Honavar V, Dobbs D. PRIDB: a Protein-RNA interface database. Nucleic Acids Res 2011; 39:D277-82. [PMID: 21071426 PMCID: PMC3013700 DOI: 10.1093/nar/gkq1108] [Citation(s) in RCA: 103] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2010] [Revised: 10/15/2010] [Accepted: 10/18/2010] [Indexed: 11/25/2022] Open
Abstract
The Protein-RNA Interface Database (PRIDB) is a comprehensive database of protein-RNA interfaces extracted from complexes in the Protein Data Bank (PDB). It is designed to facilitate detailed analyses of individual protein-RNA complexes and their interfaces, in addition to automated generation of user-defined data sets of protein-RNA interfaces for statistical analyses and machine learning applications. For any chosen PDB complex or list of complexes, PRIDB rapidly displays interfacial amino acids and ribonucleotides within the primary sequences of the interacting protein and RNA chains. PRIDB also identifies ProSite motifs in protein chains and FR3D motifs in RNA chains and provides links to these external databases, as well as to structure files in the PDB. An integrated JMol applet is provided for visualization of interacting atoms and residues in the context of the 3D complex structures. The current version of PRIDB contains structural information regarding 926 protein-RNA complexes available in the PDB (as of 10 October 2010). Atomic- and residue-level contact information for the entire data set can be downloaded in a simple machine-readable format. Also, several non-redundant benchmark data sets of protein-RNA complexes are provided. The PRIDB database is freely available online at http://bindr.gdcb.iastate.edu/PRIDB.
Collapse
Affiliation(s)
- Benjamin A Lewis
- Bioinformatics and Computational Biology Program, Iowa State University, Iowa, USA.
| | | | | | | | | | | | | |
Collapse
|
110
|
Zhao H, Yang Y, Zhou Y. Structure-based prediction of RNA-binding domains and RNA-binding sites and application to structural genomics targets. Nucleic Acids Res 2010; 39:3017-25. [PMID: 21183467 PMCID: PMC3082898 DOI: 10.1093/nar/gkq1266] [Citation(s) in RCA: 94] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Mechanistic understanding of many key cellular processes often involves identification of RNA binding proteins (RBPs) and RNA binding sites in two separate steps. Here, they are predicted simultaneously by structural alignment to known protein-RNA complex structures followed by binding assessment with a DFIRE-based statistical energy function. This method achieves 98% accuracy and 91% precision for predicting RBPs and 93% accuracy and 78% precision for predicting RNA-binding amino-acid residues for a large benchmark of 212 RNA binding and 6761 non-RNA binding domains (leave-one-out cross-validation). Additional tests revealed that the method makes no false positive prediction from 311 DNA binding domains but correctly detects six domains binding with both DNA and RNA. In addition, it correctly identified 31 of 75 unbound RNA-binding domains with 92% accuracy and 65% precision for predicted binding residues and achieved 86% success rate in its application to SCOP RNA binding domain superfamily (Structural Classification Of Proteins). It further predicts 25 targets as RBPs in 2076 structural genomics targets: 20 of 25 predicted ones (80%) are putatively RNA binding. The superior performance over existing methods indicates the importance of dividing structures into domains, using a Z-score to measure relative structural similarity, and a statistical energy function to measure protein-RNA binding affinity.
Collapse
Affiliation(s)
- Huiying Zhao
- School of Informatics, Indiana University Purdue University and Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | | | | |
Collapse
|