1
|
Pradhan UK, Naha S, Das R, Gupta A, Parsad R, Meher PK. RBProkCNN: Deep learning on appropriate contextual evolutionary information for RNA binding protein discovery in prokaryotes. Comput Struct Biotechnol J 2024; 23:1631-1640. [PMID: 38660008 PMCID: PMC11039349 DOI: 10.1016/j.csbj.2024.04.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 04/12/2024] [Accepted: 04/12/2024] [Indexed: 04/26/2024] Open
Abstract
RNA-binding proteins (RBPs) are central to key functions such as post-transcriptional regulation, mRNA stability, and adaptation to varied environmental conditions in prokaryotes. While the majority of research has concentrated on eukaryotic RBPs, recent developments underscore the crucial involvement of prokaryotic RBPs. Although computational methods have emerged in recent years to identify RBPs, they have fallen short in accurately identifying prokaryotic RBPs due to their generic nature. To bridge this gap, we introduce RBProkCNN, a novel machine learning-driven computational model meticulously designed for the accurate prediction of prokaryotic RBPs. The prediction process involves the utilization of eight shallow learning algorithms and four deep learning models, incorporating PSSM-based evolutionary features. By leveraging a convolutional neural network (CNN) and evolutionarily significant features selected through extreme gradient boosting variable importance measure, RBProkCNN achieved the highest accuracy in five-fold cross-validation, yielding 98.04% auROC and 98.19% auPRC. Furthermore, RBProkCNN demonstrated robust performance with an independent dataset, showcasing a commendable 95.77% auROC and 95.78% auPRC. Noteworthy is its superior predictive accuracy when compared to several state-of-the-art existing models. RBProkCNN is available as an online prediction tool (https://iasri-sg.icar.gov.in/rbprokcnn/), offering free access to interested users. This tool represents a substantial contribution, enriching the array of resources available for the accurate and efficient prediction of prokaryotic RBPs.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Ritwika Das
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
2
|
Basu S, Yu J, Kihara D, Kurgan L. Twenty years of advances in prediction of nucleic acid-binding residues in protein sequences. Brief Bioinform 2024; 26:bbaf016. [PMID: 39833102 PMCID: PMC11745544 DOI: 10.1093/bib/bbaf016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Revised: 12/24/2024] [Accepted: 01/06/2025] [Indexed: 01/22/2025] Open
Abstract
Computational prediction of nucleic acid-binding residues in protein sequences is an active field of research, with over 80 methods that were released in the past 2 decades. We identify and discuss 87 sequence-based predictors that include dozens of recently published methods that are surveyed for the first time. We overview historical progress and examine multiple practical issues that include availability and impact of predictors, key features of their predictive models, and important aspects related to their training and assessment. We observe that the past decade has brought increased use of deep neural networks and protein language models, which contributed to substantial gains in the predictive performance. We also highlight advancements in vital and challenging issues that include cross-predictions between deoxyribonucleic acid (DNA)-binding and ribonucleic acid (RNA)-binding residues and targeting the two distinct sources of binding annotations, structure-based versus intrinsic disorder-based. The methods trained on the structure-annotated interactions tend to perform poorly on the disorder-annotated binding and vice versa, with only a few methods that target and perform well across both annotation types. The cross-predictions are a significant problem, with some predictors of DNA-binding or RNA-binding residues indiscriminately predicting interactions with both nucleic acid types. Moreover, we show that methods with web servers are cited substantially more than tools without implementation or with no longer working implementations, motivating the development and long-term maintenance of the web servers. We close by discussing future research directions that aim to drive further progress in this area.
Collapse
Affiliation(s)
- Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| | - Jing Yu
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| | - Daisuke Kihara
- Department of Biological Sciences, Purdue University, 915 Mitch Daniels Boulevard, West Lafayette, IN 47907, United States
- Department of Computer Science, Purdue University, 305 N. University Street, West Lafayette, IN 47907, United States
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| |
Collapse
|
3
|
Zeng C, Zhuo C, Gao J, Liu H, Zhao Y. Advances and Challenges in Scoring Functions for RNA-Protein Complex Structure Prediction. Biomolecules 2024; 14:1245. [PMID: 39456178 PMCID: PMC11506084 DOI: 10.3390/biom14101245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 09/24/2024] [Accepted: 09/30/2024] [Indexed: 10/28/2024] Open
Abstract
RNA-protein complexes play a crucial role in cellular functions, providing insights into cellular mechanisms and potential therapeutic targets. However, experimental determination of these complex structures is often time-consuming and resource-intensive, and it rarely yields high-resolution data. Many computational approaches have been developed to predict RNA-protein complex structures in recent years. Despite these advances, achieving accurate and high-resolution predictions remains a formidable challenge, primarily due to the limitations inherent in current RNA-protein scoring functions. These scoring functions are critical tools for evaluating and interpreting RNA-protein interactions. This review comprehensively explores the latest advancements in scoring functions for RNA-protein docking, delving into the fundamental principles underlying various approaches, including coarse-grained knowledge-based, all-atom knowledge-based, and machine-learning-based methods. We critically evaluate the strengths and limitations of existing scoring functions, providing a detailed performance assessment. Considering the significant progress demonstrated by machine learning techniques, we discuss emerging trends and propose future research directions to enhance the accuracy and efficiency of scoring functions in RNA-protein complex prediction. We aim to inspire the development of more sophisticated and reliable computational tools in this rapidly evolving field.
Collapse
Affiliation(s)
| | | | | | | | - Yunjie Zhao
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China; (C.Z.); (C.Z.); (J.G.); (H.L.)
| |
Collapse
|
4
|
Jia P, Zhang F, Wu C, Li M. A comprehensive review of protein-centric predictors for biomolecular interactions: from proteins to nucleic acids and beyond. Brief Bioinform 2024; 25:bbae162. [PMID: 38739759 PMCID: PMC11089422 DOI: 10.1093/bib/bbae162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 02/17/2024] [Accepted: 03/31/2024] [Indexed: 05/16/2024] Open
Abstract
Proteins interact with diverse ligands to perform a large number of biological functions, such as gene expression and signal transduction. Accurate identification of these protein-ligand interactions is crucial to the understanding of molecular mechanisms and the development of new drugs. However, traditional biological experiments are time-consuming and expensive. With the development of high-throughput technologies, an increasing amount of protein data is available. In the past decades, many computational methods have been developed to predict protein-ligand interactions. Here, we review a comprehensive set of over 160 protein-ligand interaction predictors, which cover protein-protein, protein-nucleic acid, protein-peptide and protein-other ligands (nucleotide, heme, ion) interactions. We have carried out a comprehensive analysis of the above four types of predictors from several significant perspectives, including their inputs, feature profiles, models, availability, etc. The current methods primarily rely on protein sequences, especially utilizing evolutionary information. The significant improvement in predictions is attributed to deep learning methods. Additionally, sequence-based pretrained models and structure-based approaches are emerging as new trends.
Collapse
Affiliation(s)
- Pengzhen Jia
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Fuhao Zhang
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi 712100, China
| | - Chaojin Wu
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| |
Collapse
|
5
|
Yan Y, Li W, Wang S, Huang T. Seq-RBPPred: Predicting RNA-Binding Proteins from Sequence. ACS OMEGA 2024; 9:12734-12742. [PMID: 38524500 PMCID: PMC10955590 DOI: 10.1021/acsomega.3c08381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 12/18/2023] [Accepted: 12/28/2023] [Indexed: 03/26/2024]
Abstract
RNA-binding proteins (RBPs) can interact with RNAs to regulate RNA translation, modification, splicing, and other important biological processes. The accurate identification of RBPs is of paramount importance for gaining insights into the intricate mechanisms underlying organismal life activities. Traditional experimental methods to predict RBPs require a lot of time and money, so it is important to develop computational methods to predict RBPs. However, the existing approaches for RBP prediction still require further improvement due to unidentified RBPs in many species. In this study, we present Seq-RBPPred (predicting RBPs from sequence), a novel method that utilizes a comprehensive feature representation encompassing both biophysical properties and hidden-state features derived from protein sequences. In the results, comprehensive performance evaluations of Seq-RBPPred its superiority compare with state-of-the-art methods, yielding impressive performance including 0.922 for overall accuracy, 0.926 for sensitivity, 0.903 for specificity, and Matthew's correlation coefficient (MCC) of 0.757 as ascertained from the evaluation of the testing set. The data and code of Seq-RBPPred are available at https://github.com/yaoyao-11/Seq-RBPPred.
Collapse
Affiliation(s)
- Yuyao Yan
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Wenran Li
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Sijia Wang
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Tao Huang
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| |
Collapse
|
6
|
Pradhan UK, Meher PK, Naha S, Pal S, Gupta S, Gupta A, Parsad R. RBPLight: a computational tool for discovery of plant-specific RNA-binding proteins using light gradient boosting machine and ensemble of evolutionary features. Brief Funct Genomics 2023; 22:401-410. [PMID: 37158175 DOI: 10.1093/bfgp/elad016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 04/12/2023] [Accepted: 04/21/2023] [Indexed: 05/10/2023] Open
Abstract
RNA-binding proteins (RBPs) are essential for post-transcriptional gene regulation in eukaryotes, including splicing control, mRNA transport and decay. Thus, accurate identification of RBPs is important to understand gene expression and regulation of cell state. In order to detect RBPs, a number of computational models have been developed. These methods made use of datasets from several eukaryotic species, specifically from mice and humans. Although some models have been tested on Arabidopsis, these techniques fall short of correctly identifying RBPs for other plant species. Therefore, the development of a powerful computational model for identifying plant-specific RBPs is needed. In this study, we presented a novel computational model for locating RBPs in plants. Five deep learning models and ten shallow learning algorithms were utilized for prediction with 20 sequence-derived and 20 evolutionary feature sets. The highest repeated five-fold cross-validation accuracy, 91.24% AU-ROC and 91.91% AU-PRC, was achieved by light gradient boosting machine. While evaluated using an independent dataset, the developed approach achieved 94.00% AU-ROC and 94.50% AU-PRC. The proposed model achieved significantly higher accuracy for predicting plant-specific RBPs as compared to the currently available state-of-art RBP prediction models. Despite the fact that certain models have already been trained and assessed on the model organism Arabidopsis, this is the first comprehensive computer model for the discovery of plant-specific RBPs. The web server RBPLight was also developed, which is publicly accessible at https://iasri-sg.icar.gov.in/rbplight/, for the convenience of researchers to identify RBPs in plants.
Collapse
Affiliation(s)
- Upendra K Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Soumen Pal
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sagar Gupta
- CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur (HP) 176061, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
7
|
Zeng C, Jian Y, Vosoughi S, Zeng C, Zhao Y. Evaluating native-like structures of RNA-protein complexes through the deep learning method. Nat Commun 2023; 14:1060. [PMID: 36828844 PMCID: PMC9958188 DOI: 10.1038/s41467-023-36720-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Accepted: 02/14/2023] [Indexed: 02/26/2023] Open
Abstract
RNA-protein complexes underlie numerous cellular processes, including basic translation and gene regulation. The high-resolution structure determination of the RNA-protein complexes is essential for elucidating their functions. Therefore, computational methods capable of identifying the native-like RNA-protein structures are needed. To address this challenge, we thus develop DRPScore, a deep-learning-based approach for identifying native-like RNA-protein structures. DRPScore is tested on representative sets of RNA-protein complexes with various degrees of binding-induced conformation change ranging from fully rigid docking (bound-bound) to fully flexible docking (unbound-unbound). Out of the top 20 predictions, DRPScore selects native-like structures with a success rate of 91.67% on the testing set of bound RNA-protein complexes and 56.14% on the unbound complexes. DRPScore consistently outperforms existing methods with a roughly 10.53-15.79% improvement, even for the most difficult unbound cases. Furthermore, DRPScore significantly improves the accuracy of the native interface interaction predictions. DRPScore should be broadly useful for modeling and designing RNA-protein complexes.
Collapse
Affiliation(s)
- Chengwei Zeng
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan, 430079, China
| | - Yiren Jian
- Department of Computer Science, Dartmouth College, Hanover, NH, 03755, USA
| | - Soroush Vosoughi
- Department of Computer Science, Dartmouth College, Hanover, NH, 03755, USA
| | - Chen Zeng
- Department of Physics, The George Washington University, Washington, DC, 20052, USA
| | - Yunjie Zhao
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan, 430079, China.
| |
Collapse
|
8
|
Wei Q, Zhang Q, Gao H, Song T, Salhi A, Yu B. DEEPStack-RBP: Accurate identification of RNA-binding proteins based on autoencoder feature selection and deep stacking ensemble classifier. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2022]
|
9
|
Peng X, Wang X, Guo Y, Ge Z, Li F, Gao X, Song J. RBP-TSTL is a two-stage transfer learning framework for genome-scale prediction of RNA-binding proteins. Brief Bioinform 2022; 23:6596984. [PMID: 35649392 PMCID: PMC9294422 DOI: 10.1093/bib/bbac215] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Revised: 04/25/2022] [Accepted: 05/06/2022] [Indexed: 11/27/2022] Open
Abstract
RNA binding proteins (RBPs) are critical for the post-transcriptional control of RNAs and play vital roles in a myriad of biological processes, such as RNA localization and gene regulation. Therefore, computational methods that are capable of accurately identifying RBPs are highly desirable and have important implications for biomedical and biotechnological applications. Here, we propose a two-stage deep transfer learning-based framework, termed RBP-TSTL, for accurate prediction of RBPs. In the first stage, the knowledge from the self-supervised pre-trained model was extracted as feature embeddings and used to represent the protein sequences, while in the second stage, a customized deep learning model was initialized based on an annotated pre-training RBPs dataset before being fine-tuned on each corresponding target species dataset. This two-stage transfer learning framework can enable the RBP-TSTL model to be effectively trained to learn and improve the prediction performance. Extensive performance benchmarking of the RBP-TSTL models trained using the features generated by the self-supervised pre-trained model and other models trained using hand-crafting encoding features demonstrated the effectiveness of the proposed two-stage knowledge transfer strategy based on the self-supervised pre-trained models. Using the best-performing RBP-TSTL models, we further conducted genome-scale RBP predictions for Homo sapiens, Arabidopsis thaliana, Escherichia coli, and Salmonella and established a computational compendium containing all the predicted putative RBPs candidates. We anticipate that the proposed RBP-TSTL approach will be explored as a useful tool for the characterization of RNA-binding proteins and exploration of their sequence–structure–function relationships.
Collapse
Affiliation(s)
- Xinxin Peng
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | - Xiaoyu Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | - Yuming Guo
- Department of Epidemiology and Preventive Medicine, School of Public Health and Preventive Medicine, Monash University, Melbourne, Victoria 3004, Australia
| | - Zongyuan Ge
- Monash e-Research Centre and Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia.,College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia.,KAUST Computational Bioscience Research Center, King Abdullah University of Science and Technology
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| |
Collapse
|
10
|
Du X, Zhao X, Zhang Y. DeepBtoD: Improved RNA-binding proteins prediction via integrated deep learning. J Bioinform Comput Biol 2022; 20:2250006. [PMID: 35451938 DOI: 10.1142/s0219720022500068] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
RNA-binding proteins (RBPs) have crucial roles in various cellular processes such as alternative splicing and gene regulation. Therefore, the analysis and identification of RBPs is an essential issue. However, although many computational methods have been developed for predicting RBPs, a few studies simultaneously consider local and global information from the perspective of the RNA sequence. Facing this challenge, we present a novel method called DeepBtoD, which predicts RBPs directly from RNA sequences. First, a [Formula: see text]-BtoD encoding is designed, which takes into account the composition of [Formula: see text]-nucleotides and their relative positions and forms a local module. Second, we designed a multi-scale convolutional module embedded with a self-attentive mechanism, the ms-focusCNN, which is used to further learn more effective, diverse, and discriminative high-level features. Finally, global information is considered to supplement local modules with ensemble learning to predict whether the target RNA binds to RBPs. Our preliminary 24 independent test datasets show that our proposed method can classify RBPs with the area under the curve of 0.933. Remarkably, DeepBtoD shows competitive results across seven state-of-the-art methods, suggesting that RBPs can be highly recognized by integrating local [Formula: see text]-BtoD and global information only from RNA sequences. Hence, our integrative method may be useful to improve the power of RBPs prediction, which might be particularly useful for modeling protein-nucleic acid interactions in systems biology studies. Our DeepBtoD server can be accessed at http://175.27.228.227/DeepBtoD/.
Collapse
Affiliation(s)
- XiuQuan Du
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, Anhui, P. R. China.,School of Computer Science and Technology, Anhui University, Hefei 230601, Anhui, P. R. China
| | - XiuJuan Zhao
- School of Computer Science and Technology, Anhui University, Hefei 230601, Anhui, P. R. China
| | - YanPing Zhang
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, Anhui, P. R. China
| |
Collapse
|
11
|
Xie J, Zhang X, Zheng J, Hong X, Tong X, Liu X, Xue Y, Wang X, Zhang Y, Liu S. Two novel RNA-binding proteins identification through computational prediction and experimental validation. Genomics 2021; 114:149-160. [PMID: 34921931 DOI: 10.1016/j.ygeno.2021.12.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Revised: 08/05/2021] [Accepted: 12/13/2021] [Indexed: 11/16/2022]
Abstract
Since RBPs play important roles in the cell, it's particularly important to find new RBPs. We performed iRIP-seq and CLIP-seq to verify two proteins, CLIP1 and DMD, predicted by RBPPred whether are RBPs or not. The experimental results confirm that these two proteins have RNA-binding activity. We identified significantly enriched binding motifs UGGGGAGG, CUUCCG and CCCGU for CLIP1 (iRIP-seq), DMD (iRIP-seq) and DMD (CLIP-seq), respectively. The computational KEGG and GO analysis show that the CLIP1 and DMD share some biological processes and functions. Besides, we found that the SNPs between DMD and its RNA partners may be associated with Becker muscular dystrophy, Duchenne muscular dystrophy, Dilated cardiomyopathy 3B and Cardiovascular phenotype. Among the thirteen cancers data, CLIP1 and another 300 oncogenes always co-occur, and 123 of these 300 genes interact with CLIP1. These cancers may be associated with the mutations occurred in both CLIP1 and the genes it interacts with.
Collapse
Affiliation(s)
- Juan Xie
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Xiaoli Zhang
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Jinfang Zheng
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Xu Hong
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Xiaoxue Tong
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Xudong Liu
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Yaqiang Xue
- Laboratory for Genome Regulation and Human Health, ABLife Inc., Wuhan, Hubei 430075, China
| | - Xuelian Wang
- ABLife BioBigData Institute, Wuhan, Hubei 430075, China
| | - Yi Zhang
- ABLife BioBigData Institute, Wuhan, Hubei 430075, China
| | - Shiyong Liu
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China.
| |
Collapse
|
12
|
Mishra A, Khanal R, Kabir WU, Hoque T. AIRBP: Accurate identification of RNA-binding proteins using machine learning techniques. Artif Intell Med 2021; 113:102034. [PMID: 33685590 DOI: 10.1016/j.artmed.2021.102034] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2020] [Revised: 01/19/2021] [Accepted: 02/09/2021] [Indexed: 12/25/2022]
Abstract
Identification of RNA-binding proteins (RBPs) that bind to ribonucleic acid molecules is an important problem in Computational Biology and Bioinformatics. It becomes indispensable to identify RBPs as they play crucial roles in post-transcriptional control of RNAs and RNA metabolism as well as have diverse roles in various biological processes such as splicing, mRNA stabilization, mRNA localization, and translation, RNA synthesis, folding-unfolding, modification, processing, and degradation. The existing experimental techniques for identifying RBPs are time-consuming and expensive. Therefore, identifying RBPs directly from the sequence using computational methods can be useful to annotate RBPs and assist the experimental design efficiently. In this work, we present a method called AIRBP, which is designed using an advanced machine learning technique, called stacking, to effectively predict RBPs by utilizing features extracted from evolutionary information, physiochemical properties, and disordered properties. Moreover, our method, AIRBP, use the majority vote from RBPPred, DeepRBPPred, and the stacking model for the prediction for RBPs. The results show that AIRBP attains Accuracy (ACC), Balanced Accuracy (BACC), F1-score, and Mathews Correlation Coefficient (MCC) of 95.84 %, 94.71 %, 0.928, and 0.899, respectively, based on the training dataset, using 10-fold cross-validation (CV). Further evaluation of AIRBP on independent test set reveals that it achieves ACC, BACC, F1-score, and MCC of 94.36 %, 94.28 %, 0.897, and 0.860, for Human test set; 91.25 %, 93.00 %, 0.896, and 0.835 for S. cerevisiae test set; and 90.60 %, 90.41 %, 0.934, and 0.775 for A. thaliana test set, respectively. These results indicate that the AIRBP outperforms the existing Deep- and TriPepSVM methods. Therefore, the proposed better-performing AIRBP can be useful for accurate identification and annotation of RBPs directly from the sequence and help gain valuable insight to treat critical diseases. Availability: Code-data is available here: http://cs.uno.edu/∼tamjid/Software/AIRBP/code_data.zip.
Collapse
Affiliation(s)
- Avdesh Mishra
- Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX, USA
| | - Reecha Khanal
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Wasi Ul Kabir
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Tamjidul Hoque
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| |
Collapse
|
13
|
Zhao Y, Du X. econvRBP: Improved ensemble convolutional neural networks for RNA binding protein prediction directly from sequence. Methods 2020; 181-182:15-23. [PMID: 31513916 DOI: 10.1016/j.ymeth.2019.09.008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2019] [Revised: 08/21/2019] [Accepted: 09/05/2019] [Indexed: 10/26/2022] Open
Abstract
RNA binding proteins (RBPs) determine RNA process from synthesis to decay, which play a key role in RNA transport, translation and degradation. Therefore, exploring RBPs' function from the amino acid sequence using computational methods has become one of the momentous topics in genome annotation. However, there still have some challenges: (1) shallow feature: Although the sequence determines structure is self-evident, it is difficult to analyze the essential features from simple sequence. (2) Poorly understand: feature-based prediction methods mainly emphasize feature extraction, while in-depth understanding of protein mysteries limits the application of feature engineering. (3) Feature fusion: multi-feature fusion is often used, but the features are not well integrated. In view of these challenges, we propose a novel ensemble convolutional neural network (econvRBP) to predict RBPs. In order to capture the local and global features of RNA binding proteins simultaneously, first of all, One Hot and Conjoint Triad encoding methods are used to transform amino acid sequence into local and global features, respectively. After that the local and global features are combined for further high-level feature extraction using convolutional neural networks. Some experiments are constructed to evaluate our method with 10-fold cross validation and the results show that it has achieved the best performance among all the predictors so far. We correctly predicted 99% of 2875 RBPs and 99% of 6782 non-RBPs with accuracy of 0.99. In addition, the datasets provided by RBPPred are also used to validate our models with an accuracy of 0.87. These results indicate that the econvRBP is the most excellent method at present, and will provide reliable guidance for the detection of RBPs. econvRBP is available at http://47.100.203.218:3389/home.html/.
Collapse
Affiliation(s)
- Yuze Zhao
- School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Xiuquan Du
- School of Computer Science and Technology, Anhui University, Hefei, Anhui, China.
| |
Collapse
|
14
|
Wang K, Hu G, Wu Z, Su H, Yang J, Kurgan L. Comprehensive Survey and Comparative Assessment of RNA-Binding Residue Predictions with Analysis by RNA Type. Int J Mol Sci 2020; 21:E6879. [PMID: 32961749 PMCID: PMC7554811 DOI: 10.3390/ijms21186879] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2020] [Revised: 09/15/2020] [Accepted: 09/17/2020] [Indexed: 02/07/2023] Open
Abstract
With close to 30 sequence-based predictors of RNA-binding residues (RBRs), this comparative survey aims to help with understanding and selection of the appropriate tools. We discuss past reviews on this topic, survey a comprehensive collection of predictors, and comparatively assess six representative methods. We provide a novel and well-designed benchmark dataset and we are the first to report and compare protein-level and datasets-level results, and to contextualize performance to specific types of RNAs. The methods considered here are well-cited and rely on machine learning algorithms on occasion combined with homology-based prediction. Empirical tests reveal that they provide relatively accurate predictions. Virtually all methods perform well for the proteins that interact with rRNAs, some generate accurate predictions for mRNAs, snRNA, SRP and IRES, while proteins that bind tRNAs are predicted poorly. Moreover, except for DRNApred, they confuse DNA and RNA-binding residues. None of the six methods consistently outperforms the others when tested on individual proteins. This variable and complementary protein-level performance suggests that users should not rely on applying just the single best dataset-level predictor. We recommend that future work should focus on the development of approaches that facilitate protein-level selection of accurate predictors and the consensus-based prediction of RBRs.
Collapse
Affiliation(s)
- Kui Wang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China; (K.W.); (Z.W.); (H.S.); (J.Y.)
| | - Gang Hu
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China;
| | - Zhonghua Wu
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China; (K.W.); (Z.W.); (H.S.); (J.Y.)
| | - Hong Su
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China; (K.W.); (Z.W.); (H.S.); (J.Y.)
| | - Jianyi Yang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China; (K.W.); (Z.W.); (H.S.); (J.Y.)
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
15
|
He J, Tao H, Huang SY. Protein-ensemble-RNA docking by efficient consideration of protein flexibility through homology models. Bioinformatics 2020; 35:4994-5002. [PMID: 31086984 DOI: 10.1093/bioinformatics/btz388] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2019] [Revised: 04/28/2019] [Accepted: 05/03/2019] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION Given the importance of protein-ribonucleic acid (RNA) interactions in many biological processes, a variety of docking algorithms have been developed to predict the complex structure from individual protein and RNA partners in the past decade. However, due to the impact of molecular flexibility, the performance of current methods has hit a bottleneck in realistic unbound docking. Pushing the limit, we have proposed a protein-ensemble-RNA docking strategy to explicitly consider the protein flexibility in protein-RNA docking through an ensemble of multiple protein structures, which is referred to as MPRDock. Instead of taking conformations from MD simulations or experimental structures, we obtained the multiple structures of a protein by building models from its homologous templates in the Protein Data Bank (PDB). RESULTS Our approach can not only avoid the reliability issue of structures from MD simulations but also circumvent the limited number of experimental structures for a target protein in the PDB. Tested on 68 unbound-bound and 18 unbound-unbound protein-RNA complexes, our MPRDock/DITScorePR considerably improved the docking performance and achieved a significantly higher success rate than single-protein rigid docking whether pseudo-unbound templates are included or not. Similar improvements were also observed when combining our ensemble docking strategy with other scoring functions. The present homology model-based ensemble docking approach will have a general application in molecular docking for other interactions. AVAILABILITY AND IMPLEMENTATION http://huanglab.phys.hust.edu.cn/mprdock/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiahua He
- Institute of Biophysics, School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Huanyu Tao
- Institute of Biophysics, School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Sheng-You Huang
- Institute of Biophysics, School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, China
| |
Collapse
|
16
|
Qiu L, Zou X. Scoring Functions for Protein-RNA Complex Structure Prediction: Advances, Applications, and Future Directions. COMMUNICATIONS IN INFORMATION AND SYSTEMS 2020; 20:1-22. [PMID: 33867869 DOI: 10.4310/cis.2020.v20.n1.a1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Protein-RNA interaction is among the most essential of biological events in living cells, being involved in protein synthesizing, RNA processing and transport, DNA transcription, and regulation of gene expression, and many other critical bio-molecular activities. A thorough understanding of this interaction is of paramount importance in fundamental study of a variety of vital cellular processes and therapeutic application for remedy of a broad range of diseases. Experimental high-resolution 3D structure determination is the primary source of knowledge for protein-RNA complexes. However, due to technical limitations, the existing techniques for experimental structure determination couldn't match the demand from fast growing interest in academia and industry. This problem necessitates the alternative high-throughput computational method for protein-RNA complex structure prediction. Similar to the in silico methods used for protein-protein and protein-DNA interactions, a reliable prediction of protein-RNA complex structure requires a scoring function with commensurate discriminatory power. Derived from determined structures and purposed to predict the to-be-determined structures, the scoring function is not only a predictive tool but also a gauge of our knowledge of protein-RNA interaction. In this review, we present an overview of the status of existing scoring functions and the scientific principle behind their constructions as well as their strengths and limitations. Finally, we will discuss about future directions of the scoring function development for protein-RNA structure prediction.
Collapse
Affiliation(s)
- Liming Qiu
- Dalton Cardiovascular Research Center, University of Missouri, Columbia, Missouri 65211
| | - Xiaoqin Zou
- Dalton Cardiovascular Research Center, University of Missouri, Columbia, Missouri 65211.,Department of Physics & Astronomy, University of Missouri, Columbia, Missouri 65211.,Department of Biochemistry, University of Missouri, Columbia, Missouri 65211.,Informatics Institute, University of Missouri, Columbia, Missouri 65211
| |
Collapse
|
17
|
Chen S, Sun Z, Lin L, Liu Z, Liu X, Chong Y, Lu Y, Zhao H, Yang Y. To Improve Protein Sequence Profile Prediction through Image Captioning on Pairwise Residue Distance Map. J Chem Inf Model 2019; 60:391-399. [PMID: 31800243 DOI: 10.1021/acs.jcim.9b00438] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Protein sequence profile prediction aims to generate multiple sequences from structural information to advance the protein design. Protein sequence profile can be computationally predicted by energy-based or fragment-based methods. By integrating these methods with neural networks, our previous method, SPIN2, has achieved a sequence recovery rate of 34%. However, SPIN2 employed only one-dimensional (1D) structural properties that are not sufficient to represent three-dimensional (3D) structures. In this study, we represented 3D structures by 2D maps of pairwise residue distances and developed a new method (SPROF) to predict protein sequence profiles based on an image captioning learning frame. To our best knowledge, this is the first method to employ a 2D distance map for predicting protein properties. SPROF achieved 39.8% in sequence recovery of residues on the independent test set, representing a 5.2% improvement over SPIN2. We also found the sequence recovery increased with the number of their neighbored residues in 3D structural space, indicating that our method can effectively learn long-range information from the 2D distance map. Thus, such network architecture using a 2D distance map is expected to be useful for other 3D structure-based applications, such as binding site prediction, protein function prediction, and protein interaction prediction. The online server and the source code is available at http://biomed.nscc-gz.cn and https://github.com/biomed-AI/SPROF , respectively.
Collapse
Affiliation(s)
- Sheng Chen
- School of Data and Computer Science , Sun Yat-sen University , Guangzhou 510000 , China
| | - Zhe Sun
- School of Data and Computer Science , Sun Yat-sen University , Guangzhou 510000 , China
| | - Lihua Lin
- School of Data and Computer Science , Sun Yat-sen University , Guangzhou 510000 , China
| | - Zifeng Liu
- Third Affiliated Hospital of Sun Yat-sen University , Guangzhou 510000 , China
| | - Xun Liu
- Third Affiliated Hospital of Sun Yat-sen University , Guangzhou 510000 , China
| | - Yutian Chong
- Third Affiliated Hospital of Sun Yat-sen University , Guangzhou 510000 , China
| | - Yutong Lu
- School of Data and Computer Science , Sun Yat-sen University , Guangzhou 510000 , China
| | - Huiying Zhao
- Sun Yat-sen Memorial Hospital , Sun Yat-sen University , Guangzhou 510000 , China
| | - Yuedong Yang
- School of Data and Computer Science , Sun Yat-sen University , Guangzhou 510000 , China.,Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University) of the Ministry of Education , Guangzhou 510000 , China
| |
Collapse
|
18
|
Gattani S, Mishra A, Hoque MT. StackCBPred: A stacking based prediction of protein-carbohydrate binding sites from sequence. Carbohydr Res 2019; 486:107857. [DOI: 10.1016/j.carres.2019.107857] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2019] [Revised: 10/05/2019] [Accepted: 10/23/2019] [Indexed: 11/26/2022]
|
19
|
Poursheikhali Asghari M, Abdolmaleki P. Prediction of RNA- and DNA-Binding Proteins Using Various Machine Learning Classifiers. Avicenna J Med Biotechnol 2019; 11:104-111. [PMID: 30800250 PMCID: PMC6359699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Nucleic acid-binding proteins play major roles in different biological processes, such as transcription, splicing and translation. Therefore, the nucleic acid-binding function prediction of proteins is a step toward full functional annotation of proteins. The aim of our research was the improvement of nucleic-acid binding function prediction. METHODS In the current study, nine machine-learning algorithms were used to predict RNA- and DNA-binding proteins and also to discriminate between RNA-binding proteins and DNA-binding proteins. The electrostatic features were utilized for prediction of each function in corresponding adapted protein datasets. The leave-one-out cross-validation process was used to measure the performance of employed classifiers. RESULTS Radial basis function classifier gave the best results in predicting RNA- and DNA-binding proteins in comparison with other classifiers applied. In discriminating between RNA- and DNA-binding proteins, multilayer perceptron classifier was the best one. CONCLUSION Our findings show that the prediction of nucleic acid-binding function based on these simple electrostatic features can be improved by applied classifiers. Moreover, a reasonable progress to distinguish between RNA- and DNA-binding proteins has been achieved.
Collapse
Affiliation(s)
| | - Parviz Abdolmaleki
- Corresponding author: Parviz Abdolmaleki, Ph.D., Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran, Tel: +98 21 82883404, Fax: +98 21 82884457, E-mail:,
| |
Collapse
|
20
|
Deep-RBPPred: Predicting RNA binding proteins in the proteome scale based on deep learning. Sci Rep 2018; 8:15264. [PMID: 30323214 PMCID: PMC6189057 DOI: 10.1038/s41598-018-33654-x] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2018] [Accepted: 09/28/2018] [Indexed: 12/22/2022] Open
Abstract
RNA binding protein (RBP) plays an important role in cellular processes. Identifying RBPs by computation and experiment are both essential. Recently, an RBP predictor, RBPPred, is proposed in our group to predict RBPs. However, RBPPred is too slow for that it needs to generate PSSM matrix as its feature. Herein, based on the protein feature of RBPPred and Convolutional Neural Network (CNN), we develop a deep learning model called Deep-RBPPred. With the balance and imbalance training set, we obtain Deep-RBPPred-balance and Deep-RBPPred-imbalance models. Deep-RBPPred has three advantages comparing to previous methods. (1) Deep-RBPPred only needs few physicochemical properties based on protein sequences. (2) Deep-RBPPred runs much faster. (3) Deep-RBPPred has a good generalization ability. In the meantime, Deep-RBPPred is still as good as the state-of-the-art method. Testing in A. thaliana, S. cerevisiae and H. sapiens proteomes, MCC values are 0.82 (0.82), 0.65 (0.69) and 0.85 (0.80) for balance model (imbalance model) when the score cutoff is set to 0.5, respectively. In the same testing dataset, different machine learning algorithms (CNN and SVM) are also compared. The results show that CNN-based model can identify more RBPs than SVM-based. In comparing the balance and imbalance model, both CNN-base and SVM-based tend to favor the majority class in the imbalance set. Deep-RBPPred forecasts 280 (balance model) and 265 (imbalance model) of 299 new RBP. The sensitivity of balance model is about 7% higher than the state-of-the-art method. We also apply deep-RBPPred to 30 eukaryotes and 109 bacteria proteomes downloaded from Uniprot to estimate all possible RBPs. The estimating result shows that rates of RBPs in eukaryote proteomes are much higher than bacteria proteomes.
Collapse
|
21
|
Zhao H, Taherzadeh G, Zhou Y, Yang Y. Computational Prediction of Carbohydrate-Binding Proteins and Binding Sites. ACTA ACUST UNITED AC 2018; 94:e75. [PMID: 30106511 DOI: 10.1002/cpps.75] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
Protein-carbohydrate interaction is essential for biological systems, and carbohydrate-binding proteins (CBPs) are important targets when designing antiviral and anticancer drugs. Due to the high cost and difficulty associated with experimental approaches, many computational methods have been developed as complementary approaches to predict CBPs or carbohydrate-binding sites. However, most of these computational methods are not publicly available. Here, we provide a comprehensive review of related studies and demonstrate our two recently developed bioinformatics methods. The method SPOT-CBP is a template-based method for detecting CBPs based on structure through structural homology search combined with a knowledge-based scoring function. This method can yield model complex structure in addition to accurate prediction of CBPs. Furthermore, it has been observed that similarly accurate predictions can be made using structures from homology modeling, which has significantly expanded its applicability. The other method, SPRINT-CBH, is a de novo approach that predicts binding residues directly from protein sequences by using sequence information and predicted structural properties. This approach does not need structurally similar templates and thus is not limited by the current database of known protein-carbohydrate complex structures. These two complementary methods are available at https://sparks-lab.org. © 2018 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Huiying Zhao
- Sun Yat-Sen Memorial Hospital, Sun Yat-sen University, Guangzhou, China
| | - Ghazaleh Taherzadeh
- School of Information and Communication Technology, Griffith University, Gold Coast, Queensland, Australia
| | - Yaoqi Zhou
- School of Information and Communication Technology, Griffith University, Gold Coast, Queensland, Australia.,Institute for Glycomics, Griffith University, Gold Coast, Queensland, Australia
| | - Yuedong Yang
- School of Information and Communication Technology, Griffith University, Gold Coast, Queensland, Australia.,Institute for Glycomics, Griffith University, Gold Coast, Queensland, Australia.,School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
22
|
Lee CM, Han JI, Kang MH, Kim SG, Park HM. Polymorphism in the serotonin transporter protein gene in Maltese dogs with degenerative mitral valve disease. J Vet Sci 2018; 19:129-135. [PMID: 28693307 PMCID: PMC5799389 DOI: 10.4142/jvs.2018.19.1.129] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2017] [Revised: 04/29/2017] [Accepted: 06/21/2017] [Indexed: 11/20/2022] Open
Abstract
Degenerative mitral valve disease (DMVD) is the most commonly acquired cardiac disease in dogs. This study evaluated the relationship between genetic variations in the serotonin transporter (SERT) gene of Maltese dogs and DMVD. Genomic DNA was extracted from blood samples collected from 20 client-owned DMVD Maltese dogs and 10 healthy control dogs, and each exon of the SERT gene was amplified via polymerase chain reaction. The resulting genetic sequences were aligned and analyzed for variations by comparing with reference sequences; the predicted secondary structures of these variations were modeled and cross-verified by applying computational methods. Genetic variations, including five nonsynonymous genetic variations, were detected in five exons. Protein structure and function of the five nonsynonymous genetic variations were predicted. Three of the five polymorphisms were predicted to be probable causes of damage to protein function and confirmed by protein structure model verification. This study identified six polymorphisms of the SERT gene in Maltese dogs with DMVD, suggesting an association between the SERT gene and canine DMVD. This is the first study of SERT mutation in Maltese dogs with DMVD and is considered a pilot study into clinical genetic examination for early DMVD diagnosis.
Collapse
Affiliation(s)
- Chang-Min Lee
- Department of Veterinary Internal Medicine, College of Veterinary Medicine, Konkuk University, Seoul 05030, Korea
| | - Jae-Ik Han
- Laboratory of Wildlife Diseases, College of Veterinary Medicine, Chonbuk National University, Iksan 54596, Korea
| | - Min-Hee Kang
- Department of Veterinary Internal Medicine, College of Veterinary Medicine, Konkuk University, Seoul 05030, Korea
| | - Seung-Gon Kim
- Department of Veterinary Internal Medicine, College of Veterinary Medicine, Konkuk University, Seoul 05030, Korea
| | - Hee-Myung Park
- Department of Veterinary Internal Medicine, College of Veterinary Medicine, Konkuk University, Seoul 05030, Korea
| |
Collapse
|
23
|
Chowdhury S, Zhang J, Kurgan L. In Silico Prediction and Validation of Novel RNA Binding Proteins and Residues in the Human Proteome. Proteomics 2018; 18:e1800064. [PMID: 29806170 DOI: 10.1002/pmic.201800064] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2018] [Revised: 05/05/2018] [Indexed: 12/22/2022]
Abstract
Deciphering a complete landscape of protein-RNA interactions in the human proteome remains an elusive challenge. We computationally elucidate RNA binding proteins (RBPs) using an approach that complements previous efforts. We employ two modern complementary sequence-based methods that provide accurate predictions from the structured and the intrinsically disordered sequences, even in the absence of sequence similarity to the known RBPs. We generate and analyze putative RNA binding residues on the whole proteome scale. Using a conservative setting that ensures low, 5% false positive rate, we identify 1511 putative RBPs that include 281 known RBPs and 166 RBPs that were previously predicted. We empirically demonstrate that these overlaps are statistically significant. We also validate the putative RBPs based on two major hallmarks of their RNA binding residues: high levels of evolutionary conservation and enrichment in charged amino acids. Moreover, we show that the novel RBPs are significantly under-annotated functionally which coincides with the fact that they were not yet found to interact with RNAs. We provide two examples of our novel putative RBPs for which there is recent evidence of their interactions with RNAs. The dataset of novel putative RBPs and RNA binding residues for the future hypothesis generation is provided in the Supporting Information.
Collapse
Affiliation(s)
- Shomeek Chowdhury
- Dr. Vikram Sarabhai Institute of Cell and Molecular Biology, Maharaja Sayajirao University of Baroda, Gujarat, 390005, India.,Department of Computer Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| | - Jian Zhang
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, 23284, USA.,School of Computer and Information Technology, Xinyang Normal University, Xinyang, 464000, P. R. China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| |
Collapse
|
24
|
Sharan M, Förstner KU, Eulalio A, Vogel J. APRICOT: an integrated computational pipeline for the sequence-based identification and characterization of RNA-binding proteins. Nucleic Acids Res 2017; 45:e96. [PMID: 28334975 PMCID: PMC5499795 DOI: 10.1093/nar/gkx137] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2016] [Accepted: 02/27/2017] [Indexed: 11/14/2022] Open
Abstract
RNA-binding proteins (RBPs) have been established as core components of several post-transcriptional gene regulation mechanisms. Experimental techniques such as cross-linking and co-immunoprecipitation have enabled the identification of RBPs, RNA-binding domains (RBDs) and their regulatory roles in the eukaryotic species such as human and yeast in large-scale. In contrast, our knowledge of the number and potential diversity of RBPs in bacteria is poorer due to the technical challenges associated with the existing global screening approaches. We introduce APRICOT, a computational pipeline for the sequence-based identification and characterization of proteins using RBDs known from experimental studies. The pipeline identifies functional motifs in protein sequences using position-specific scoring matrices and Hidden Markov Models of the functional domains and statistically scores them based on a series of sequence-based features. Subsequently, APRICOT identifies putative RBPs and characterizes them by several biological properties. Here we demonstrate the application and adaptability of the pipeline on large-scale protein sets, including the bacterial proteome of Escherichia coli. APRICOT showed better performance on various datasets compared to other existing tools for the sequence-based prediction of RBPs by achieving an average sensitivity and specificity of 0.90 and 0.91 respectively. The command-line tool and its documentation are available at https://pypi.python.org/pypi/bio-apricot.
Collapse
Affiliation(s)
- Malvika Sharan
- Institute of Molecular Infection Biology, University of Würzburg, 97080 Würzburg, Germany
| | - Konrad U Förstner
- Core Unit Systems Medicine, University of Würzburg, 97080 Würzburg, Germany
| | - Ana Eulalio
- Institute of Molecular Infection Biology, University of Würzburg, 97080 Würzburg, Germany
| | - Jörg Vogel
- Institute of Molecular Infection Biology, University of Würzburg, 97080 Würzburg, Germany
| |
Collapse
|
25
|
Yan J, Kurgan L. DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues. Nucleic Acids Res 2017; 45:e84. [PMID: 28132027 PMCID: PMC5449545 DOI: 10.1093/nar/gkx059] [Citation(s) in RCA: 86] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2016] [Accepted: 01/24/2017] [Indexed: 01/18/2023] Open
Abstract
Protein-DNA and protein-RNA interactions are part of many diverse and essential cellular functions and yet most of them remain to be discovered and characterized. Recent research shows that sequence-based predictors of DNA-binding residues accurately find these residues but also cross-predict many RNA-binding residues as DNA-binding, and vice versa. Most of these methods are also relatively slow, prohibiting applications on the whole-genome scale. We describe a novel sequence-based method, DRNApred, which accurately and in high-throughput predicts and discriminates between DNA- and RNA-binding residues. DRNApred was designed using a new dataset with both DNA- and RNA-binding proteins, regression that penalizes cross-predictions, and a novel two-layered architecture. DRNApred outperforms state-of-the-art predictors of DNA- or RNA-binding residues on a benchmark test dataset by substantially reducing the cross predictions and predicting arguably higher quality false positives that are located nearby the native binding residues. Moreover, it also more accurately predicts the DNA- and RNA-binding proteins. Application on the human proteome confirms that DRNApred reduces the cross predictions among the native nucleic acid binders. Also, novel putative DNA/RNA-binding proteins that it predicts share similar subcellular locations and residue charge profiles with the known native binding proteins. Webserver of DRNApred is freely available at http://biomine.cs.vcu.edu/servers/DRNApred/.
Collapse
Affiliation(s)
- Jing Yan
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton T6G 2V4, Canada
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, 23284, USA
| |
Collapse
|
26
|
Yang Y, Heffernan R, Paliwal K, Lyons J, Dehzangi A, Sharma A, Wang J, Sattar A, Zhou Y. SPIDER2: A Package to Predict Secondary Structure, Accessible Surface Area, and Main-Chain Torsional Angles by Deep Neural Networks. Methods Mol Biol 2017; 1484:55-63. [PMID: 27787820 DOI: 10.1007/978-1-4939-6406-2_6] [Citation(s) in RCA: 105] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Predicting one-dimensional structure properties has played an important role to improve prediction of protein three-dimensional structures and functions. The most commonly predicted properties are secondary structure and accessible surface area (ASA) representing local and nonlocal structural characteristics, respectively. Secondary structure prediction is further complemented by prediction of continuous main-chain torsional angles. Here we describe a newly developed method SPIDER2 that utilizes three iterations of deep learning neural networks to improve the prediction accuracy of several structural properties simultaneously. For an independent test set of 1199 proteins SPIDER2 achieves 82 % accuracy for secondary structure prediction, 0.76 for the correlation coefficient between predicted and actual solvent accessible surface area, 19° and 30° for mean absolute errors of backbone φ and ψ angles, respectively, and 8° and 32° for mean absolute errors of Cα-based θ and τ angles, respectively. The method provides state-of-the-art, all-in-one accurate prediction of local structure and solvent accessible surface area. The method is implemented, as a webserver along with a standalone package that are available in our website: http://sparks-lab.org .
Collapse
Affiliation(s)
- Yuedong Yang
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Gold Coast Campus, Science 1 (G24) 2.10, Parklands Drive, Southport, QLD, 4222, Australia
| | - Rhys Heffernan
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, QLD, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, QLD, Australia
| | - James Lyons
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, QLD, Australia
| | - Abdollah Dehzangi
- Department of Psychiatry, Medical Research Center, University of Iowa, Iowa City, IA, USA
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD, Australia
- School of Engineering and Physics, University of the South Pacific, Private Mail Bag, Laucala Campus, Suva, Fiji
| | - Jihua Wang
- Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Dezhou University, Dezhou, Shandong, China
| | - Abdul Sattar
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD, Australia
- National ICT Australia (NICTA), Brisbane, QLD, Australia
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Gold Coast Campus, Science 1 (G24) 2.10, Parklands Drive, Southport, QLD, 4222, Australia.
| |
Collapse
|
27
|
Kunz M, Wolf B, Schulze H, Atlan D, Walles T, Walles H, Dandekar T. Non-Coding RNAs in Lung Cancer: Contribution of Bioinformatics Analysis to the Development of Non-Invasive Diagnostic Tools. Genes (Basel) 2016; 8:E8. [PMID: 28035947 PMCID: PMC5295003 DOI: 10.3390/genes8010008] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2016] [Revised: 12/05/2016] [Accepted: 12/15/2016] [Indexed: 01/11/2023] Open
Abstract
Lung cancer is currently the leading cause of cancer related mortality due to late diagnosis and limited treatment intervention. Non-coding RNAs are not translated into proteins and have emerged as fundamental regulators of gene expression. Recent studies reported that microRNAs and long non-coding RNAs are involved in lung cancer development and progression. Moreover, they appear as new promising non-invasive biomarkers for early lung cancer diagnosis. Here, we highlight their potential as biomarker in lung cancer and present how bioinformatics can contribute to the development of non-invasive diagnostic tools. For this, we discuss several bioinformatics algorithms and software tools for a comprehensive understanding and functional characterization of microRNAs and long non-coding RNAs.
Collapse
Affiliation(s)
- Meik Kunz
- Functional Genomics and Systems Biology Group, Department of Bioinformatics, Biocenter, University of Wuerzburg, 97074 Wuerzburg, Germany.
| | - Beat Wolf
- Functional Genomics and Systems Biology Group, Department of Bioinformatics, Biocenter, University of Wuerzburg, 97074 Wuerzburg, Germany.
- University of Applied Sciences and Arts of Western Switzerland, Perolles 80, 1700 Fribourg, Switzerland.
| | - Harald Schulze
- Institute of Experimental Biomedicine, University Hospital Wuerzburg, 97080 Wuerzburg, Germany.
| | - David Atlan
- Phenosystems SA, 137 Rue de Tubize, 1440 Braine le Château, Belgium.
| | - Thorsten Walles
- Department of Cardiothoracic Surgery, University Hospital of Wuerzburg, 97080 Wuerzburg, Germany.
| | - Heike Walles
- Department of Tissue Engineering and Regenerative Medicine, University Hospital Wuerzburg, Roentgenring 11, 97070 Wuerzburg, Germany.
- Translational Center Wuerzburg "Regenerative therapies in oncology and musculoskeletal disease" Wuerzburg branch of the Fraunhofer Institute Interfacial Engineering and Biotechnology (IGB), Roentgenring 11, 97070 Wuerzburg, Germany.
| | - Thomas Dandekar
- Functional Genomics and Systems Biology Group, Department of Bioinformatics, Biocenter, University of Wuerzburg, 97074 Wuerzburg, Germany.
- BioComputing Unit, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117 Heidelberg, Germany.
| |
Collapse
|
28
|
Zhang X, Liu S. RBPPred: predicting RNA-binding proteins from sequence using SVM. Bioinformatics 2016; 33:854-862. [DOI: 10.1093/bioinformatics/btw730] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2016] [Accepted: 11/16/2016] [Indexed: 11/13/2022] Open
|
29
|
Marchese D, de Groot NS, Lorenzo Gotor N, Livi CM, Tartaglia GG. Advances in the characterization of RNA-binding proteins. WILEY INTERDISCIPLINARY REVIEWS. RNA 2016; 7:793-810. [PMID: 27503141 PMCID: PMC5113702 DOI: 10.1002/wrna.1378] [Citation(s) in RCA: 84] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/28/2016] [Revised: 06/14/2016] [Accepted: 06/23/2016] [Indexed: 12/14/2022]
Abstract
From transcription, to transport, storage, and translation, RNA depends on association with different RNA-binding proteins (RBPs). Methods based on next-generation sequencing and protein mass-spectrometry have started to unveil genome-wide interactions of RBPs but many aspects still remain out of sight. How many of the binding sites identified in high-throughput screenings are functional? A number of computational methods have been developed to analyze experimental data and to obtain insights into the specificity of protein-RNA interactions. How can theoretical models be exploited to identify RBPs? In addition to oligomeric complexes, protein and RNA molecules can associate into granular assemblies whose physical properties are still poorly understood. What protein features promote granule formation and what effects do these assemblies have on cell function? Here, we describe the newest in silico, in vitro, and in vivo advances in the field of protein-RNA interactions. We also present the challenges that experimental and computational approaches will have to face in future studies. WIREs RNA 2016, 7:793-810. doi: 10.1002/wrna.1378 For further resources related to this article, please visit the WIREs website.
Collapse
Affiliation(s)
- Domenica Marchese
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Natalia Sanchez de Groot
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Nieves Lorenzo Gotor
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Carmen Maria Livi
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
- IFOM Foundation, FIRC Institute of Molecular Oncology Foundation, Milan, Italy
| | - Gian G Tartaglia
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain.
- Universitat Pompeu Fabra (UPF), Barcelona, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
| |
Collapse
|
30
|
Ghosh P, Sowdhamini R. Genome-wide survey of putative RNA-binding proteins encoded in the human proteome. MOLECULAR BIOSYSTEMS 2016; 12:532-40. [PMID: 26675803 DOI: 10.1039/c5mb00638d] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
RNA-binding proteins (RBPs) are involved in various post-transcriptional gene regulatory processes and are also functionally important members of the ribosome and the spliceosome. However, RBPs and their interactions with RNA are less well-studied in comparison to DNA-binding proteins. We have classified the existing RBP structures, available in complexes with RNA and RNA/DNA hybrids, into different structural families and created Hidden Markov Models (HMMs). These structure-centric family HMMs, along with the sequence-centric family HMMs, were used as a primary database to systematically search the human proteome for the presence of putative RBPs. We have found more than 2600 gene products with RBP signatures in humans, of which around 28% are likely to bind to RNA but not DNA, whereas 9% might bind to both RNA and DNA. 11% of them do not contain an explicit functional annotation yet. Nearly 30% of the putative RBPs are exclusively nuclear, 15% have known disease associations and around 30% are enzymes. Around 40% of the proteins identified in this study are novel and have not been reported by recent large-scale studies on human RBPs.
Collapse
Affiliation(s)
- Pritha Ghosh
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bellary Road, Bangalore, Karnataka 560 065, India.
| | - R Sowdhamini
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bellary Road, Bangalore, Karnataka 560 065, India.
| |
Collapse
|
31
|
Taherzadeh G, Zhou Y, Liew AWC, Yang Y. Sequence-Based Prediction of Protein-Carbohydrate Binding Sites Using Support Vector Machines. J Chem Inf Model 2016; 56:2115-2122. [PMID: 27623166 DOI: 10.1021/acs.jcim.6b00320] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Carbohydrate-binding proteins play significant roles in many diseases including cancer. Here, we established a machine-learning-based method (called sequence-based prediction of residue-level interaction sites of carbohydrates, SPRINT-CBH) to predict carbohydrate-binding sites in proteins using support vector machines (SVMs). We found that integrating evolution-derived sequence profiles with additional information on sequence and predicted solvent accessible surface area leads to a reasonably accurate, robust, and predictive method, with area under receiver operating characteristic curve (AUC) of 0.78 and 0.77 and Matthew's correlation coefficient of 0.34 and 0.29, respectively for 10-fold cross validation and independent test without balancing binding and nonbinding residues. The quality of the method is further demonstrated by having statistically significantly more binding residues predicted for carbohydrate-binding proteins than presumptive nonbinding proteins in the human proteome, and by the bias of rare alleles toward predicted carbohydrate-binding sites for nonsynonymous mutations from the 1000 genome project. SPRINT-CBH is available as an online server at http://sparks-lab.org/server/SPRINT-CBH .
Collapse
Affiliation(s)
- Ghazaleh Taherzadeh
- School of Information and Communication Technology and ‡Institute for Glycomics, Griffith University , Parklands Drive, Southport, Queensland 4215, Australia
| | - Yaoqi Zhou
- School of Information and Communication Technology and ‡Institute for Glycomics, Griffith University , Parklands Drive, Southport, Queensland 4215, Australia
| | - Alan Wee-Chung Liew
- School of Information and Communication Technology and ‡Institute for Glycomics, Griffith University , Parklands Drive, Southport, Queensland 4215, Australia
| | - Yuedong Yang
- School of Information and Communication Technology and ‡Institute for Glycomics, Griffith University , Parklands Drive, Southport, Queensland 4215, Australia
| |
Collapse
|
32
|
Yang Y, Zhan J, Zhou Y. SPOT‐Ligand: Fast and effective structure‐based virtual screening by binding homology search according to ligand and receptor similarity. J Comput Chem 2016; 37:1734-9. [DOI: 10.1002/jcc.24380] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2015] [Revised: 01/12/2016] [Accepted: 03/05/2016] [Indexed: 12/11/2022]
Affiliation(s)
- Yuedong Yang
- Institute for Glycomics and School of Information and Communication TechnologyGriffith UniversityParklands DrSouthport QLD4222 Australia
| | - Jian Zhan
- Institute for Glycomics and School of Information and Communication TechnologyGriffith UniversityParklands DrSouthport QLD4222 Australia
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication TechnologyGriffith UniversityParklands DrSouthport QLD4222 Australia
| |
Collapse
|
33
|
Heffernan R, Dehzangi A, Lyons J, Paliwal K, Sharma A, Wang J, Sattar A, Zhou Y, Yang Y. Highly accurate sequence-based prediction of half-sphere exposures of amino acid residues in proteins. Bioinformatics 2015; 32:843-9. [DOI: 10.1093/bioinformatics/btv665] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2015] [Accepted: 11/07/2015] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation: Solvent exposure of amino acid residues of proteins plays an important role in understanding and predicting protein structure, function and interactions. Solvent exposure can be characterized by several measures including solvent accessible surface area (ASA), residue depth (RD) and contact numbers (CN). More recently, an orientation-dependent contact number called half-sphere exposure (HSE) was introduced by separating the contacts within upper and down half spheres defined according to the Cα-Cβ (HSEβ) vector or neighboring Cα-Cα vectors (HSEα). HSEα calculated from protein structures was found to better describe the solvent exposure over ASA, CN and RD in many applications. Thus, a sequence-based prediction is desirable, as most proteins do not have experimentally determined structures. To our best knowledge, there is no method to predict HSEα and only one method to predict HSEβ.
Results: This study developed a novel method for predicting both HSEα and HSEβ (SPIDER-HSE) that achieved a consistent performance for 10-fold cross validation and two independent tests. The correlation coefficients between predicted and measured HSEβ (0.73 for upper sphere, 0.69 for down sphere and 0.76 for contact numbers) for the independent test set of 1199 proteins are significantly higher than existing methods. Moreover, predicted HSEα has a higher correlation coefficient (0.46) to the stability change by residue mutants than predicted HSEβ (0.37) and ASA (0.43). The results, together with its easy Cα-atom-based calculation, highlight the potential usefulness of predicted HSEα for protein structure prediction and refinement as well as function prediction.
Availability and implementation: The method is available at http://sparks-lab.org.
Contact: yuedong.yang@griffith.edu.au or yaoqi.zhou@griffith.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rhys Heffernan
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, Australia,
| | - Abdollah Dehzangi
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, Australia,
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia,
- Medical Research Center (MRC), Department of Psychiatry, University of Iowa, Iowa City, USA,
| | - James Lyons
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, Australia,
| | - Kuldip Paliwal
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, Australia,
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia,
- School of Engineering and Physics, University of the South Pacific, Private Mail Bag, Laucala Campus, Suva, Fiji,
| | - Jihua Wang
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, Shandong 253023, China,
| | - Abdul Sattar
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia,
- National ICT Australia (NICTA), Brisbane, Australia and
| | - Yaoqi Zhou
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, Shandong 253023, China,
- Institute for Glycomics and School of Information and Communication Technique, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Yuedong Yang
- Institute for Glycomics and School of Information and Communication Technique, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| |
Collapse
|
34
|
Kumar R, Corbett MA, van Bon BWM, Woenig JA, Weir L, Douglas E, Friend KL, Gardner A, Shaw M, Jolly LA, Tan C, Hunter MF, Hackett A, Field M, Palmer EE, Leffler M, Rogers C, Boyle J, Bienek M, Jensen C, Van Buggenhout G, Van Esch H, Hoffmann K, Raynaud M, Zhao H, Reed R, Hu H, Haas SA, Haan E, Kalscheuer VM, Gecz J. THOC2 Mutations Implicate mRNA-Export Pathway in X-Linked Intellectual Disability. Am J Hum Genet 2015; 97:302-10. [PMID: 26166480 DOI: 10.1016/j.ajhg.2015.05.021] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 05/27/2015] [Indexed: 11/30/2022] Open
Abstract
Export of mRNA from the cell nucleus to the cytoplasm is essential for protein synthesis, a process vital to all living eukaryotic cells. mRNA export is highly conserved and ubiquitous. Mutations affecting mRNA and mRNA processing or export factors, which cause aberrant retention of mRNAs in the nucleus, are thus emerging as contributors to an important class of human genetic disorders. Here, we report that variants in THOC2, which encodes a subunit of the highly conserved TREX mRNA-export complex, cause syndromic intellectual disability (ID). Affected individuals presented with variable degrees of ID and commonly observed features included speech delay, elevated BMI, short stature, seizure disorders, gait disturbance, and tremors. X chromosome exome sequencing revealed four missense variants in THOC2 in four families, including family MRX12, first ascertained in 1971. We show that two variants lead to decreased stability of THOC2 and its TREX-complex partners in cells derived from the affected individuals. Protein structural modeling showed that the altered amino acids are located in the RNA-binding domains of two complex THOC2 structures, potentially representing two different intermediate RNA-binding states of THOC2 during RNA transport. Our results show that disturbance of the canonical molecular pathway of mRNA export is compatible with life but results in altered neuronal development with other comorbidities.
Collapse
MESH Headings
- Active Transport, Cell Nucleus/genetics
- Amino Acid Sequence
- Base Sequence
- Chromosomes, Human, X/genetics
- Humans
- Mental Retardation, X-Linked/genetics
- Mental Retardation, X-Linked/pathology
- Models, Molecular
- Molecular Sequence Data
- Mutation, Missense/genetics
- Pedigree
- RNA, Messenger/genetics
- RNA, Messenger/metabolism
- RNA-Binding Proteins/chemistry
- RNA-Binding Proteins/genetics
- Sequence Analysis, DNA
- Syndrome
Collapse
Affiliation(s)
- Raman Kumar
- School of Paediatrics and Reproductive Health, Robinson Research Institute, University of Adelaide, Adelaide, SA 5000, Australia
| | - Mark A Corbett
- School of Paediatrics and Reproductive Health, Robinson Research Institute, University of Adelaide, Adelaide, SA 5000, Australia
| | - Bregje W M van Bon
- School of Paediatrics and Reproductive Health, Robinson Research Institute, University of Adelaide, Adelaide, SA 5000, Australia; Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, 6500 HB Nijmegen, the Netherlands
| | - Joshua A Woenig
- School of Paediatrics and Reproductive Health, Robinson Research Institute, University of Adelaide, Adelaide, SA 5000, Australia
| | - Lloyd Weir
- School of Paediatrics and Reproductive Health, Robinson Research Institute, University of Adelaide, Adelaide, SA 5000, Australia
| | - Evelyn Douglas
- Genetics and Molecular Pathology, SA Pathology, North Adelaide, SA 5006, Australia
| | - Kathryn L Friend
- Genetics and Molecular Pathology, SA Pathology, North Adelaide, SA 5006, Australia
| | - Alison Gardner
- School of Paediatrics and Reproductive Health, Robinson Research Institute, University of Adelaide, Adelaide, SA 5000, Australia
| | - Marie Shaw
- School of Paediatrics and Reproductive Health, Robinson Research Institute, University of Adelaide, Adelaide, SA 5000, Australia
| | - Lachlan A Jolly
- School of Paediatrics and Reproductive Health, Robinson Research Institute, University of Adelaide, Adelaide, SA 5000, Australia
| | - Chuan Tan
- School of Paediatrics and Reproductive Health, Robinson Research Institute, University of Adelaide, Adelaide, SA 5000, Australia
| | - Matthew F Hunter
- Monash Genetics, Monash Medical Centre, Clayton, VIC 3168, Australia; Department of Paediatrics, Monash University, Clayton, VIC 3168, Australia
| | - Anna Hackett
- Genetics of Learning Disability Service, Hunter Genetics, Waratah, NSW 2298, Australia
| | - Michael Field
- Genetics of Learning Disability Service, Hunter Genetics, Waratah, NSW 2298, Australia
| | - Elizabeth E Palmer
- Genetics of Learning Disability Service, Hunter Genetics, Waratah, NSW 2298, Australia
| | - Melanie Leffler
- Genetics of Learning Disability Service, Hunter Genetics, Waratah, NSW 2298, Australia
| | - Carolyn Rogers
- Genetics of Learning Disability Service, Hunter Genetics, Waratah, NSW 2298, Australia
| | - Jackie Boyle
- Genetics of Learning Disability Service, Hunter Genetics, Waratah, NSW 2298, Australia
| | - Melanie Bienek
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany
| | - Corinna Jensen
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany
| | | | - Hilde Van Esch
- Center for Human Genetics, University Hospitals Leuven, Leuven 3000, Belgium
| | - Katrin Hoffmann
- Institute of Human Genetics, Martin Luther University Halle-Wittenberg, Magdeburger Strasse 2, 06112 Halle (Saale), Germany
| | - Martine Raynaud
- INSERM U930, Imaging and Brain, François-Rabelais University, 37000 Tours, France; INSERM U930, Service de Génétique, Centre Hospitalier Régional Universitaire, 37000 Tours, France
| | - Huiying Zhao
- QIMR Berghofer Medical Research Institute, Brisbane, QLD 4029, Australia
| | - Robin Reed
- Department of Cell Biology, Harvard Medical School, Harvard University, Boston, MA 02115, USA
| | - Hao Hu
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany
| | - Stefan A Haas
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany
| | - Eric Haan
- School of Paediatrics and Reproductive Health, Robinson Research Institute, University of Adelaide, Adelaide, SA 5000, Australia; South Australian Clinical Genetics Service, SA Pathology, North Adelaide, SA 5006, Australia
| | - Vera M Kalscheuer
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany
| | - Jozef Gecz
- School of Paediatrics and Reproductive Health, Robinson Research Institute, University of Adelaide, Adelaide, SA 5000, Australia; School of Molecular and Biomedical Sciences, University of Adelaide, Adelaide, SA 5005, Australia.
| |
Collapse
|
35
|
SNBRFinder: A Sequence-Based Hybrid Algorithm for Enhanced Prediction of Nucleic Acid-Binding Residues. PLoS One 2015; 10:e0133260. [PMID: 26176857 PMCID: PMC4503397 DOI: 10.1371/journal.pone.0133260] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 06/25/2015] [Indexed: 11/19/2022] Open
Abstract
Protein-nucleic acid interactions are central to various fundamental biological processes. Automated methods capable of reliably identifying DNA- and RNA-binding residues in protein sequence are assuming ever-increasing importance. The majority of current algorithms rely on feature-based prediction, but their accuracy remains to be further improved. Here we propose a sequence-based hybrid algorithm SNBRFinder (Sequence-based Nucleic acid-Binding Residue Finder) by merging a feature predictor SNBRFinderF and a template predictor SNBRFinderT. SNBRFinderF was established using the support vector machine whose inputs include sequence profile and other complementary sequence descriptors, while SNBRFinderT was implemented with the sequence alignment algorithm based on profile hidden Markov models to capture the weakly homologous template of query sequence. Experimental results show that SNBRFinderF was clearly superior to the commonly used sequence profile-based predictor and SNBRFinderT can achieve comparable performance to the structure-based template methods. Leveraging the complementary relationship between these two predictors, SNBRFinder reasonably improved the performance of both DNA- and RNA-binding residue predictions. More importantly, the sequence-based hybrid prediction reached competitive performance relative to our previous structure-based counterpart. Our extensive and stringent comparisons show that SNBRFinder has obvious advantages over the existing sequence-based prediction algorithms. The value of our algorithm is highlighted by establishing an easy-to-use web server that is freely accessible at http://ibi.hzau.edu.cn/SNBRFinder.
Collapse
|
36
|
Yan J, Friedrich S, Kurgan L. A comprehensive comparative review of sequence-based predictors of DNA- and RNA-binding residues. Brief Bioinform 2015; 17:88-105. [DOI: 10.1093/bib/bbv023] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Indexed: 01/07/2023] Open
|
37
|
Yang XX, Deng ZL, Liu R. RBRDetector: Improved prediction of binding residues on RNA-binding protein structures using complementary feature- and template-based strategies. Proteins 2014; 82:2455-71. [DOI: 10.1002/prot.24610] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2014] [Revised: 04/28/2014] [Accepted: 05/09/2014] [Indexed: 11/05/2022]
Affiliation(s)
- Xiao-Xia Yang
- Agricultural Bioinformatics Key Laboratory of Hubei Province; College of Informatics; Huazhong Agricultural University; Wuhan 430070 People's Republic of China
| | - Zhi-Luo Deng
- Agricultural Bioinformatics Key Laboratory of Hubei Province; College of Informatics; Huazhong Agricultural University; Wuhan 430070 People's Republic of China
| | - Rong Liu
- Agricultural Bioinformatics Key Laboratory of Hubei Province; College of Informatics; Huazhong Agricultural University; Wuhan 430070 People's Republic of China
| |
Collapse
|
38
|
Zhao H, Wang J, Zhou Y, Yang Y. Predicting DNA-binding proteins and binding residues by complex structure prediction and application to human proteome. PLoS One 2014; 9:e96694. [PMID: 24792350 PMCID: PMC4008587 DOI: 10.1371/journal.pone.0096694] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2014] [Accepted: 04/10/2014] [Indexed: 12/25/2022] Open
Abstract
As more and more protein sequences are uncovered from increasingly inexpensive sequencing techniques, an urgent task is to find their functions. This work presents a highly reliable computational technique for predicting DNA-binding function at the level of protein-DNA complex structures, rather than low-resolution two-state prediction of DNA-binding as most existing techniques do. The method first predicts protein-DNA complex structure by utilizing the template-based structure prediction technique HHblits, followed by binding affinity prediction based on a knowledge-based energy function (Distance-scaled finite ideal-gas reference state for protein-DNA interactions). A leave-one-out cross validation of the method based on 179 DNA-binding and 3797 non-binding protein domains achieves a Matthews correlation coefficient (MCC) of 0.77 with high precision (94%) and high sensitivity (65%). We further found 51% sensitivity for 82 newly determined structures of DNA-binding proteins and 56% sensitivity for the human proteome. In addition, the method provides a reasonably accurate prediction of DNA-binding residues in proteins based on predicted DNA-binding complex structures. Its application to human proteome leads to more than 300 novel DNA-binding proteins; some of these predicted structures were validated by known structures of homologous proteins in APO forms. The method [SPOT-Seq (DNA)] is available as an on-line server at http://sparks-lab.org.
Collapse
Affiliation(s)
- Huiying Zhao
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America
- QIMR Berghofer Medical Research Institute, Brisbane, Queensland, Australia
| | - Jihua Wang
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America
- Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Dezhou University, Dezhou, Shandong, China
| | - Yaoqi Zhou
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America
- Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Dezhou University, Dezhou, Shandong, China
- Institute for Glycomics and School of Information and Communication Technique, Griffith University, Southport, Queensland, Australia
- * E-mail: (YZ); (YY)
| | - Yuedong Yang
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America
- Institute for Glycomics and School of Information and Communication Technique, Griffith University, Southport, Queensland, Australia
- * E-mail: (YZ); (YY)
| |
Collapse
|
39
|
Zhao H, Yang Y, Janga SC, Kao CC, Zhou Y. Prediction and validation of the unexplored RNA-binding protein atlas of the human proteome. Proteins 2014; 82:640-7. [PMID: 24123256 PMCID: PMC3949140 DOI: 10.1002/prot.24441] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2013] [Revised: 09/13/2013] [Accepted: 09/26/2013] [Indexed: 12/13/2022]
Abstract
Detecting protein-RNA interactions is challenging both experimentally and computationally because RNAs are large in number, diverse in cellular location and function, and flexible in structure. As a result, many RNA-binding proteins (RBPs) remain to be identified. Here, a template-based, function-prediction technique SPOT-Seq for RBPs is applied to human proteome and its result is validated by a recent proteomic experimental discovery of 860 mRNA-binding proteins (mRBPs). The coverage (or sensitivity) is 42.6% for 1217 known RBPs annotated in the Gene Ontology and 43.6% for 860 newly discovered human mRBPs. Consistent sensitivity indicates the robust performance of SPOT-Seq for predicting RBPs. More importantly, SPOT-Seq detects 2418 novel RBPs in human proteome, 291 of which were validated by the newly discovered mRBP set. Among 291 validated novel RBPs, 61 are not homologous to any known RBPs. Successful validation of predicted novel RBPs permits us to further analysis of their phenotypic roles in disease pathways. The dataset of 2418 predicted novel RBPs along with confidence levels and complex structures is available at http://sparks-lab.org (in publications) for experimental confirmations and hypothesis generation.
Collapse
Affiliation(s)
- Huiying Zhao
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave Ste 319, Walker Plaza Building, Indianapolis, Indiana 46202, USA
| | - Yuedong Yang
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave Ste 319, Walker Plaza Building, Indianapolis, Indiana 46202, USA
- Institute for Glycomics and School of Informatics and Communication Technology, Griffith University, Parklands Dr., Southport, QLD4215, Australia
| | - Sarath Chandra Janga
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave Ste 319, Walker Plaza Building, Indianapolis, Indiana 46202, USA
| | - C. Cheng Kao
- Department of Molecular & Cellular Biochemistry, Indiana University, Bloomington, Indiana, 47405, USA
| | - Yaoqi Zhou
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave Ste 319, Walker Plaza Building, Indianapolis, Indiana 46202, USA
- Institute for Glycomics and School of Informatics and Communication Technology, Griffith University, Parklands Dr., Southport, QLD4215, Australia
| |
Collapse
|
40
|
Huang SY, Zou X. A knowledge-based scoring function for protein-RNA interactions derived from a statistical mechanics-based iterative method. Nucleic Acids Res 2014; 42:e55. [PMID: 24476917 PMCID: PMC3985650 DOI: 10.1093/nar/gku077] [Citation(s) in RCA: 113] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
Protein-RNA interactions play important roles in many biological processes. Given the high cost and technique difficulties in experimental methods, computationally predicting the binding complexes from individual protein and RNA structures is pressingly needed, in which a reliable scoring function is one of the critical components. Here, we have developed a knowledge-based scoring function, referred to as ITScore-PR, for protein-RNA binding mode prediction by using a statistical mechanics-based iterative method. The pairwise distance-dependent atomic interaction potentials of ITScore-PR were derived from experimentally determined protein–RNA complex structures. For validation, we have compared ITScore-PR with 10 other scoring methods on four diverse test sets. For bound docking, ITScore-PR achieved a success rate of up to 86% if the top prediction was considered and up to 94% if the top 10 predictions were considered, respectively. For truly unbound docking, the respective success rates of ITScore-PR were up to 24 and 46%. ITScore-PR can be used stand-alone or easily implemented in other docking programs for protein–RNA recognition.
Collapse
Affiliation(s)
- Sheng-You Huang
- Department of Physics and Astronomy, Department of Biochemistry, Dalton Cardiovascular Research Center, and Informatics Institute, University of Missouri, Columbia, MO 65211, USA
| | | |
Collapse
|
41
|
Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes. PLoS One 2014; 9:e86703. [PMID: 24475169 PMCID: PMC3901691 DOI: 10.1371/journal.pone.0086703] [Citation(s) in RCA: 115] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2013] [Accepted: 12/10/2013] [Indexed: 11/22/2022] Open
Abstract
Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.
Collapse
|
42
|
Yang Y, Zhao H, Wang J, Zhou Y. SPOT-Seq-RNA: predicting protein-RNA complex structure and RNA-binding function by fold recognition and binding affinity prediction. Methods Mol Biol 2014; 1137:119-30. [PMID: 24573478 PMCID: PMC3937850 DOI: 10.1007/978-1-4939-0366-5_9] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
RNA-binding proteins (RBPs) play key roles in RNA metabolism and post-transcriptional regulation. Computational methods have been developed separately for prediction of RBPs and RNA-binding residues by machine-learning techniques and prediction of protein-RNA complex structures by rigid or semiflexible structure-to-structure docking. Here, we describe a template-based technique called SPOT-Seq-RNA that integrates prediction of RBPs, RNA-binding residues, and protein-RNA complex structures into a single package. This integration is achieved by combining template-based structure-prediction software, SPARKS X, with binding affinity prediction software, DRNA. This tool yields reasonable sensitivity (46 %) and high precision (84 %) for an independent test set of 215 RBPs and 5,766 non-RBPs. SPOT-Seq-RNA is computationally efficient for genome-scale prediction of RBPs and protein-RNA complex structures. Its application to human genome study has revealed a similar sensitivity and ability to uncover hundreds of novel RBPs beyond simple homology. The online server and downloadable version of SPOT-Seq-RNA are available at http://sparks-lab.org/server/SPOT-Seq-RNA/.
Collapse
Affiliation(s)
- Yuedong Yang
- School of Informatics, Indiana University Purdue University, Indianapolis, IN, USA
| | | | | | | |
Collapse
|
43
|
Zhao H, Yang Y, Zhou Y. Prediction of RNA binding proteins comes of age from low resolution to high resolution. MOLECULAR BIOSYSTEMS 2013; 9:2417-25. [PMID: 23872922 PMCID: PMC3870025 DOI: 10.1039/c3mb70167k] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Networks of protein-RNA interactions is likely to be larger than protein-protein and protein-DNA interaction networks because RNA transcripts are encoded tens of times more than proteins (e.g. only 3% of human genome coded for proteins), have diverse function and localization, and are controlled by proteins from birth (transcription) to death (degradation). This massive network is evidenced by several recent experimental discoveries of large numbers of previously unknown RNA-binding proteins (RBPs). Meanwhile, more than 400 non-redundant protein-RNA complex structures (at 25% sequence identity or less) have been deposited into the protein databank. These sequences and structural resources for RBPs provide ample data for the development of computational techniques dedicated to RBP prediction, as experimentally determining RNA-binding functions is time-consuming and expensive. This review compares traditional machine-learning based approaches with emerging template-based methods at several levels of prediction resolution ranging from two-state binding/non-binding prediction, to binding residue prediction and protein-RNA complex structure prediction. The analysis indicates that the two approaches are complementary and their combinations may lead to further improvements.
Collapse
Affiliation(s)
- Huiying Zhao
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana 46202, USA.
| | | | | |
Collapse
|
44
|
Janga SC. From specific to global analysis of posttranscriptional regulation in eukaryotes: posttranscriptional regulatory networks. Brief Funct Genomics 2012; 11:505-21. [PMID: 23124862 DOI: 10.1093/bfgp/els046] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Regulation of gene expression occurs at several levels in eukaryotic organisms and is a highly controlled process. Although RNAs have been traditionally viewed as passive molecules in the pathway from transcription to translation, there is mounting evidence that their metabolism is controlled by a class of proteins called RNA-binding proteins (RBPs), as well as a number of small RNAs. In this review, I provide an overview of the recent developments in our understanding of the repertoire of RBPs across diverse model systems, and discuss the computational and experimental approaches currently available for the construction of posttranscriptional networks governed by them. I also present an overview of the different roles played by RBPs in the cellular context, based on their cis-regulatory modules identified in the literature and discuss how their interplay can result in the dynamic, spatial and tissue-specific expression maps of RNAs. I finally present the concept of posttranscriptional network of RBPs and their cognate RNA targets and discuss their cross-talk with other important posttranscriptional regulatory molecules such as microRNAs s, resulting in diverse functional network motifs. I argue that with rapid developments in the genome-wide elucidation of posttranscriptional networks it would not only be possible to gain a deeper understanding of regulation at a level that has been under-appreciated in the past, but would also allow us to use the newly developed high-throughput approaches to interrogate the prevalence of these phenomena in different states, and thereby study their relevance to physiology and disease across organisms.
Collapse
Affiliation(s)
- Sarath Chandra Janga
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave Ste 319, Walker Plaza Building, IN 46202, USA.
| |
Collapse
|
45
|
Cirillo D, Agostini F, Tartaglia GG. Predictions of protein-RNA interactions. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2012. [DOI: 10.1002/wcms.1119] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|
46
|
Abstract
A key challenge of modern biology is to uncover the functional role of the protein entities that compose cellular proteomes. To this end, the availability of reliable three-dimensional atomic models of proteins is often crucial. This protocol presents a community-wide web-based method using RaptorX (http://raptorx.uchicago.edu/) for protein secondary structure prediction, template-based tertiary structure modeling, alignment quality assessment and sophisticated probabilistic alignment sampling. RaptorX distinguishes itself from other servers by the quality of the alignment between a target sequence and one or multiple distantly related template proteins (especially those with sparse sequence profiles) and by a novel nonlinear scoring function and a probabilistic-consistency algorithm. Consequently, RaptorX delivers high-quality structural models for many targets with only remote templates. At present, it takes RaptorX ~35 min to finish processing a sequence of 200 amino acids. Since its official release in August 2011, RaptorX has processed ~6,000 sequences submitted by ~1,600 users from around the world.
Collapse
|
47
|
Yang Y, Zhan J, Zhao H, Zhou Y. A new size-independent score for pairwise protein structure alignment and its application to structure classification and nucleic-acid binding prediction. Proteins 2012; 80:2080-8. [PMID: 22522696 DOI: 10.1002/prot.24100] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2012] [Revised: 04/13/2012] [Accepted: 04/17/2012] [Indexed: 11/12/2022]
Abstract
A structure alignment program aligns two structures by optimizing a scoring function that measures structural similarity. It is highly desirable that such scoring function is independent of the sizes of proteins in comparison so that the significance of alignment across different sizes of the protein regions aligned is comparable. Here, we developed a new score called SP-score that fixes the cutoff distance at 4 Å and removed the size dependence using a normalization prefactor. We further built a program called SPalign that optimizes SP-score for structure alignment. SPalign was applied to recognize proteins within the same structure fold and having the same function of DNA or RNA binding. For fold discrimination, SPalign improves sensitivity over TMalign for the chain-level comparison by 12% and over DALI for the domain-level comparison by 13% at the same specificity of 99.6%. The difference between TMalign and SPalign at the chain level is due to the inability of TMalign to detect single domain similarity between multidomain proteins. For recognizing nucleic acid binding proteins, SPalign consistently improves over TMalign by 12% and DALI by 31% in average value of Mathews correlation coefficients for four datasets. SPalign with default setting is 14% faster than TMalign. SPalign is expected to be useful for function prediction and comparing structures with or without domains defined. The source code for SPalign and the server are available at http://sparks.informatics.iupui.edu.
Collapse
Affiliation(s)
- Yuedong Yang
- Indiana University School of Informatics, Indiana University-Purdue University, Indianapolis, Indiana 46202, USA
| | | | | | | |
Collapse
|
48
|
Lu T, Yang Y, Yao B, Liu S, Zhou Y, Zhang C. Template-based structure prediction and classification of transcription factors in Arabidopsis thaliana. Protein Sci 2012; 21:828-38. [PMID: 22549903 DOI: 10.1002/pro.2066] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2011] [Revised: 03/14/2012] [Accepted: 03/16/2012] [Indexed: 11/11/2022]
Abstract
Transcription factors (TFs) play important roles in plants. However, there is no systematic study of their structures and functions of most TFs in plants. Here, we performed template-based structure prediction for all TFs in Arabidopsis thaliana, with their full-length sequences as well as C-terminal and N-terminal regions. A total of 2918 model structures were obtained with a high confidence score. We find that TF families employ only a smaller number of templates for DNA-binding domains (DBD) but a diverse number of templates for transcription regulatory domains (TRD). Although TF families are classified according to DBD, their sizes have a significant correlation with the number of unique non-DNA-binding templates employed in the family (Pearson correlation coefficient of 0.74). That is, the size of TF family is related to its functional diversity. Network analysis reveals new connections between TF families based on shared TRD or DBD templates; 81% TF families share DBD and 67% share TRD templates. Two large fully connected family clusters in this network are observed along with 69 island families. In addition, 25 genes with unknown functions are found to be DNA-binding and/or TF factors according to predicted structures. This work provides a global view of the classification of TFs based on their DBD or TRD templates, and hence, a deeper understanding of DNA-binding and regulatory functions from structural perspective. All structural models of TFs are deposited in the online database for public usage at http://sysbio.unl.edu/AthTF.
Collapse
Affiliation(s)
- Tao Lu
- School of Biological Sciences, Center for Plant Science and Innovation, University of Nebraska, Lincoln, Nebraska 68588, USA
| | | | | | | | | | | |
Collapse
|