1
|
Patiyal S, Dhall A, Bajaj K, Sahu H, Raghava GPS. Prediction of RNA-interacting residues in a protein using CNN and evolutionary profile. Brief Bioinform 2023; 24:6901899. [PMID: 36516298 DOI: 10.1093/bib/bbac538] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2022] [Revised: 09/28/2022] [Accepted: 11/08/2022] [Indexed: 12/15/2022] Open
Abstract
This paper describes a method Pprint2, which is an improved version of Pprint developed for predicting RNA-interacting residues in a protein. Training and independent/validation datasets used in this study comprises of 545 and 161 non-redundant RNA-binding proteins, respectively. All models were trained on training dataset and evaluated on the validation dataset. The preliminary analysis reveals that positively charged amino acids such as H, R and K, are more prominent in the RNA-interacting residues. Initially, machine learning based models have been developed using binary profile and obtain maximum area under curve (AUC) 0.68 on validation dataset. The performance of this model improved significantly from AUC 0.68 to 0.76, when evolutionary profile is used instead of binary profile. The performance of our evolutionary profile-based model improved further from AUC 0.76 to 0.82, when convolutional neural network has been used for developing model. Our final model based on convolutional neural network using evolutionary information achieved AUC 0.82 with Matthews correlation coefficient of 0.49 on the validation dataset. Our best model outperforms existing methods when evaluated on the independent/validation dataset. A user-friendly standalone software and web-based server named 'Pprint2' has been developed for predicting RNA-interacting residues (https://webs.iiitd.edu.in/raghava/pprint2 and https://github.com/raghavagps/pprint2).
Collapse
Affiliation(s)
- Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Khushboo Bajaj
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Harshita Sahu
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| |
Collapse
|
2
|
Liu Y, Gong W, Yang Z, Li C. SNB-PSSM: A spatial neighbor-based PSSM used for protein-RNA binding site prediction. J Mol Recognit 2021; 34:e2887. [PMID: 33442949 DOI: 10.1002/jmr.2887] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2020] [Revised: 12/22/2020] [Accepted: 12/23/2020] [Indexed: 01/09/2023]
Abstract
Protein-RNA interactions play essential roles in a wide variety of biological processes. Recognition of RNA-binding residues on proteins has been a challenging problem. Most of methods utilize the position-specific scoring matrix (PSSM). It has been found that considering the evolutionary information of sequence neighboring residues can improve the prediction. In this work, we introduce a novel method SNB-PSSM (spatial neighbor-based PSSM) combined with the structure window scheme where the evolutionary information of spatially neighboring residues is considered. The results show our method consistently outperforms the standard and smoothed PSSM methods. Tested on multiple datasets, this approach shows an encouraging performance compared with RNABindRPlus, BindN+, PPRInt, xypan, Predict_RBP, SpaPF, PRNA, and KYG, although is inferior to RNAProSite, RBscore, and aaRNA. In addition, since our method is not sensitive to protein structure changes, it can be applied well on binding site predictions of modeled structures. Thus, the result also suggests the evolution of binding sites is spatially cooperative. The proposed method as an effective tool of considering evolutionary information can be widely used for the nucleic acid-/protein-binding site prediction and functional motif finding.
Collapse
Affiliation(s)
- Yang Liu
- Faculty of Environmental and Life Sciences, Beijing University of Technology, Beijing, China
| | - Weikang Gong
- Faculty of Environmental and Life Sciences, Beijing University of Technology, Beijing, China
| | - Zhen Yang
- Faculty of Environmental and Life Sciences, Beijing University of Technology, Beijing, China
| | - Chunhua Li
- Faculty of Environmental and Life Sciences, Beijing University of Technology, Beijing, China
| |
Collapse
|
3
|
Abstract
Background:
Revealing the subcellular location of a newly discovered protein can
bring insight into their function and guide research at the cellular level. The experimental methods
currently used to identify the protein subcellular locations are both time-consuming and expensive.
Thus, it is highly desired to develop computational methods for efficiently and effectively identifying
the protein subcellular locations. Especially, the rapidly increasing number of protein sequences
entering the genome databases has called for the development of automated analysis methods.
Methods:
In this review, we will describe the recent advances in predicting the protein subcellular
locations with machine learning from the following aspects: i) Protein subcellular location benchmark
dataset construction, ii) Protein feature representation and feature descriptors, iii) Common
machine learning algorithms, iv) Cross-validation test methods and assessment metrics, v) Web
servers.
Result & Conclusion:
Concomitant with a large number of protein sequences generated by highthroughput
technologies, four future directions for predicting protein subcellular locations with
machine learning should be paid attention. One direction is the selection of novel and effective features
(e.g., statistics, physical-chemical, evolutional) from the sequences and structures of proteins.
Another is the feature fusion strategy. The third is the design of a powerful predictor and the fourth
one is the protein multiple location sites prediction.
Collapse
Affiliation(s)
- Ting-He Zhang
- School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Shao-Wu Zhang
- School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China
| |
Collapse
|
4
|
Sun M, Wang X, Zou C, He Z, Liu W, Li H. Accurate prediction of RNA-binding protein residues with two discriminative structural descriptors. BMC Bioinformatics 2016; 17:231. [PMID: 27266516 PMCID: PMC4897909 DOI: 10.1186/s12859-016-1110-x] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2016] [Accepted: 06/02/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND RNA-binding proteins participate in many important biological processes concerning RNA-mediated gene regulation, and several computational methods have been recently developed to predict the protein-RNA interactions of RNA-binding proteins. Newly developed discriminative descriptors will help to improve the prediction accuracy of these prediction methods and provide further meaningful information for researchers. RESULTS In this work, we designed two structural features (residue electrostatic surface potential and triplet interface propensity) and according to the statistical and structural analysis of protein-RNA complexes, the two features were powerful for identifying RNA-binding protein residues. Using these two features and other excellent structure- and sequence-based features, a random forest classifier was constructed to predict RNA-binding residues. The area under the receiver operating characteristic curve (AUC) of five-fold cross-validation for our method on training set RBP195 was 0.900, and when applied to the test set RBP68, the prediction accuracy (ACC) was 0.868, and the F-score was 0.631. CONCLUSIONS The good prediction performance of our method revealed that the two newly designed descriptors could be discriminative for inferring protein residues interacting with RNAs. To facilitate the use of our method, a web-server called RNAProSite, which implements the proposed method, was constructed and is freely available at http://lilab.ecust.edu.cn/NABind .
Collapse
Affiliation(s)
- Meijian Sun
- State Key Laboratory of Bioreactor Engineering, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Mei Long Road, Shanghai, 200237, China
| | - Xia Wang
- State Key Laboratory of Bioreactor Engineering, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Mei Long Road, Shanghai, 200237, China
| | - Chuanxin Zou
- State Key Laboratory of Bioreactor Engineering, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Mei Long Road, Shanghai, 200237, China
| | - Zenghui He
- State Key Laboratory of Bioreactor Engineering, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Mei Long Road, Shanghai, 200237, China
| | - Wei Liu
- State Key Laboratory of Bioreactor Engineering, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Mei Long Road, Shanghai, 200237, China
| | - Honglin Li
- State Key Laboratory of Bioreactor Engineering, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Mei Long Road, Shanghai, 200237, China.
| |
Collapse
|
5
|
Fan XN, Zhang SW. lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning. MOLECULAR BIOSYSTEMS 2015; 11:892-7. [DOI: 10.1039/c4mb00650j] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
By fusing multiple features and using deep learning algorithms, a lncRNA-MFDL predictor was developed to identify lncRNAs, which is much more effective and robust.
Collapse
Affiliation(s)
- Xiao-Nan Fan
- Key Laboratory of Information Fusion Technology of Ministry of Education
- School of Automation
- Northwestern Polytechnical University
- Xi'an
- China
| | - Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education
- School of Automation
- Northwestern Polytechnical University
- Xi'an
- China
| |
Collapse
|
6
|
Prediction of protein-protein interaction with pairwise kernel support vector machine. Int J Mol Sci 2014; 15:3220-33. [PMID: 24566145 PMCID: PMC3958907 DOI: 10.3390/ijms15023220] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2014] [Revised: 01/27/2014] [Accepted: 01/29/2014] [Indexed: 11/17/2022] Open
Abstract
Protein–protein interactions (PPIs) play a key role in many cellular processes. Unfortunately, the experimental methods currently used to identify PPIs are both time-consuming and expensive. These obstacles could be overcome by developing computational approaches to predict PPIs. Here, we report two methods of amino acids feature extraction: (i) distance frequency with PCA reducing the dimension (DFPCA) and (ii) amino acid index distribution (AAID) representing the protein sequences. In order to obtain the most robust and reliable results for PPI prediction, pairwise kernel function and support vector machines (SVM) were employed to avoid the concatenation order of two feature vectors generated with two proteins. The highest prediction accuracies of AAID and DFPCA were 94% and 93.96%, respectively, using the 10 CV test, and the results of pairwise radial basis kernel function are considerably improved over those based on radial basis kernel function. Overall, the PPI prediction tool, termed PPI-PKSVM, which is freely available at http://159.226.118.31/PPI/index.html, promises to become useful in such areas as bio-analysis and drug development.
Collapse
|
7
|
Walia RR, Caragea C, Lewis BA, Towfic F, Terribilini M, El-Manzalawy Y, Dobbs D, Honavar V. Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art. BMC Bioinformatics 2012; 13:89. [PMID: 22574904 PMCID: PMC3490755 DOI: 10.1186/1471-2105-13-89] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2011] [Accepted: 05/10/2012] [Indexed: 11/15/2022] Open
Abstract
BACKGROUND RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition 'code' that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. RESULTS We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naïve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. CONCLUSIONS Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.
Collapse
Affiliation(s)
- Rasna R Walia
- Bioinformatics and Computational Biology Program, Iowa State University, USA
- Department of Computer Science, Iowa State University, USA
| | - Cornelia Caragea
- Center for Computational Intelligence, Learning and Discovery, Iowa State University, USA
- College of Information Sciences & Technology, The Pennsylvania State University, University Park, USA
| | - Benjamin A Lewis
- Bioinformatics and Computational Biology Program, Iowa State University, USA
- Department of Genetics, Development and Cell Biology, , USA
| | - Fadi Towfic
- Center for Computational Intelligence, Learning and Discovery, Iowa State University, USA
- The Broad Institute, USA
| | | | - Yasser El-Manzalawy
- Department of Computer Science, Iowa State University, USA
- Center for Computational Intelligence, Learning and Discovery, Iowa State University, USA
- Department of Systems & Computer Engineering, Al-Azhar University, Egypt
| | - Drena Dobbs
- Bioinformatics and Computational Biology Program, Iowa State University, USA
- Department of Genetics, Development and Cell Biology, , USA
| | - Vasant Honavar
- Bioinformatics and Computational Biology Program, Iowa State University, USA
- Department of Computer Science, Iowa State University, USA
- Center for Computational Intelligence, Learning and Discovery, Iowa State University, USA
| |
Collapse
|