1
|
Zhang J, Basu S, Zhang F, Kurgan L. MERIT: Accurate Prediction of Multi Ligand-binding Residues with Hybrid Deep Transformer Network, Evolutionary Couplings and Transfer Learning. J Mol Biol 2024:168872. [PMID: 40133785 DOI: 10.1016/j.jmb.2024.168872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Revised: 10/30/2024] [Accepted: 11/15/2024] [Indexed: 03/27/2025]
Abstract
Multi-ligand binding residues (MLBRs) are amino acids in protein sequences that interact with multiple different ligands that include proteins, peptides, nucleic acids, and a variety of small molecules. MLBRs are implicated in a number of cellular functions and targeted in a context of multiple human diseases. There are many sequence-based predictors of residues that interact with specific ligand types and they can be collectively used to identify MLBRs. However, there are no methods that directly predict MLBRs. To this end, we conceptualize, design, evaluate and release MERIT (Multi-binding rEsidues pRedIcTor). This tool relies on a custom-crafted deep neural network that implements a number of innovative features, such as a multi-layered/step architecture with transformer modules that we train using a custom-designed loss function, computation of evolutionary couplings, and application of transfer learning. These innovations boost predictive performance, which we demonstrate using an ablation analysis. In particular, they reduce the number of cross-predictions, defined as residues that interact with a single ligand type that are incorrectly predicted as MLBRs. We compare MERIT against a representative selection of current and popular ligand-specific predictors, meta-predictors that combine their results to identify MLBRs, and a baseline regression-based predictor. These tests reveal that MERIT provides accurate predictions and statistically outperforms these alternatives. Moreover, using two test datasets, one with MLBRs and another with only the single ligand binding residues, we show that MERIT consistently produces relatively low false positive rates, including low rates of cross-predictions. The web server and datasets from this study are freely available at http://biomine.cs.vcu.edu/servers/MERIT/.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China; Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, China.
| | - Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Fuhao Zhang
- College of Information Engineering, Northwest A&F University, Yangling, Shaanxi 712100, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA.
| |
Collapse
|
2
|
Ge F, Li CF, Zhang CM, Zhang M, Yu DJ. PRITrans: A Transformer-Based Approach for the Prediction of the Effects of Missense Mutation on Protein-RNA Interactions. Int J Mol Sci 2024; 25:12348. [PMID: 39596413 PMCID: PMC11594650 DOI: 10.3390/ijms252212348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2024] [Revised: 11/13/2024] [Accepted: 11/15/2024] [Indexed: 11/28/2024] Open
Abstract
Protein-RNA interactions are essential to many cellular functions, and missense mutations in RNA-binding proteins can disrupt these interactions, often leading to disease. To address this, we developed PRITrans, a specialized computational method aimed at predicting the effects of missense mutations on protein-RNA interactions, which is vital for understanding disease mechanisms and advancing molecular biology research. PRITrans is a novel deep learning model designed to predict the effects of missense mutations on protein-RNA interactions, which employs a Transformer architecture enhanced with multiscale convolution modules for comprehensive feature extraction. Its primary innovation lies in integrating protein language model embeddings with a deep feature fusion strategy, effectively handling high-dimensional feature representations. By utilizing multi-layer self-attention mechanisms, PRITrans captures nuanced, high-level sequence information, while multiscale convolutions extract features across various depths, thereby enhancing predictive accuracy. Consequently, this architecture enables significant improvements in ΔΔG prediction compared to traditional approaches. We validated PRITrans using three different cross-validation strategies on two newly reconstructed mutation datasets, S315 and S630 (containing 315 forward and 315 reverse mutations). The results consistently demonstrated PRITrans's strong performance on both datasets. PRITrans demonstrated strong predictive capability, achieving a Pearson correlation coefficient of 0.741 and a root mean square error (RMSE) of 1.168 kcal/mol on the S630 dataset. Moreover, its robust performance extended to independent test sets, achieving a Pearson correlation of 0.699 and an RMSE of 1.592 kcal/mol. These results underscore PRITrans's potential as a powerful tool for protein-RNA interaction studies. Moreover, when tested against existing prediction methods on an independent dataset, PRITrans showed improved predictive accuracy and robustness.
Collapse
Affiliation(s)
- Fang Ge
- State Key Laboratory of Organic Electronics and Information Displays & Institute of Advanced Materials (IAM), Nanjing University of Posts & Telecommunications, 9 Wenyuan, Nanjing 210023, China;
| | - Cui-Feng Li
- School of Computer, Jiangsu University of Science and Technology, 666 Changhui Road, Zhenjiang 212100, China; (C.-F.L.); (C.-M.Z.); (M.Z.)
| | - Chao-Ming Zhang
- School of Computer, Jiangsu University of Science and Technology, 666 Changhui Road, Zhenjiang 212100, China; (C.-F.L.); (C.-M.Z.); (M.Z.)
| | - Ming Zhang
- School of Computer, Jiangsu University of Science and Technology, 666 Changhui Road, Zhenjiang 212100, China; (C.-F.L.); (C.-M.Z.); (M.Z.)
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
3
|
Xia Y, Zhang Y, Liu D, Zhu YH, Wang Z, Song J, Yu DJ. BLAM6A-Merge: Leveraging Attention Mechanisms and Feature Fusion Strategies to Improve the Identification of RNA N6-Methyladenosine Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1803-1815. [PMID: 38913512 DOI: 10.1109/tcbb.2024.3418490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
RNA N6-methyladenosine is a prevalent and abundant type of RNA modification that exerts significant influence on diverse biological processes. To date, numerous computational approaches have been developed for predicting methylation, with most of them ignoring the correlations of different encoding strategies and failing to explore the adaptability of various attention mechanisms for methylation identification. To solve the above issues, we proposed an innovative framework for predicting RNA m6A modification site, termed BLAM6A-Merge. Specifically, it utilized a multimodal feature fusion strategy to combine the classification results of four features and Blastn tool. Apart from this, different attention mechanisms were employed for extracting higher-level features on specific features after the screening process. Extensive experiments on 12 benchmarking datasets demonstrated that BLAM6A-Merge achieved superior performance (average AUC: 0.849 for the full transcript mode and 0.784 for the mature mRNA mode). Notably, the Blastn tool was employed for the first time in the identification of methylation sites.
Collapse
|
4
|
Yan Z, Ge F, Liu Y, Zhang Y, Li F, Song J, Yu DJ. TransEFVP: A Two-Stage Approach for the Prediction of Human Pathogenic Variants Based on Protein Sequence Embedding Fusion. J Chem Inf Model 2024; 64:1407-1418. [PMID: 38334115 DOI: 10.1021/acs.jcim.3c02019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/10/2024]
Abstract
Studying the effect of single amino acid variations (SAVs) on protein structure and function is integral to advancing our understanding of molecular processes, evolutionary biology, and disease mechanisms. Screening for deleterious variants is one of the crucial issues in precision medicine. Here, we propose a novel computational approach, TransEFVP, based on large-scale protein language model embeddings and a transformer-based neural network to predict disease-associated SAVs. The model adopts a two-stage architecture: the first stage is designed to fuse different feature embeddings through a transformer encoder. In the second stage, a support vector machine model is employed to quantify the pathogenicity of SAVs after dimensionality reduction. The prediction performance of TransEFVP on blind test data achieves a Matthews correlation coefficient of 0.751, an F1-score of 0.846, and an area under the receiver operating characteristic curve of 0.871, higher than the existing state-of-the-art methods. The benchmark results demonstrate that TransEFVP can be explored as an accurate and effective SAV pathogenicity prediction method. The data and codes for TransEFVP are available at https://github.com/yzh9607/TransEFVP/tree/master for academic use.
Collapse
Affiliation(s)
- Zihao Yan
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, PR China
| | - Fang Ge
- State Key Laboratory of Organic Electronics and lnformation Displays & lnstitute of Advanced Materials (IAM), Nanjing University of Posts & Telecommunications, 9 Wenyuan Road, Nanjing 210023, PR China
| | - Yan Liu
- Department of Computer Science, Yangzhou University, Yangzhou 225100, PR China
| | - Yumeng Zhang
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, PR China
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Fuyi Li
- South Australian immunoGENomics Cancer Institute (SAiGENCI), Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, South Australia 5005, Australia
- The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, Victoria 3000, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, PR China
| |
Collapse
|
5
|
Zhang M, Gong C, Ge F, Yu DJ. FCMSTrans: Accurate Prediction of Disease-Associated nsSNPs by Utilizing Multiscale Convolution and Deep Feature Combination within a Transformer Framework. J Chem Inf Model 2024; 64:1394-1406. [PMID: 38349747 DOI: 10.1021/acs.jcim.3c02025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Nonsynonymous single-nucleotide polymorphisms (nsSNPs), implicated in over 6000 diseases, necessitate accurate prediction for expedited drug discovery and improved disease diagnosis. In this study, we propose FCMSTrans, a novel nsSNP predictor that innovatively combines the transformer framework and multiscale modules for comprehensive feature extraction. The distinctive attribute of FCMSTrans resides in a deep feature combination strategy. This strategy amalgamates evolutionary-scale modeling (ESM) and ProtTrans (PT) features, providing an understanding of protein biochemical properties, and position-specific scoring matrix, secondary structure, predicted relative solvent accessibility, and predicted disorder (PSPP) features, which are derived from four protein sequences and structure-oriented characteristics. This feature combination offers a comprehensive view of the molecular dynamics involving nsSNPs. Our model employs the transformer's self-attention mechanisms across multiple layers, extracting higher-level and abstract representations. Simultaneously, varied-level features are captured by multiscale convolutions, enriching feature abstraction at multiple echelons. Our comparative analyses with existing methodologies highlight significant improvements made possible by the integrated feature fusion approach adopted in FCMSTrans. This is further substantiated by performance assessments based on diverse data sets, such as PredictSNP, MMP, and PMD, with areas under the curve (AUCs) of 0.869, 0.819, and 0.693, respectively. Furthermore, FCMSTrans shows robustness and superiority by outperforming the current best predictor, PROVEAN, in a blind test conducted on a third-party data set, achieving an impressive AUC score of 0.7838. The Python code of FCMSTrans is available at https://github.com/gc212/FCMSTrans for academic usage.
Collapse
Affiliation(s)
- Ming Zhang
- School of Computer, Jiangsu University of Science and Technology, 666 Changhui Road, Zhenjiang 212100, China
| | - Chao Gong
- School of Computer, Jiangsu University of Science and Technology, 666 Changhui Road, Zhenjiang 212100, China
| | - Fang Ge
- State Key Laboratory of Organic Electronics and Information Displays & Institute of Advanced Materials (IAM), Nanjing University of Posts & Telecommunications, 9 Wenyuan Road, Nanjing 210023, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
6
|
Cai J, Wang T, Deng X, Tang L, Liu L. GM-lncLoc: LncRNAs subcellular localization prediction based on graph neural network with meta-learning. BMC Genomics 2023; 24:52. [PMID: 36709266 PMCID: PMC9883864 DOI: 10.1186/s12864-022-09034-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 11/21/2022] [Indexed: 01/29/2023] Open
Abstract
In recent years, a large number of studies have shown that the subcellular localization of long non-coding RNAs (lncRNAs) can bring crucial information to the recognition of lncRNAs function. Therefore, it is of great significance to establish a computational method to accurately predict the subcellular localization of lncRNA. Previous prediction models are based on low-level sequences information and are troubled by the few samples problem. In this study, we propose a new prediction model, GM-lncLoc, which is based on the initial information extracted from the lncRNA sequence, and also combines the graph structure information to extract high level features of lncRNA. In addition, the training mode of meta-learning is introduced to obtain meta-parameters by training a series of tasks. With the meta-parameters, the final parameters of other similar tasks can be learned quickly, so as to solve the problem of few samples in lncRNA subcellular localization. Compared with the previous methods, GM-lncLoc achieved the best results with an accuracy of 93.4 and 94.2% in the benchmark datasets of 5 and 4 subcellular compartments, respectively. Furthermore, the prediction performance of GM-lncLoc was also better on the independent dataset. It shows the effectiveness and great potential of our proposed method for lncRNA subcellular localization prediction. The datasets and source code are freely available at https://github.com/JunzheCai/GM-lncLoc .
Collapse
Affiliation(s)
- Junzhe Cai
- grid.410739.80000 0001 0723 6903School of Information, Yunnan Normal University, Kunming, Yunnan China
| | - Ting Wang
- grid.410739.80000 0001 0723 6903School of Information, Yunnan Normal University, Kunming, Yunnan China
| | - Xi Deng
- grid.410739.80000 0001 0723 6903School of Information, Yunnan Normal University, Kunming, Yunnan China
| | - Lin Tang
- grid.410739.80000 0001 0723 6903Key Laboratory of Educational Information for Nationalities Ministry of Education, Yunnan Normal University, Kunming, Yunnan China
| | - Lin Liu
- grid.410739.80000 0001 0723 6903School of Information, Yunnan Normal University, Kunming, Yunnan China
| |
Collapse
|
7
|
Hu J, Bai YS, Zheng LL, Jia NX, Yu DJ, Zhang GJ. Protein-DNA Binding Residue Prediction via Bagging Strategy and Sequence-Based Cube-Format Feature. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3635-3645. [PMID: 34714748 DOI: 10.1109/tcbb.2021.3123828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Protein-DNA interactions play an important role in diverse biological processes. Accurately identifying protein-DNA binding residues is a critical but challenging task for protein function annotations and drug design. Although wet-lab experimental methods are the most accurate way to identify protein-DNA binding residues, they are time consuming and labor intensive. There is an urgent need to develop computational methods to rapidly and accurately predict protein-DNA binding residues. In this study, we propose a novel sequence-based method, named PredDBR, for predicting DNA-binding residues. In PredDBR, for each query protein, its position-specific frequency matrix (PSFM), predicted secondary structure (PSS), and predicted probabilities of ligand-binding residues (PPLBR) are first generated as three feature sources. Secondly, for each feature source, the sliding window technique is employed to extract the matrix-format feature of each residue. Then, we design two strategies, i.e., square root (SR) and average (AVE), to separately transform PSFM-based and two predicted feature source-based, i.e., PSS-based and PPLBR-based, matrix-format features of each residue into three corresponding cube-format features. Finally, after serially combining the three cube-format features, the ensemble classifier is generated via applying bagging strategy to multiple base classifiers built by the framework of 2D convolutional neural network. The computational experimental results demonstrate that the proposed PredDBR achieves an average overall accuracy of 93.7% and a Mathew's correlation coefficient of 0.405 on two independent validation datasets and outperforms several state-of-the-art sequenced-based protein-DNA binding residue predictors. The PredDBR web-server is available at https://jun-csbio.github.io/PredDBR/.
Collapse
|
8
|
Ge F, Zhang Y, Xu J, Muhammad A, Song J, Yu DJ. Prediction of disease-associated nsSNPs by integrating multi-scale ResNet models with deep feature fusion. Brief Bioinform 2022; 23:bbab530. [PMID: 34953462 PMCID: PMC8769912 DOI: 10.1093/bib/bbab530] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Revised: 11/13/2021] [Accepted: 11/16/2021] [Indexed: 11/13/2022] Open
Abstract
More than 6000 human diseases have been recorded to be caused by non-synonymous single nucleotide polymorphisms (nsSNPs). Rapid and accurate prediction of pathogenic nsSNPs can improve our understanding of the principle and design of new drugs, which remains an unresolved challenge. In the present work, a new computational approach, termed MSRes-MutP, is proposed based on ResNet blocks with multi-scale kernel size to predict disease-associated nsSNPs. By feeding the serial concatenation of the extracted four types of features, the performance of MSRes-MutP does not obviously improve. To address this, a second model FFMSRes-MutP is developed, which utilizes deep feature fusion strategy and multi-scale 2D-ResNet and 1D-ResNet blocks to extract relevant two-dimensional features and physicochemical properties. FFMSRes-MutP with the concatenated features achieves a better performance than that with individual features. The performance of FFMSRes-MutP is benchmarked on five different datasets. It achieves the Matthew's correlation coefficient (MCC) of 0.593 and 0.618 on the PredictSNP and MMP datasets, which are 0.101 and 0.210 higher than that of the existing best method PredictSNP1. When tested on the HumDiv and HumVar datasets, it achieves MCC of 0.9605 and 0.9507, and area under curve (AUC) of 0.9796 and 0.9748, which are 0.1747 and 0.2669, 0.0853 and 0.1335, respectively, higher than the existing best methods PolyPhen-2 and FATHMM (weighted). In addition, on blind test using a third-party dataset, FFMSRes-MutP performs as the second-best predictor (with MCC and AUC of 0.5215 and 0.7633, respectively), when compared with the other four predictors. Extensive benchmarking experiments demonstrate that FFMSRes-MutP achieves effective feature fusion and can be explored as a useful approach for predicting disease-associated nsSNPs. The webserver is freely available at http://csbio.njust.edu.cn/bioinf/ffmsresmutp/ for academic use.
Collapse
Affiliation(s)
- Fang Ge
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Ying Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Arif Muhammad
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| |
Collapse
|
9
|
Ge F, Hu J, Zhu YH, Arif M, Yu DJ. TargetMM: Accurate Missense Mutation Prediction by Utilizing Local and Global Sequence Information with Classifier Ensemble. Comb Chem High Throughput Screen 2022; 25:38-52. [PMID: 33280588 DOI: 10.2174/1386207323666201204140438] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 10/22/2020] [Accepted: 10/26/2020] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE Missense mutation (MM) may lead to various human diseases by disabling proteins. Accurate prediction of MM is important and challenging for both protein function annotation and drug design. Although several computational methods yielded acceptable success rates, there is still room for further enhancing the prediction performance of MM. MATERIALS AND METHODS In the present study, we designed a new feature extracting method, which considers the impact degree of residues in the microenvironment range to the mutation site. Stringent cross-validation and independent test on benchmark datasets were performed to evaluate the efficacy of the proposed feature extracting method. Furthermore, three heterogeneous prediction models were trained and then ensembled for the final prediction. By combining the feature representation method and classifier ensemble technique, we reported a novel MM predictor called TargetMM for identifying the pathogenic mutations from the neutral ones. RESULTS Comparison outcomes based on statistical evaluation demonstrate that TargetMM outperforms the prior advanced methods on the independent test data. The source codes and benchmark datasets of TargetMM are freely available at https://github.com/sera616/TargetMM.git for academic use.
Collapse
Affiliation(s)
- Fang Ge
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023,China
| | - Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| | - Muhammad Arif
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| |
Collapse
|
10
|
Ge F, Muhammad A, Yu DJ. DeepnsSNPs: Accurate prediction of non-synonymous single-nucleotide polymorphisms by combining multi-scale convolutional neural network and residue environment information. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS 2021; 215:104326. [DOI: 10.1016/j.chemolab.2021.104326] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2024]
|
11
|
Wang X, Zhang Y, Yu B, Salhi A, Chen R, Wang L, Liu Z. Prediction of protein-protein interaction sites through eXtreme gradient boosting with kernel principal component analysis. Comput Biol Med 2021; 134:104516. [PMID: 34119922 DOI: 10.1016/j.compbiomed.2021.104516] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Revised: 05/24/2021] [Accepted: 05/24/2021] [Indexed: 12/22/2022]
Abstract
Predicting protein-protein interaction sites (PPI sites) can provide important clues for understanding biological activity. Using machine learning to predict PPI sites can mitigate the cost of running expensive and time-consuming biological experiments. Here we propose PPISP-XGBoost, a novel PPI sites prediction method based on eXtreme gradient boosting (XGBoost). First, the characteristic information of protein is extracted through the pseudo-position specific scoring matrix (PsePSSM), pseudo-amino acid composition (PseAAC), hydropathy index and solvent accessible surface area (ASA) under the sliding window. Next, these raw features are preprocessed to obtain more optimal representations in order to achieve better prediction. In particular, the synthetic minority oversampling technique (SMOTE) is used to circumvent class imbalance, and the kernel principal component analysis (KPCA) is applied to remove redundant characteristics. Finally, these optimal features are fed to the XGBoost classifier to identify PPI sites. Using PPISP-XGBoost, the prediction accuracy on the training dataset Dset186 reaches 85.4%, and the accuracy on the independent validation datasets Dtestset72, PDBtestset164, Dset_448 and Dset_355 reaches 85.3%, 83.9%, 85.8% and 85.4%, respectively, which all show an increase in accuracy against existing PPI sites prediction methods. These results demonstrate that the PPISP-XGBoost method can further enhance the prediction of PPI sites.
Collapse
Affiliation(s)
- Xue Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yaqun Zhang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China; Key Laboratory of Computational Science and Application of Hainan Province, Haikou, 571158, China.
| | - Adil Salhi
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Ruixin Chen
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Lin Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Zengfeng Liu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| |
Collapse
|
12
|
Sak J, Suchodolska M. Artificial Intelligence in Nutrients Science Research: A Review. Nutrients 2021; 13:322. [PMID: 33499405 PMCID: PMC7911928 DOI: 10.3390/nu13020322] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Revised: 01/12/2021] [Accepted: 01/18/2021] [Indexed: 12/13/2022] Open
Abstract
Artificial intelligence (AI) as a branch of computer science, the purpose of which is to imitate thought processes, learning abilities and knowledge management, finds more and more applications in experimental and clinical medicine. In recent decades, there has been an expansion of AI applications in biomedical sciences. The possibilities of artificial intelligence in the field of medical diagnostics, risk prediction and support of therapeutic techniques are growing rapidly. The aim of the article is to analyze the current use of AI in nutrients science research. The literature review was conducted in PubMed. A total of 399 records published between 1987 and 2020 were obtained, of which, after analyzing the titles and abstracts, 261 were rejected. In the next stages, the remaining records were analyzed using the full-text versions and, finally, 55 papers were selected. These papers were divided into three areas: AI in biomedical nutrients research (20 studies), AI in clinical nutrients research (22 studies) and AI in nutritional epidemiology (13 studies). It was found that the artificial neural network (ANN) methodology was dominant in the group of research on food composition study and production of nutrients. However, machine learning (ML) algorithms were widely used in studies on the influence of nutrients on the functioning of the human body in health and disease and in studies on the gut microbiota. Deep learning (DL) algorithms prevailed in a group of research works on clinical nutrients intake. The development of dietary systems using AI technology may lead to the creation of a global network that will be able to both actively support and monitor the personalized supply of nutrients.
Collapse
Affiliation(s)
- Jarosław Sak
- Chair and Department of Humanities and Social Medicine, Medical University of Lublin, 20-093 Lublin, Poland
- BioMolecular Resources Research Infrastructure Poland (BBMRI.pl), Poland
| | | |
Collapse
|
13
|
Zhu YH, Hu J, Qi Y, Song XN, Yu DJ. Boosting Granular Support Vector Machines for the Accurate Prediction of Protein-Nucleotide Binding Sites. Comb Chem High Throughput Screen 2020; 22:455-469. [PMID: 31553288 DOI: 10.2174/1386207322666190925125524] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2019] [Revised: 06/21/2019] [Accepted: 08/23/2019] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE The accurate identification of protein-ligand binding sites helps elucidate protein function and facilitate the design of new drugs. Machine-learning-based methods have been widely used for the prediction of protein-ligand binding sites. Nevertheless, the severe class imbalance phenomenon, where the number of nonbinding (majority) residues is far greater than that of binding (minority) residues, has a negative impact on the performance of such machine-learning-based predictors. MATERIALS AND METHODS In this study, we aim to relieve the negative impact of class imbalance by Boosting Multiple Granular Support Vector Machines (BGSVM). In BGSVM, each base SVM is trained on a granular training subset consisting of all minority samples and some reasonably selected majority samples. The efficacy of BGSVM for dealing with class imbalance was validated by benchmarking it with several typical imbalance learning algorithms. We further implemented a protein-nucleotide binding site predictor, called BGSVM-NUC, with the BGSVM algorithm. RESULTS Rigorous cross-validation and independent validation tests for five types of proteinnucleotide interactions demonstrated that the proposed BGSVM-NUC achieves promising prediction performance and outperforms several popular sequence-based protein-nucleotide binding site predictors. The BGSVM-NUC web server is freely available at http://csbio.njust.edu.cn/bioinf/BGSVM-NUC/ for academic use.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Yong Qi
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Xiao-Ning Song
- School of Internet of Things, Jiangnan University, Wuxi 214122, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| |
Collapse
|
14
|
Agrawal P, Mishra G, Raghava GPS. SAMbinder: A Web Server for Predicting S-Adenosyl-L-Methionine Binding Residues of a Protein From Its Amino Acid Sequence. Front Pharmacol 2020; 10:1690. [PMID: 32082172 PMCID: PMC7002541 DOI: 10.3389/fphar.2019.01690] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Accepted: 12/24/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION S-adenosyl-L-methionine (SAM) is an essential cofactor present in the biological system and plays a key role in many diseases. There is a need to develop a method for predicting SAM binding sites in a protein for designing drugs against SAM associated disease. To the best of our knowledge, there is no method that can predict the binding site of SAM in a given protein sequence. RESULT This manuscript describes a method SAMbinder, developed for predicting SAM interacting residue in a protein from its primary sequence. All models were trained, tested, and evaluated on 145 SAM binding protein chains where no two chains have more than 40% sequence similarity. Firstly, models were developed using different machine learning techniques on a balanced data set containing 2,188 SAM interacting and an equal number of non-interacting residues. Our random forest based model developed using binary profile feature got maximum Matthews Correlation Coefficient (MCC) 0.42 with area under receiver operating characteristics (AUROC) 0.79 on the validation data set. The performance of our models improved significantly from MCC 0.42 to 0.61, when evolutionary information in the form of the position-specific scoring matrix (PSSM) profile is used as a feature. We also developed models on a realistic data set containing 2,188 SAM interacting and 40,029 non-interacting residues and got maximum MCC 0.61 with AUROC of 0.89. In order to evaluate the performance of our models, we used internal as well as external cross-validation technique. AVAILABILITY AND IMPLEMENTATION https://webs.iiitd.edu.in/raghava/sambinder/.
Collapse
Affiliation(s)
- Piyush Agrawal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
- Bioinformatics Center, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Gaurav Mishra
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
- Department of Electrical Engineering, Shiv Nadar University, Greater Noida, India
| | - Gajendra P. S. Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| |
Collapse
|
15
|
Patiyal S, Agrawal P, Kumar V, Dhall A, Kumar R, Mishra G, Raghava GP. NAGbinder: An approach for identifying N-acetylglucosamine interacting residues of a protein from its primary sequence. Protein Sci 2020; 29:201-210. [PMID: 31654438 PMCID: PMC6933864 DOI: 10.1002/pro.3761] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Revised: 10/24/2019] [Accepted: 10/24/2019] [Indexed: 12/14/2022]
Abstract
N-acetylglucosamine (NAG) belongs to the eight essential saccharides that are required to maintain the optimal health and precise functioning of systems ranging from bacteria to human. In the present study, we have developed a method, NAGbinder, which predicts the NAG-interacting residues in a protein from its primary sequence information. We extracted 231 NAG-interacting nonredundant protein chains from Protein Data Bank, where no two sequences share more than 40% sequence identity. All prediction models were trained, validated, and evaluated on these 231 protein chains. At first, prediction models were developed on balanced data consisting of 1,335 NAG-interacting and noninteracting residues, using various window size. The model developed by implementing Random Forest using binary profiles as the main principle for identifying NAG-interacting residue with window size 9, performed best among other models. It achieved highest Matthews Correlation Coefficient (MCC) of 0.31 and 0.25, and Area Under Receiver Operating Curve (AUROC) of 0.73 and 0.70 on training and validation data set, respectively. We also developed prediction models on realistic data set (1,335 NAG-interacting and 47,198 noninteracting residues) using the same principle, where the model achieved MCC of 0.26 and 0.27, and AUROC of 0.70 and 0.71, on training and validation data set, respectively. The success of our method can be appraised by the fact that, if a sequence of 1,000 amino acids is analyzed with our approach, 10 residues will be predicted as NAG-interacting, out of which five are correct. Best models were incorporated in the standalone version and in the webserver available at https://webs.iiitd.edu.in/raghava/nagbinder/.
Collapse
Affiliation(s)
- Sumeet Patiyal
- Department of Computational BiologyIndraprastha Institute of Information TechnologyDelhiIndia
| | - Piyush Agrawal
- Department of Computational BiologyIndraprastha Institute of Information TechnologyDelhiIndia
- Bioinformatics CentreCSIR‐Institute of Microbial TechnologyChandigarhIndia
| | - Vinod Kumar
- Department of Computational BiologyIndraprastha Institute of Information TechnologyDelhiIndia
- Bioinformatics CentreCSIR‐Institute of Microbial TechnologyChandigarhIndia
| | - Anjali Dhall
- Department of Computational BiologyIndraprastha Institute of Information TechnologyDelhiIndia
| | - Rajesh Kumar
- Department of Computational BiologyIndraprastha Institute of Information TechnologyDelhiIndia
- Bioinformatics CentreCSIR‐Institute of Microbial TechnologyChandigarhIndia
| | - Gaurav Mishra
- Department of Electrical EngineeringShiv Nadar University, Greater NoidaGautam Buddha NagarIndia
| | - Gajendra P.S. Raghava
- Department of Computational BiologyIndraprastha Institute of Information TechnologyDelhiIndia
| |
Collapse
|
16
|
Cao Z, Pan X, Yang Y, Huang Y, Shen HB. The lncLocator: a subcellular localization predictor for long non-coding RNAs based on a stacked ensemble classifier. Bioinformatics 2019; 34:2185-2194. [PMID: 29462250 DOI: 10.1093/bioinformatics/bty085] [Citation(s) in RCA: 287] [Impact Index Per Article: 47.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2017] [Accepted: 02/14/2018] [Indexed: 01/01/2023] Open
Abstract
Motivation The long non-coding RNA (lncRNA) studies have been hot topics in the field of RNA biology. Recent studies have shown that their subcellular localizations carry important information for understanding their complex biological functions. Considering the costly and time-consuming experiments for identifying subcellular localization of lncRNAs, computational methods are urgently desired. However, to the best of our knowledge, there are no computational tools for predicting the lncRNA subcellular locations to date. Results In this study, we report an ensemble classifier-based predictor, lncLocator, for predicting the lncRNA subcellular localizations. To fully exploit lncRNA sequence information, we adopt both k-mer features and high-level abstraction features generated by unsupervised deep models, and construct four classifiers by feeding these two types of features to support vector machine (SVM) and random forest (RF), respectively. Then we use a stacked ensemble strategy to combine the four classifiers and get the final prediction results. The current lncLocator can predict five subcellular localizations of lncRNAs, including cytoplasm, nucleus, cytosol, ribosome and exosome, and yield an overall accuracy of 0.59 on the constructed benchmark dataset. Availability and implementation The lncLocator is available at www.csbio.sjtu.edu.cn/bioinf/lncLocator. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhen Cao
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - Xiaoyong Pan
- Department of Medical Informatics, Erasmus MC, Rotterdam, The Netherlands
| | - Yang Yang
- Department of Computer Science, Shanghai Jiao Tong University, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, China
| | - Yan Huang
- State Key Laboratory of Infrared Physics, Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| |
Collapse
|
17
|
Wang X, Yu B, Ma A, Chen C, Liu B, Ma Q. Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique. Bioinformatics 2019; 35:2395-2402. [PMID: 30520961 PMCID: PMC6612859 DOI: 10.1093/bioinformatics/bty995] [Citation(s) in RCA: 93] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2018] [Revised: 11/19/2018] [Accepted: 12/03/2018] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The prediction of protein-protein interaction (PPI) sites is a key to mutation design, catalytic reaction and the reconstruction of PPI networks. It is a challenging task considering the significant abundant sequences and the imbalance issue in samples. RESULTS A new ensemble learning-based method, Ensemble Learning of synthetic minority oversampling technique (SMOTE) for Unbalancing samples and RF algorithm (EL-SMURF), was proposed for PPI sites prediction in this study. The sequence profile feature and the residue evolution rates were combined for feature extraction of neighboring residues using a sliding window, and the SMOTE was applied to oversample interface residues in the feature space for the imbalance problem. The Multi-dimensional Scaling feature selection method was implemented to reduce feature redundancy and subset selection. Finally, the Random Forest classifiers were applied to build the ensemble learning model, and the optimal feature vectors were inserted into EL-SMURF to predict PPI sites. The performance validation of EL-SMURF on two independent validation datasets showed 77.1% and 77.7% accuracy, which were 6.2-15.7% and 6.1-18.9% higher than the other existing tools, respectively. AVAILABILITY AND IMPLEMENTATION The source codes and data used in this study are publicly available at http://github.com/QUST-AIBBDRC/EL-SMURF/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaoying Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, China
- School of Mathematics, Shandong University, Jinan, China
- Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, China
- Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, China
- School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Anjun Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA
- Department Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| | - Cheng Chen
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, China
- Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, China
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, China
| | - Qin Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA
- Department Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
18
|
Zhang J, Ma Z, Kurgan L. Comprehensive review and empirical analysis of hallmarks of DNA-, RNA- and protein-binding residues in protein chains. Brief Bioinform 2017; 20:1250-1268. [DOI: 10.1093/bib/bbx168] [Citation(s) in RCA: 60] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 11/15/2017] [Indexed: 11/13/2022] Open
Abstract
Abstract
Proteins interact with a variety of molecules including proteins and nucleic acids. We review a comprehensive collection of over 50 studies that analyze and/or predict these interactions. While majority of these studies address either solely protein–DNA or protein–RNA binding, only a few have a wider scope that covers both protein–protein and protein–nucleic acid binding. Our analysis reveals that binding residues are typically characterized with three hallmarks: relative solvent accessibility (RSA), evolutionary conservation and propensity of amino acids (AAs) for binding. Motivated by drawbacks of the prior studies, we perform a large-scale analysis to quantify and contrast the three hallmarks for residues that bind DNA-, RNA-, protein- and (for the first time) multi-ligand-binding residues that interact with DNA and proteins, and with RNA and proteins. Results generated on a well-annotated data set of over 23 000 proteins show that conservation of binding residues is higher for nucleic acid- than protein-binding residues. Multi-ligand-binding residues are more conserved and have higher RSA than single-ligand-binding residues. We empirically show that each hallmark discriminates between binding and nonbinding residues, even predicted RSA, and that combining them improves discriminatory power for each of the five types of interactions. Linear scoring functions that combine these hallmarks offer good predictive performance of residue-level propensity for binding and provide intuitive interpretation of predictions. Better understanding of these residue-level interactions will facilitate development of methods that accurately predict binding in the exponentially growing databases of protein sequences.
Collapse
|
19
|
Agudelo D, Bourassa P, Bariyanga J, Tajmir-Riahi H. Loading efficacy and binding analysis of retinoids with milk proteins: a short review. J Biomol Struct Dyn 2017; 36:4246-4254. [DOI: 10.1080/07391102.2017.1411833] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Affiliation(s)
- D. Agudelo
- Département de Chemistry-Biochemistry, Physics, Université du Québec à Trois-Rivières, C. P. 500, Trois-Rivières G9A 5H7, Québec, Canada
| | - P. Bourassa
- Département de Chemistry-Biochemistry, Physics, Université du Québec à Trois-Rivières, C. P. 500, Trois-Rivières G9A 5H7, Québec, Canada
| | - J. Bariyanga
- Department of Chemistry, University of Hawaii-West O’ahu, 96-129 Ala Ike, Pearl City 96782, HI, USA
| | - H.A. Tajmir-Riahi
- Département de Chemistry-Biochemistry, Physics, Université du Québec à Trois-Rivières, C. P. 500, Trois-Rivières G9A 5H7, Québec, Canada
| |
Collapse
|
20
|
Pai PP, Dattatreya RK, Mondal S. Ensemble Architecture for Prediction of Enzyme‐ligand Binding Residues Using Evolutionary Information. Mol Inform 2017. [DOI: 10.1002/minf.201700021] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Priyadarshini P. Pai
- Department of Biological SciencesBirla Institute of Technology and Science-Pilani, K.K. Birla Goa Campus. Near NH17 Bypass Road Zuarinagar, Goa India
| | - Rohit Kadam Dattatreya
- Department of EconomicsBirla Institute of Technology and Science-Pilani, K.K. Birla Goa Campus. Near NH17 Bypass Road Zuarinagar, Goa India, PIN: 403726
| | - Sukanta Mondal
- Department of Biological SciencesBirla Institute of Technology and Science-Pilani, K.K. Birla Goa Campus. Near NH17 Bypass Road Zuarinagar, Goa India
| |
Collapse
|
21
|
Zhang J, Kurgan L. Review and comparative assessment of sequence-based predictors of protein-binding residues. Brief Bioinform 2017; 19:821-837. [DOI: 10.1093/bib/bbx022] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2016] [Indexed: 12/31/2022] Open
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
22
|
Li GQ, Liu Z, Shen HB, Yu DJ. TargetM6A: Identifying N6-Methyladenosine Sites From RNA Sequences via Position-Specific Nucleotide Propensities and a Support Vector Machine. IEEE Trans Nanobioscience 2016; 15:674-682. [DOI: 10.1109/tnb.2016.2599115] [Citation(s) in RCA: 57] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
23
|
Wei ZS, Han K, Yang JY, Shen HB, Yu DJ. Protein–protein interaction sites prediction by ensembling SVM and sample-weighted random forests. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.02.022] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
24
|
Xia G, Yan H, Yang J. Sparse preserving feature weights learning. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2015.12.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
25
|
Cai B, Jiang X. Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences. BMC Bioinformatics 2016; 17:116. [PMID: 26940649 PMCID: PMC4778322 DOI: 10.1186/s12859-016-0959-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2015] [Accepted: 02/19/2016] [Indexed: 11/10/2022] Open
Abstract
Background Ubiquitination is a very important process in protein post-translational modification, which has been widely investigated by biology scientists and researchers. Different experimental and computational methods have been developed to identify the ubiquitination sites in protein sequences. This paper aims at exploring computational machine learning methods for the prediction of ubiquitination sites using the physicochemical properties (PCPs) of amino acids in the protein sequences. Results We first establish six different ubiquitination data sets, whose records contain both ubiquitination sites and non-ubiquitination sites in variant numbers of protein sequence segments. In particular, to establish such data sets, protein sequence segments are extracted from the original protein sequences used in four published papers on ubiquitination, while 531 PCP features of each extracted protein sequence segment are calculated based on PCP values from AAindex (Amino Acid index database) by averaging PCP values of all amino acids on each segment. Various computational machine-learning methods, including four Bayesian network methods (i.e., Naïve Bayes (NB), Feature Selection NB (FSNB), Model Averaged NB (MANB), and Efficient Bayesian Multivariate Classifier (EBMC)) and three regression methods (i.e., Support Vector Machine (SVM), Logistic Regression (LR), and Least Absolute Shrinkage and Selection Operator (LASSO)), are then applied to the six established segment-PCP data sets. Five-fold cross-validation and the Area Under Receiver Operating Characteristic Curve (AUROC) are employed to evaluate the ubiquitination prediction performance of each method. Results demonstrate that the PCP data of protein sequences contain information that could be mined by machine learning methods for ubiquitination site prediction. The comparative results show that EBMC, SVM and LR perform better than other methods, and EBMC is the only method that can get AUCs greater than or equal to 0.6 for the six established data sets. Results also show EBMC tends to perform better for larger data. Conclusions Machine learning methods have been employed for the ubiquitination site prediction based on physicochemical properties of amino acids on protein sequences. Results demonstrate the effectiveness of using machine learning methodology to mine information from PCP data concerning protein sequences, as well as the superiority of EBMC, SVM and LR (especially EBMC) for the ubiquitination prediction compared to other methods. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0959-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Binghuang Cai
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15206-3701, USA.
| | - Xia Jiang
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15206-3701, USA.
| |
Collapse
|
26
|
Wei ZS, Yang JY, Shen HB, Yu DJ. A Cascade Random Forests Algorithm for Predicting Protein-Protein Interaction Sites. IEEE Trans Nanobioscience 2015; 14:746-60. [DOI: 10.1109/tnb.2015.2475359] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
|
27
|
Yan H, Yang J. Locality preserving score for joint feature weights learning. Neural Netw 2015; 69:126-34. [DOI: 10.1016/j.neunet.2015.06.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2015] [Revised: 05/06/2015] [Accepted: 06/03/2015] [Indexed: 10/23/2022]
|