1
|
Karim A, Alkhalifah T, Alturise F, Khan YD. PADG-Pred: Exploring Ensemble Approaches for Identifying Parkinson's Disease Associated Biomarkers Using Genomic Sequences Analysis. IET Syst Biol 2025; 19:e70006. [PMID: 40088455 PMCID: PMC11910177 DOI: 10.1049/syb2.70006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2024] [Revised: 01/04/2025] [Accepted: 02/12/2025] [Indexed: 03/17/2025] Open
Abstract
Parkinson's disease (PD), a degenerative disorder affecting the nervous system, manifests as unbalanced movements, stiffness, tremors, and coordination difficulties. Its cause, believed to involve genetic and environmental factors, underscores the critical need for prompt diagnosis and intervention to enhance treatment effectiveness. Despite the array of available diagnostics, their reliability remains a challenge. In this study, an innovative predictor PADG-Pred is proposed for the identification of Parkinson's associated biomarkers, utilising a genomic profile. In this study, a novel predictor, PADG-Pred, which not only identifies Parkinson's associated biomarkers through genomic profiling but also uniquely integrates multiple statistical feature extraction techniques with ensemble-based classification frameworks, thereby providing a more robust and interpretable decision-making process than existing tools. The processed dataset was utilised for feature extraction through multiple statistical moments and it is further involved in extensive training of the model using diverse classification techniques, encompassing Ensemble methods; XGBoost, Random Forest, Light Gradient Boosting Machine, Bagging, ExtraTrees, and Stacking. State-of-the-art validation procedures are applied, assessing key metrics such as specificity, accuracy, sensitivity/recall, and Mathew's correlation coefficient. The outcomes demonstrate the outstanding performance of PADG-RF, showcasing accuracy metrics consistently achieving ∼91% for the independent set, ∼94% for 5-fold, and ∼96% for 10-fold in cross-validation.
Collapse
Affiliation(s)
- Ayesha Karim
- Department of Computer ScienceSchool of Systems and Technology University of Management and TechnologyLahorePakistan
| | - Tamim Alkhalifah
- Department of Computer Engineering, College of ComputerQassim UniversityBuraydahSaudi Arabia
| | - Fahad Alturise
- Department of CybersecurityCollege of ComputerQassim UniversityBuraydahSaudi Arabia
| | - Yaser Daanial Khan
- Department of Computer ScienceSchool of Systems and Technology University of Management and TechnologyLahorePakistan
| |
Collapse
|
2
|
Naseem A, Alturise F, Alkhalifah T, Khan YD. ESM-BBB-Pred: a fine-tuned ESM 2.0 and deep neural networks for the identification of blood-brain barrier peptides. Brief Bioinform 2024; 26:bbaf066. [PMID: 39987496 DOI: 10.1093/bib/bbaf066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 01/24/2025] [Accepted: 02/07/2025] [Indexed: 02/25/2025] Open
Abstract
Blood-brain barrier peptides (BBBP) could significantly improve the delivery of drugs to the brain, paving the way for new treatments for central nervous system (CNS) disorders. The primary challenge in treating CNS disorders lies in the difficulty pharmaceutical agent's face in crossing the BBB. Almost 98% of small molecule drugs and nearly all large molecule drugs fail to penetrate the BBB effectively. Thus, identifying these peptides is vital for advancements in healthcare. This study introduces an enhanced intelligent computational model called BBB-PEP- Evolutionary Scale Modeling (ESM), designed to identify BBBP. The relative positions, reverse position and statistical moment-based features have been utilized on the existing benchmark dataset. For classification purpose, six deep classifiers such as fully connected networks, convolutional neural network, simple recurrent neural networks, long short-term memory (LSTM), bidirectional LSTM, and gated recurrent unit have been utilized. In addition to harnessing the effectiveness of the pre-trained model, a protein language model ESM 2.0 has been fine-tuned on a benchmark dataset for BBBP classification. Three tests such as self-consistency, independent set testing, and five-fold cross-validation have been utilized for evaluation purposes with evaluation metrics includes accuracy, specificity, sensitivity, and Matthews correlation coefficient. The fine-tuned model ESM 2.0 has shown superior results as compared to employed classifiers and surpasses the existing benchmark studies. This system will support future research and the scientific community in the computational identification of BBBP.
Collapse
Affiliation(s)
- Ansar Naseem
- Department of Software Engineering, Superior University, 17 KM Raiwind Road Lahore, Punjab 55150, Pakistan
| | - Fahad Alturise
- Department of Cybersecurity, College of Computer, Qassim University, Buraydah, Saudi Arabia
| | - Tamim Alkhalifah
- Department of Computer Engineering, College of Computer, Qassim University, Buraydah, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
3
|
Zhao MX, Ding RF, Chen Q, Meng J, Li F, Fu S, Huang B, Liu Y, Ji ZL, Zhao Y. Nphos: Database and Predictor of Protein N-phosphorylation. GENOMICS, PROTEOMICS & BIOINFORMATICS 2024; 22:qzae032. [PMID: 39380205 DOI: 10.1093/gpbjnl/qzae032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Revised: 03/03/2024] [Accepted: 04/01/2024] [Indexed: 10/10/2024]
Abstract
Protein N-phosphorylation is widely present in nature and participates in various biological processes. However, current knowledge on N-phosphorylation is extremely limited compared to that on O-phosphorylation. In this study, we collected 11,710 experimentally verified N-phosphosites of 7344 proteins from 39 species and subsequently constructed the database Nphos to share up-to-date information on protein N-phosphorylation. Upon these substantial data, we characterized the sequential and structural features of protein N-phosphorylation. Moreover, after comparing hundreds of learning models, we chose and optimized gradient boosting decision tree (GBDT) models to predict three types of human N-phosphorylation, achieving mean area under the receiver operating characteristic curve (AUC) values of 90.56%, 91.24%, and 92.01% for pHis, pLys, and pArg, respectively. Meanwhile, we discovered 488,825 distinct N-phosphosites in the human proteome. The models were also deployed in Nphos for interactive N-phosphosite prediction. In summary, this work provides new insights and points for both flexible and focused investigations of N-phosphorylation. It will also facilitate a deeper and more systematic understanding of protein N-phosphorylation modification by providing a data and technical foundation. Nphos is freely available at http://www.bio-add.org/Nphos/ and http://ppodd.org.cn/Nphos/.
Collapse
Affiliation(s)
- Ming-Xiao Zhao
- Institute of Drug Discovery Technology, Ningbo University, Ningbo 315211, China
- Department of Chemical Biology, Key Laboratory for Chemical Biology of Fujian Province, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Ruo-Fan Ding
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, Xiamen 361102, China
| | - Qiang Chen
- Zhejiang Key Laboratory of Pathophysiology, Department of Biochemistry and Molecular Biology, Health Science Center, Ningbo University, Ningbo 315211, China
| | - Junhua Meng
- BGI Genomics, BGI-Shenzhen, Shenzhen 518083, China
| | - Fulai Li
- Institute of Drug Discovery Technology, Ningbo University, Ningbo 315211, China
| | - Songsen Fu
- Institute of Drug Discovery Technology, Ningbo University, Ningbo 315211, China
| | - Biling Huang
- Institute of Drug Discovery Technology, Ningbo University, Ningbo 315211, China
| | - Yan Liu
- Department of Chemical Biology, Key Laboratory for Chemical Biology of Fujian Province, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Zhi-Liang Ji
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, Xiamen 361102, China
| | - Yufen Zhao
- Institute of Drug Discovery Technology, Ningbo University, Ningbo 315211, China
- Department of Chemical Biology, Key Laboratory for Chemical Biology of Fujian Province, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
- Key Laboratory of Bioorganic Phosphorus Chemistry & Chemical Biology, Department of Chemistry, Tsinghua University, Beijing 100084, China
| |
Collapse
|
4
|
Ansar Khawaja S, Alturise F, Alkhalifah T, Khan SA, Khan YD. Gluconeogenesis unraveled: A proteomic Odyssey with machine learning. Methods 2024; 232:29-42. [PMID: 39276958 DOI: 10.1016/j.ymeth.2024.09.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 08/05/2024] [Accepted: 09/01/2024] [Indexed: 09/17/2024] Open
Abstract
The metabolic pathway known as gluconeogenesis, which produces glucose from non-carbohydrate substrates, is essential for maintaining balanced blood sugar levels while fasting. It's extremely important to anticipate gluconeogenesis rates accurately to recognize metabolic disorders and create efficient treatment strategies. The implementation of deep learning and machine learning methods to forecast complex biological processes has been gaining popularity in recent years. The recognition of both the regulation of the pathway and possible therapeutic applications of proteins depends on accurate identification associated with their gluconeogenesis patterns. This article analyzes the uses of machine learning and deep learning models, to predict gluconeogenesis efficiency. The study also discusses the challenges that come with restricted data availability and model interpretability, as well as possible applications in personalized healthcare, metabolic disease treatment, and the discovery of drugs. The predictor utilizes statistics moments on the structures of gluconeogenesis and their enzymes, while Random Forest is utilized as a classifier to ensure the accuracy of this model in identifying the best outcomes. The method was validated utilizing the independent test, self-consistency, 10k fold cross-validations, and jackknife test which achieved 92.33 %, 91.87%, 87.88%, and 87.02%. An accurate prediction of gluconeogenesis has significant implications for understanding metabolic disorders and developing targeted therapies. This study contributes to the rising field of predictive biology by mixing algorithms for deep learning, and machine learning, with metabolic pathways.
Collapse
Affiliation(s)
- Seher Ansar Khawaja
- Department of Computer Science, University of Management and Technology, Lahore, Paksistan
| | - Fahad Alturise
- Department of Cybersecurity, College of Computer, Qassim University, Buraydah, Saudi Arabia.
| | - Tamim Alkhalifah
- Deparment of Computer Engineering, College of Computer, Qassim University, Buraydah, Saudi Arabia.
| | - Sher Afzal Khan
- Deparment of Computer Sciences, Abdul Wali Khan University, Mardan, Pakistan.
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore, Paksistan.
| |
Collapse
|
5
|
Sabir MJ, Kamli MR, Atef A, Alhibshi AM, Edris S, Hajarah NH, Bahieldin A, Manavalan B, Sabir JSM. Computational prediction of phosphorylation sites of SARS-CoV-2 infection using feature fusion and optimization strategies. Methods 2024; 229:1-8. [PMID: 38768932 DOI: 10.1016/j.ymeth.2024.04.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 03/15/2024] [Accepted: 04/30/2024] [Indexed: 05/22/2024] Open
Abstract
SARS-CoV-2's global spread has instigated a critical health and economic emergency, impacting countless individuals. Understanding the virus's phosphorylation sites is vital to unravel the molecular intricacies of the infection and subsequent changes in host cellular processes. Several computational methods have been proposed to identify phosphorylation sites, typically focusing on specific residue (S/T) or Y phosphorylation sites. Unfortunately, current predictive tools perform best on these specific residues and may not extend their efficacy to other residues, emphasizing the urgent need for enhanced methodologies. In this study, we developed a novel predictor that integrated all the residues (STY) phosphorylation sites information. We extracted ten different feature descriptors, primarily derived from composition, evolutionary, and position-specific information, and assessed their discriminative power through five classifiers. Our results indicated that Light Gradient Boosting (LGB) showed superior performance, and five descriptors displayed excellent discriminative capabilities. Subsequently, we identified the top two integrated features have high discriminative capability and trained with LGB to develop the final prediction model, LGB-IPs. The proposed approach shows an excellent performance on 10-fold cross-validation with an ACC, MCC, and AUC values of 0.831, 0.662, 0.907, respectively. Notably, these performances are replicated in the independent evaluation. Consequently, our approach may provide valuable insights into the phosphorylation mechanisms in SARS-CoV-2 infection for biomedical researchers.
Collapse
Affiliation(s)
- Mumdooh J Sabir
- Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Majid Rasool Kamli
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Atef
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Alawiah M Alhibshi
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Sherif Edris
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Nahid H Hajarah
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Bahieldin
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea.
| | - Jamal S M Sabir
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia.
| |
Collapse
|
6
|
Naseem A, Khan YD. An intelligent model for prediction of abiotic stress-responsive microRNAs in plants using statistical moments based features and ensemble approaches. Methods 2024; 228:65-79. [PMID: 38768931 DOI: 10.1016/j.ymeth.2024.05.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 04/30/2024] [Accepted: 05/10/2024] [Indexed: 05/22/2024] Open
Abstract
This study proposed an intelligent model for predicting abiotic stress-responsive microRNAs in plants. MicroRNAs (miRNAs) are short RNA molecules regulates the stress in genes. Experimental methods are costly and time-consuming, as compare to in-silico prediction. Addressing this gap, the study seeks to develop an efficient computational model for plant stress response prediction. The two benchmark datasets for MiRNA and Pre-MiRNA dataset have been acquired in this study. Four ensemble approaches such as bagging, boosting, stacking, and blending have been employed. Classifiers such as Random Forest (RF), Extra Trees (ET), Ada Boost (ADB), Light Gradient Boosting Machine (LGBM), and Support Vector Machine (SVM). Stacking and Blending employed all stated classifiers as base learners and Logistic Regression (LR) as Meta Classifier. There have been a total of four types of testing used, including independent set, self-consistency, cross-validation with 5 and 10 folds, and jackknife. This study has utilized evaluation metrics such as accuracy score, specificity, sensitivity, Mathew's correlation coefficient (MCC), and AUC. Our proposed methodology has outperformed existing state of the art study in both datasets based on independent set testing. The SVM-based approach has exhibited accuracy score of 0.659 for the MiRNA dataset, which is better than the previous study. The ET classifier has surpassed the accuracy of Pre-MiRNA dataset as compared to the existing benchmark study, achieving an impressive score of 0.67. The proposed method can be used in future research to predict abiotic stresses in plants.
Collapse
Affiliation(s)
- Ansar Naseem
- Department of Artificial Intelligence, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan.
| |
Collapse
|
7
|
Chang X, Zhu Y, Chen Y, Li L. DeepNphos: A deep-learning architecture for prediction of N-phosphorylation sites. Comput Biol Med 2024; 170:108079. [PMID: 38295472 DOI: 10.1016/j.compbiomed.2024.108079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 01/25/2024] [Accepted: 01/27/2024] [Indexed: 02/02/2024]
Abstract
MOTIVATION Phosphorylation, a prevalent post-translational modification, plays a crucial role in regulating cellular activities. This process encompasses O-phosphorylation (e.g., phosphoserine) and N-phosphorylation (e.g., phospho-lysine (pK), phospho-arginine (pR), and phospho-histidine (pH)). While significant research has focused on O-phosphorylation, resulting in the development of various algorithms for predicting O-phosphorylation sites with commendable performance, there has been a notable absence of models designed to predict N-phosphorylation sites. This study introduces an integrated model named DeepNphos, designed to predict N-phosphorylation sites. This model is developed based on the analysis of thousands of experimentally identified pK, pR and pH sites. RESULTS Observing that the Convolutional Neural Network (CNN) model, incorporating the One-Hot encoding feature, demonstrates favorable performance in comparison to other models when predicting pK, pR, and pH sites. Additionally, pK exhibits similarities to other lysine modification types, and integrating the CNN model with a deep-transfer learning (DTL) strategy based on tens of thousands of known lysine modification sites could enhance pK prediction performance. In contrast, pR exhibits little similarity to other arginine modification types, and the integration of DTL has minimal impact on pR prediction performance. Furthermore, the decision was made to refrain from incorporating the DTL strategy in predicting pH sites, given the scarcity of histidine modification sites beyond those associated with pH. The final classifiers for predicting pK, pR, and pH sites achieve AUC values of 0.856, 0.805 and 0.802 for ten-fold cross-validation, respectively. Overall, DeepNphos is the first classifier for predicting N-phosphorylation sites, accessible at https://github.com/ChangXulinmessi/DeepNPhos.
Collapse
Affiliation(s)
- Xulin Chang
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, China
| | - Yafei Zhu
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, China
| | - Yu Chen
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, China
| | - Lei Li
- School of Health and Life Sciences, University of Health and Rehabilitation Sciences, Qingdao, 266000, China.
| |
Collapse
|
8
|
Bao W, Liu Y, Chen B. Oral_voting_transfer: classification of oral microorganisms' function proteins with voting transfer model. Front Microbiol 2024; 14:1277121. [PMID: 38384719 PMCID: PMC10879614 DOI: 10.3389/fmicb.2023.1277121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 12/19/2023] [Indexed: 02/23/2024] Open
Abstract
Introduction The oral microbial group typically represents the human body's highly complex microbial group ecosystem. Oral microorganisms take part in human diseases, including Oral cavity inflammation, mucosal disease, periodontal disease, tooth decay, and oral cancer. On the other hand, oral microbes can also cause endocrine disorders, digestive function, and nerve function disorders, such as diabetes, digestive system diseases, and Alzheimer's disease. It was noted that the proteins of oral microbes play significant roles in these serious diseases. Having a good knowledge of oral microbes can be helpful in analyzing the procession of related diseases. Moreover, the high-dimensional features and imbalanced data lead to the complexity of oral microbial issues, which can hardly be solved with traditional experimental methods. Methods To deal with these challenges, we proposed a novel method, which is oral_voting_transfer, to deal with such classification issues in the field of oral microorganisms. Such a method employed three features to classify the five oral microorganisms, including Streptococcus mutans, Staphylococcus aureus, abiotrophy adjacent, bifidobacterial, and Capnocytophaga. Firstly, we utilized the highly effective model, which successfully classifies the organelle's proteins and transfers to deal with the oral microorganisms. And then, some classification methods can be treated as the local classifiers in this work. Finally, the results are voting from the transfer classifiers and the voting ones. Results and discussion The proposed method achieved the well performances in the five oral microorganisms. The oral_voting_transfer is a standalone tool, and all its source codes are publicly available at https://github.com/baowz12345/voting_transfer.
Collapse
Affiliation(s)
- Wenzheng Bao
- School of Information Engineering, Xuzhou University of Technology, Xuzhou, China
| | - Yujun Liu
- School of Information Engineering, Xuzhou University of Technology, Xuzhou, China
| | - Baitong Chen
- The Affiliated Xuzhou Municipal Hospital of Xuzhou Medical University, Xuzhou, China
- Department of Stomatology, Xuzhou First People’s Hospital, Xuzhou, China
| |
Collapse
|
9
|
Ali Shah A, Shaker ASA, Jabbar S, Abbas Q, Al-Balawi TS, Celebi ME. An ensemble-based deep learning model for detection of mutation causing cutaneous melanoma. Sci Rep 2023; 13:22251. [PMID: 38097641 PMCID: PMC10721601 DOI: 10.1038/s41598-023-49075-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 12/04/2023] [Indexed: 12/17/2023] Open
Abstract
When the mutation affects the melanocytes of the body, a condition called melanoma results which is one of the deadliest skin cancers. Early detection of cutaneous melanoma is vital for raising the chances of survival. Melanoma can be due to inherited defective genes or due to environmental factors such as excessive sun exposure. The accuracy of the state-of-the-art computer-aided diagnosis systems is unsatisfactory. Moreover, the major drawback of medical imaging is the shortage of labeled data. Generalized classifiers are required to diagnose melanoma to avoid overfitting the dataset. To address these issues, blending ensemble-based deep learning (BEDLM-CMS) model is proposed to detect mutation of cutaneous melanoma by integrating long short-term memory (LSTM), Bi-directional LSTM (BLSTM) and gated recurrent unit (GRU) architectures. The dataset used in the proposed study contains 2608 human samples and 6778 mutations in total along with 75 types of genes. The most prominent genes that function as biomarkers for early diagnosis and prognosis are utilized. Multiple extraction techniques are used in this study to extract the most-prominent features. Afterwards, we applied different DL models optimized through grid search technique to diagnose melanoma. The validity of the results is confirmed using several techniques, including tenfold cross validation (10-FCVT), independent set (IST), and self-consistency (SCT). For validation of the results multiple metrics are used which include accuracy, specificity, sensitivity, and Matthews's correlation coefficient. BEDLM gives the highest accuracy of 97% in the independent set test whereas in self-consistency test and tenfold cross validation test it gives 94% and 93% accuracy, respectively. Accuracy of in self-consistency test, independent set test, and tenfold cross validation test is LSTM (96%, 94%, 92%), GRU (93%, 94%, 91%), and BLSTM (99%, 98%, 93%), respectively. The findings demonstrate that the proposed BEDLM-CMS can be used effectively applied for early diagnosis and treatment efficacy evaluation of cutaneous melanoma.
Collapse
Affiliation(s)
- Asghar Ali Shah
- Department of Computer Science, Bahria University, Islamabad, Pakistan
| | | | - Sohail Jabbar
- College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), 11432, Riyadh, Saudi Arabia
| | - Qaisar Abbas
- College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), 11432, Riyadh, Saudi Arabia.
| | - Talal Saad Al-Balawi
- College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), 11432, Riyadh, Saudi Arabia
| | - M Emre Celebi
- Department of Computer Science and Engineering, University of Central Arkansas, 201 Donaghey Ave., Conway, AR, 72035, USA
| |
Collapse
|
10
|
Esmaili F, Pourmirzaei M, Ramazi S, Shojaeilangari S, Yavari E. A Review of Machine Learning and Algorithmic Methods for Protein Phosphorylation Site Prediction. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:1266-1285. [PMID: 37863385 PMCID: PMC11082408 DOI: 10.1016/j.gpb.2023.03.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 01/16/2023] [Accepted: 03/23/2023] [Indexed: 10/22/2023]
Abstract
Post-translational modifications (PTMs) have key roles in extending the functional diversity of proteins and, as a result, regulating diverse cellular processes in prokaryotic and eukaryotic organisms. Phosphorylation modification is a vital PTM that occurs in most proteins and plays a significant role in many biological processes. Disorders in the phosphorylation process lead to multiple diseases, including neurological disorders and cancers. The purpose of this review is to organize this body of knowledge associated with phosphorylation site (p-site) prediction to facilitate future research in this field. At first, we comprehensively review all related databases and introduce all steps regarding dataset creation, data preprocessing, and method evaluation in p-site prediction. Next, we investigate p-site prediction methods, which are divided into two computational groups: algorithmic and machine learning (ML). Additionally, it is shown that there are basically two main approaches for p-site prediction by ML: conventional and end-to-end deep learning methods, both of which are given an overview. Moreover, this review introduces the most important feature extraction techniques, which have mostly been used in p-site prediction. Finally, we create three test sets from new proteins related to the released version of the database of protein post-translational modifications (dbPTM) in 2022 based on general and human species. Evaluating online p-site prediction tools on newly added proteins introduced in the dbPTM 2022 release, distinct from those in the dbPTM 2019 release, reveals their limitations. In other words, the actual performance of these online p-site prediction tools on unseen proteins is notably lower than the results reported in their respective research papers.
Collapse
Affiliation(s)
- Farzaneh Esmaili
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| | - Mahdi Pourmirzaei
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| | - Shahin Ramazi
- Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran 14115-111, Iran.
| | - Seyedehsamaneh Shojaeilangari
- Biomedical Engineering Group, Department of Electrical Engineering and Information Technology, Iranian Research Organization for Science and Technology (IROST), Tehran 33535-111, Iran
| | - Elham Yavari
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| |
Collapse
|
11
|
Naseem A, Alturise F, Alkhalifah T, Khan YD. BBB-PEP-prediction: improved computational model for identification of blood-brain barrier peptides using blending position relative composition specific features and ensemble modeling. J Cheminform 2023; 15:110. [PMID: 37980534 PMCID: PMC10656963 DOI: 10.1186/s13321-023-00773-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 10/21/2023] [Indexed: 11/20/2023] Open
Abstract
BBPs have the potential to facilitate the delivery of drugs to the brain, opening up new avenues for the development of treatments targeting diseases of the central nervous system (CNS). The obstacle faced in central nervous system disorders stems from the formidable task of traversing the blood-brain barrier (BBB) for pharmaceutical agents. Nearly 98% of small molecule-based drugs and nearly 100% of large molecule-based drugs encounter difficulties in successfully penetrating the BBB. This importance leads to identification of these peptides, can help in healthcare systems. In this study, we proposed an improved intelligent computational model BBB-PEP-Prediction for identification of BBB peptides. Position and statistical moments based features have been computed for acquired benchmark dataset. Four types of ensembles such as bagging, boosting, stacking and blending have been utilized in the methodology section. Bagging employed Random Forest (RF) and Extra Trees (ET), Boosting utilizes XGBoost (XGB) and Light Gradient Boosting Machine (LGBM). Stacking uses ET and XGB as base learners, blending exploited LGBM and RF as base learners, while Logistic Regression (LR) has been applied as Meta learner for stacking and blending. Three classifiers such as LGBM, XGB and ET have been optimized by using Randomized search CV. Four types of testing such as self-consistency, independent set, cross-validation with 5 and 10 folds and jackknife test have been employed. Evaluation metrics such as Accuracy (ACC), Specificity (SPE), Sensitivity (SEN), Mathew's correlation coefficient (MCC) have been utilized. The stacking of classifiers has shown best results in almost each testing. The stacking results for independent set testing exhibits accuracy, specificity, sensitivity and MCC score of 0.824, 0.911, 0.831 and 0.663 respectively. The proposed model BBB-PEP-Prediction shown superlative performance as compared to previous benchmark studies. The proposed system helps in future research and research community for in-silico identification of BBB peptides.
Collapse
Affiliation(s)
- Ansar Naseem
- Department of Artificial Intelligence, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass, Saudi Arabia.
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
12
|
Alotaibi FM, Khan YD. A Framework for Prediction of Oncogenomic Progression Aiding Personalized Treatment of Gastric Cancer. Diagnostics (Basel) 2023; 13:2291. [PMID: 37443684 DOI: 10.3390/diagnostics13132291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Revised: 06/05/2023] [Accepted: 06/13/2023] [Indexed: 07/15/2023] Open
Abstract
Mutations in genes can alter their DNA patterns, and by recognizing these mutations, many carcinomas can be diagnosed in the progression stages. The human body contains many hidden and enigmatic features that humankind has not yet fully understood. A total of 7539 neoplasm cases were reported from 1 January 2021 to 31 December 2021. Of these, 3156 were seen in males (41.9%) and 4383 (58.1%) in female patients. Several machine learning and deep learning frameworks are already implemented to detect mutations, but these techniques lack generalized datasets and need to be optimized for better results. Deep learning-based neural networks provide the computational power to calculate the complex structures of gastric carcinoma-driven gene mutations. This study proposes deep learning approaches such as long and short-term memory, gated recurrent units and bi-LSTM to help in identifying the progression of gastric carcinoma in an optimized manner. This study includes 61 carcinogenic driver genes whose mutations can cause gastric cancer. The mutation information was downloaded from intOGen.org and normal gene sequences were downloaded from asia.ensembl.org, as explained in the data collection section. The proposed deep learning models are validated using the self-consistency test (SCT), 10-fold cross-validation test (FCVT), and independent set test (IST); the IST prediction metrics of accuracy, sensitivity, specificity, MCC and AUC of LSTM, Bi-LSTM, and GRU are 97.18%, 98.35%, 96.01%, 0.94, 0.98; 99.46%, 98.93%, 100%, 0.989, 1.00; 99.46%, 98.93%, 100%, 0.989 and 1.00, respectively.
Collapse
Affiliation(s)
- Fahad M Alotaibi
- Department of Information System, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore 54770, Pakistan
| |
Collapse
|
13
|
Butt AH, Alkhalifah T, Alturise F, Khan YD. Ensemble Learning for Hormone Binding Protein Prediction: A Promising Approach for Early Diagnosis of Thyroid Hormone Disorders in Serum. Diagnostics (Basel) 2023; 13:diagnostics13111940. [PMID: 37296792 DOI: 10.3390/diagnostics13111940] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 05/20/2023] [Accepted: 05/22/2023] [Indexed: 06/12/2023] Open
Abstract
Hormone-binding proteins (HBPs) are specific carrier proteins that bind to a given hormone. A soluble carrier hormone binding protein (HBP), which can interact non-covalently and specifically with growth hormone, modulates or inhibits hormone signaling. HBP is essential for the growth of life, despite still being poorly understood. Several diseases, according to some data, are caused by HBPs that express themselves abnormally. Accurate identification of these molecules is the first step in investigating the roles of HBPs and understanding their biological mechanisms. For a better understanding of cell development and cellular mechanisms, accurate HBP determination from a given protein sequence is essential. Using traditional biochemical experiments, it is difficult to correctly separate HBPs from an increasing number of proteins because of the high experimental costs and lengthy experiment periods. The abundance of protein sequence data that has been gathered in the post-genomic era necessitates a computational method that is automated and enables quick and accurate identification of putative HBPs within a large number of candidate proteins. A brand-new machine-learning-based predictor is suggested as the HBP identification method. To produce the desirable feature set for the method proposed, statistical moment-based features and amino acids were combined, and the random forest was used to train the feature set. During 5-fold cross validation experiments, the suggested method achieved 94.37% accuracy and 0.9438 F1-scores, respectively, demonstrating the importance of the Hahn moment-based features.
Collapse
Affiliation(s)
- Ahmad Hassan Butt
- Department of Computer Science, Faculty of Computing & Information Technology, University of the Punjab, Lahore 54000, Pakistan
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 51921, Qassim, Saudi Arabia
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 51921, Qassim, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan
| |
Collapse
|
14
|
Zhang Y, Li Z. RF_phage virion: Classification of phage virion proteins with a random forest model. Front Genet 2023; 13:1103783. [PMID: 36846294 PMCID: PMC9945117 DOI: 10.3389/fgene.2022.1103783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 12/30/2022] [Indexed: 02/10/2023] Open
Abstract
Introduction: Phages play essential roles in biological procession, and the virion proteins encoded by the phage genome constitute critical elements of the assembled phage particle. Methods: This study uses machine learning methods to classify phage virion proteins. We proposed a novel approach, RF_phage virion, for the effective classification of the virion and non-virion proteins. The model uses four protein sequence coding methods as features, and the random forest algorithm was employed to solve the classification problem. Results: The performance of the RF_phage virion model was analyzed by comparing the performance of this algorithm with that of classical machine learning methods. The proposed method achieved a specificity (Sp) of 93.37%%, sensitivity (Sn) of 90.30%, accuracy (Acc) of 91.84%, Matthews correlation coefficient (MCC) of .8371, and an F1 score of .9196.
Collapse
Affiliation(s)
- Yanqing Zhang
- School of Finance, Xuzhou University of Technology, Xuzhou, China
| | - Zhiyuan Li
- School of Artificial Intelligence and Software College, Jiangsu Normal University Kewen College, Xuzhou, China,*Correspondence: Zhiyuan Li,
| |
Collapse
|
15
|
Perveen G, Alturise F, Alkhalifah T, Daanial Khan Y. Hemolytic-Pred: A machine learning-based predictor for hemolytic proteins using position and composition-based features. Digit Health 2023; 9:20552076231180739. [PMID: 37434723 PMCID: PMC10331097 DOI: 10.1177/20552076231180739] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Accepted: 05/22/2023] [Indexed: 07/13/2023] Open
Abstract
Objective The objective of this study is to propose a novel in-silico method called Hemolytic-Pred for identifying hemolytic proteins based on their sequences, using statistical moment-based features, along with position-relative and frequency-relative information. Methods Primary sequences were transformed into feature vectors using statistical and position-relative moment-based features. Varying machine learning algorithms were employed for classification. Computational models were rigorously evaluated using four different validation. The Hemolytic-Pred webserver is available for further analysis at http://ec2-54-160-229-10.compute-1.amazonaws.com/. Results XGBoost outperformed the other six classifiers with an accuracy value of 0.99, 0.98, 0.97, and 0.98 for self-consistency test, 10-fold cross-validation, Jackknife test, and independent set test, respectively. The proposed method with the XGBoost classifier is a workable and robust solution for predicting hemolytic proteins efficiently and accurately. Conclusions The proposed method of Hemolytic-Pred with XGBoost classifier is a reliable tool for the timely identification of hemolytic cells and diagnosis of various related severe disorders. The application of Hemolytic-Pred can yield profound benefits in the medical field.
Collapse
Affiliation(s)
- Gulnaz Perveen
- Department of Computer Science, School
of Systems and Technology, University of Management and Technology, Lahore, Punjab,
Pakistan
| | - Fahad Alturise
- Department of Computer, College of
Science and Arts in Ar Rass Qassim University, Buraidah, Qassim, Saudi Arabia
| | - Tamim Alkhalifah
- Department of Computer, College of
Science and Arts in Ar Rass Qassim University, Buraidah, Qassim, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School
of Systems and Technology, University of Management and Technology, Lahore, Punjab,
Pakistan
| |
Collapse
|
16
|
Zeng Y, Liu D, Wang Y. Identification of phosphorylation site using S-padding strategy based convolutional neural network. Health Inf Sci Syst 2022; 10:29. [PMID: 36124094 PMCID: PMC9481819 DOI: 10.1007/s13755-022-00196-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Accepted: 08/25/2022] [Indexed: 10/14/2022] Open
Abstract
Purpose Abnormal phosphorylation has been proved to associate with a variety of human diseases, and the identification of phosphorylation sites is one of the research hotspots in healthcare. The study of phosphorylation site prediction in deep learning models often introduces a variety of information, and the utilization of complex models limits the usage scenarios of the models. Methods An enhanced deep learning method with S-padding strategy based on convolutional neural network is proposed in this paper. The S-padding strategy forms a three-dimensional matrix with extension information from original amino acid sequences, and a corresponding 2D-CNN model is designed to abstract the comprehensive features of phosphorylation site area in protein sequences. Results The fivefold cross-validation experiments are conducted, and the results show the performance of the proposed method on human dataset can achieve an accuracy of 89.68 % on serine/threonine sites and 88.16 % on tyrosine sites, respectively. Furthermore, phosphorylation site prediction on different organisms obtains the accuracy, sensitivity, and specificity of over 0.85, indicating a potential capability on phosphorylation site prediction task. Comparison result with existing models shows that the proposed method obtains better performance on both accuracy and AUC value, and the proposed method can further improve performance with sufficient training data. Conclusion This method enables proteome-wide predictions via models trained on a large amount of phosphorylation data, further exploiting the potential of protein phosphorylation site identification, and helping to provide insights into phosphorylation mechanisms.
Collapse
Affiliation(s)
- Yanjiao Zeng
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, 510006 Guangdong China
| | - Dongning Liu
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, 510006 Guangdong China
| | - Yang Wang
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, 510006 Guangdong China
| |
Collapse
|
17
|
Zhao J, Zhuang M, Liu J, Zhang M, Zeng C, Jiang B, Wu J, Song X. pHisPred: a tool for the identification of histidine phosphorylation sites by integrating amino acid patterns and properties. BMC Bioinformatics 2022; 23:399. [PMID: 36171552 PMCID: PMC9520798 DOI: 10.1186/s12859-022-04938-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Accepted: 09/16/2022] [Indexed: 11/17/2022] Open
Abstract
Background Protein histidine phosphorylation (pHis) plays critical roles in prokaryotic signal transduction pathways and various eukaryotic cellular processes. It is estimated to account for 6–10% of the phosphoproteome, however only hundreds of pHis sites have been discovered to date. Due to the inherent disadvantages of experimental methods, it is an urgent task for developing efficient computational approaches to identify pHis sites. Results Here, we present a novel tool, pHisPred, for accurately identifying pHis sites from protein sequences. We manually collected the largest number of experimental validated pHis sites to build benchmark datasets. Using randomized tenfold CV, the weighted SVM-RBF model shows the best performance than other four commonly used classification models (LR, KNN, RF, and MLP). From ten thousands of features, 140 and 150 most informative features were individually selected out for eukaryotic and prokaryotic models. The average AUC and F1-score values of pHisPred were (0.81, 0.40) and (0.78, 0.46) for tenfold CV on the eukaryotic and prokaryotic training datasets, respectively. In addition, pHisPred significantly outperforms other tools on testing datasets, in particular on the eukaryotic one. Conclusion We implemented a python program of pHisPred, which is freely available for non-commercial use at https://github.com/xiaofengsong/pHisPred. Moreover, users can use it to train new models with their own data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04938-x.
Collapse
Affiliation(s)
- Jian Zhao
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Minhui Zhuang
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Jingjing Liu
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Meng Zhang
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Cong Zeng
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
| | - Bin Jiang
- College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
| | - Jing Wu
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing, 211166, China.
| | - Xiaofeng Song
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China.
| |
Collapse
|
18
|
Butt AH, Alkhalifah T, Alturise F, Khan YD. A machine learning technique for identifying DNA enhancer regions utilizing CIS-regulatory element patterns. Sci Rep 2022; 12:15183. [PMID: 36071071 PMCID: PMC9452539 DOI: 10.1038/s41598-022-19099-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Accepted: 08/24/2022] [Indexed: 11/26/2022] Open
Abstract
Enhancers regulate gene expression, by playing a crucial role in the synthesis of RNAs and proteins. They do not directly encode proteins or RNA molecules. In order to control gene expression, it is important to predict enhancers and their potency. Given their distance from the target gene, lack of common motifs, and tissue/cell specificity, enhancer regions are thought to be difficult to predict in DNA sequences. Recently, a number of bioinformatics tools were created to distinguish enhancers from other regulatory components and to pinpoint their advantages. However, because the quality of its prediction method needs to be improved, its practical application value must also be improved. Based on nucleotide composition and statistical moment-based features, the current study suggests a novel method for identifying enhancers and non-enhancers and evaluating their strength. The proposed study outperformed state-of-the-art techniques using fivefold and tenfold cross-validation in terms of accuracy. The accuracy from the current study results in 86.5% and 72.3% in enhancer site and its strength prediction respectively. The results of the suggested methodology point to the potential for more efficient and successful outcomes when statistical moment-based features are used. The current study's source code is available to the research community at https://github.com/csbioinfopk/enpred.
Collapse
Affiliation(s)
- Ahmad Hassan Butt
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass, Saudi Arabia.
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
19
|
Chen Y, Li S, Guo J. A method for identifying moonlighting proteins based on linear discriminant analysis and bagging-SVM. Front Genet 2022; 13:963349. [PMID: 36046247 PMCID: PMC9420859 DOI: 10.3389/fgene.2022.963349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 07/18/2022] [Indexed: 11/13/2022] Open
Abstract
Moonlighting proteins have at least two independent functions and are widely found in animals, plants and microorganisms. Moonlighting proteins play important roles in signal transduction, cell growth and movement, tumor inhibition, DNA synthesis and repair, and metabolism of biological macromolecules. Moonlighting proteins are difficult to find through biological experiments, so many researchers identify moonlighting proteins through bioinformatics methods, but their accuracies are relatively low. Therefore, we propose a new method. In this study, we select SVMProt-188D as the feature input, and apply a model combining linear discriminant analysis and basic classifiers in machine learning to study moonlighting proteins, and perform bagging ensemble on the best-performing support vector machine. They are identified accurately and efficiently. The model achieves an accuracy of 93.26% and an F-sorce of 0.946 on the MPFit dataset, which is better than the existing MEL-MP model. Meanwhile, it also achieves good results on the other two moonlighting protein datasets.
Collapse
|
20
|
Naseer S, Hussain W, Khan YD, Rasool N. iPhosS(Deep)-PseAAC: Identification of Phosphoserine Sites in Proteins Using Deep Learning on General Pseudo Amino Acid Compositions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1703-1714. [PMID: 33242308 DOI: 10.1109/tcbb.2020.3040747] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Among all the PTMs, the protein phosphorylation is pivotal for various pathological and physiological processes. About 30 percent of eukaryotic proteins undergo the phosphorylation modification, leading to various changes in conformation, function, stability, localization, and so forth. In eukaryotic proteins, phosphorylation occurs on serine (S), Threonine (T) and Tyrosine (Y) residues. Among these all, serine phosphorylation has its own importance as it is associated with various importance biological processes, including energy metabolism, signal transduction pathways, cell cycling, and apoptosis. Thus, its identification is important, however, the in vitro, ex vivo and in vivo identification can be laborious, time-taking and costly. There is a dire need of an efficient and accurate computational model to help researchers and biologists identifying these sites, in an easy manner. Herein, we propose a novel predictor for identification of Phosphoserine sites (PhosS) in proteins, by integrating the Chou's Pseudo Amino Acid Composition (PseAAC) with deep features. We used well-known DNNs for both the tasks of learning a feature representation of peptide sequences and performing classifications. Among different DNNs, the best score is shown by Covolutional Neural Network based model which renders CNN based prediction model the best for Phosphoserine prediction. Based on these results, it is concluded that the proposed model can help to identify PhosS sites in a very efficient and accurate manner which can help scientists understand the mechanism of this modification in proteins.
Collapse
|
21
|
Malebary S, Rahman S, Barukab O, Ash’ari R, Khan SA. iAcety–SmRF: Identification of Acetylation Protein by Using Statistical Moments and Random Forest. MEMBRANES 2022; 12:membranes12030265. [PMID: 35323738 PMCID: PMC8955084 DOI: 10.3390/membranes12030265] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 01/25/2022] [Accepted: 02/01/2022] [Indexed: 12/21/2022]
Abstract
Acetylation is the most important post-translation modification (PTM) in eukaryotes; it has manifold effects on the level of protein that transform an acetyl group from an acetyl coenzyme to a specific site on a polypeptide chain. Acetylation sites play many important roles, including regulating membrane protein functions and strongly affecting the membrane interaction of proteins and membrane remodeling. Because of these properties, its correct identification is essential to understand its mechanism in biological systems. As such, some traditional methods, such as mass spectrometry and site-directed mutagenesis, are used, but they are tedious and time-consuming. To overcome such limitations, many computer models are being developed to correctly identify their sequences from non-acetyl sequences, but they have poor efficiency in terms of accuracy, sensitivity, and specificity. This work proposes an efficient and accurate computational model for predicting Acetylation using machine learning approaches. The proposed model achieved an accuracy of 100 percent with the 10-fold cross-validation test based on the Random Forest classifier, along with a feature extraction approach using statistical moments. The model is also validated by the jackknife, self-consistency, and independent test, which achieved an accuracy of 100, 100, and 97, respectively, results far better as compared to the already existing models available in the literature.
Collapse
Affiliation(s)
- Sharaf Malebary
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21911, Saudi Arabia; (S.M.); (O.B.); (R.A.)
| | - Shaista Rahman
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan 23200, Pakistan;
- Correspondence:
| | - Omar Barukab
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21911, Saudi Arabia; (S.M.); (O.B.); (R.A.)
| | - Rehab Ash’ari
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21911, Saudi Arabia; (S.M.); (O.B.); (R.A.)
| | - Sher Afzal Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan 23200, Pakistan;
| |
Collapse
|
22
|
Naseer S, Ali RF, Fati SM, Muneer A. Computational identification of 4-carboxyglutamate sites to supplement physiological studies using deep learning. Sci Rep 2022; 12:128. [PMID: 34996975 PMCID: PMC8741832 DOI: 10.1038/s41598-021-03895-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Accepted: 12/03/2021] [Indexed: 01/23/2023] Open
Abstract
In biological systems, Glutamic acid is a crucial amino acid which is used in protein biosynthesis. Carboxylation of glutamic acid is a significant post-translational modification which plays important role in blood coagulation by activating prothrombin to thrombin. Contrariwise, 4-carboxy-glutamate is also found to be involved in diseases including plaque atherosclerosis, osteoporosis, mineralized heart valves, bone resorption and serves as biomarker for onset of these diseases. Owing to the pathophysiological significance of 4-carboxyglutamate, its identification is important to better understand pathophysiological systems. The wet lab identification of prospective 4-carboxyglutamate sites is costly, laborious and time consuming due to inherent difficulties of in-vivo, ex-vivo and in vitro experiments. To supplement these experiments, we proposed, implemented, and evaluated a different approach to develop 4-carboxyglutamate site predictors using pseudo amino acid compositions (PseAAC) and deep neural networks (DNNs). Our approach does not require any feature extraction and employs deep neural networks to learn feature representation of peptide sequences and performing classification thereof. Proposed approach is validated using standard performance evaluation metrics. Among different deep neural networks, convolutional neural network-based predictor achieved best scores on independent dataset with accuracy of 94.7%, AuC score of 0.91 and F1-score of 0.874 which shows the promise of proposed approach. The iCarboxE-Deep server is deployed at https://share.streamlit.io/sheraz-n/carboxyglutamate/app.py .
Collapse
Affiliation(s)
- Sheraz Naseer
- Department of Computer Science, University of Management and Technology, Lahore, 54770, Pakistan
| | - Rao Faizan Ali
- Department of Computer Science, University of Management and Technology, Lahore, 54770, Pakistan.
- Computer and Information Sciences Department, Universiti Teknologi PETRONAS, 32610, Seri Iskandar, Malaysia.
| | - Suliman Mohamed Fati
- College of Computer and Information Sciences, Prince Sultan University, Riyadh, 11586, Saudi Arabia
| | - Amgad Muneer
- Computer and Information Sciences Department, Universiti Teknologi PETRONAS, 32610, Seri Iskandar, Malaysia
| |
Collapse
|
23
|
Hussain W. sAMP-PFPDeep: Improving accuracy of short antimicrobial peptides prediction using three different sequence encodings and deep neural networks. Brief Bioinform 2021; 23:6445107. [PMID: 34849586 DOI: 10.1093/bib/bbab487] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Revised: 10/06/2021] [Accepted: 10/23/2021] [Indexed: 12/15/2022] Open
Abstract
Short antimicrobial peptides (sAMPs) belong to a significant repertoire of antimicrobial agents and are known to possess enhanced antimicrobial activity, higher stability and less toxicity to human cells, as well as less complex than other large biological drugs. As these molecules are significantly important, herein, a prediction method for sAMPs (with a sequence length ≤ 30 residues) is proposed for accurate and efficient prediction of sAMPs instead of laborious and costly experimental approaches. Benchmark dataset was collected from a recently reported study and sequences were converted into three channel images comprising information related to the position, frequency and sum of 12 physiochemical features as the first, second and third channels, respectively. Two image-based deep neural networks (DNNs), i.e. RESNET-50 and VGG-16 were trained and evaluated using various metrics while a comparative analysis with previous techniques was also performed. Validation of sAMP-PFPDeep was also performed by using molecular docking based analysis. The results showed that VGG-16 provided more accurate results, i.e. 98.30% training accuracy and 87.37% testing accuracy for predicting sAMPs as compared to those of RESNET-50 having 96.14% training accuracy and 83.87% testing accuracy. However, the comparative analysis revealed that both these models outperformed previously reported state-of-the-art methods. Based on the results, it is concluded that sAMP-PFPDeep can help identify antimicrobial peptides with promising accuracy and efficiency. It can help biologists and scientists to identify antimicrobial peptides, by further aiding the computer-aided drug design and discovery, as well as virtual screening protocols against various pathologies. sAMP-PFPDeep is available at (https://github.com/WaqarHusain/sAMP-PFPDeep).
Collapse
Affiliation(s)
- Waqar Hussain
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore-54770, Pakistan
| |
Collapse
|
24
|
Malebary SJ, Alzahrani E, Khan YD. A comprehensive tool for accurate identification of methyl-Glutamine sites. J Mol Graph Model 2021; 110:108074. [PMID: 34768228 DOI: 10.1016/j.jmgm.2021.108074] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 10/15/2021] [Accepted: 11/02/2021] [Indexed: 11/16/2022]
Abstract
Methylation is a biochemical process involved in nearly all of the human body functions. Glutamine is considered an indispensable amino acid that is susceptible to methylation via post-translational modification (PTM). Modern research has proved that methylation plays a momentous role in the progression of most types of cancers. Therefore, there is a need for an effective method to predict glutamine sites vulnerable to methylation accurately and inexpensively. The motive of this study is the formulation of an accurate method that could predict such sites with high accuracy. Various computationally intelligent classifiers were employed for their formulation and evaluation. Rigorous validations prove that deep learning performs best as compared to other classifiers. The accuracy (ACC) and the area under the receiver operating curve (AUC) obtained by 10-fold cross-validation was 0.962 and 0.981, while with the jackknife testing, it was 0.968 and 0.980, respectively. From these results, it is concluded that the proposed methodology works sufficiently well for the prediction of methyl-glutamine sites. The webserver's code, developed for the prediction of methyl-glutamine sites, is freely available at https://github.com/s20181080001/WebServer.git. The code can easily be set up by any intermediate-level Python user.
Collapse
Affiliation(s)
- Sharaf J Malebary
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P.O. Box 344, Rabigh, 21911, Saudi Arabia.
| | - Ebraheem Alzahrani
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P. O. Box 80203, Jeddah, 21589, Saudi Arabia.
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan.
| |
Collapse
|
25
|
Alzahrani E, Alghamdi W, Ullah MZ, Khan YD. Identification of stress response proteins through fusion of machine learning models and statistical paradigms. Sci Rep 2021; 11:21767. [PMID: 34741132 PMCID: PMC8571424 DOI: 10.1038/s41598-021-99083-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 09/13/2021] [Indexed: 11/08/2022] Open
Abstract
Proteins are a vital component of cells that perform physiological functions to ensure smooth operations of bodily functions. Identification of a protein's function involves a detailed understanding of the structure of proteins. Stress proteins are essential mediators of several responses to cellular stress and are categorized based on their structural characteristics. These proteins are found to be conserved across many eukaryotic and prokaryotic linkages and demonstrate varied crucial functional activities inside a cell. The in-vivo, ex vivo, and in-vitro identification of stress proteins are a time-consuming and costly task. This study is aimed at the identification of stress protein sequences with the aid of mathematical modelling and machine learning methods to supplement the aforementioned wet lab methods. The model developed using Random Forest showed remarkable results with 91.1% accuracy while models based on neural network and support vector machine showed 87.7% and 47.0% accuracy, respectively. Based on evaluation results it was concluded that random-forest based classifier surpassed all other predictors and is suitable for use in practical applications for the identification of stress proteins. Live web server is available at http://biopred.org/stressprotiens , while the webserver code available is at https://github.com/abdullah5naveed/SRP_WebServer.git.
Collapse
Affiliation(s)
- Ebraheem Alzahrani
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P. O. Box 80203, Jeddah, 21589, Saudi Arabia
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P. O. Box 80221, Jeddah, 21589, Saudi Arabia
| | - Malik Zaka Ullah
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P. O. Box 80203, Jeddah, 21589, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore, 54770, Pakistan.
| |
Collapse
|
26
|
The many ways that nature has exploited the unusual structural and chemical properties of phosphohistidine for use in proteins. Biochem J 2021; 478:3575-3596. [PMID: 34624072 DOI: 10.1042/bcj20210533] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Revised: 09/15/2021] [Accepted: 09/22/2021] [Indexed: 01/12/2023]
Abstract
Histidine phosphorylation is an important and ubiquitous post-translational modification. Histidine undergoes phosphorylation on either of the nitrogens in its imidazole side chain, giving rise to 1- and 3- phosphohistidine (pHis) isomers, each having a phosphoramidate linkage that is labile at high temperatures and low pH, in contrast with stable phosphomonoester protein modifications. While all organisms routinely use pHis as an enzyme intermediate, prokaryotes, lower eukaryotes and plants also use it for signal transduction. However, research to uncover additional roles for pHis in higher eukaryotes is still at a nascent stage. Since the discovery of pHis in 1962, progress in this field has been relatively slow, in part due to a lack of the tools and techniques necessary to study this labile modification. However, in the past ten years the development of phosphoproteomic techniques to detect phosphohistidine (pHis), and methods to synthesize stable pHis analogues, which enabled the development of anti-phosphohistidine (pHis) antibodies, have accelerated our understanding. Recent studies that employed anti-pHis antibodies and other advanced techniques have contributed to a rapid expansion in our knowledge of histidine phosphorylation. In this review, we examine the varied roles of pHis-containing proteins from a chemical and structural perspective, and present an overview of recent developments in pHis proteomics and antibody development.
Collapse
|
27
|
Akmal MA, Hussain W, Rasool N, Khan YD, Khan SA, Chou KC. Using CHOU'S 5-Steps Rule to Predict O-Linked Serine Glycosylation Sites by Blending Position Relative Features and Statistical Moment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2045-2056. [PMID: 31985438 DOI: 10.1109/tcbb.2020.2968441] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Glycosylation of proteins in eukaryote cells is an important and complicated post-translation modification due to its pivotal role and association with crucial physiological functions within most of the proteins. Identification of glycosylation sites in a polypeptide chain is not an easy task due to multiple impediments. Analytical identification of these sites is expensive and laborious. There is a dire need to develop a reliable computational method for precise determination of such sites which can help researchers to save time and effort. Herein, we propose a novel predictor namely iGlycoS-PseAAC by integrating the Chou's Pseudo Amino Acid Composition (PseAAC) and relative/absolute position-based features. The self-consistency results show that the accuracy revealed by the model using the benchmark dataset for prediction of O-linked glycosylation having serine sites is 98.8 percent. The overall accuracy of predictor achieved through 10-fold cross validation by combining the positive and negative results is 97.2 percent. The overall accuracy achieved through Jackknife test is 96.195 percent by aggregating of all the prediction results. Thus the proposed predictor can help in predicting the O-linked glycosylated serine sites in an efficient and accurate way. The overall results show that the accuracy of the iGlycoS-PseAAC is higher than the existing tools.
Collapse
|
28
|
Malebary SJ, Khan YD. Evaluating machine learning methodologies for identification of cancer driver genes. Sci Rep 2021; 11:12281. [PMID: 34112883 PMCID: PMC8192921 DOI: 10.1038/s41598-021-91656-8] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Accepted: 05/19/2021] [Indexed: 02/06/2023] Open
Abstract
Cancer is driven by distinctive sorts of changes and basic variations in genes. Recognizing cancer driver genes is basic for accurate oncological analysis. Numerous methodologies to distinguish and identify drivers presently exist, but efficient tools to combine and optimize them on huge datasets are few. Most strategies for prioritizing transformations depend basically on frequency-based criteria. Strategies are required to dependably prioritize organically dynamic driver changes over inert passengers in high-throughput sequencing cancer information sets. This study proposes a model namely PCDG-Pred which works as a utility capable of distinguishing cancer driver and passenger attributes of genes based on sequencing data. Keeping in view the significance of the cancer driver genes an efficient method is proposed to identify the cancer driver genes. Further, various validation techniques are applied at different levels to establish the effectiveness of the model and to obtain metrics like accuracy, Mathew's correlation coefficient, sensitivity, and specificity. The results of the study strongly indicate that the proposed strategy provides a fundamental functional advantage over other existing strategies for cancer driver genes identification. Subsequently, careful experiments exhibit that the accuracy metrics obtained for self-consistency, independent set, and cross-validation tests are 91.08%., 87.26%, and 92.48% respectively.
Collapse
Affiliation(s)
- Sharaf J Malebary
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P.O. Box 344, Rabigh, 21911, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan.
| |
Collapse
|
29
|
Naseer S, Hussain W, Khan YD, Rasool N. NPalmitoylDeep-PseAAC: A Predictor of N-Palmitoylation Sites in Proteins Using Deep Representations of Proteins and PseAAC via Modified 5-Steps Rule. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200605142828] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Among all the major Post-translational modification, lipid modifications
possess special significance due to their widespread functional importance in eukaryotic cells. There
exist multiple types of lipid modifications and Palmitoylation, among them, is one of the broader
types of modification, having three different types. The N-Palmitoylation is carried out by
attachment of palmitic acid to an N-terminal cysteine. Due to the association of N-Palmitoylation
with various biological functions and diseases such as Alzheimer’s and other neurodegenerative
diseases, its identification is very important.
Objective:
The in vitro, ex vivo and in vivo identification of Palmitoylation is laborious, time-taking
and costly. There is a dire need for an efficient and accurate computational model to help researchers
and biologists identify these sites, in an easy manner. Herein, we propose a novel prediction model
for the identification of N-Palmitoylation sites in proteins.
Method:
The proposed prediction model is developed by combining the Chou’s Pseudo Amino
Acid Composition (PseAAC) with deep neural networks. We used well-known deep neural
networks (DNNs) for both the tasks of learning a feature representation of peptide sequences and
developing a prediction model to perform classification.
Results:
Among different DNNs, Gated Recurrent Unit (GRU) based RNN model showed the
highest scores in terms of accuracy, and all other computed measures, and outperforms all the
previously reported predictors.
Conclusion:
The proposed GRU based RNN model can help to identify N-Palmitoylation in a very
efficient and accurate manner which can help scientists understand the mechanism of this
modification in proteins.
Collapse
Affiliation(s)
- Sheraz Naseer
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore 54770, Pakistan
| | - Waqar Hussain
- National Center of Artificial Intelligence, Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore 54770, Pakistan
| | - Nouman Rasool
- Dr Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi, 75270, Pakistan
| |
Collapse
|
30
|
iDRP-PseAAC: Identification of DNA Replication Proteins Using General PseAAC and Position Dependent Features. Int J Pept Res Ther 2021; 27:1315-1329. [PMID: 33584161 PMCID: PMC7869428 DOI: 10.1007/s10989-021-10170-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/18/2021] [Indexed: 10/25/2022]
Abstract
DNA replication is one of the specific processes to be considered in all the living organisms, specifically eukaryotes. The prevalence of DNA replication is significant for an evolutionary transition at the beginning of life. DNA replication proteins are those proteins which support the process of replication and are also reported to be important in drug design and discovery. This information depicts that DNA replication proteins have a very important role in human bodies, however, to study their mechanism, their identification is necessary. Thus, it is a very important task but, in any case, an experimental identification is time-consuming, highly-costly and laborious. To cope with this issue, a computational methodology is required for prediction of these proteins, however, no prior method exists. This study comprehends the construction of novel prediction model to serve the proposed purpose. The prediction model is developed based on the artificial neural network by integrating the position relative features and sequence statistical moments in PseAAC for training neural networks. Highest overall accuracy has been achieved through tenfold cross-validation and Jackknife testing that was computed to be 96.22% and 98.56%, respectively. Our astonishing experimental results demonstrated that the proposed predictor surpass the existing models that can be served as a time and cost-effective stratagem for designing novel drugs to strike the contemporary bacterial infection.
Collapse
|
31
|
Khan YD, Alzahrani E, Alghamdi W, Ullah MZ. Sequence-based Identification of Allergen Proteins Developed by Integration of PseAAC and Statistical Moments via 5-Step Rule. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200424085947] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Background:
Allergens are antigens that can stimulate an atopic type I human
hypersensitivity reaction by an immunoglobulin E (IgE) reaction. Some proteins are naturally
allergenic than others. The challenge for toxicologists is to identify properties that allow proteins
to cause allergic sensitization and allergic diseases. The identification of allergen proteins is a very
critical and pivotal task. The experimental identification of protein functions is a hectic, laborious
and costly task; therefore, computer scientists have proposed various methods in the field of
computational biology and bioinformatics using various data science approaches. Objectives:
Herein, we report a novel predictor for the identification of allergen proteins.
Methods:
For feature extraction, statistical moments and various position-based features have been
incorporated into Chou’s pseudo amino acid composition (PseAAC), and are used for training of a
neural network.
Results:
The predictor is validated through 10-fold cross-validation and Jackknife testing, which
gave 99.43% and 99.87% accurate results.
Conclusions:
Thus, the proposed predictor can help in predicting the Allergen proteins in an
efficient and accurate way and can provide baseline data for the discovery of new drugs and
biomarkers.
Collapse
Affiliation(s)
- Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, C II Johar Town, Lahore 54770, Pakistan
| | - Ebraheem Alzahrani
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P.O. Box 80203, Jeddah 21589, Saudi Arabia
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P.O. Box 80221, Jeddah, Saudi Arabia
| | - Malik Zaka Ullah
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P.O. Box 80203, Jeddah 21589, Saudi Arabia
| |
Collapse
|
32
|
Naseer S, Hussain W, Khan YD, Rasool N. Optimization of serine phosphorylation prediction in proteins by comparing human engineered features and deep representations. Anal Biochem 2020; 615:114069. [PMID: 33340540 DOI: 10.1016/j.ab.2020.114069] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 11/15/2020] [Accepted: 12/14/2020] [Indexed: 02/01/2023]
Abstract
Deep representations can be used to replace human-engineered representations, as such features are constrained by certain limitations. For the prediction of protein post-translation modifications (PTMs) sites, research community uses different feature extraction techniques applied on Pseudo amino acid compositions (PseAAC). Serine phosphorylation is one of the most important PTM as it is the most occurring, and is important for various biological functions. Creating efficient representations from large protein sequences, to predict PTM sites, is a time and resource intensive task. In this study we propose, implement and evaluate use of Deep learning to learn effective protein data representations from PseAAC to develop data driven PTM detection systems and compare the same with two human representations.. The comparisons are performed by training an xgboost based classifier using each representation. The best scores were achieved by RNN-LSTM based deep representation and CNN based representation with an accuracy score of 81.1% and 78.3% respectively. Human engineered representations scored 77.3% and 74.9% respectively. Based on these results, it is concluded that the deep features are promising feature engineering replacement to identify PhosS sites in a very efficient and accurate manner which can help scientists understand the mechanism of this modification in proteins.
Collapse
Affiliation(s)
- Sheraz Naseer
- Department of Computer Science, University of Management and Technology, Lahore, Pakistan.
| | - Waqar Hussain
- National Center of Artificial Intelligence, Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan; Center for Professional & Applied Studies, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore, Pakistan
| | - Nouman Rasool
- Center for Professional & Applied Studies, Lahore, Pakistan
| |
Collapse
|
33
|
Patient Management in Aortic Stenosis: Towards Precision Medicine Through Protein Analysis, Imaging and Diagnostic Tests. J Clin Med 2020; 9:jcm9082421. [PMID: 32731585 PMCID: PMC7463596 DOI: 10.3390/jcm9082421] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2020] [Revised: 07/22/2020] [Accepted: 07/24/2020] [Indexed: 01/12/2023] Open
Abstract
Aortic stenosis is the most frequent valvular disease in developed countries. It progresses from mild fibrocalcific leaflet changes to a more severe leaflet calcification at the end stages of the disease. Unfortunately, symptoms of aortic stenosis are unspecific and only appear when it is too late, complicating patients' management. The global impact of aortic stenosis is increasing due to the growing elderly population. The disease supposes a great challenge because of the multiple comorbidities of these patients. Nowadays, the only effective treatment is valve replacement, which has a high cost in both social and economic terms. For that reason, it is crucial to find potential diagnostic, prognostic and therapeutic indicators that could help us to detect this disease in its earliest stages. In this article, we comprehensively review several key observations and translational studies related to protein markers that are promising for being implemented in the clinical field as well as a discussion about the role of precision medicine in aortic stenosis.
Collapse
|
34
|
Abstract
During the last three decades or so, many efforts have been made to study the protein cleavage
sites by some disease-causing enzyme, such as HIV (Human Immunodeficiency Virus) protease
and SARS (Severe Acute Respiratory Syndrome) coronavirus main proteinase. It has become increasingly
clear <i>via</i> this mini-review that the motivation driving the aforementioned studies is quite wise,
and that the results acquired through these studies are very rewarding, particularly for developing peptide
drugs.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
35
|
Chou KC. An Insightful 10-year Recollection Since the Emergence of the 5-steps Rule. Curr Pharm Des 2020; 25:4223-4234. [PMID: 31782354 DOI: 10.2174/1381612825666191129164042] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2019] [Accepted: 11/25/2019] [Indexed: 11/22/2022]
Abstract
OBJECTIVE One of the most challenging and also the most difficult problems is how to formulate a biological sequence with a vector but considerably keep its sequence order information. METHODS To address such a problem, the approach of Pseudo Amino Acid Components or PseAAC has been developed. RESULTS AND CONCLUSION It has become increasingly clear via the 10-year recollection that the aforementioned proposal has been indeed very powerful.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, Massachusetts 02478, United States.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
36
|
Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Webb GI, Baggag A, Bensmail H, Song J. PROSPECT: A web server for predicting protein histidine phosphorylation sites. J Bioinform Comput Biol 2020; 18:2050018. [DOI: 10.1142/s0219720020500183] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Background: Phosphorylation of histidine residues plays crucial roles in signaling pathways and cell metabolism in prokaryotes such as bacteria. While evidence has emerged that protein histidine phosphorylation also occurs in more complex organisms, its role in mammalian cells has remained largely uncharted. Thus, it is highly desirable to develop computational tools that are able to identify histidine phosphorylation sites. Result: Here, we introduce PROSPECT that enables fast and accurate prediction of proteome-wide histidine phosphorylation substrates and sites. Our tool is based on a hybrid method that integrates the outputs of two convolutional neural network (CNN)-based classifiers and a random forest-based classifier. Three features, including the one-of-K coding, enhanced grouped amino acids content (EGAAC) and composition of k-spaced amino acid group pairs (CKSAAGP) encoding, were taken as the input to three classifiers, respectively. Our results show that it is able to accurately predict histidine phosphorylation sites from sequence information. Our PROSPECT web server is user-friendly and publicly available at http://PROSPECT.erc.monash.edu/ . Conclusions: PROSPECT is superior than other pHis predictors in both the running speed and prediction accuracy and we anticipate that the PROSPECT webserver will become a popular tool for identifying the pHis sites in bacteria.
Collapse
Affiliation(s)
- Zhen Chen
- School of Basic Medical Science, Qingdao University, Qingdao, P. R. China
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese, Academy of Agricultural Sciences (CAAS), Anyang, P. R. China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese, Academy of Agricultural Sciences (CAAS), Anyang, P. R. China
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, VIC 3800, Australia
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, USA
- Informatics Institute, School of Medicine, University of Alabama at Birmingham, USA
| | - Tatiana T. Marquez-Lago
- School of Basic Medical Science, Qingdao University, Qingdao, P. R. China
- Informatics Institute, School of Medicine, University of Alabama at Birmingham, USA
| | - Geoffrey I. Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, VIC 3800, Australia
| | - Abdelkader Baggag
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Halima Bensmail
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
37
|
|
38
|
Hussain W, Rasool N, Khan YD. A Sequence-Based Predictor of Zika Virus Proteins Developed by Integration of PseAAC and Statistical Moments. Comb Chem High Throughput Screen 2020; 23:797-804. [PMID: 32342804 DOI: 10.2174/1386207323666200428115449] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 03/17/2020] [Accepted: 03/19/2020] [Indexed: 12/20/2022]
Abstract
BACKGROUND ZIKV has been a well-known global threat, which hits almost all of the American countries and posed a serious threat to the entire globe in 2016. The first outbreak of ZIKV was reported in 2007 in the Pacific area, followed by another severe outbreak, which occurred in 2013/2014 and subsequently, ZIKV spread to all other Pacific islands. A broad spectrum of ZIKV associated neurological malformations in neonates and adults has driven this deadly virus into the limelight. Though tremendous efforts have been focused on understanding the molecular basis of ZIKV, the viral proteins of ZIKV have still not been studied extensively. OBJECTIVES Herein, we report the first and the novel predictor for the identification of ZIKV proteins. METHODS We have employed Chou's pseudo amino acid composition (PseAAC), statistical moments and various position-based features. RESULTS The predictor is validated through 10-fold cross-validation and Jackknife testing. In 10- fold cross-validation, 94.09% accuracy, 93.48% specificity, 94.20% sensitivity and 0.80 MCC were achieved while in Jackknife testing, 96.62% accuracy, 94.57% specificity, 97.00% sensitivity and 0.88 MCC were achieved. CONCLUSION Thus, ZIKVPred-PseAAC can help in predicting the ZIKV proteins efficiently and accurately and can provide baseline data for the discovery of new drugs and biomarkers against ZIKV.
Collapse
Affiliation(s)
- Waqar Hussain
- National Center of Artificial Intelligence, Punjab University College of Information Technology, University of the
Punjab, Lahore, Pakistan,Center for Professional Studies, Lahore, Pakistan
| | | | - Yaser D Khan
- Department of Computer Science, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
39
|
Some illuminating remarks on molecular genetics and genomics as well as drug development. Mol Genet Genomics 2020; 295:261-274. [PMID: 31894399 DOI: 10.1007/s00438-019-01634-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Accepted: 12/05/2019] [Indexed: 02/07/2023]
Abstract
Facing the explosive growth of biological sequences unearthed in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, but still keep it with considerable sequence-order information or its special pattern. To deal with such a challenging problem, the ideas of "pseudo amino acid components" and "pseudo K-tuple nucleotide composition" have been proposed. The ideas and their approaches have further stimulated the birth for "distorted key theory", "wenxing diagram", and substantially strengthening the power in treating the multi-label systems, as well as the establishment of the famous "5-steps rule". All these logic developments are quite natural that are very useful not only for theoretical scientists but also for experimental scientists in conducting genetics/genomics analysis and drug development. Presented in this review paper are also their future perspectives; i.e., their impacts will become even more significant and propounding.
Collapse
|
40
|
Shao YT, Liu XX, Lu Z, Chou KC. pLoc_Deep-mHum: Predict Subcellular Localization of Human Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.127042] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
41
|
Shao Y, Chou KC. pLoc_Deep-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.126034] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
42
|
iQSP: A Sequence-Based Tool for the Prediction and Analysis of Quorum Sensing Peptides via Chou's 5-Steps Rule and Informative Physicochemical Properties. Int J Mol Sci 2019; 21:ijms21010075. [PMID: 31861928 PMCID: PMC6981611 DOI: 10.3390/ijms21010075] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 12/13/2019] [Accepted: 12/18/2019] [Indexed: 01/18/2023] Open
Abstract
Understanding of quorum-sensing peptides (QSPs) in their functional mechanism plays an essential role in finding new opportunities to combat bacterial infections by designing drugs. With the avalanche of the newly available peptide sequences in the post-genomic age, it is highly desirable to develop a computational model for efficient, rapid and high-throughput QSP identification purely based on the peptide sequence information alone. Although, few methods have been developed for predicting QSPs, their prediction accuracy and interpretability still requires further improvements. Thus, in this work, we proposed an accurate sequence-based predictor (called iQSP) and a set of interpretable rules (called IR-QSP) for predicting and analyzing QSPs. In iQSP, we utilized a powerful support vector machine (SVM) cooperating with 18 informative features from physicochemical properties (PCPs). Rigorous independent validation test showed that iQSP achieved maximum accuracy and MCC of 93.00% and 0.86, respectively. Furthermore, a set of interpretable rules IR-QSP was extracted by using random forest model and the 18 informative PCPs. Finally, for the convenience of experimental scientists, the iQSP web server was established and made freely available online. It is anticipated that iQSP will become a useful tool or at least as a complementary existing method for predicting and analyzing QSPs.
Collapse
|
43
|
Chou KC. Impacts of Pseudo Amino Acid Components and 5-steps Rule to Proteomics and Proteome Analysis. Curr Top Med Chem 2019; 19:2283-2300. [DOI: 10.2174/1568026619666191018100141] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Revised: 08/18/2019] [Accepted: 08/26/2019] [Indexed: 01/27/2023]
Abstract
Stimulated by the 5-steps rule during the last decade or so, computational proteomics has achieved remarkable progresses in the following three areas: (1) protein structural class prediction; (2) protein subcellular location prediction; (3) post-translational modification (PTM) site prediction. The results obtained by these predictions are very useful not only for an in-depth study of the functions of proteins and their biological processes in a cell, but also for developing novel drugs against major diseases such as cancers, Alzheimer’s, and Parkinson’s. Moreover, since the targets to be predicted may have the multi-label feature, two sets of metrics are introduced: one is for inspecting the global prediction quality, while the other for the local prediction quality. All the predictors covered in this review have a userfriendly web-server, through which the majority of experimental scientists can easily obtain their desired data without the need to go through the complicated mathematics.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| |
Collapse
|
44
|
Malebary SJ, Rehman MSU, Khan YD. iCrotoK-PseAAC: Identify lysine crotonylation sites by blending position relative statistical features according to the Chou's 5-step rule. PLoS One 2019; 14:e0223993. [PMID: 31751380 PMCID: PMC6874067 DOI: 10.1371/journal.pone.0223993] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Accepted: 10/02/2019] [Indexed: 01/22/2023] Open
Abstract
Among different post-translational modifications (PTMs), one of the most important one is the lysine crotonylation in proteins. Its importance cannot be undermined related to different diseases and essential biological practice. The key step for finding the hidden mechanisms of crotonylation along with their occurrence sites is to completely apprehend the mechanism behind this biological process. In previously reported studies, researchers have used different techniques, like position weighted matrix (PWM), support vector machine (SVM), k nearest neighbors (KNN), and many others. However, the maximum prediction accuracy achieved was not such high. To address this, herein, we propose an improved predictor for lysine crotonylation sites named iCrotoK-PseAAC, in which we have incorporated various position and composition relative features along with statistical moments into PseAAC. The results of self-consistency testing were 100% accurate, while the 10-fold cross validation gave 99.0% accuracy. Based on the validation and comparison of model, it is concluded that the iCrotoK-PseAAC is more accurate than the previously proposed models.
Collapse
Affiliation(s)
- Sharaf Jameel Malebary
- Department of Information Technology, King Abdul Aziz University, Rabigh, Kingdom of Saudi Arabia
| | - Muhammad Safi ur Rehman
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
45
|
Xuan P, Cui H, Shen T, Sheng N, Zhang T. HeteroDualNet: A Dual Convolutional Neural Network With Heterogeneous Layers for Drug-Disease Association Prediction via Chou's Five-Step Rule. Front Pharmacol 2019; 10:1301. [PMID: 31780934 PMCID: PMC6856670 DOI: 10.3389/fphar.2019.01301] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2019] [Accepted: 10/11/2019] [Indexed: 11/14/2022] Open
Abstract
Identifying new treatments for existing drugs can help reduce drug development costs and explore novel indications of drugs. The prediction of associations between drugs and diseases is challenging because their similarities and relations are complicated and non-linear. We propose a HeteroDualNet model to address this issue. Firstly, three types of matrices are extracted to represent intra-drug similarities, intra-disease similarity and drug-disease associations. The intra-drug similarities consider three drug features and a newly introduced drug-related disease correlation. Secondly, an embedding mechanism is proposed to integrate these matrices in a heterogenous drug-disease association layer (hetero-layer). Further, a neighbouring heterogeneous layer (hetero-layer-N) is constructed to incorporate the biological premise that similar drugs can often treat related diseases. Finally, a dual convolutional neural network is built with hetero-layer and hetero-layer-N as two branches to learn from characteristics of drug-disease and the relations of their neighbours simultaneously. HeteroDualNet outperformed the other four methods in comparison over a public dataset of 763 drugs and 681 diseases in terms of Areas Under the Curves of Receiver Operating Characteristics and Precision-Recall, and recall rate at top k. Case study of five drugs further proved the capacity of HeteroDualNet in finding reliable disease candidates of drugs as validated by database records or literature. Our findings show that the embedded heterogenous layers of original and neighbouring drug-disease representations in a dual neural network improved the association prediction performance.
Collapse
Affiliation(s)
- Ping Xuan
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| | - Hui Cui
- Department of Computer Science and Information Technology, La Trobe University, Bundoora, VIC, Australia
| | - Tonghui Shen
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| | - Nan Sheng
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| | - Tiangang Zhang
- School of Mathematical Science, Heilongjiang University, Harbin, China
| |
Collapse
|
46
|
Prediction of S-Sulfenylation Sites Using Statistical Moments Based Features via CHOU’S 5-Step Rule. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09931-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
47
|
FKRR-MVSF: A Fuzzy Kernel Ridge Regression Model for Identifying DNA-Binding Proteins by Multi-View Sequence Features via Chou's Five-Step Rule. Int J Mol Sci 2019; 20:ijms20174175. [PMID: 31454964 PMCID: PMC6747228 DOI: 10.3390/ijms20174175] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2019] [Revised: 08/10/2019] [Accepted: 08/19/2019] [Indexed: 12/22/2022] Open
Abstract
DNA-binding proteins play an important role in cell metabolism. In biological laboratories, the detection methods of DNA-binding proteins includes yeast one-hybrid methods, bacterial singles and X-ray crystallography methods and others, but these methods involve a lot of labor, material and time. In recent years, many computation-based approachs have been proposed to detect DNA-binding proteins. In this paper, a machine learning-based method, which is called the Fuzzy Kernel Ridge Regression model based on Multi-View Sequence Features (FKRR-MVSF), is proposed to identifying DNA-binding proteins. First of all, multi-view sequence features are extracted from protein sequences. Next, a Multiple Kernel Learning (MKL) algorithm is employed to combine multiple features. Finally, a Fuzzy Kernel Ridge Regression (FKRR) model is built to detect DNA-binding proteins. Compared with other methods, our model achieves good results. Our method obtains an accuracy of 83.26% and 81.72% on two benchmark datasets (PDB1075 and compared with PDB186), respectively.
Collapse
|
48
|
Chou KC. Proposing Pseudo Amino Acid Components is an Important Milestone for Proteome and Genome Analyses. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09910-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
49
|
|
50
|
Niu B, Liang C, Lu Y, Zhao M, Chen Q, Zhang Y, Zheng L, Chou KC. Glioma stages prediction based on machine learning algorithm combined with protein-protein interaction networks. Genomics 2019; 112:837-847. [PMID: 31150762 DOI: 10.1016/j.ygeno.2019.05.024] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Accepted: 05/25/2019] [Indexed: 12/18/2022]
Abstract
BACKGROUND Glioma is the most lethal nervous system cancer. Recent studies have made great efforts to study the occurrence and development of glioma, but the molecular mechanisms are still unclear. This study was designed to reveal the molecular mechanisms of glioma based on protein-protein interaction network combined with machine learning methods. Key differentially expressed genes (DEGs) were screened and selected by using the protein-protein interaction (PPI) networks. RESULTS As a result, 19 genes between grade I and grade II, 21 genes between grade II and grade III, and 20 genes between grade III and grade IV. Then, five machine learning methods were employed to predict the gliomas stages based on the selected key genes. After comparison, Complement Naive Bayes classifier was employed to build the prediction model for grade II-III with accuracy 72.8%. And Random forest was employed to build the prediction model for grade I-II and grade III-VI with accuracy 97.1% and 83.2%, respectively. Finally, the selected genes were analyzed by PPI networks, Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, and the results improve our understanding of the biological functions of select DEGs involved in glioma growth. We expect that the key genes expressed have a guiding significance for the occurrence of gliomas or, at the very least, that they are useful for tumor researchers. CONCLUSION Machine learning combined with PPI networks, GO and KEGG analyses of selected DEGs improve our understanding of the biological functions involved in glioma growth.
Collapse
Affiliation(s)
- Bing Niu
- School of Life Sciences, Shanghai University, Shanghai 200444, China; Gordon Life Science Institute, Boston, MA 02478, USA.
| | - Chaofeng Liang
- Department of Neurosurgery, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Yi Lu
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Manman Zhao
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Qin Chen
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| | - Yuhui Zhang
- Renji Hospital, Medical School, Shanghai Jiaotong University, 160 Pujian Rd, New Pudong District, Shanghai 200127, China; Changhai Hospital, Second Military Medical University, Shanghai 200433, China.
| | - Linfeng Zheng
- Department of Radiology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200080, China; Department of Radiology, Shanghai First People's Hospital, Baoshan Branch, Shanghai 200940, China.
| | - Kuo-Chen Chou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; Gordon Life Science Institute, Boston, MA 02478, USA.
| |
Collapse
|