1
|
Lai H, Zhu T, Xie S, Luo X, Hong F, Luo D, Dao F, Lin H, Shu K, Lv H. Empirical Comparison and Analysis of Artificial Intelligence-Based Methods for Identifying Phosphorylation Sites of SARS-CoV-2 Infection. Int J Mol Sci 2024; 25:13674. [PMID: 39769436 PMCID: PMC11678915 DOI: 10.3390/ijms252413674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Revised: 12/18/2024] [Accepted: 12/19/2024] [Indexed: 01/11/2025] Open
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a member of the large coronavirus family with high infectivity and pathogenicity and is the primary pathogen causing the global pandemic of coronavirus disease 2019 (COVID-19). Phosphorylation is a major type of protein post-translational modification that plays an essential role in the process of SARS-CoV-2-host interactions. The precise identification of phosphorylation sites in host cells infected with SARS-CoV-2 will be of great importance to investigate potential antiviral responses and mechanisms and exploit novel targets for therapeutic development. Numerous computational tools have been developed on the basis of phosphoproteomic data generated by mass spectrometry-based experimental techniques, with which phosphorylation sites can be accurately ascertained across the whole SARS-CoV-2-infected proteomes. In this work, we have comprehensively reviewed several major aspects of the construction strategies and availability of these predictors, including benchmark dataset preparation, feature extraction and refinement methods, machine learning algorithms and deep learning architectures, model evaluation approaches and metrics, and publicly available web servers and packages. We have highlighted and compared the prediction performance of each tool on the independent serine/threonine (S/T) and tyrosine (Y) phosphorylation datasets and discussed the overall limitations of current existing predictors. In summary, this review would provide pertinent insights into the exploitation of new powerful phosphorylation site identification tools, facilitate the localization of more suitable target molecules for experimental verification, and contribute to the development of antiviral therapies.
Collapse
Affiliation(s)
- Hongyan Lai
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (H.L.); (T.Z.); (D.L.)
| | - Tao Zhu
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (H.L.); (T.Z.); (D.L.)
| | - Sijia Xie
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China; (S.X.); (X.L.); (F.H.); (H.L.)
| | - Xinwei Luo
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China; (S.X.); (X.L.); (F.H.); (H.L.)
| | - Feitong Hong
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China; (S.X.); (X.L.); (F.H.); (H.L.)
| | - Diyu Luo
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (H.L.); (T.Z.); (D.L.)
| | - Fuying Dao
- School of Biological Sciences, Nanyang Technological University, Singapore 639798, Singapore;
| | - Hao Lin
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China; (S.X.); (X.L.); (F.H.); (H.L.)
| | - Kunxian Shu
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (H.L.); (T.Z.); (D.L.)
| | - Hao Lv
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China; (S.X.); (X.L.); (F.H.); (H.L.)
| |
Collapse
|
2
|
Almuflih AS, Abdullayev I, Bakhvalov S, Shichiyakh R, Dash BB, Rao KBVB, Bansal K. Securing IoT devices with zero day intrusion detection system using binary snake optimization and attention based bidirectional gated recurrent classifier. Sci Rep 2024; 14:29238. [PMID: 39587157 PMCID: PMC11589145 DOI: 10.1038/s41598-024-80255-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2024] [Accepted: 11/18/2024] [Indexed: 11/27/2024] Open
Abstract
The fast improvement of cyberattacks in the area of the Internet of Things (IoT) presents novel safety challenges to zero-day attacks. Intrusion detection systems (IDS) are generally focused on exact attacks to defend the use of IoT. However, the attacks were unidentified, for IDS still signifies tasks and concerns about consumers' data privacy and safety. Anomaly-detection models are generally based on machine learning (ML) models. Conventional ML-based models have been recognized to have low estimate excellence and recognition rates. DL-based models, particularly convolutional neural networks (CNN) with regularization techniques, direct this problem, offer a superior prediction value with unidentified data, and prevent over-fitting. This manuscript presents a Binary Snake Optimizer with DL-Enabled Zero-Day Attack Detection and Classification (BSODL-ZDADC) method. The objective of the BSODL-ZDADC method is to employ metaheuristics with the DL method for enhanced recognition and classification of zero-day attacks. For data normalization, the BSODL-ZDADC method uses a Z-score normalization approach. To reduce the high dimensionality issue and improve the classification results, the BSODL-ZDADC technique designs a BSO method to choose a set of related features. Besides, the attention-based bidirectional gated recurrent unit (ABi-GRU) method helps recognize zero-day attacks. Since the hyperparameters play a vital part in the classification performance, the BSODL-ZDADC technique employs an improved sparrow search algorithm (ISSA). The experimental validation of the BSODL-ZDADC technique is verified by utilizing the ToN-IoT dataset. The performance validation of the BSODL-ZDADC technique portrayed a superior accuracy value of 98.28% over other models.
Collapse
Affiliation(s)
- Ali Saeed Almuflih
- Department of Industrial Engineering, College of Engineering, King Khalid University, P.O. Box 394, Abha, 61421, Saudi Arabia
- Center for Engineering and Technology Innovations, King Khalid University, Abha, 61421, Saudi Arabia
| | - Ilyos Abdullayev
- Department of Business and Management, Urgench State University, Urgench, 220100, Uzbekistan
| | - Sergey Bakhvalov
- Department of Economics and Management of Elabuga Institute, Kazan Federal University, 420008, Krasnodar, Kazan, Russia
- Department of Economics and Management, Khorezm University of Economics, Urgench, 220100, Uzbekistan
| | - Rustem Shichiyakh
- Department of Management, Kuban State Agrarian University named after I.T. Trubilin, 350044, Krasnodar, Russia
| | - Bibhuti Bhusan Dash
- School of Computer Applications, KIIT Deemed to be University, Bhubaneswar, Odisha, India
| | - K B V Brahma Rao
- Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, 522302, Andhra Pradesh, India
| | - Kritika Bansal
- School of Electronics Engineering(SENSE), VIT-AP University, Amaravati, Andhra Pradesh, India.
| |
Collapse
|
3
|
Huang G, Xiao R, Chen W, Dai Q. GBMPhos: A Gating Mechanism and Bi-GRU-Based Method for Identifying Phosphorylation Sites of SARS-CoV-2 Infection. BIOLOGY 2024; 13:798. [PMID: 39452107 PMCID: PMC11505089 DOI: 10.3390/biology13100798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2024] [Revised: 10/03/2024] [Accepted: 10/03/2024] [Indexed: 10/26/2024]
Abstract
Phosphorylation, a reversible and widespread post-translational modification of proteins, is essential for numerous cellular processes. However, due to technical limitations, large-scale detection of phosphorylation sites, especially those infected by SARS-CoV-2, remains a challenging task. To address this gap, we propose a method called GBMPhos, a novel method that combines convolutional neural networks (CNNs) for extracting local features, gating mechanisms to selectively focus on relevant information, and a bi-directional gated recurrent unit (Bi-GRU) to capture long-range dependencies within protein sequences. GBMPhos leverages a comprehensive set of features, including sequence encoding, physicochemical properties, and structural information, to provide an in-depth analysis of phosphorylation sites. We conducted an extensive comparison of GBMPhos with traditional machine learning algorithms and state-of-the-art methods. Experimental results demonstrate the superiority of GBMPhos over existing methods. The visualization analysis further highlights its effectiveness and efficiency. Additionally, we have established a free web server platform to help researchers explore phosphorylation in SARS-CoV-2 infections. The source code of GBMPhos is publicly available on GitHub.
Collapse
Affiliation(s)
- Guohua Huang
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China; (G.H.); (R.X.)
- School of Information Technology and Administration, Hunan University of Finance and Economics, Changsha 410205, China
| | - Runjuan Xiao
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China; (G.H.); (R.X.)
| | - Weihong Chen
- School of Information Technology and Administration, Hunan University of Finance and Economics, Changsha 410205, China
| | - Qi Dai
- College of Life Science and Medicine, Zhejiang Sci-Tech University, Hangzhou 310018, China;
| |
Collapse
|
4
|
Li Y, Gao R, Liu S, Zhang H, Lv H, Lai H. PhosBERT: A self-supervised learning model for identifying phosphorylation sites in SARS-CoV-2-infected human cells. Methods 2024; 230:140-146. [PMID: 39179191 DOI: 10.1016/j.ymeth.2024.08.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 08/03/2024] [Accepted: 08/17/2024] [Indexed: 08/26/2024] Open
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a single-stranded RNA virus, which mainly causes respiratory and enteric diseases and is responsible for the outbreak of coronavirus disease 19 (COVID-19). Numerous studies have demonstrated that SARS-CoV-2 infection will lead to a significant dysregulation of protein post-translational modification profile in human cells. The accurate recognition of phosphorylation sites in host cells will contribute to a deep understanding of the pathogenic mechanisms of SARS-CoV-2 and also help to screen drugs and compounds with antiviral potential. Therefore, there is a need to develop cost-effective and high-precision computational strategies for specifically identifying SARS-CoV-2-infected phosphorylation sites. In this work, we first implemented a custom neural network model (named PhosBERT) on the basis of a pre-trained protein language model of ProtBert, which was a self-supervised learning approach developed on the Bidirectional Encoder Representation from Transformers (BERT) architecture. PhosBERT was then trained and validated on serine (S) and threonine (T) phosphorylation dataset and tyrosine (Y) phosphorylation dataset with 5-fold cross-validation, respectively. Independent validation results showed that PhosBERT could identify S/T phosphorylation sites with high accuracy and AUC (area under the receiver operating characteristic) value of 81.9% and 0.896. The prediction accuracy and AUC value of Y phosphorylation sites reached up to 87.1% and 0.902. It indicated that the proposed model was of good prediction ability and stability and would provide a new approach for studying SARS-CoV-2 phosphorylation sites.
Collapse
Affiliation(s)
- Yong Li
- Sichuan Vocational College of Health and Rehabilitation, Zigong 643000, Sichuan, China
| | - Ru Gao
- The People's Hospital of Ya 'an, Ya'an 625000, Sichuan, China; The People's Hospital of Wenjiang Chengdu, Chengdu 611130, Sichuan, China
| | - Shan Liu
- The People's Hospital of Wenjiang Chengdu, Chengdu 611130, Sichuan, China
| | - Hongqi Zhang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Hao Lv
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| | - Hongyan Lai
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China.
| |
Collapse
|
5
|
Sabir MJ, Kamli MR, Atef A, Alhibshi AM, Edris S, Hajarah NH, Bahieldin A, Manavalan B, Sabir JSM. Computational prediction of phosphorylation sites of SARS-CoV-2 infection using feature fusion and optimization strategies. Methods 2024; 229:1-8. [PMID: 38768932 DOI: 10.1016/j.ymeth.2024.04.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 03/15/2024] [Accepted: 04/30/2024] [Indexed: 05/22/2024] Open
Abstract
SARS-CoV-2's global spread has instigated a critical health and economic emergency, impacting countless individuals. Understanding the virus's phosphorylation sites is vital to unravel the molecular intricacies of the infection and subsequent changes in host cellular processes. Several computational methods have been proposed to identify phosphorylation sites, typically focusing on specific residue (S/T) or Y phosphorylation sites. Unfortunately, current predictive tools perform best on these specific residues and may not extend their efficacy to other residues, emphasizing the urgent need for enhanced methodologies. In this study, we developed a novel predictor that integrated all the residues (STY) phosphorylation sites information. We extracted ten different feature descriptors, primarily derived from composition, evolutionary, and position-specific information, and assessed their discriminative power through five classifiers. Our results indicated that Light Gradient Boosting (LGB) showed superior performance, and five descriptors displayed excellent discriminative capabilities. Subsequently, we identified the top two integrated features have high discriminative capability and trained with LGB to develop the final prediction model, LGB-IPs. The proposed approach shows an excellent performance on 10-fold cross-validation with an ACC, MCC, and AUC values of 0.831, 0.662, 0.907, respectively. Notably, these performances are replicated in the independent evaluation. Consequently, our approach may provide valuable insights into the phosphorylation mechanisms in SARS-CoV-2 infection for biomedical researchers.
Collapse
Affiliation(s)
- Mumdooh J Sabir
- Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Majid Rasool Kamli
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Atef
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Alawiah M Alhibshi
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Sherif Edris
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Nahid H Hajarah
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Bahieldin
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea.
| | - Jamal S M Sabir
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia.
| |
Collapse
|
6
|
Pham NT, Zhang Y, Rakkiyappan R, Manavalan B. HOTGpred: Enhancing human O-linked threonine glycosylation prediction using integrated pretrained protein language model-based features and multi-stage feature selection approach. Comput Biol Med 2024; 179:108859. [PMID: 39029431 DOI: 10.1016/j.compbiomed.2024.108859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 06/19/2024] [Accepted: 07/06/2024] [Indexed: 07/21/2024]
Abstract
O-linked glycosylation is a complex post-translational modification (PTM) in human proteins that plays a critical role in regulating various cellular metabolic and signaling pathways. In contrast to N-linked glycosylation, O-linked glycosylation lacks specific sequence features and maintains an unstable core structure. Identifying O-linked threonine glycosylation sites (OTGs) remains challenging, requiring extensive experimental tests. While bioinformatics tools have emerged for predicting OTGs, their reliance on limited conventional features and absence of well-defined feature selection strategies limit their effectiveness. To address these limitations, we introduced HOTGpred (Human O-linked Threonine Glycosylation predictor), employing a multi-stage feature selection process to identify the optimal feature set for accurately identifying OTGs. Initially, we assessed 25 different feature sets derived from various pretrained protein language model (PLM)-based embeddings and conventional feature descriptors using nine classifiers. Subsequently, we integrated the top five embeddings linearly and determined the most effective scoring function for ranking hybrid features, identifying the optimal feature set through a process of sequential forward search. Among the classifiers, the extreme gradient boosting (XGBT)-based model, using the optimal feature set (HOTGpred), achieved 92.03 % accuracy on the training dataset and 88.25 % on the balanced independent dataset. Notably, HOTGpred significantly outperformed the current state-of-the-art methods on both the balanced and imbalanced independent datasets, demonstrating its superior prediction capabilities. Additionally, SHapley Additive exPlanations (SHAP) and ablation analyses were conducted to identify the features contributing most significantly to HOTGpred. Finally, we developed an easy-to-navigate web server, accessible at https://balalab-skku.org/HOTGpred/, to support glycobiologists in their research on glycosylation structure and function.
Collapse
Affiliation(s)
- Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, Republic of Korea
| | - Ying Zhang
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Rajan Rakkiyappan
- Department of Mathematics, Bharathiar University, Coimbatore, 641046, Tamil Nadu, India.
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, Republic of Korea.
| |
Collapse
|
7
|
Wang C, Wang Y, Ding P, Li S, Yu X, Yu B. ML-FGAT: Identification of multi-label protein subcellular localization by interpretable graph attention networks and feature-generative adversarial networks. Comput Biol Med 2024; 170:107944. [PMID: 38215617 DOI: 10.1016/j.compbiomed.2024.107944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 12/08/2023] [Accepted: 01/01/2024] [Indexed: 01/14/2024]
Abstract
The prediction of multi-label protein subcellular localization (SCL) is a pivotal area in bioinformatics research. Recent advancements in protein structure research have facilitated the application of graph neural networks. This paper introduces a novel approach termed ML-FGAT. The approach begins by extracting node information of proteins from sequence data, physical-chemical properties, evolutionary insights, and structural details. Subsequently, various evolutionary techniques are integrated to consolidate multi-view information. A linear discriminant analysis framework, grounded on entropy weight, is then employed to reduce the dimensionality of the merged features. To enhance the robustness of the model, the training dataset is augmented using feature-generative adversarial networks. For the primary prediction step, graph attention networks are employed to determine multi-label protein SCL, leveraging both node and neighboring information. The interpretability is enhanced by analyzing the attention weight parameters. The training is based on the Gram-positive bacteria dataset, while validation employs newly constructed datasets: human, virus, Gram-negative bacteria, plant, and SARS-CoV-2. Following a leave-one-out cross-validation procedure, ML-FGAT demonstrates noteworthy superiority in this domain.
Collapse
Affiliation(s)
- Congjing Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yifei Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Pengju Ding
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Shan Li
- School of Mathematics and Statistics, Central South University, Changsha, 410083, China
| | - Xu Yu
- Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum, Qingdao, 266580, China
| | - Bin Yu
- School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
8
|
Zhang G, Li M, Tang Q, Meng F, Feng P, Chen W. MulCNN-HSP: A multi-scale convolutional neural networks-based deep learning method for classification of heat shock proteins. Int J Biol Macromol 2024; 257:128802. [PMID: 38101670 DOI: 10.1016/j.ijbiomac.2023.128802] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 12/04/2023] [Accepted: 12/12/2023] [Indexed: 12/17/2023]
Abstract
Heat shock proteins (HSPs) are crucial cellular stress proteins that react to environmental cues, ensuring the preservation of cellular functions. They also play pivotal roles in orchestrating the immune response and participating in processes associated with cancer. Consequently, the classification of HSPs holds immense significance in enhancing our understanding of their biological functions and in various diseases. However, the use of computational methods for identifying and classifying HSPs still faces challenges related to accuracy and interpretability. In this study, we introduced MulCNN-HSP, a novel deep learning model based on multi-scale convolutional neural networks, for identifying and classifying of HSPs. Comparative results showed that MulCNN-HSP outperforms or matches existing models in the identification and classification of HSPs. Furthermore, MulCNN-HSP can extract and analyze essential features for the prediction task, enhancing its interpretability. To facilitate its accessibility, we have made MulCNN-HSP available at http://cbcb.cdutcm.edu.cn/HSP/. We hope that MulCNN-HSP will contribute to advancing the study of HSPs and their roles in various biological processes and diseases.
Collapse
Affiliation(s)
- Guiyang Zhang
- State Key Laboratory of Southwestern Chinese Medicine Resources, Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Mingrui Li
- State Key Laboratory of Southwestern Chinese Medicine Resources, School of Basic Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Qiang Tang
- State Key Laboratory of Southwestern Chinese Medicine Resources, School of Basic Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Fanbo Meng
- State Key Laboratory of Southwestern Chinese Medicine Resources, School of Basic Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Pengmian Feng
- State Key Laboratory of Southwestern Chinese Medicine Resources, School of Basic Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China.
| | - Wei Chen
- State Key Laboratory of Southwestern Chinese Medicine Resources, Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China; State Key Laboratory of Southwestern Chinese Medicine Resources, School of Basic Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China.
| |
Collapse
|
9
|
Pham NT, Rakkiyapan R, Park J, Malik A, Manavalan B. H2Opred: a robust and efficient hybrid deep learning model for predicting 2'-O-methylation sites in human RNA. Brief Bioinform 2023; 25:bbad476. [PMID: 38180830 PMCID: PMC10768780 DOI: 10.1093/bib/bbad476] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 11/22/2023] [Accepted: 11/28/2023] [Indexed: 01/07/2024] Open
Abstract
2'-O-methylation (2OM) is the most common post-transcriptional modification of RNA. It plays a crucial role in RNA splicing, RNA stability and innate immunity. Despite advances in high-throughput detection, the chemical stability of 2OM makes it difficult to detect and map in messenger RNA. Therefore, bioinformatics tools have been developed using machine learning (ML) algorithms to identify 2OM sites. These tools have made significant progress, but their performances remain unsatisfactory and need further improvement. In this study, we introduced H2Opred, a novel hybrid deep learning (HDL) model for accurately identifying 2OM sites in human RNA. Notably, this is the first application of HDL in developing four nucleotide-specific models [adenine (A2OM), cytosine (C2OM), guanine (G2OM) and uracil (U2OM)] as well as a generic model (N2OM). H2Opred incorporated both stacked 1D convolutional neural network (1D-CNN) blocks and stacked attention-based bidirectional gated recurrent unit (Bi-GRU-Att) blocks. 1D-CNN blocks learned effective feature representations from 14 conventional descriptors, while Bi-GRU-Att blocks learned feature representations from five natural language processing-based embeddings extracted from RNA sequences. H2Opred integrated these feature representations to make the final prediction. Rigorous cross-validation analysis demonstrated that H2Opred consistently outperforms conventional ML-based single-feature models on five different datasets. Moreover, the generic model of H2Opred demonstrated a remarkable performance on both training and testing datasets, significantly outperforming the existing predictor and other four nucleotide-specific H2Opred models. To enhance accessibility and usability, we have deployed a user-friendly web server for H2Opred, accessible at https://balalab-skku.org/H2Opred/. This platform will serve as an invaluable tool for accurately predicting 2OM sites within human RNA, thereby facilitating broader applications in relevant research endeavors.
Collapse
Affiliation(s)
- Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Republic of Korea
| | - Rajan Rakkiyapan
- Department of Mathematics, Bharathiar University, Coimbatore - 641046, Tamil Nadu, India
| | - Jongsun Park
- InfoBoss inc. and InfoBoss Research Center, Gangnam-gu, Seoul 06278, Republic of Korea
| | - Adeel Malik
- Institute of Intelligence Informatics Technology, Sangmyung University, Seoul, 03016, Republic of Korea
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Republic of Korea
| |
Collapse
|
10
|
Pham NT, Phan LT, Seo J, Kim Y, Song M, Lee S, Jeon YJ, Manavalan B. Advancing the accuracy of SARS-CoV-2 phosphorylation site detection via meta-learning approach. Brief Bioinform 2023; 25:bbad433. [PMID: 38058187 PMCID: PMC10753650 DOI: 10.1093/bib/bbad433] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 10/30/2023] [Accepted: 11/05/2023] [Indexed: 12/08/2023] Open
Abstract
The worldwide appearance of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has generated significant concern and posed a considerable challenge to global health. Phosphorylation is a common post-translational modification that affects many vital cellular functions and is closely associated with SARS-CoV-2 infection. Precise identification of phosphorylation sites could provide more in-depth insight into the processes underlying SARS-CoV-2 infection and help alleviate the continuing COVID-19 crisis. Currently, available computational tools for predicting these sites lack accuracy and effectiveness. In this study, we designed an innovative meta-learning model, Meta-Learning for Serine/Threonine Phosphorylation (MeL-STPhos), to precisely identify protein phosphorylation sites. We initially performed a comprehensive assessment of 29 unique sequence-derived features, establishing prediction models for each using 14 renowned machine learning methods, ranging from traditional classifiers to advanced deep learning algorithms. We then selected the most effective model for each feature by integrating the predicted values. Rigorous feature selection strategies were employed to identify the optimal base models and classifier(s) for each cell-specific dataset. To the best of our knowledge, this is the first study to report two cell-specific models and a generic model for phosphorylation site prediction by utilizing an extensive range of sequence-derived features and machine learning algorithms. Extensive cross-validation and independent testing revealed that MeL-STPhos surpasses existing state-of-the-art tools for phosphorylation site prediction. We also developed a publicly accessible platform at https://balalab-skku.org/MeL-STPhos. We believe that MeL-STPhos will serve as a valuable tool for accelerating the discovery of serine/threonine phosphorylation sites and elucidating their role in post-translational regulation.
Collapse
Affiliation(s)
- Nhat Truong Pham
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Le Thi Phan
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Jimin Seo
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Yeonwoo Kim
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Minkyung Song
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Sukchan Lee
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Young-Jun Jeon
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Balachandran Manavalan
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| |
Collapse
|
11
|
Xu Z, Wang X, Meng J, Zhang L, Song B. m5U-GEPred: prediction of RNA 5-methyluridine sites based on sequence-derived and graph embedding features. Front Microbiol 2023; 14:1277099. [PMID: 37937221 PMCID: PMC10627201 DOI: 10.3389/fmicb.2023.1277099] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 10/02/2023] [Indexed: 11/09/2023] Open
Abstract
5-Methyluridine (m5U) is one of the most common post-transcriptional RNA modifications, which is involved in a variety of important biological processes and disease development. The precise identification of the m5U sites allows for a better understanding of the biological processes of RNA and contributes to the discovery of new RNA functional and therapeutic targets. Here, we present m5U-GEPred, a prediction framework, to combine sequence characteristics and graph embedding-based information for m5U identification. The graph embedding approach was introduced to extract the global information of training data that complemented the local information represented by conventional sequence features, thereby enhancing the prediction performance of m5U identification. m5U-GEPred outperformed the state-of-the-art m5U predictors built on two independent species, with an average AUROC of 0.984 and 0.985 tested on human and yeast transcriptomes, respectively. To further validate the performance of our newly proposed framework, the experimentally validated m5U sites identified from Oxford Nanopore Technology (ONT) were collected as independent testing data, and in this project, m5U-GEPred achieved reasonable prediction performance with ACC of 91.84%. We hope that m5U-GEPred should make a useful computational alternative for m5U identification.
Collapse
Affiliation(s)
- Zhongxing Xu
- Department of Public Health, School of Medicine and Holistic Integrative Medicine, Nanjing University of Chinese Medicine, Nanjing, China
- School of AI and Advanced Computing, Xi'an Jiaotong-Liverpool University, Suzhou, China
| | - Xuan Wang
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
| | - Jia Meng
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
- AI University Research Centre, Xi'an Jiaotong-Liverpool University, Suzhou, China
| | - Lin Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Bowen Song
- Department of Public Health, School of Medicine and Holistic Integrative Medicine, Nanjing University of Chinese Medicine, Nanjing, China
| |
Collapse
|