1
|
Pratyush P, Carrier C, Pokharel S, Ismail HD, Chaudhari M, KC DB. CaLMPhosKAN: prediction of general phosphorylation sites in proteins via fusion of codon aware embeddings with amino acid aware embeddings and wavelet-based Kolmogorov-Arnold network. Bioinformatics 2025; 41:btaf124. [PMID: 40116777 PMCID: PMC11972116 DOI: 10.1093/bioinformatics/btaf124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Revised: 02/19/2025] [Accepted: 03/17/2025] [Indexed: 03/23/2025] Open
Abstract
MOTIVATION The mapping from codon to amino acid is surjective due to codon degeneracy, suggesting that codon space might harbor higher information content. Embeddings from the codon language model have recently demonstrated success in various protein downstream tasks. However, predictive models for residue-level tasks such as phosphorylation sites, arguably the most studied Post-Translational Modification (PTM), and PTM sites prediction in general, have predominantly relied on representations in amino acid space. RESULTS We introduce a novel approach for predicting phosphorylation sites by utilizing codon-level information through embeddings from the codon adaptation language model (CaLM), trained on protein-coding DNA sequences. Protein sequences are first reverse-translated into reliable coding sequences by mapping UniProt sequences to their corresponding NCBI reference sequences and extracting the exact coding sequences from their GenBank format using a dynamic programming-based global pairwise alignment. The resulting coding sequences are encoded using the CaLM encoder to generate codon-aware embeddings, which are subsequently integrated with amino acid-aware embeddings obtained from a protein language model, through an early fusion strategy. Next, a window-level representation of the site of interest, retaining the full sequence context, is constructed from the fused embeddings. A ConvBiGRU network extracts feature maps that capture spatiotemporal correlations between proximal residues within the window. This is followed by a prediction head based on a Kolmogorov-Arnold network (KAN) using the derivative of gaussian wavelet transform to generate the inference for the site. The overall model, dubbed CaLMPhosKAN, performs better than the existing approaches across multiple datasets. AVAILABILITY AND IMPLEMENTATION CaLMPhosKAN is publicly available at https://github.com/KCLabMTU/CaLMPhosKAN.
Collapse
Affiliation(s)
- Pawel Pratyush
- Golisano College of Computing and Information Sciences, Rochester Institute of Technology, Rochester, NY 14623, United States
| | - Callen Carrier
- College of Computing, Michigan Technological University, Houghton, MI 49931, United States
| | - Suresh Pokharel
- Golisano College of Computing and Information Sciences, Rochester Institute of Technology, Rochester, NY 14623, United States
| | - Hamid D Ismail
- College of Engineering, North Carolina Agricultural and Technical State University, Greensboro, NC 27411, United States
| | - Meenal Chaudhari
- College of Applied Sciences and Technology, Illinois State University, Normal, IL 61761, United States
| | - Dukka B KC
- Golisano College of Computing and Information Sciences, Rochester Institute of Technology, Rochester, NY 14623, United States
| |
Collapse
|
2
|
Chakraborty C, Bhattacharya M, Pal S, Chatterjee S, Das A, Lee SS. AI-enabled language models (LMs) to large language models (LLMs) and multimodal large language models (MLLMs) in drug discovery and development. J Adv Res 2025:S2090-1232(25)00109-2. [PMID: 39952319 DOI: 10.1016/j.jare.2025.02.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Revised: 01/03/2025] [Accepted: 02/08/2025] [Indexed: 02/17/2025] Open
Abstract
BACKGROUND Due to the recent revolution of artificial intelligence (AI), AI-enabled large language models (LLMs) have flourished and started to be applied in various sectors of science and medicine. Drug discovery and development are time-consuming, complex processes that require high investment. The conventional method of drug discovery is costly and has a high failure rate. AI-enabled LLMs are used in various steps of drug discovery to solve the challenges of time and cost. AIM OF REVIEW The article aims to provide a comprehensive understanding of AI-enabled LLMs and their use in various steps of drug discovery to ease the challenges. KEY SCIENTIFIC CONCEPTS OF REVIEW The review provides an overview of the LLMs and their current state-of-the-art application in structure-based drug molecule design and de novo drug design. The different applications of AI-enabled LLMshave been illustrated, such as drug target identification, validation, interaction, and ADME/ADMET. Several domain-specific models of LLMs are developed in this direction and applied in drug discovery and development to speed up the process. We discussed all these domain-specific models of LLMs and their applications in this field. Finally, we illustrated the challenges and future perspectives on the applications of AI-enabled LLMs in drug discovery and development.
Collapse
Affiliation(s)
- Chiranjib Chakraborty
- Department of Biotechnology, School of Life Science and Biotechnology, Adamas University, Kolkata, West Bengal 700126, India.
| | - Manojit Bhattacharya
- Department of Zoology, Fakir Mohan University, Vyasa Vihar, Balasore 756020, Odisha, India
| | - Soumen Pal
- School of Mechanical Engineering, Vellore Institute of Technology, Vellore 632014, Tamil Nadu, India
| | - Srijan Chatterjee
- Institute for Skeletal Aging & Orthopedic Surgery, Hallym University-Chuncheon Sacred Heart Hospital, Chuncheon, Gangwon-Do, 24252, Republic of Korea
| | - Arpita Das
- Department of Biotechnology, School of Life Science and Biotechnology, Adamas University, Kolkata, West Bengal 700126, India
| | - Sang-Soo Lee
- Institute for Skeletal Aging & Orthopedic Surgery, Hallym University-Chuncheon Sacred Heart Hospital, Chuncheon, Gangwon-Do, 24252, Republic of Korea.
| |
Collapse
|
3
|
Hu X, Li J, Liu T. Alg-MFDL: A multi-feature deep learning framework for allergenic proteins prediction. Anal Biochem 2025; 697:115701. [PMID: 39481588 DOI: 10.1016/j.ab.2024.115701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Revised: 10/26/2024] [Accepted: 10/28/2024] [Indexed: 11/02/2024]
Abstract
The escalating global incidence of allergy patients illustrates the growing impact of allergic issues on global health. Allergens are small molecule antigens that trigger allergic reactions. A widely recognized strategy for allergy prevention involves identifying allergens and avoiding re-exposure. However, the laboratory methods to identify allergenic proteins are often time-consuming and resource-intensive. There is a crucial need to establish efficient and reliable computational approaches for the identification of allergenic proteins. In this study, we developed a novel allergenic proteins predictor named Alg-MFDL, which integrates pre-trained protein language models (PLMs) and traditional handcrafted features to achieve a more complete protein representation. First, we compared the performance of eight pre-trained PLMs from ProtTrans and ESM-2 and selected the best-performing one from each of the two groups. In addition, we evaluated the performance of three handcrafted features and different combinations of them to select the optimal feature or feature combination. Then, these three protein representations were fused and used as inputs to train the convolutional neural network (CNN). Finally, the independent validation was performed on benchmark datasets to evaluate the performance of Alg-MFDL. As a result, Alg-MFDL achieved an accuracy of 0.973, a precision of 0.996, a sensitivity of 0.951, and an F1 value of 0.973, outperforming the most of current state-of-the-art (SOTA) methods across all key metrics. We anticipated that the proposed model could be considered a useful tool for predicting allergen proteins.
Collapse
Affiliation(s)
- Xiang Hu
- College of Information Technology, Shanghai Ocean University, Shanghai, 201306, China
| | - Jingyi Li
- AIEN Institute, Shanghai Ocean University, Shanghai, 201306, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai, 201306, China.
| |
Collapse
|
4
|
Luo Z, Wang Q, Xia Y, Zhu X, Yang S, Xu Z, Gu L. DLBWE-Cys: a deep-learning-based tool for identifying cysteine S-carboxyethylation sites using binary-weight encoding. Front Genet 2025; 15:1464976. [PMID: 39845187 PMCID: PMC11751040 DOI: 10.3389/fgene.2024.1464976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Accepted: 12/23/2024] [Indexed: 01/24/2025] Open
Abstract
Cysteine S-carboxyethylation, a novel post-translational modification (PTM), plays a critical role in the pathogenesis of autoimmune diseases, particularly ankylosing spondylitis. Accurate identification of S-carboxyethylation modification sites is essential for elucidating their functional mechanisms. Unfortunately, there are currently no computational tools that can accurately predict these sites, posing a significant challenge to this area of research. In this study, we developed a new deep learning model, DLBWE-Cys, which integrates CNN, BiLSTM, Bahdanau attention mechanisms, and a fully connected neural network (FNN), using Binary-Weight encoding specifically designed for the accurate identification of cysteine S-carboxyethylation sites. Our experimental results show that our model architecture outperforms other machine learning and deep learning models in 5-fold cross-validation and independent testing. Feature comparison experiments confirmed the superiority of our proposed Binary-Weight encoding method over other encoding techniques. t-SNE visualization further validated the model's effective classification capabilities. Additionally, we confirmed the similarity between the distribution of positional weights in our Binary-Weight encoding and the allocation of weights in attentional mechanisms. Further experiments proved the effectiveness of our Binary-Weight encoding approach. Thus, this model paves the way for predicting cysteine S-carboxyethylation modification sites in protein sequences. The source code of DLBWE-Cys and experiments data are available at: https://github.com/ztLuo-bioinfo/DLBWE-Cys.
Collapse
Affiliation(s)
- Zhengtao Luo
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, China
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Hefei, Anhui, China
- Anhui Provincial Engineering Research Center for Agricultural Information Perception and Intelligent Computing, Anhui Agricultural University, Hefei, Anhui, China
| | - Qingyong Wang
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, China
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Hefei, Anhui, China
- Anhui Provincial Engineering Research Center for Agricultural Information Perception and Intelligent Computing, Anhui Agricultural University, Hefei, Anhui, China
| | - Yingchun Xia
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, China
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Hefei, Anhui, China
- Anhui Provincial Engineering Research Center for Agricultural Information Perception and Intelligent Computing, Anhui Agricultural University, Hefei, Anhui, China
| | - Xiaolei Zhu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, China
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Hefei, Anhui, China
- Anhui Provincial Engineering Research Center for Agricultural Information Perception and Intelligent Computing, Anhui Agricultural University, Hefei, Anhui, China
| | - Shuai Yang
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, China
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Hefei, Anhui, China
- Anhui Provincial Engineering Research Center for Agricultural Information Perception and Intelligent Computing, Anhui Agricultural University, Hefei, Anhui, China
| | - Zhaochun Xu
- Computer Department, Jingdezhen Ceramic University, Jingdezhen, China
- School for Interdisciplinary Medicine and Engineering, Harbin Medical University, Harbin, China
| | - Lichuan Gu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, China
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Hefei, Anhui, China
- Anhui Provincial Engineering Research Center for Agricultural Information Perception and Intelligent Computing, Anhui Agricultural University, Hefei, Anhui, China
| |
Collapse
|
5
|
Pratyush P, Pokharel S, Ismail HD, Bahmani S, Kc DB. LMPTMSite: A Platform for PTM Site Prediction in Proteins Leveraging Transformer-Based Protein Language Models. Methods Mol Biol 2025; 2867:261-297. [PMID: 39576587 DOI: 10.1007/978-1-0716-4196-5_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2024]
Abstract
Protein post-translational modifications (PTMs) introduce new functionalities and play a critical role in the regulation of protein functions. Characterizing these modifications, especially PTM sites, is essential for unraveling complex biological systems. However, traditional experimental approaches, such as mass spectrometry, are time-consuming and expensive. Machine learning and deep learning techniques offer promising alternatives for predicting PTM sites. In this chapter, we introduce our LMPTMSite (language model-based post-translational modification site predictor) platform, which emphasizes two transformer-based protein language model (pLM) approaches: pLMSNOSite and LMSuccSite, for the prediction of S-nitrosylation sites and succinylation sites in proteins, respectively. We highlight the various methods of using pLM-based sequence encoding, explain the underlying deep learning architectures, and discuss the superior efficacy of these tools compared to other state-of-the-art tools. Subsequently, we present an analysis of runtime and memory usage for pLMSNOSite, with a focus on CPU and RAM usage as the input sequence length is scaled up. Finally, we showcase a case study predicting succinylation sites in proteins active within the tricarboxylic acid (TCA) cycle pathway using LMSuccSite, demonstrating its potential utility and efficiency in real-world biological contexts. The LMPTMSite platform, inclusive of pLMSNOSite and LMSuccSite, is freely available both as a web server ( http://kcdukkalab.org/pLMSNOSite/ and http://kcdukkalab.org/LMSuccSite/ ) and as standalone packages ( https://github.com/KCLabMTU/pLMSNOSite and https://github.com/KCLabMTU/LMSuccSite ), providing valuable tools for researchers in the field.
Collapse
Affiliation(s)
- Pawel Pratyush
- Computer Science Department, Rochester Institute of Technology, Rochester, NY, USA
| | - Suresh Pokharel
- Computer Science Department, Rochester Institute of Technology, Rochester, NY, USA
| | - Hamid D Ismail
- Computer Science Department, Rochester Institute of Technology, Rochester, NY, USA
- North Carolina A&T State University, Computational Data Science and Engineering, Greensboro, NC, USA
| | - Soufia Bahmani
- Computer Science Department, Rochester Institute of Technology, Rochester, NY, USA
- Michigan Technological University, Comptuer Science Department, Houghton, MI, USA
| | - Dukka B Kc
- Computer Science Department, Rochester Institute of Technology, Rochester, NY, USA.
| |
Collapse
|
6
|
Pratyush P, Kc DB. Advances in Prediction of Posttranslational Modification Sites Known to Localize in Protein Supersecondary Structures. Methods Mol Biol 2025; 2870:117-151. [PMID: 39543034 DOI: 10.1007/978-1-0716-4213-9_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2024]
Abstract
Posttranslational modifications (PTMs) play a crucial role in modulating the structure, function, localization, and interactions of proteins, with many PTMs being localized within supersecondary structures, such as helical pairs. These modifications can significantly influence the conformation and stability of these structures. For instance, phosphorylation introduces negative charges that alter electrostatic interactions, while acetylation or methylation of lysine residues affects the stability and interactions of alpha helices or beta strands. Given the pivotal role of supersecondary structures in the overall protein architecture, their modulation by PTMs is essential for protein functionality. This chapter explores the latest advancements in predicting sites for the five PTMs (phosphorylation, acetylation, glycosylation, methylation, and ubiquitination) known to be localized within supersecondary structures. The chapter highlights the recent advances in the prediction of these PTM sites, including the use of global contextualized embeddings from protein language models, integration of structural information, utilization of reliable positive and negative sites, and application of contrastive learning. These methodologies and emerging trends offer a roadmap for novel innovations in addressing PTM prediction challenges, particularly those linked to supersecondary structures.
Collapse
Affiliation(s)
- Pawel Pratyush
- Computer Science Department, Michigan Technological University, Houghton, MI, USA
- Computer Science Department, Rochester Institute of Technology, Henrietta, NY, USA
| | - Dukka B Kc
- Computer Science Department, Michigan Technological University, Houghton, MI, USA.
- Computer Science Department, Rochester Institute of Technology, Henrietta, NY, USA.
| |
Collapse
|
7
|
Kim DN, Yin T, Zhang T, Im AK, Cort JR, Rozum JC, Pollock D, Qian WJ, Feng S. Artificial Intelligence Transforming Post-Translational Modification Research. Bioengineering (Basel) 2024; 12:26. [PMID: 39851300 PMCID: PMC11762806 DOI: 10.3390/bioengineering12010026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Revised: 12/16/2024] [Accepted: 12/29/2024] [Indexed: 01/26/2025] Open
Abstract
Post-Translational Modifications (PTMs) are covalent changes to amino acids that occur after protein synthesis, including covalent modifications on side chains and peptide backbones. Many PTMs profoundly impact cellular and molecular functions and structures, and their significance extends to evolutionary studies as well. In light of these implications, we have explored how artificial intelligence (AI) can be utilized in researching PTMs. Initially, rationales for adopting AI and its advantages in understanding the functions of PTMs are discussed. Then, various deep learning architectures and programs, including recent applications of language models, for predicting PTM sites on proteins and the regulatory functions of these PTMs are compared. Finally, our high-throughput PTM-data-generation pipeline, which formats data suitably for AI training and predictions is described. We hope this review illuminates areas where future AI models on PTMs can be improved, thereby contributing to the field of PTM bioengineering.
Collapse
Affiliation(s)
- Doo Nam Kim
- Biological Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99352, USA (J.C.R.); (D.P.); (W.-J.Q.)
| | - Tianzhixi Yin
- National Security Directorate, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99352, USA
| | - Tong Zhang
- Biological Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99352, USA (J.C.R.); (D.P.); (W.-J.Q.)
| | - Alexandria K. Im
- Biological Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99352, USA (J.C.R.); (D.P.); (W.-J.Q.)
| | - John R. Cort
- Biological Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99352, USA (J.C.R.); (D.P.); (W.-J.Q.)
| | - Jordan C. Rozum
- Biological Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99352, USA (J.C.R.); (D.P.); (W.-J.Q.)
| | - David Pollock
- Biological Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99352, USA (J.C.R.); (D.P.); (W.-J.Q.)
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045, USA
| | - Wei-Jun Qian
- Biological Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99352, USA (J.C.R.); (D.P.); (W.-J.Q.)
| | - Song Feng
- Biological Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99352, USA (J.C.R.); (D.P.); (W.-J.Q.)
| |
Collapse
|
8
|
Wang M, Wang J, Ji J, Ma C, Wang H, He J, Song Y, Zhang X, Cao Y, Dai Y, Hua M, Qin R, Li K, Cao L. Improving compound-protein interaction prediction by focusing on intra-modality and inter-modality dynamics with a multimodal tensor fusion strategy. Comput Struct Biotechnol J 2024; 23:3714-3729. [PMID: 39525082 PMCID: PMC11544084 DOI: 10.1016/j.csbj.2024.10.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Revised: 10/01/2024] [Accepted: 10/01/2024] [Indexed: 11/16/2024] Open
Abstract
Identifying novel compound-protein interactions (CPIs) plays a pivotal role in target identification and drug discovery. Although the recent multimodal methods have achieved outstanding advances in CPI prediction, they fail to effectively learn both intra-modality and inter-modality dynamics, which limits their prediction performance. To address the limitation, we propose a novel multimodal tensor fusion CPI prediction framework, named MMTF-CPI, which contains three unimodal learning modules for structure, heterogeneous network and transcriptional profiling modalities, a tensor fusion module and a prediction module. MMTF-CPI is capable of focusing on both intra-modality and inter-modality dynamics with the tensor fusion module. We demonstrated that MMTF-CPI is superior to multiple state-of-the-art multimodal methods across seven datasets. The prediction performance of MMTF-CPI is significantly improved with the tensor fusion module compared to other fusion methods. Moreover, our case studies confirmed the practical value of MMTF-CPI in target identification. Via MMTF-CPI, we also discovered several candidate compounds for the therapy of breast cancer and non-small cell lung cancer.
Collapse
Affiliation(s)
- Meng Wang
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Jianmin Wang
- Department of Integrative Biotechnology, Yonsei University, Incheon 21983, South Korea
| | - Jianxin Ji
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Chenjing Ma
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Hesong Wang
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Jia He
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Yongzhen Song
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Xuan Zhang
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Yong Cao
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Yanyan Dai
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Menglei Hua
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Ruihao Qin
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Kang Li
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Lei Cao
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| |
Collapse
|
9
|
Ullah M, Akbar S, Raza A, Khan KA, Zou Q. TargetCLP: clathrin proteins prediction combining transformed and evolutionary scale modeling-based multi-view features via weighted feature integration approach. Brief Bioinform 2024; 26:bbaf026. [PMID: 39844339 PMCID: PMC11753890 DOI: 10.1093/bib/bbaf026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 12/31/2024] [Accepted: 01/09/2025] [Indexed: 01/24/2025] Open
Abstract
Clathrin proteins, key elements of the vesicle coat, play a crucial role in various cellular processes, including neural function, signal transduction, and endocytosis. Disruptions in clathrin protein functions have been associated with a wide range of diseases, such as Alzheimer's, neurodegeneration, viral infection, and cancer. Therefore, correctly identifying clathrin protein functions is critical to unravel the mechanism of these fatal diseases and designing drug targets. This paper presents a novel computational method, named TargetCLP, to precisely identify clathrin proteins. TargetCLP leverages four single-view feature representation methods, including two transformed feature sets (PSSM-CLBP and RECM-CLBP), one qualitative characteristics feature, and one deep-learned-based embedding using ESM. The single-view features are integrated based on their weights using differential evolution, and the BTG feature selection algorithm is utilized to generate a more optimal and reduced subset. The model is trained using various classifiers, among which the proposed SnBiLSTM achieved remarkable performance. Experimental and comparative results on both training and independent datasets show that the proposed TargetCLP offers significant improvements in terms of both prediction accuracy and generalization to unseen data, furthering advancements in the research field.
Collapse
Affiliation(s)
- Matee Ullah
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China
| | - Shahid Akbar
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan 23200, Pakistan
| | - Ali Raza
- Department of Computer Science, MY University, Islamabad 45750, Pakistan
| | - Kashif Ahmad Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan 23200, Pakistan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324003, China
| |
Collapse
|
10
|
Zhang C, Tang D, Han C, Gou Y, Chen M, Huang X, Liu D, Zhao M, Xiao L, Xiao Q, Peng D, Xue Y. GPS-pPLM: A Language Model for Prediction of Prokaryotic Phosphorylation Sites. Cells 2024; 13:1854. [PMID: 39594603 PMCID: PMC11593113 DOI: 10.3390/cells13221854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 11/06/2024] [Accepted: 11/07/2024] [Indexed: 11/28/2024] Open
Abstract
In the prokaryotic kingdom, protein phosphorylation serves as one of the most important posttranslational modifications (PTMs) and is involved in orchestrating a broad spectrum of biological processes. Here, we report an updated online server named the group-based prediction system for prokaryotic phosphorylation language model (GPS-pPLM), used for predicting phosphorylation sites (p-sites) in prokaryotes. For model training, two deep learning methods, a transformer and a deep neural network, were employed, and a total of 10 sequence features and contextual features were integrated. Using 44,839 nonredundant p-sites in 16,041 proteins from 95 prokaryotes, two general models for the prediction of O-phosphorylation and N-phosphorylation were first pretrained and then fine-tuned to construct 6 predictors specific for each phosphorylatable residue type as well as 134 species-specific predictors. Compared with other existing tools, the GPS-pPLM exhibits higher accuracy in predicting prokaryotic O-phosphorylation p-sites. Protein sequences in FASTA format or UniProt accession numbers can be submitted by users, and the predicted results are displayed in tabular form. In addition, we annotate the predicted p-sites with knowledge from 22 public resources, including experimental evidence, 3D structures, and disorder tendencies. The online service of the GPS-pPLM is freely accessible for academic research.
Collapse
Affiliation(s)
- Chi Zhang
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Dachao Tang
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Cheng Han
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Yujie Gou
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Miaomiao Chen
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Xinhe Huang
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Dan Liu
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Miaoying Zhao
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Leming Xiao
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Qiang Xiao
- School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China;
| | - Di Peng
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Yu Xue
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| |
Collapse
|
11
|
Pakhrin SC, Chauhan N, Khan S, Upadhyaya J, Beck MR, Blanco E. Prediction of human O-linked glycosylation sites using stacked generalization and embeddings from pre-trained protein language model. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae643. [PMID: 39447059 PMCID: PMC11552629 DOI: 10.1093/bioinformatics/btae643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 10/02/2024] [Accepted: 10/23/2024] [Indexed: 10/26/2024]
Abstract
MOTIVATION O-linked glycosylation, an essential post-translational modification process in Homo sapiens, involves attaching sugar moieties to the oxygen atoms of serine and/or threonine residues. It influences various biological and cellular functions. While threonine or serine residues within protein sequences are potential sites for O-linked glycosylation, not all serine and/or threonine residues undergo this modification, underscoring the importance of characterizing its occurrence. This study presents a novel approach for predicting intracellular and extracellular O-linked glycosylation events on proteins, which are crucial for comprehending cellular processes. Two base multi-layer perceptron models were trained by leveraging a stacked generalization framework. These base models respectively use ProtT5 and Ankh O-linked glycosylation site-specific embeddings whose combined predictions are used to train the meta-multi-layer perceptron model. Trained on extensive O-linked glycosylation datasets, the stacked-generalization model demonstrated high predictive performance on independent test datasets. Furthermore, the study emphasizes the distinction between nucleocytoplasmic and extracellular O-linked glycosylation, offering insights into their functional implications that were overlooked in previous studies. By integrating the protein language model's embedding with stacked generalization techniques, this approach enhances predictive accuracy of O-linked glycosylation events and illuminates the intricate roles of O-linked glycosylation in proteomics, potentially accelerating the discovery of novel glycosylation sites. RESULTS Stack-OglyPred-PLM produces Sensitivity, Specificity, Matthews Correlation Coefficient, and Accuracy of 90.50%, 89.60%, 0.464, and 89.70%, respectively on a benchmark NetOGlyc-4.0 independent test dataset. These results demonstrate that Stack-OglyPred-PLM is a robust computational tool to predict O-linked glycosylation sites in proteins. AVAILABILITY AND IMPLEMENTATION The developed tool, programs, training, and test dataset are available at https://github.com/PakhrinLab/Stack-OglyPred-PLM.
Collapse
Affiliation(s)
- Subash Chandra Pakhrin
- Department of Computer Science and Engineering Technology, University of Houston-Downtown, Houston, TX 77002, United States
| | - Neha Chauhan
- School of Computing, Wichita State University, Wichita, KS 67260, United States
| | - Salman Khan
- Department of Computer Science, The University of Texas at Austin, Austin, TX 78712, United States
| | - Jamie Upadhyaya
- Department of Computer Science and Engineering Technology, University of Houston-Downtown, Houston, TX 77002, United States
| | - Moriah Rene Beck
- Department of Chemistry and Biochemistry, Wichita State University, Wichita, KS 67260, United States
| | - Eduardo Blanco
- Department of Computer Science, University of Arizona, Tucson, AZ 85721, United States
| |
Collapse
|
12
|
Niu B, Lee B, Wang L, Chen W, Johnson J. The Accurate Prediction of Antibody Deamidations by Combining High-Throughput Automated Peptide Mapping and Protein Language Model-Based Deep Learning. Antibodies (Basel) 2024; 13:74. [PMID: 39311379 PMCID: PMC11417914 DOI: 10.3390/antib13030074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2024] [Revised: 08/30/2024] [Accepted: 09/06/2024] [Indexed: 09/26/2024] Open
Abstract
Therapeutic antibodies such as monoclonal antibodies (mAbs), bispecific and multispecific antibodies are pivotal in therapeutic protein development and have transformed disease treatments across various therapeutic areas. The integrity of therapeutic antibodies, however, is compromised by sequence liabilities, notably deamidation, where asparagine (N) and glutamine (Q) residues undergo chemical degradations. Deamidation negatively impacts the efficacy, stability, and safety of diverse classes of antibodies, thus necessitating the critical need for the early and accurate identification of vulnerable sites. In this article, a comprehensive antibody deamidation-specific dataset (n = 2285) of varied modalities was created by using high-throughput automated peptide mapping followed by supervised machine learning to predict the deamidation propensities, as well as the extents, throughout the entire antibody sequences. We propose a novel chimeric deep learning model, integrating protein language model (pLM)-derived embeddings with local sequence information for enhanced deamidation predictions. Remarkably, this model requires only sequence inputs, eliminating the need for laborious feature engineering. Our approach demonstrates state-of-the-art performance, offering a streamlined workflow for high-throughput automated peptide mapping and deamidation prediction, with the potential of broader applicability to other antibody sequence liabilities.
Collapse
Affiliation(s)
- Ben Niu
- Discovery Biotherapeutics, Bristol Myers Squibb, San Diego, CA 92121, USA
| | - Benjamin Lee
- Discovery Biotherapeutics, Bristol Myers Squibb, San Diego, CA 92121, USA
| | - Lili Wang
- Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA
| | - Wen Chen
- Discovery Biotherapeutics, Bristol Myers Squibb, San Diego, CA 92121, USA
| | - Jeffrey Johnson
- Discovery Biotherapeutics, Bristol Myers Squibb, San Diego, CA 92121, USA
| |
Collapse
|
13
|
Pratyush P, Bahmani S, Pokharel S, Ismail HD, KC DB. LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model. Bioinformatics 2024; 40:btae290. [PMID: 38662579 PMCID: PMC11088740 DOI: 10.1093/bioinformatics/btae290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 02/13/2024] [Accepted: 04/24/2024] [Indexed: 05/13/2024] Open
Abstract
MOTIVATION Recent advancements in natural language processing have highlighted the effectiveness of global contextualized representations from protein language models (pLMs) in numerous downstream tasks. Nonetheless, strategies to encode the site-of-interest leveraging pLMs for per-residue prediction tasks, such as crotonylation (Kcr) prediction, remain largely uncharted. RESULTS Herein, we adopt a range of approaches for utilizing pLMs by experimenting with different input sequence types (full-length protein sequence versus window sequence), assessing the implications of utilizing per-residue embedding of the site-of-interest as well as embeddings of window residues centered around it. Building upon these insights, we developed a novel residual ConvBiLSTM network designed to process window-level embeddings of the site-of-interest generated by the ProtT5-XL-UniRef50 pLM using full-length sequences as input. This model, termed T5ResConvBiLSTM, surpasses existing state-of-the-art Kcr predictors in performance across three diverse datasets. To validate our approach of utilizing full sequence-based window-level embeddings, we also delved into the interpretability of ProtT5-derived embedding tensors in two ways: firstly, by scrutinizing the attention weights obtained from the transformer's encoder block; and secondly, by computing SHAP values for these tensors, providing a model-agnostic interpretation of the prediction results. Additionally, we enhance the latent representation of ProtT5 by incorporating two additional local representations, one derived from amino acid properties and the other from supervised embedding layer, through an intermediate fusion stacked generalization approach, using an n-mer window sequence (or, peptide/fragment). The resultant stacked model, dubbed LMCrot, exhibits a more pronounced improvement in predictive performance across the tested datasets. AVAILABILITY AND IMPLEMENTATION LMCrot is publicly available at https://github.com/KCLabMTU/LMCrot.
Collapse
Affiliation(s)
- Pawel Pratyush
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, United States
| | - Soufia Bahmani
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, United States
| | - Suresh Pokharel
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, United States
| | - Hamid D Ismail
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, United States
| | - Dukka B KC
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, United States
| |
Collapse
|
14
|
Prabhu H, Bhosale H, Sane A, Dhadwal R, Ramakrishnan V, Valadi J. Protein feature engineering framework for AMPylation site prediction. Sci Rep 2024; 14:8695. [PMID: 38622194 PMCID: PMC11369087 DOI: 10.1038/s41598-024-58450-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Accepted: 03/29/2024] [Indexed: 04/17/2024] Open
Abstract
AMPylation is a biologically significant yet understudied post-translational modification where an adenosine monophosphate (AMP) group is added to Tyrosine and Threonine residues primarily. While recent work has illuminated the prevalence and functional impacts of AMPylation, experimental identification of AMPylation sites remains challenging. Computational prediction techniques provide a faster alternative approach. The predictive performance of machine learning models is highly dependent on the features used to represent the raw amino acid sequences. In this work, we introduce a novel feature extraction pipeline to encode the key properties relevant to AMPylation site prediction. We utilize a recently published dataset of curated AMPylation sites to develop our feature generation framework. We demonstrate the utility of our extracted features by training various machine learning classifiers, on various numerical representations of the raw sequences extracted with the help of our framework. Tenfold cross-validation is used to evaluate the model's capability to distinguish between AMPylated and non-AMPylated sites. The top-performing set of features extracted achieved MCC score of 0.58, Accuracy of 0.8, AUC-ROC of 0.85 and F1 score of 0.73. Further, we elucidate the behaviour of the model on the set of features consisting of monogram and bigram counts for various representations using SHapley Additive exPlanations.
Collapse
Affiliation(s)
- Hardik Prabhu
- Computing and Data Sciences, FLAME University, Pune, 412115, India
- Robert Bosch Centre for Cyber Physical Systems, Indian Institute of Science, Bengaluru, 560012, India
| | | | - Aamod Sane
- Computing and Data Sciences, FLAME University, Pune, 412115, India
| | - Renu Dhadwal
- Computing and Data Sciences, FLAME University, Pune, 412115, India
| | - Vigneshwar Ramakrishnan
- Bioinformatics Center, School of Chemical and Biotechnology, SASTRA Deemed to be University, Thanjavur, 613401, India
| | - Jayaraman Valadi
- Computing and Data Sciences, FLAME University, Pune, 412115, India.
| |
Collapse
|
15
|
Palacios A, Acharya P, Peidl A, Beck M, Blanco E, Mishra A, Bawa-Khalfe T, Pakhrin S. SumoPred-PLM: human SUMOylation and SUMO2/3 sites Prediction using Pre-trained Protein Language Model. NAR Genom Bioinform 2024; 6:lqae011. [PMID: 38327870 PMCID: PMC10849187 DOI: 10.1093/nargab/lqae011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 11/17/2023] [Accepted: 01/17/2024] [Indexed: 02/09/2024] Open
Abstract
SUMOylation is an essential post-translational modification system with the ability to regulate nearly all aspects of cellular physiology. Three major paralogues SUMO1, SUMO2 and SUMO3 form a covalent bond between the small ubiquitin-like modifier with lysine residues at consensus sites in protein substrates. Biochemical studies continue to identify unique biological functions for protein targets conjugated to SUMO1 versus the highly homologous SUMO2 and SUMO3 paralogues. Yet, the field has failed to harness contemporary AI approaches including pre-trained protein language models to fully expand and/or recognize the SUMOylated proteome. Herein, we present a novel, deep learning-based approach called SumoPred-PLM for human SUMOylation prediction with sensitivity, specificity, Matthew's correlation coefficient, and accuracy of 74.64%, 73.36%, 0.48% and 74.00%, respectively, on the CPLM 4.0 independent test dataset. In addition, this novel platform uses contextualized embeddings obtained from a pre-trained protein language model, ProtT5-XL-UniRef50 to identify SUMO2/3-specific conjugation sites. The results demonstrate that SumoPred-PLM is a powerful and unique computational tool to predict SUMOylation sites in proteins and accelerate discovery.
Collapse
Affiliation(s)
- Andrew Vargas Palacios
- Department of Computer Science and Engineering Technology, University of Houston-Downtown, 1 Main St., Houston, TX 77002, USA
| | - Pujan Acharya
- Department of Computer Science and Engineering Technology, University of Houston-Downtown, 1 Main St., Houston, TX 77002, USA
| | - Anthony Stephen Peidl
- Department of Biology and Biochemistry, Center for Nuclear Receptors & Cell Signaling, University of Houston, Houston, TX 77204, USA
| | - Moriah Rene Beck
- Department of Chemistry and Biochemistry, Wichita State University, 1845 Fairmount St., Wichita, KS 67260, USA
| | - Eduardo Blanco
- Department of Computer Science, University of Arizona, 1040 4th St., Tucson, AZ 85721, USA
| | - Avdesh Mishra
- Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX 78363, USA
| | - Tasneem Bawa-Khalfe
- Department of Biology and Biochemistry, Center for Nuclear Receptors & Cell Signaling, University of Houston, Houston, TX 77204, USA
| | - Subash Chandra Pakhrin
- Department of Computer Science and Engineering Technology, University of Houston-Downtown, 1 Main St., Houston, TX 77002, USA
| |
Collapse
|