1
|
Chen SF, Steele RJ, Hocky GM, Lemeneh B, Lad SP, Oermann EK. Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions. ARXIV 2025:arXiv:2408.16245v3. [PMID: 40236839 PMCID: PMC11998858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/17/2025]
Abstract
The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually DNA/RNA or proteins. These models have seen incredible success in downstream tasks in each domain, and have achieved particularly noteworthy breakthroughs in sequence modeling and structural modeling. However, these single-omic models are naturally incapable of efficiently modeling multi-omic tasks, one of the most biologically critical being protein-nucleic acid interactions. We present our work training the largest open-source multi-omic foundation model to date. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on protein-nucleic acid interaction tasks, namely predicting the change in Gibbs free energyΔ G of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any a priori structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, we provide evidence that multi-omic biosequence models are in many cases superior to foundation models trained on single-omics distributions, both in performance-per-FLOP and absolute performance, suggesting a more generalized or foundational approach to building these models for biology.
Collapse
Affiliation(s)
- Sully F Chen
- Duke University School of Medicine, Durham, NC 27710, USA
| | | | - Glen M Hocky
- Department of Chemistry and Simons Center for Computational Physical Chemistry, New York University, New York, NY 10012, USA
| | | | - Shivanand P Lad
- Duke University School of Medicine, Department of Neurological Surgery, Durham, NC 27710, USA
| | - Eric K Oermann
- NYU Langone Health, Department of Neurological Surgery, New York, NY 10016, USA
| |
Collapse
|
2
|
Esmaili F, Qin Y, Wang D, Xu D. Kinase-substrate prediction using an autoregressive model. Comput Struct Biotechnol J 2025; 27:1103-1111. [PMID: 40190572 PMCID: PMC11968300 DOI: 10.1016/j.csbj.2025.03.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2024] [Revised: 02/27/2025] [Accepted: 03/01/2025] [Indexed: 04/09/2025] Open
Abstract
Kinase-specific phosphorylation plays a critical role in cellular signaling and various diseases. However, even in model organisms, the substrates of most kinases remain unidentified. Currently, there is no reliable method to predict kinase-substrate relationships. In this study, we introduce an innovative approach leveraging an autoregressive model to predict kinase-substrate pairs. Unlike traditional methods focused on predicting site-specific phosphorylation, our approach addresses kinase-specific protein substrate prediction at the protein level. We redefine this problem as a special type of protein-protein interaction prediction task. Our model integrates protein large language model ESM-2 as the encoder and employs an autoregressive decoder to classify protein-kinase interactions in a binary fashion. We adopted a hard negative strategy, based on kinase embedding distances generated from ESM-2, to compel the model to effectively distinguish positive from negative data. We conducted a top‑k analysis to assess how well our model can prioritize the most likely kinase candidates. Our method is also capable of zero-shot prediction, meaning it can predict substrates for a kinase in case of no known substrates, which cannot be achieved by site-specific prediction methods. Our model's robust generalization to novel kinase and underrepresented groups showcases its versatility and broad utility. Code and data are available at https://github.com/farz1995/substrate_kinase_prediction.
Collapse
Affiliation(s)
- Farzaneh Esmaili
- Data Science and Informatics Institute, Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
| | - Yongfang Qin
- Data Science and Informatics Institute, Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
| | - Duolin Wang
- Data Science and Informatics Institute, Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
| | - Dong Xu
- Data Science and Informatics Institute, Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
| |
Collapse
|
3
|
Kim J, Woo J, Park JY, Kim KJ, Kim D. Deep learning for NAD/NADP cofactor prediction and engineering using transformer attention analysis in enzymes. Metab Eng 2025; 87:86-94. [PMID: 39571721 DOI: 10.1016/j.ymben.2024.11.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 09/25/2024] [Accepted: 11/17/2024] [Indexed: 12/13/2024]
Abstract
Understanding and manipulating the cofactor preferences of NAD(P)-dependent oxidoreductases, the most widely distributed enzyme group in nature, is increasingly crucial in bioengineering. However, large-scale identification of the cofactor preferences and the design of mutants to switch cofactor specificity remain as complex tasks. Here, we introduce DISCODE (Deep learning-based Iterative pipeline to analyze Specificity of COfactors and to Design Enzyme), a novel transformer-based deep learning model to predict NAD(P) cofactor preferences. For model training, a total of 7,132 NAD(P)-dependent enzyme sequences were collected. Leveraging whole-length sequence information, DISCODE classifies the cofactor preferences of NAD(P)-dependent oxidoreductase protein sequences without structural or taxonomic limitation. The model showed 97.4% and 97.3% of accuracy and F1 score, respectively. A notable feature of DISCODE is the interpretability of its transformer layers. Analysis of attention layers in the model enables identification of several residues that showed significantly higher attention weights. They were well aligned with structurally important residues that closely interact with NAD(P), facilitating the identification of key residues for determining cofactor specificities. These key residues showed high consistency with verified cofactor switching mutants. Integrated into an enzyme design pipeline, DISCODE coupled with attention analysis, enables a fully automated approach to redesign cofactor specificity.
Collapse
Affiliation(s)
- Jaehyung Kim
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Jihoon Woo
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Joon Young Park
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Kyung-Jin Kim
- School of Life Sciences, BK21 FOUR KNU Creative BioResearch Group, KNU Institute of Microbiology, Kyungpook National University, Daegu, 41566, Republic of Korea
| | - Donghyuk Kim
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea.
| |
Collapse
|
4
|
Zhang C, Tang D, Han C, Gou Y, Chen M, Huang X, Liu D, Zhao M, Xiao L, Xiao Q, Peng D, Xue Y. GPS-pPLM: A Language Model for Prediction of Prokaryotic Phosphorylation Sites. Cells 2024; 13:1854. [PMID: 39594603 PMCID: PMC11593113 DOI: 10.3390/cells13221854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 11/06/2024] [Accepted: 11/07/2024] [Indexed: 11/28/2024] Open
Abstract
In the prokaryotic kingdom, protein phosphorylation serves as one of the most important posttranslational modifications (PTMs) and is involved in orchestrating a broad spectrum of biological processes. Here, we report an updated online server named the group-based prediction system for prokaryotic phosphorylation language model (GPS-pPLM), used for predicting phosphorylation sites (p-sites) in prokaryotes. For model training, two deep learning methods, a transformer and a deep neural network, were employed, and a total of 10 sequence features and contextual features were integrated. Using 44,839 nonredundant p-sites in 16,041 proteins from 95 prokaryotes, two general models for the prediction of O-phosphorylation and N-phosphorylation were first pretrained and then fine-tuned to construct 6 predictors specific for each phosphorylatable residue type as well as 134 species-specific predictors. Compared with other existing tools, the GPS-pPLM exhibits higher accuracy in predicting prokaryotic O-phosphorylation p-sites. Protein sequences in FASTA format or UniProt accession numbers can be submitted by users, and the predicted results are displayed in tabular form. In addition, we annotate the predicted p-sites with knowledge from 22 public resources, including experimental evidence, 3D structures, and disorder tendencies. The online service of the GPS-pPLM is freely accessible for academic research.
Collapse
Affiliation(s)
- Chi Zhang
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Dachao Tang
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Cheng Han
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Yujie Gou
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Miaomiao Chen
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Xinhe Huang
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Dan Liu
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Miaoying Zhao
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Leming Xiao
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Qiang Xiao
- School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China;
| | - Di Peng
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Yu Xue
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| |
Collapse
|
5
|
Deng Q, Zhang J, Liu J, Liu Y, Dai Z, Zou X, Li Z. Identifying Protein Phosphorylation Site-Disease Associations Based on Multi-Similarity Fusion and Negative Sample Selection by Convolutional Neural Network. Interdiscip Sci 2024; 16:649-664. [PMID: 38457108 DOI: 10.1007/s12539-024-00615-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Revised: 01/26/2024] [Accepted: 01/29/2024] [Indexed: 03/09/2024]
Abstract
As one of the most important post-translational modifications (PTMs), protein phosphorylation plays a key role in a variety of biological processes. Many studies have shown that protein phosphorylation is associated with various human diseases. Therefore, identifying protein phosphorylation site-disease associations can help to elucidate the pathogenesis of disease and discover new drug targets. Networks of sequence similarity and Gaussian interaction profile kernel similarity were constructed for phosphorylation sites, as well as networks of disease semantic similarity, disease symptom similarity and Gaussian interaction profile kernel similarity were constructed for diseases. To effectively combine different phosphorylation sites and disease similarity information, random walk with restart algorithm was used to obtain the topology information of the network. Then, the diffusion component analysis method was utilized to obtain the comprehensive phosphorylation site similarity and disease similarity. Meanwhile, the reliable negative samples were screened based on the Euclidean distance method. Finally, a convolutional neural network (CNN) model was constructed to identify potential associations between phosphorylation sites and diseases. Based on tenfold cross-validation, the evaluation indicators were obtained including accuracy of 93.48%, specificity of 96.82%, sensitivity of 90.15%, precision of 96.62%, Matthew's correlation coefficient of 0.8719, area under the receiver operating characteristic curve of 0.9786 and area under the precision-recall curve of 0.9836. Additionally, most of the top 20 predicted disease-related phosphorylation sites (19/20 for Alzheimer's disease; 20/16 for neuroblastoma) were verified by literatures and databases. These results show that the proposed method has an outstanding prediction performance and a high practical value.
Collapse
Affiliation(s)
- Qian Deng
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, China
| | - Jing Zhang
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, China
| | - Jie Liu
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, China
| | - Yuqi Liu
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, China
| | - Zong Dai
- School of Biomedical Engineering, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Xiaoyong Zou
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, China.
| | - Zhanchao Li
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, China.
| |
Collapse
|
6
|
Wenzel M, Grüner E, Strodthoff N. Insights into the inner workings of transformer models for protein function prediction. Bioinformatics 2024; 40:btae031. [PMID: 38244570 PMCID: PMC10950482 DOI: 10.1093/bioinformatics/btae031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 12/14/2023] [Accepted: 01/16/2024] [Indexed: 01/22/2024] Open
Abstract
MOTIVATION We explored how explainable artificial intelligence (XAI) can help to shed light into the inner workings of neural networks for protein function prediction, by extending the widely used XAI method of integrated gradients such that latent representations inside of transformer models, which were finetuned to Gene Ontology term and Enzyme Commission number prediction, can be inspected too. RESULTS The approach enabled us to identify amino acids in the sequences that the transformers pay particular attention to, and to show that these relevant sequence parts reflect expectations from biology and chemistry, both in the embedding layer and inside of the model, where we identified transformer heads with a statistically significant correspondence of attribution maps with ground truth sequence annotations (e.g. transmembrane regions, active sites) across many proteins. AVAILABILITY AND IMPLEMENTATION Source code can be accessed at https://github.com/markuswenzel/xai-proteins.
Collapse
Affiliation(s)
- Markus Wenzel
- Department of Artificial Intelligence, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, HHI, Einsteinufer 37, 10587 Berlin, Germany
| | - Erik Grüner
- Department of Artificial Intelligence, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, HHI, Einsteinufer 37, 10587 Berlin, Germany
| | - Nils Strodthoff
- School VI - Medicine and Health Services, Carl von Ossietzky University of Oldenburg, Ammerländer Heerstr. 114-118, 26129 Oldenburg, Germany
| |
Collapse
|
7
|
Taujale R, Gravel N, Zhou Z, Yeung W, Kochut K, Kannan N. Informatic challenges and advances in illuminating the druggable proteome. Drug Discov Today 2024; 29:103894. [PMID: 38266979 DOI: 10.1016/j.drudis.2024.103894] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Revised: 01/08/2024] [Accepted: 01/17/2024] [Indexed: 01/26/2024]
Abstract
The understudied members of the druggable proteomes offer promising prospects for drug discovery efforts. While large-scale initiatives have generated valuable functional information on understudied members of the druggable gene families, translating this information into actionable knowledge for drug discovery requires specialized informatics tools and resources. Here, we review the unique informatics challenges and advances in annotating understudied members of the druggable proteome. We demonstrate the application of statistical evolutionary inference tools, knowledge graph mining approaches, and protein language models in illuminating understudied protein kinases, pseudokinases, and ion channels.
Collapse
Affiliation(s)
- Rahil Taujale
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA, USA
| | - Nathan Gravel
- Institute of Bioinformatics, University of Georgia, Athens, GA, USA
| | | | - Wayland Yeung
- Institute of Bioinformatics, University of Georgia, Athens, GA, USA
| | - Krystof Kochut
- School of Computing, University of Georgia, Athens, GA, USA
| | - Natarajan Kannan
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA, USA; Institute of Bioinformatics, University of Georgia, Athens, GA, USA.
| |
Collapse
|
8
|
Zhou Z, Yeung W, Soleymani S, Gravel N, Salcedo M, Li S, Kannan N. Using explainable machine learning to uncover the kinase-substrate interaction landscape. Bioinformatics 2024; 40:btae033. [PMID: 38244571 PMCID: PMC10868336 DOI: 10.1093/bioinformatics/btae033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 12/09/2023] [Accepted: 01/17/2024] [Indexed: 01/22/2024] Open
Abstract
MOTIVATION Phosphorylation, a post-translational modification regulated by protein kinase enzymes, plays an essential role in almost all cellular processes. Understanding how each of the nearly 500 human protein kinases selectively phosphorylates their substrates is a foundational challenge in bioinformatics and cell signaling. Although deep learning models have been a popular means to predict kinase-substrate relationships, existing models often lack interpretability and are trained on datasets skewed toward a subset of well-studied kinases. RESULTS Here we leverage recent peptide library datasets generated to determine substrate specificity profiles of 300 serine/threonine kinases to develop an explainable Transformer model for kinase-peptide interaction prediction. The model, trained solely on primary sequences, achieved state-of-the-art performance. Its unique multitask learning paradigm built within the model enables predictions on virtually any kinase-peptide pair, including predictions on 139 kinases not used in peptide library screens. Furthermore, we employed explainable machine learning methods to elucidate the model's inner workings. Through analysis of learned embeddings at different training stages, we demonstrate that the model employs a unique strategy of substrate prediction considering both substrate motif patterns and kinase evolutionary features. SHapley Additive exPlanation (SHAP) analysis reveals key specificity determining residues in the peptide sequence. Finally, we provide a web interface for predicting kinase-substrate associations for user-defined sequences and a resource for visualizing the learned kinase-substrate associations. AVAILABILITY AND IMPLEMENTATION All code and data are available at https://github.com/esbgkannan/Phosformer-ST. Web server is available at https://phosformer.netlify.app.
Collapse
Affiliation(s)
- Zhongliang Zhou
- School of Computing, University of Georgia, Athens, GA 30602, United States
| | - Wayland Yeung
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, United States
| | - Saber Soleymani
- School of Computing, University of Georgia, Athens, GA 30602, United States
| | - Nathan Gravel
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, United States
| | - Mariah Salcedo
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602, United States
| | - Sheng Li
- School of Data Science, University of Virginia, Charlottesville, VA 22903, United States
| | - Natarajan Kannan
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, United States
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602, United States
| |
Collapse
|
9
|
Poretsky E, Andorf CM, Sen TZ. PhosBoost: Improved phosphorylation prediction recall using gradient boosting and protein language models. PLANT DIRECT 2023; 7:e554. [PMID: 38124705 PMCID: PMC10732782 DOI: 10.1002/pld3.554] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Revised: 11/20/2023] [Accepted: 11/26/2023] [Indexed: 12/23/2023]
Abstract
Protein phosphorylation is a dynamic and reversible post-translational modification that regulates a variety of essential biological processes. The regulatory role of phosphorylation in cellular signaling pathways, protein-protein interactions, and enzymatic activities has motivated extensive research efforts to understand its functional implications. Experimental protein phosphorylation data in plants remains limited to a few species, necessitating a scalable and accurate prediction method. Here, we present PhosBoost, a machine-learning approach that leverages protein language models and gradient-boosting trees to predict protein phosphorylation from experimentally derived data. Trained on data obtained from a comprehensive plant phosphorylation database, qPTMplants, we compared the performance of PhosBoost to existing protein phosphorylation prediction methods, PhosphoLingo and DeepPhos. For serine and threonine prediction, PhosBoost achieved higher recall than PhosphoLingo and DeepPhos (.78, .56, and .14, respectively) while maintaining a competitive area under the precision-recall curve (.54, .56, and .42, respectively). PhosphoLingo and DeepPhos failed to predict any tyrosine phosphorylation sites, while PhosBoost achieved a recall score of .6. Despite the precision-recall tradeoff, PhosBoost offers improved performance when recall is prioritized while consistently providing more confident probability scores. A sequence-based pairwise alignment step improved prediction results for all classifiers by effectively increasing the number of inferred positive phosphosites. We provide evidence to show that PhosBoost models are transferable across species and scalable for genome-wide protein phosphorylation predictions. PhosBoost is freely and publicly available on GitHub.
Collapse
Affiliation(s)
- Elly Poretsky
- Agricultural Research Service, Crop Improvement and Genetics Research UnitU.S. Department of AgricultureAlbanyCAUnited States
| | - Carson M. Andorf
- Agricultural Research Service, Corn Insects and Crop Genetics ResearchU.S. Department of AgricultureAmesIAUnited States
- Department of Computer ScienceIowa State UniversityAmesIAUnited States
| | - Taner Z. Sen
- Agricultural Research Service, Crop Improvement and Genetics Research UnitU.S. Department of AgricultureAlbanyCAUnited States
- Department of BioengineeringUniversity of CaliforniaBerkeleyCAUnited States
| |
Collapse
|
10
|
Valentini G, Malchiodi D, Gliozzo J, Mesiti M, Soto-Gomez M, Cabri A, Reese J, Casiraghi E, Robinson PN. The promises of large language models for protein design and modeling. FRONTIERS IN BIOINFORMATICS 2023; 3:1304099. [PMID: 38076030 PMCID: PMC10701588 DOI: 10.3389/fbinf.2023.1304099] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 11/07/2023] [Indexed: 10/16/2024] Open
Abstract
The recent breakthroughs of Large Language Models (LLMs) in the context of natural language processing have opened the way to significant advances in protein research. Indeed, the relationships between human natural language and the "language of proteins" invite the application and adaptation of LLMs to protein modelling and design. Considering the impressive results of GPT-4 and other recently developed LLMs in processing, generating and translating human languages, we anticipate analogous results with the language of proteins. Indeed, protein language models have been already trained to accurately predict protein properties, generate novel functionally characterized proteins, achieving state-of-the-art results. In this paper we discuss the promises and the open challenges raised by this novel and exciting research area, and we propose our perspective on how LLMs will affect protein modeling and design.
Collapse
Affiliation(s)
- Giorgio Valentini
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
- ELLIS, European Laboratory for Learning and Intelligent Systems, Milan, Italy
| | - Dario Malchiodi
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Jessica Gliozzo
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
- European Commission, Joint Research Centre (JRC), Ispra, Italy
| | - Marco Mesiti
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Mauricio Soto-Gomez
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Alberto Cabri
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Justin Reese
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, United States
| | - Elena Casiraghi
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
- ELLIS, European Laboratory for Learning and Intelligent Systems, Milan, Italy
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, United States
| | | |
Collapse
|
11
|
Liang Z, Liu T, Li Q, Zhang G, Zhang B, Du X, Liu J, Chen Z, Ding H, Hu G, Lin H, Zhu F, Luo C. Deciphering the functional landscape of phosphosites with deep neural network. Cell Rep 2023; 42:113048. [PMID: 37659078 DOI: 10.1016/j.celrep.2023.113048] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 07/11/2023] [Accepted: 08/11/2023] [Indexed: 09/04/2023] Open
Abstract
Current biochemical approaches have only identified the most well-characterized kinases for a tiny fraction of the phosphoproteome, and the functional assignments of phosphosites are almost negligible. Herein, we analyze the substrate preference catalyzed by a specific kinase and present a novel integrated deep neural network model named FuncPhos-SEQ for functional assignment of human proteome-level phosphosites. FuncPhos-SEQ incorporates phosphosite motif information from a protein sequence using multiple convolutional neural network (CNN) channels and network features from protein-protein interactions (PPIs) using network embedding and deep neural network (DNN) channels. These concatenated features are jointly fed into a heterogeneous feature network to prioritize functional phosphosites. Combined with a series of in vitro and cellular biochemical assays, we confirm that NADK-S48/50 phosphorylation could activate its enzymatic activity. In addition, ERK1/2 are discovered as the primary kinases responsible for NADK-S48/50 phosphorylation. Moreover, FuncPhos-SEQ is developed as an online server.
Collapse
Affiliation(s)
- Zhongjie Liang
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China; Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| | - Tonghai Liu
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan 528437, China; State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Qi Li
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan 528437, China; State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Guangyu Zhang
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Bei Zhang
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Xikun Du
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China
| | - Jingqiu Liu
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Zhifeng Chen
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Hong Ding
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Guang Hu
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China; Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| | - Hao Lin
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Fei Zhu
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
| | - Cheng Luo
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan 528437, China; State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China; School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, Hangzhou 310024, China; School of Life Science and Technology, Shanghai Tech University, 100 Haike Road, Shanghai 201210, China; School of Pharmacy, Fujian Medical University, Fuzhou 350122, China.
| |
Collapse
|