1
|
Ma T, Jiang M, Pang S, Zhang Z, Hang H, Zhou W, Zhang Y. SeqMG-RPI: A Sequence-Based Framework Integrating Multi-Scale RNA Features and Protein Graphs for RNA-Protein Interaction Prediction. J Chem Inf Model 2025; 65:4698-4713. [PMID: 40262169 DOI: 10.1021/acs.jcim.5c00176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/24/2025]
Abstract
RNA-protein interaction (RPI) plays a crucial role in cell biology, and accurate prediction of RPI is essential to understand molecular mechanisms and advance disease research. Some existing RPI prediction methods typically rely on a single feature and there is significant room for improvement. In this paper, we propose a novel sequence-based RPI prediction method, called SeqMG-RPI. For RNA, SeqMG-RPI introduces an innovative multi-scale RNA feature that integrates three sequence-based representations: a multi-channel RNA feature, a k-mer frequency feature, and a k-mer sparse matrix feature. For protein, SeqMG-RPI utilizes a graph-based protein feature to capture protein information. Moreover, a novel neural network architecture is constructed for feature extraction and RPI prediction. Through experiments from multiple perspectives across various datasets, it is demonstrated that the proposed method outperforms existing methods, which has better performance and generalization.
Collapse
Affiliation(s)
- Teng Ma
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, China
| | - Mingjian Jiang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, China
| | - Shunpeng Pang
- School of Computer Engineering, Weifang University, Weifang 261061, China
| | - Zhi Zhang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, China
| | - Huaibin Hang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, China
| | - Wei Zhou
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, China
| | - Yuanyuan Zhang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, China
| |
Collapse
|
2
|
Li S, Li W, Shao Y, Wang M, Yin C, Xin Z. Development of DeepPQK and DeepQK sequence-based deep learning models to predict protein-ligand affinity and application in the directed evolution of ferulic esterase DLfae4. Int J Biol Macromol 2025; 307:141790. [PMID: 40054795 DOI: 10.1016/j.ijbiomac.2025.141790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2024] [Revised: 03/04/2025] [Accepted: 03/04/2025] [Indexed: 05/07/2025]
Abstract
Affinity plays an essential role in the rate and stability of enzyme-catalyzed reactions, thus directly impacting the catalytic activity. In general, the predictive method for protein-ligand binding affinity mainly relies on high-resolution protein crystal structure data; however, some protein crystals are difficult to culture, time-consuming, and expensive to obtain. In this study, two sequence-based neural network deep learning models - DeepPQK and DeepQK, were constructed to predict the protein-ligand binding affinity. DeepPQK was developed by integrating local and global contextual features using convolutional neural networks(CNN) with protein sequences, pocket amino acids, and ligands as input. In particular, the protein-binding pocket, which possesses special properties for directly binding the ligand, was used as the local input feature for predicting protein-ligand binding affinity. DeepQK, consisting of a protein sequence module and a ligand module, utilizes these features for its predictions, enabling the identification of the intrinsic relationship between protein sequence and affinity. Specifically, dilated convolution was used to capture multiscale long-range interactions and the special sequence-level features of a protein and ligand. When tested on the 2016 core dataset, the Pearson correlation coefficient of DeepPQK and DeepQK reached 0.805 and 0.804 respectively, which is a significant accuracy improvement compared with the recent state-of-art methods. Both models, once trained, can learn the two- and three-dimensional structural properties of proteins, and the relative position relationship between proteins and ligands. Based on the results, a series of variants of feruloyl esterase DLFae4 were designed using DeepPQK and DeepQK, and the enzyme activity of these mutations was verified by experiments, among which the optimal mutant I149G/W237H/M297C improved 5.6-fold enzyme activity and 10.1-fold catalytic efficiency than the wild-type enzyme. In conclusion, DeepPQK and DeepQK deep learning models overcome the limitations of traditional methods that depend on protein crystal structures and have been successfully applied to guide the directed evolution of enzymes, providing a new approach to studying enzyme-directed evolution. The resource codes are available at https://github.com/KK-SW1207/DeepPQK_QK.
Collapse
Affiliation(s)
- Siwei Li
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Wenqing Li
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Yuting Shao
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Mengxi Wang
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Chenyue Yin
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Zhihong Xin
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China.
| |
Collapse
|
3
|
Jiang J, Chen L, Zhu Y, Shi Y, Qiu H, Zhang B, Zhou T, Wei GW. Proteomic Learning of Gamma-Aminobutyric Acid (GABA) Receptor-Mediated Anesthesia. J Chem Inf Model 2025; 65:3655-3668. [PMID: 40094320 PMCID: PMC12004937 DOI: 10.1021/acs.jcim.5c00114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2025] [Revised: 02/27/2025] [Accepted: 03/04/2025] [Indexed: 03/19/2025]
Abstract
Anesthetics are crucial in surgical procedures and therapeutic interventions, but they come with side effects and varying levels of effectiveness, calling for novel anesthetic agents that offer more precise and controllable effects. Targeting Gamma-aminobutyric acid (GABA) receptors, the primary inhibitory receptors in the central nervous system, could enhance their inhibitory action, potentially reducing side effects while improving the potency of anesthetics. In this study, we introduce a proteomic learning of GABA receptor-mediated anesthesia based on 24 GABA receptor subtypes by considering over 4000 proteins in protein-protein interaction (PPI) networks and over 1.5 millions known binding compounds. We develop a corresponding drug-target interaction network to identify potential lead compounds for novel anesthetic design. To ensure robust proteomic learning predictions, we curated a data set comprising 136 targets from a pool of 980 targets within the PPI networks. We employed three machine learning algorithms, integrating advanced natural language processing (NLP) models such as pretrained transformers and autoencoder embeddings. Through a comprehensive screening process, we evaluated the side effects and repurposing potential of over 180,000 drug candidates targeting the GABRA5 receptor. Additionally, we assessed the ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties of these candidates to identify those with near-optimal characteristics. This approach also involved optimizing the structures of existing anesthetics. Our work presents an innovative strategy for the development of new anesthetic drugs, optimization of anesthetic use, and a deeper understanding of potential anesthesia-related side effects.
Collapse
Affiliation(s)
- Jian Jiang
- Research
Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan 430200, P R. China
- Department
of Mathematics, Michigan State University, East Lansing 48824, Michigan, United States
| | - Long Chen
- Research
Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan 430200, P R. China
| | - Yueying Zhu
- Research
Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan 430200, P R. China
| | - Yazhou Shi
- Research
Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan 430200, P R. China
| | - Huahai Qiu
- Research
Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan 430200, P R. China
| | - Bengong Zhang
- Research
Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan 430200, P R. China
| | - Tianshou Zhou
- Key
Laboratory of Computational Mathematics, Guangdong Province, and School
of Mathematics, Sun Yat-sen University, Guangzhou 510006, P R. China
| | - Guo-Wei Wei
- Department
of Mathematics, Michigan State University, East Lansing 48824, Michigan, United States
- Department
of Electrical and Computer Engineering, Michigan State University, East Lansing 48824, Michigan, United States
- Department
of Biochemistry and Molecular Biology, Michigan
State University, East Lansing 48824, Michigan, United States
| |
Collapse
|
4
|
Michels J, Bandarupalli R, Ahangar Akbari A, Le T, Xiao H, Li J, Hom EFY. Natural Language Processing Methods for the Study of Protein-Ligand Interactions. J Chem Inf Model 2025; 65:2191-2213. [PMID: 39993834 PMCID: PMC11898065 DOI: 10.1021/acs.jcim.4c01907] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Revised: 02/05/2025] [Accepted: 02/06/2025] [Indexed: 02/26/2025]
Abstract
Natural Language Processing (NLP) has revolutionized the way computers are used to study and interact with human languages and is increasingly influential in the study of protein and ligand binding, which is critical for drug discovery and development. This review examines how NLP techniques have been adapted to decode the "language" of proteins and small molecule ligands to predict protein-ligand interactions (PLIs). We discuss how methods such as long short-term memory (LSTM) networks, transformers, and attention mechanisms can leverage different protein and ligand data types to identify potential interaction patterns. Significant challenges are highlighted including the scarcity of high-quality negative data, difficulties in interpreting model decisions, and sampling biases in existing data sets. We argue that focusing on improving data quality, enhancing model robustness, and fostering both collaboration and competition could catalyze future advances in machine-learning-based predictions of PLIs.
Collapse
Affiliation(s)
- James Michels
- Department
of Computer and Information Science, University
of Mississippi, University, Mississippi 38677, United States
| | - Ramya Bandarupalli
- Department
of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, Mississippi 38677, United States
| | - Amin Ahangar Akbari
- Department
of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, Mississippi 38677, United States
| | - Thai Le
- Department
of Computer Science, Indiana University, Bloomington, Indiana 47408, United States
| | - Hong Xiao
- Department
of Computer and Information Science and Institute for Data Science, University of Mississippi, University, Mississippi 38677, United States
| | - Jing Li
- Department
of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, Mississippi 38677, United States
| | - Erik F. Y. Hom
- Department
of Biology and Center for Biodiversity and Conservation Research, University of Mississippi, University, Mississippi 38677, United States
| |
Collapse
|
5
|
Li J, Gong X. Harnessing pre-trained models for accurate prediction of protein-ligand binding affinity. BMC Bioinformatics 2025; 26:55. [PMID: 39962390 PMCID: PMC11834573 DOI: 10.1186/s12859-025-06064-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2024] [Accepted: 01/22/2025] [Indexed: 02/20/2025] Open
Abstract
BACKGROUND The binding between proteins and ligands plays a crucial role in the field of drug discovery. However, this area currently faces numerous challenges. On one hand, existing methods are constrained by the limited availability of labeled data, often performing inadequately when addressing complex protein-ligand interactions. On the other hand, many models struggle to effectively capture the flexible variations and relative spatial relationships between proteins and ligands. These issues not only significantly hinder the advancement of protein-ligand binding research but also adversely affect the accuracy and efficiency of drug discovery. Therefore, in response to these challenges, our study aims to enhance predictive capabilities through innovative approaches, providing more reliable support for drug discovery efforts. METHODS This study leverages a pre-trained model with spatial awareness to enhance the prediction of protein-ligand binding affinity. By perturbing the structures of small molecules in a manner consistent with physical constraints and employing self-supervised tasks, we improve the representation of small molecule structures, allowing for better adaptation to affinity predictions. Meanwhile, our approach enables the identification of potential binding sites on proteins. RESULTS Our model demonstrates a significantly higher correlation coefficient in binding affinity predictions. Extensive evaluation on the PDBBind v2019 refined set, CASF, and Merck FEP benchmarks confirms the model's robustness and strong generalization across diverse datasets. Additionally, the model achieves over 95% in classification ROC for binding site identification, underscoring its high accuracy in pinpointing protein-ligand interaction regions. CONCLUSION This research presents a novel approach that not only enhances the accuracy of binding affinity predictions but also facilitates the identification of binding sites, showcasing the potential of pre-trained models in computational drug design. Data and code are available at https://github.com/MIALAB-RUC/SableBind .
Collapse
Affiliation(s)
- Jiashan Li
- Institute for Mathematical Sciences, School of Mathematics, Renmin University of China, 59 Zhongguancun Street, Beijing, 100872, China
| | - Xinqi Gong
- Institute for Mathematical Sciences, School of Mathematics, Renmin University of China, 59 Zhongguancun Street, Beijing, 100872, China.
| |
Collapse
|
6
|
Yue Y, Cheng Y, Marquet C, Xiao C, Guo J, Li S, He S. Meta-Learning Enables Complex Cluster-Specific Few-Shot Binding Affinity Prediction for Protein-Protein Interactions. J Chem Inf Model 2025; 65:580-588. [PMID: 39772708 DOI: 10.1021/acs.jcim.4c01607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2025]
Abstract
Predicting protein-protein interaction (PPI) binding affinities in unseen protein complex clusters is essential for elucidating complex protein interactions and for the targeted screening of peptide- or protein-based drugs. We introduce MCGLPPI++, a meta-learning framework designed to improve the adaptability of pretrained geometric models in such scenarios. To effectively boost the meta-learning optimization by injecting prior intersample distribution knowledge, three specially designed training sample cluster splitting patterns based on protein interaction interfaces are introduced. Additionally, MCGLPPI++ is equipped with an independent energy component which explicitly models interface nonbonded interaction energies closely related to the strengths of PPIs. To validate our approach, we curate a new data set featuring a challenging test cluster of T-cell receptors binding to antigenic peptide-MHC molecules (TCR-pMHC). Experimental results show that geometric models enhanced by the MCGLPPI++ framework achieve significantly more robust binding affinity predictions after fine-tuning on a few samples from this novel cluster compared to their vanilla counterparts, which demonstrates the effectiveness of the framework.
Collapse
Affiliation(s)
- Yang Yue
- School of Computer Science, The University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K
| | - Yihua Cheng
- School of Computer Science, The University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K
| | - Céline Marquet
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching 85748, Munich, Germany
| | - Chenguang Xiao
- School of Computer Science, The University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K
| | - Jingjing Guo
- Centre of Artificial Intelligence Driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macao SAR 999078, China
| | - Shu Li
- Centre of Artificial Intelligence Driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macao SAR 999078, China
| | - Shan He
- School of Computer Science, The University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K
| |
Collapse
|
7
|
Wang J, Mao J, Li C, Xiang H, Wang X, Wang S, Wang Z, Chen Y, Li Y, No KT, Song T, Zeng X. Interface-aware molecular generative framework for protein-protein interaction modulators. J Cheminform 2024; 16:142. [PMID: 39707457 DOI: 10.1186/s13321-024-00930-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Accepted: 11/11/2024] [Indexed: 12/23/2024] Open
Abstract
Protein-protein interactions (PPIs) play a crucial role in numerous biochemical and biological processes. Although several structure-based molecular generative models have been developed, PPI interfaces and compounds targeting PPIs exhibit distinct physicochemical properties compared to traditional binding pockets and small-molecule drugs. As a result, generating compounds that effectively target PPIs, particularly by considering PPI complexes or interface hotspot residues, remains a significant challenge. In this work, we constructed a comprehensive dataset of PPI interfaces with active and inactive compound pairs. Based on this, we propose a novel molecular generative framework tailored to PPI interfaces, named GENiPPI. Our evaluation demonstrates that GENiPPI captures the implicit relationships between the PPI interfaces and the active molecules, and can generate novel compounds that target these interfaces. Moreover, GENiPPI can generate structurally diverse novel compounds with limited PPI interface modulators. To the best of our knowledge, this is the first exploration of a structure-based molecular generative model focused on PPI interfaces, which could facilitate the design of PPI modulators. The PPI interface-based molecular generative model enriches the existing landscape of structure-based (pocket/interface) molecular generative model. SCIENTIFIC CONTRIBUTION: This study introduces GENiPPI, a protein-protein interaction (PPI) interface-aware molecular generative framework. The framework first employs Graph Attention Networks to capture atomic-level interaction features at the protein complex interface. Subsequently, Convolutional Neural Networks extract compound representations in voxel and electron density spaces. These features are integrated into a Conditional Wasserstein Generative Adversarial Network, which trains the model to generate compound representations targeting PPI interfaces. GENiPPI effectively captures the relationship between PPI interfaces and active/inactive compounds. Furthermore, in fewshot molecular generation, GENiPPI successfully generates compounds comparable to known disruptors. GENiPPI provides an efficient tool for structure-based design of PPI modulators.
Collapse
Affiliation(s)
- Jianmin Wang
- Department of Integrative Biotechnology, Yonsei University, Incheon, 21983, Republic of Korea
| | - Jiashun Mao
- Department of Integrative Biotechnology, Yonsei University, Incheon, 21983, Republic of Korea
| | - Chunyan Li
- School of Informatics, Yunnan Normal University, Kunming, China
| | - Hongxin Xiang
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, Hunan, China
| | - Xun Wang
- School of Computer Science and Technology, China University of Petroleum, Qingdao, 266580, Shandong, China
- High Performance Computer Research Center, University of Chinese Academy of Sciences, Beijing, 100190, China
| | - Shuang Wang
- School of Computer Science and Technology, China University of Petroleum, Qingdao, 266580, Shandong, China
| | - Zixu Wang
- Department of Computer Science, University of Tsukuba, Tsukuba, 3058577, Japan
| | - Yangyang Chen
- Department of Computer Science, University of Tsukuba, Tsukuba, 3058577, Japan
| | - Yuquan Li
- College of Chemistry and Chemical Engineering, Lanzhou University, Lanzhou, China
| | - Kyoung Tai No
- Department of Integrative Biotechnology, Yonsei University, Incheon, 21983, Republic of Korea.
| | - Tao Song
- School of Computer Science and Technology, China University of Petroleum, Qingdao, 266580, Shandong, China.
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, Hunan, China.
| |
Collapse
|
8
|
Zhou Z, Yin Y, Han H, Jia Y, Koh JH, Kong AWK, Mu Y. ProAffinity-GNN: A Novel Approach to Structure-Based Protein-Protein Binding Affinity Prediction via a Curated Data Set and Graph Neural Networks. J Chem Inf Model 2024; 64:8796-8808. [PMID: 39558674 DOI: 10.1021/acs.jcim.4c01850] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2024]
Abstract
Protein-protein interactions (PPIs) are crucial for understanding biological processes and disease mechanisms, contributing significantly to advances in protein engineering and drug discovery. The accurate determination of binding affinities, essential for decoding PPIs, faces challenges due to the substantial time and financial costs involved in experimental and theoretical methods. This situation underscores the urgent need for more effective and precise methodologies for predicting binding affinity. Despite the abundance of research on PPI modeling, the field of quantitative binding affinity prediction remains underexplored, mainly due to a lack of comprehensive data. This study seeks to address these needs by manually curating pairwise interaction labels on available 3D structures of protein complexes, with experimentally determined binding affinities, creating the largest data set for structure-based pairwise protein interaction with binding affinity to date. Subsequently, we introduce ProAffinity-GNN, a novel deep learning framework using protein language model and graph neural network (GNN) to improve the accuracy of prediction of structure-based protein-protein binding affinities. The evaluation results across several benchmark test sets and an additional case study demonstrate that ProAffinity-GNN not only outperforms existing models in terms of accuracy but also shows strong generalization capabilities.
Collapse
Affiliation(s)
- Zhiyuan Zhou
- School of Biological Sciences, Nanyang Technological University, 637551, Singapore
| | - Yueming Yin
- Institute for Digital Molecular Analytics and Science (IDMxS), Nanyang Technological University, 636921, Singapore
| | - Hao Han
- School of Biological Sciences, Nanyang Technological University, 637551, Singapore
| | - Yiping Jia
- School of Pharmacy, Shanghai Jiao Tong University, 200240, Shanghai, China
| | - Jun Hong Koh
- School of Biological Sciences, Nanyang Technological University, 637551, Singapore
| | - Adams Wai-Kin Kong
- College of Computing and Data Science, Nanyang Technological University, 639798, Singapore
| | - Yuguang Mu
- School of Biological Sciences, Nanyang Technological University, 637551, Singapore
| |
Collapse
|
9
|
Son A, Park J, Kim W, Yoon Y, Lee S, Ji J, Kim H. Recent Advances in Omics, Computational Models, and Advanced Screening Methods for Drug Safety and Efficacy. TOXICS 2024; 12:822. [PMID: 39591001 PMCID: PMC11598288 DOI: 10.3390/toxics12110822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2024] [Revised: 11/10/2024] [Accepted: 11/14/2024] [Indexed: 11/28/2024]
Abstract
It is imperative to comprehend the mechanisms that underlie drug toxicity in order to enhance the efficacy and safety of novel therapeutic agents. The capacity to identify molecular pathways that contribute to drug-induced toxicity has been significantly enhanced by recent developments in omics technologies, such as transcriptomics, proteomics, and metabolomics. This has enabled the early identification of potential adverse effects. These insights are further enhanced by computational tools, including quantitative structure-activity relationship (QSAR) analyses and machine learning models, which accurately predict toxicity endpoints. Additionally, technologies such as physiologically based pharmacokinetic (PBPK) modeling and micro-physiological systems (MPS) provide more precise preclinical-to-clinical translation, thereby improving drug safety assessments. This review emphasizes the synergy between sophisticated screening technologies, in silico modeling, and omics data, emphasizing their roles in reducing late-stage drug development failures. Challenges persist in the integration of a variety of data types and the interpretation of intricate biological interactions, despite the progress that has been made. The development of standardized methodologies that further enhance predictive toxicology is contingent upon the ongoing collaboration between researchers, clinicians, and regulatory bodies. This collaboration ensures the development of therapeutic pharmaceuticals that are more effective and safer.
Collapse
Affiliation(s)
- Ahrum Son
- Department of Molecular Medicine, Scripps Research, San Diego, CA 92037, USA;
| | - Jongham Park
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.)
| | - Woojin Kim
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.)
| | - Yoonki Yoon
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.)
| | - Sangwoon Lee
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.)
| | - Jaeho Ji
- Department of Convergent Bioscience and Informatics, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea;
| | - Hyunsoo Kim
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.)
- Department of Convergent Bioscience and Informatics, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea;
- Protein AI Design Institute, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea
- SCICS, Prove Beyond AI, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea
| |
Collapse
|
10
|
Ugurlu SY, McDonald D, He S. MEF-AlloSite: an accurate and robust Multimodel Ensemble Feature selection for the Allosteric Site identification model. J Cheminform 2024; 16:116. [PMID: 39444016 PMCID: PMC11515501 DOI: 10.1186/s13321-024-00882-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2024] [Accepted: 07/09/2024] [Indexed: 10/25/2024] Open
Abstract
A crucial mechanism for controlling the actions of proteins is allostery. Allosteric modulators have the potential to provide many benefits compared to orthosteric ligands, such as increased selectivity and saturability of their effect. The identification of new allosteric sites presents prospects for the creation of innovative medications and enhances our comprehension of fundamental biological mechanisms. Allosteric sites are increasingly found in different protein families through various techniques, such as machine learning applications, which opens up possibilities for creating completely novel medications with a diverse variety of chemical structures. Machine learning methods, such as PASSer, exhibit limited efficacy in accurately finding allosteric binding sites when relying solely on 3D structural information.Scientific ContributionPrior to conducting feature selection for allosteric binding site identification, integration of supporting amino-acid-based information to 3D structural knowledge is advantageous. This approach can enhance performance by ensuring accuracy and robustness. Therefore, we have developed an accurate and robust model called Multimodel Ensemble Feature Selection for Allosteric Site Identification (MEF-AlloSite) after collecting 9460 relevant and diverse features from the literature to characterise pockets. The model employs an accurate and robust multimodal feature selection technique for the small training set size of only 90 proteins to improve predictive performance. This state-of-the-art technique increased the performance in allosteric binding site identification by selecting promising features from 9460 features. Also, the relationship between selected features and allosteric binding sites enlightened the understanding of complex allostery for proteins by analysing selected features. MEF-AlloSite and state-of-the-art allosteric site identification methods such as PASSer2.0 and PASSerRank have been tested on three test cases 51 times with a different split of the training set. The Student's t test and Cohen's D value have been used to evaluate the average precision and ROC AUC score distribution. On three test cases, most of the p-values ( < 0.05 ) and the majority of Cohen's D values ( > 0.5 ) showed that MEF-AlloSite's 1-6% higher mean of average precision and ROC AUC than state-of-the-art allosteric site identification methods are statistically significant.
Collapse
Affiliation(s)
- Sadettin Y Ugurlu
- School of Computer Science, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
| | | | - Shan He
- School of Computer Science, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK.
- AIA Insights Ltd, Birmingham, UK.
| |
Collapse
|
11
|
Michels J, Bandarupalli R, Akbari AA, Le T, Xiao H, Li J, Hom EFY. Natural Language Processing Methods for the Study of Protein-Ligand Interactions. ARXIV 2024:arXiv:2409.13057v2. [PMID: 39483353 PMCID: PMC11527106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Abstract
Natural Language Processing (NLP) has revolutionized the way computers are used to study and interact with human languages and is increasingly influential in the study of protein and ligand binding, which is critical for drug discovery and development. This review examines how NLP techniques have been adapted to decode the "language" of proteins and small molecule ligands to predict protein-ligand interactions (PLIs). We discuss how methods such as long short-term memory (LSTM) networks, transformers, and attention mechanisms can leverage different protein and ligand data types to identify potential interaction patterns. Significant challenges are highlighted, including the scarcity of high-quality negative data, difficulties in interpreting model decisions, and sampling biases of existing datasets. We argue that focusing on improving data quality, enhancing model robustness, and fostering both collaboration and competition could catalyze future advances in machine-learning-based predictions of PLIs.
Collapse
Affiliation(s)
- James Michels
- Department of Computer Science, University of Mississippi, University, MS
| | - Ramya Bandarupalli
- Department of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, MS
| | - Amin Ahangar Akbari
- Department of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, MS
| | - Thai Le
- Department of Computer Science, Indiana University, Bloomington, IN
| | - Hong Xiao
- Department of Computer Science, University of Mississippi, University, MS
| | - Jing Li
- Department of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, MS
| | - Erik F Y Hom
- Department of Biology and Center for Biodiversity and Conservation Research, University of Mississippi, University, MS
| |
Collapse
|
12
|
Hozumi Y, Wei GW. Analyzing Single Cell RNA Sequencing with Topological Nonnegative Matrix Factorization. JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS 2024; 445:115842. [PMID: 38464901 PMCID: PMC10919214 DOI: 10.1016/j.cam.2024.115842] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) is a relatively new technology that has stimulated enormous interest in statistics, data science, and computational biology due to the high dimensionality, complexity, and large scale associated with scRNA-seq data. Nonnegative matrix factorization (NMF) offers a unique approach due to its meta-gene interpretation of resulting low-dimensional components. However, NMF approaches suffer from the lack of multiscale analysis. This work introduces two persistent Laplacian regularized NMF methods, namely, topological NMF (TNMF) and robust topological NMF (rTNMF). By employing a total of 12 datasets, we demonstrate that the proposed TNMF and rTNMF significantly outperform all other NMF-based methods. We have also utilized TNMF and rTNMF for the visualization of popular Uniform Manifold Approximation and Projection (UMAP) and t -distributed stochastic neighbor embedding (t -SNE).
Collapse
Affiliation(s)
- Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
13
|
Zhao S, Cui Z, Zhang G, Gong Y, Su L. MGPPI: multiscale graph neural networks for explainable protein-protein interaction prediction. Front Genet 2024; 15:1440448. [PMID: 39076171 PMCID: PMC11284081 DOI: 10.3389/fgene.2024.1440448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 06/24/2024] [Indexed: 07/31/2024] Open
Abstract
Protein-Protein Interactions (PPIs) involves in various biological processes, which are of significant importance in cancer diagnosis and drug development. Computational based PPI prediction methods are more preferred due to their low cost and high accuracy. However, existing protein structure based methods are insufficient in the extraction of protein structural information. Furthermore, most methods are less interpretable, which hinder their practical application in the biomedical field. In this paper, we propose MGPPI, which is a Multiscale graph convolutional neural network model for PPI prediction. By incorporating multiscale module into the Graph Neural Network (GNN) and constructing multi convolutional layers, MGPPI can effectively capture both local and global protein structure information. For model interpretability, we introduce a novel visual explanation method named Gradient Weighted interaction Activation Mapping (Grad-WAM), which can highlight key binding residue sites. We evaluate the performance of MGPPI by comparing with state-of-the-arts methods on various datasets. Results shows that MGPPI outperforms other methods significantly and exhibits strong generalization capabilities on the multi-species dataset. As a practical case study, we predicted the binding affinity between the spike (S) protein of SARS-COV-2 and the human ACE2 receptor protein, and successfully identified key binding sites with known binding functions. Key binding sites mutation in PPIs can affect cancer patient survival statues. Therefore, we further verified Grad-WAM highlighted residue sites in separating patients survival groups in several different cancer type datasets. According to our results, some of the highlighted residues can be used as biomarkers in predicting patients survival probability. All these results together demonstrate the high accuracy and practical application value of MGPPI. Our method not only addresses the limitations of existing approaches but also can assists researchers in identifying crucial drug targets and help guide personalized cancer treatment.
Collapse
Affiliation(s)
| | | | | | | | - Lingtao Su
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, China
| |
Collapse
|
14
|
Kim H, Lee K, Kim C, Lim J, Kim WY. DFRscore: Deep Learning-Based Scoring of Synthetic Complexity with Drug-Focused Retrosynthetic Analysis for High-Throughput Virtual Screening. J Chem Inf Model 2024; 64:2432-2444. [PMID: 37651152 DOI: 10.1021/acs.jcim.3c01134] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
Recently emerging generative AI models enable us to produce a vast number of compounds for potential applications. While they can provide novel molecular structures, the synthetic feasibility of the generated molecules is often questioned. To address this issue, a few recent studies have attempted to use deep learning models to estimate the synthetic accessibility of many molecules rapidly. However, retrosynthetic analysis tools used to train the models rely on reaction templates automatically extracted from a large reaction database that are not domain-specific and may exhibit low chemical correctness. To overcome this limitation, we introduce DFRscore (Drug-Focused Retrosynthetic score), a deep learning-based approach for a more practical assessment of synthetic accessibility in drug discovery. The DFRscore model is trained exclusively on drug-focused reactions, providing a predicted number of minimally required synthetic steps for each compound. This approach enables practitioners to filter out compounds that do not meet their desired level of synthetic accessibility at an early stage of high-throughput virtual screening for accelerated drug discovery. The proposed strategy can be easily adapted to other domains by adjusting the synthesis planning setup of the reaction templates and starting materials.
Collapse
Affiliation(s)
- Hyeongwoo Kim
- Department of Chemistry, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Kyunghoon Lee
- Department of Chemistry, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Chansu Kim
- Department of Chemistry, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Jaechang Lim
- HITS Incorporation, 124 Teheran-ro, Gangnam-gu, Seoul 06234, Republic of Korea
| | - Woo Youn Kim
- Department of Chemistry, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
- HITS Incorporation, 124 Teheran-ro, Gangnam-gu, Seoul 06234, Republic of Korea
- AI Institute, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| |
Collapse
|
15
|
Zhao D, Tu S, Xu L. Efficient retrosynthetic planning with MCTS exploration enhanced A * search. Commun Chem 2024; 7:52. [PMID: 38454002 PMCID: PMC10920677 DOI: 10.1038/s42004-024-01133-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Accepted: 02/20/2024] [Indexed: 03/09/2024] Open
Abstract
Retrosynthetic planning, which aims to identify synthetic pathways for target molecules from starting materials, is a fundamental problem in synthetic chemistry. Computer-aided retrosynthesis has made significant progress, in which heuristic search algorithms, including Monte Carlo Tree Search (MCTS) and A* search, have played a crucial role. However, unreliable guiding heuristics often cause search failure due to insufficient exploration. Conversely, excessive exploration also prevents the search from reaching the optimal solution. In this paper, MCTS exploration enhanced A* (MEEA*) search is proposed to incorporate the exploratory behavior of MCTS into A* by providing a look-ahead search. Path consistency is adopted as a regularization to improve the generalization performance of heuristics. Extensive experimental results on 10 molecule datasets demonstrate the effectiveness of MEEA*. Especially, on the widely used United States Patent and Trademark Office (USPTO) benchmark, MEEA* achieves a 100.0% success rate. Moreover, for natural products, MEEA* successfully identifies bio-retrosynthetic pathways for 97.68% test compounds.
Collapse
Affiliation(s)
- Dengwei Zhao
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Shikui Tu
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China.
| | - Lei Xu
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China.
- Guangdong Institute of Intelligence Science and Technology, Zhuhai, China.
| |
Collapse
|
16
|
Qiu Y, Wei GW. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. Brief Bioinform 2023; 24:bbad289. [PMID: 37580175 PMCID: PMC10516362 DOI: 10.1093/bib/bbad289] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 07/14/2023] [Accepted: 07/26/2023] [Indexed: 08/16/2023] Open
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824 MI, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824 MI, USA
| |
Collapse
|
17
|
Qiu Y, Wei GW. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. ARXIV 2023:arXiv:2307.14587v1. [PMID: 37547662 PMCID: PMC10402185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, 48824, MI, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824, MI, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824, MI, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824, MI, USA
| |
Collapse
|