1
|
Li X, Shen C, Zhu H, Yang Y, Wang Q, Yang J, Huang N. A High-Quality Data Set of Protein-Ligand Binding Interactions Via Comparative Complex Structure Modeling. J Chem Inf Model 2024; 64:2454-2466. [PMID: 38181418 DOI: 10.1021/acs.jcim.3c01170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2024]
Abstract
High-quality protein-ligand complex structures provide the basis for understanding the nature of noncovalent binding interactions at the atomic level and enable structure-based drug design. However, experimentally determined complex structures are scarce compared with the vast chemical space. In this study, we addressed this issue by constructing the BindingNet data set via comparative complex structure modeling, which contains 69,816 modeled high-quality protein-ligand complex structures with experimental binding affinity data. BindingNet provides valuable insights into investigating protein-ligand interactions, allowing visual inspection and interpretation of structural analogues' structure-activity relationships. It can also be used for evaluating machine-learning-based scoring functions. Our results indicate that machine learning models trained on BindingNet could reduce the bias caused by buried solvent-accessible surface area, as we previously found for models trained on the PDBbind data set. We also discussed strategies to improve BindingNet and its potential utilization for benchmarking the molecular docking methods and ligand binding free energy calculation approaches. The BindingNet complements PDBbind in constructing a sufficient and unbiased protein-ligand binding data set and is freely available at http://bindingnet.huanglab.org.cn.
Collapse
Affiliation(s)
- Xuelian Li
- National Institute of Biological Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100730, China
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Cheng Shen
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Hui Zhu
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing 102206, China
| | - Yujian Yang
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Qing Wang
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Jincai Yang
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Niu Huang
- National Institute of Biological Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100730, China
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing 102206, China
| |
Collapse
|
2
|
Zeng X, Li SJ, Lv SQ, Wen ML, Li Y. A comprehensive review of the recent advances on predicting drug-target affinity based on deep learning. Front Pharmacol 2024; 15:1375522. [PMID: 38628639 PMCID: PMC11019008 DOI: 10.3389/fphar.2024.1375522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 03/21/2024] [Indexed: 04/19/2024] Open
Abstract
Accurate calculation of drug-target affinity (DTA) is crucial for various applications in the pharmaceutical industry, including drug screening, design, and repurposing. However, traditional machine learning methods for calculating DTA often lack accuracy, posing a significant challenge in accurately predicting DTA. Fortunately, deep learning has emerged as a promising approach in computational biology, leading to the development of various deep learning-based methods for DTA prediction. To support researchers in developing novel and highly precision methods, we have provided a comprehensive review of recent advances in predicting DTA using deep learning. We firstly conducted a statistical analysis of commonly used public datasets, providing essential information and introducing the used fields of these datasets. We further explored the common representations of sequences and structures of drugs and targets. These analyses served as the foundation for constructing DTA prediction methods based on deep learning. Next, we focused on explaining how deep learning models, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Transformer, and Graph Neural Networks (GNNs), were effectively employed in specific DTA prediction methods. We highlighted the unique advantages and applications of these models in the context of DTA prediction. Finally, we conducted a performance analysis of multiple state-of-the-art methods for predicting DTA based on deep learning. The comprehensive review aimed to help researchers understand the shortcomings and advantages of existing methods, and further develop high-precision DTA prediction tool to promote the development of drug discovery.
Collapse
Affiliation(s)
- Xin Zeng
- College of Mathematics and Computer Science, Dali University, Dali, China
| | - Shu-Juan Li
- Yunnan Institute of Endemic Diseases Control and Prevention, Dali, China
| | - Shuang-Qing Lv
- Institute of Surveying and Information Engineering West Yunnan University of Applied Science, Dali, China
| | - Meng-Liang Wen
- State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, Yunnan University, Kunming, China
| | - Yi Li
- College of Mathematics and Computer Science, Dali University, Dali, China
| |
Collapse
|
3
|
Chen D, Liu J, Wei GW. TopoFormer: Multiscale Topology-enabled Structure-to-Sequence Transformer for Protein-Ligand Interaction Predictions. RESEARCH SQUARE 2024:rs.3.rs-3640878. [PMID: 38405777 PMCID: PMC10889053 DOI: 10.21203/rs.3.rs-3640878/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Pre-trained deep Transformers have had tremendous success in a wide variety of disciplines. However, in computational biology, essentially all Transformers are built upon the biological sequences, which ignores vital stereochemical information and may result in crucial errors in downstream predictions. On the other hand, three-dimensional (3D) molecular structures are incompatible with the sequential architecture of Transformer and natural language processing (NLP) models in general. This work addresses this foundational challenge by a topological Transformer (TopoFormer). TopoFormer is built by integrating NLP and a multiscale topology techniques, the persistent topological hyperdigraph Laplacian (PTHL), which systematically converts intricate 3D protein-ligand complexes at various spatial scales into a NLP-admissible sequence of topological invariants and homotopic shapes. Element-specific PTHLs are further developed to embed crucial physical, chemical, and biological interactions into topological sequences. TopoFormer surges ahead of conventional algorithms and recent deep learning variants and gives rise to exemplary scoring accuracy and superior performance in ranking, docking, and screening tasks in a number of benchmark datasets. The proposed topological sequences can be extracted from all kinds of structural data in data science to facilitate various NLP models, heralding a new era in AI-driven discovery.
Collapse
Affiliation(s)
- Dong Chen
- Department of Mathematics, Michigan State University, MI, 48824, USA
| | - Jian Liu
- Department of Mathematics, Michigan State University, MI, 48824, USA
- Mathematical Science Research Center, Chongqing University of Technology, Chongqing 400054, China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI, 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA
| |
Collapse
|
4
|
Lee J, Jun DW, Song I, Kim Y. DLM-DTI: a dual language model for the prediction of drug-target interaction with hint-based learning. J Cheminform 2024; 16:14. [PMID: 38297330 PMCID: PMC10832108 DOI: 10.1186/s13321-024-00808-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Accepted: 01/22/2024] [Indexed: 02/02/2024] Open
Abstract
The drug discovery process is demanding and time-consuming, and machine learning-based research is increasingly proposed to enhance efficiency. A significant challenge in this field is predicting whether a drug molecule's structure will interact with a target protein. A recent study attempted to address this challenge by utilizing an encoder that leverages prior knowledge of molecular and protein structures, resulting in notable improvements in the prediction performance of the drug-target interactions task. Nonetheless, the target encoders employed in previous studies exhibit computational complexity that increases quadratically with the input length, thereby limiting their practical utility. To overcome this challenge, we adopt a hint-based learning strategy to develop a compact and efficient target encoder. With the adaptation parameter, our model can blend general knowledge and target-oriented knowledge to build features of the protein sequences. This approach yielded considerable performance enhancements and improved learning efficiency on three benchmark datasets: BIOSNAP, DAVIS, and Binding DB. Furthermore, our methodology boasts the merit of necessitating only a minimal Video RAM (VRAM) allocation, specifically 7.7GB, during the training phase (16.24% of the previous state-of-the-art model). This ensures the feasibility of training and inference even with constrained computational resources.
Collapse
Affiliation(s)
- Jonghyun Lee
- Department of Medical and Digital Engineering, Hanyang University College of Engineering, 222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea
| | - Dae Won Jun
- Department of Medical and Digital Engineering, Hanyang University College of Engineering, 222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea
- Department of Internal Medicine, Hanyang University College of Medicine, 222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea
| | - Ildae Song
- Department of Pharmaceutical Science and Technology, Kyungsung University, 309, Suyeong-ro, Nam-gu, Busan, 48434, Korea
| | - Yun Kim
- College of Pharmacy, Deagu Catholic University, 13-13, Hayang-ro, Hayang-eup, Gyeongsan-si, 38430, Gyeongsangbuk-do, Korea.
| |
Collapse
|
5
|
Guo J. Improving structure-based protein-ligand affinity prediction by graph representation learning and ensemble learning. PLoS One 2024; 19:e0296676. [PMID: 38232063 DOI: 10.1371/journal.pone.0296676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 12/15/2023] [Indexed: 01/19/2024] Open
Abstract
Predicting protein-ligand binding affinity presents a viable solution for accelerating the discovery of new lead compounds. The recent widespread application of machine learning approaches, especially graph neural networks, has brought new advancements in this field. However, some existing structure-based methods treat protein macromolecules and ligand small molecules in the same way and ignore the data heterogeneity, potentially leading to incomplete exploration of the biochemical information of ligands. In this work, we propose LGN, a graph neural network-based fusion model with extra ligand feature extraction to effectively capture local features and global features within the protein-ligand complex, and make use of interaction fingerprints. By combining the ligand-based features and interaction fingerprints, LGN achieves Pearson correlation coefficients of up to 0.842 on the PDBbind 2016 core set, compared to 0.807 when using the features of complex graphs alone. Finally, we verify the rationalization and generalization of our model through comprehensive experiments. We also compare our model with state-of-the-art baseline methods, which validates the superiority of our model. To reduce the impact of data similarity, we increase the robustness of the model by incorporating ensemble learning.
Collapse
Affiliation(s)
- Jia Guo
- Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Beijing, P.R. China
- Chongqing School, University of Chinese Academy of Sciences, Chongqing, China
| |
Collapse
|
6
|
Ugurlu SY, McDonald D, Lei H, Jones AM, Li S, Tong HY, Butler MS, He S. Cobdock: an accurate and practical machine learning-based consensus blind docking method. J Cheminform 2024; 16:5. [PMID: 38212855 PMCID: PMC10785400 DOI: 10.1186/s13321-023-00793-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 12/10/2023] [Indexed: 01/13/2024] Open
Abstract
Probing the surface of proteins to predict the binding site and binding affinity for a given small molecule is a critical but challenging task in drug discovery. Blind docking addresses this issue by performing docking on binding regions randomly sampled from the entire protein surface. However, compared with local docking, blind docking is less accurate and reliable because the docking space is too largetly sampled. Cavity detection-guided blind docking methods improved the accuracy by using cavity detection (also known as binding site detection) tools to guide the docking procedure. However, it is worth noting that the performance of these methods heavily relies on the quality of the cavity detection tool. This constraint, namely the dependence on a single cavity detection tool, significantly impacts the overall performance of cavity detection-guided methods. To overcome this limitation, we proposed Consensus Blind Dock (CoBDock), a novel blind, parallel docking method that uses machine learning algorithms to integrate docking and cavity detection results to improve not only binding site identification but also pose prediction accuracy. Our experiments on several datasets, including PDBBind 2020, ADS, MTi, DUD-E, and CASF-2016, showed that CoBDock has better binding site and binding mode performance than other state-of-the-art cavity detector tools and blind docking methods.
Collapse
Affiliation(s)
- Sadettin Y Ugurlu
- School of Computer Science, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
| | | | - Huangshu Lei
- YaoPharma Co. Ltd., 100 Xingguang Avenue, Renhe Town, Yubei District, Chongqing, 401121, People's Republic of China
| | - Alan M Jones
- School of Pharmacy, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
| | - Shu Li
- Centre for Artificial Intelligence Driven Drug Discovery, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao, 5HV2+CP8, China
| | - Henry Y Tong
- Centre for Artificial Intelligence Driven Drug Discovery, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao, 5HV2+CP8, China
| | | | - Shan He
- School of Computer Science, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK.
- AIA Insights Ltd, Birmingham, UK.
| |
Collapse
|
7
|
Zhang L, Ouyang C, Liu Y, Liao Y, Gao Z. Multimodal contrastive representation learning for drug-target binding affinity prediction. Methods 2023; 220:126-133. [PMID: 37952703 DOI: 10.1016/j.ymeth.2023.11.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 10/28/2023] [Accepted: 11/06/2023] [Indexed: 11/14/2023] Open
Abstract
In the biomedical field, the efficacy of most drugs is demonstrated by their interactions with targets, meanwhile, accurate prediction of the strength of drug-target binding is extremely important for drug development efforts. Traditional bioassay-based drug-target binding affinity (DTA) prediction methods cannot meet the needs of drug R&D in the era of big data. Recent years we have witnessed significant success on deep learning-based models for drug-target binding affinity prediction task. However, these models only considered a single modality of drug and target information, and some valuable information was not fully utilized. In fact, the information of different modalities of drug and target can complement each other, and more valuable information can be obtained by fusing the information of different modalities. In this paper, we introduce a multimodal information fusion model for DTA prediction that is called FMDTA, which fully considers drug/target information in both string and graph modalities and balances the feature representations of different modalities by a contrastive learning approach. In addition, we exploited the alignment information of drug atoms and target residues to capture the positional information of string patterns, which can extract more useful feature information in SMILES and target sequences. Experimental results on two benchmark datasets show that FMDTA outperforms the state-of-the-art model, demonstrating the feasibility and excellent feature capture capability of FMDTA. The code of FMDTA and the data are available at: https://github.com/bestdoubleLin/FMDTA.
Collapse
Affiliation(s)
- Linlin Zhang
- School of Computer, University of South China, Hengyang, China
| | - Chunping Ouyang
- School of Computer, University of South China, Hengyang, China.
| | - Yongbin Liu
- School of Computer, University of South China, Hengyang, China
| | - Yiming Liao
- The Second Affiliated Hospital, Hengyang Medical School, University of South China, Hengyang, China
| | - Zheng Gao
- Department of Information and Library Science, Indiana University Bloomington, Bloomington, United States
| |
Collapse
|
8
|
Wang J, Xiao Y, Shang X, Peng J. Predicting drug-target binding affinity with cross-scale graph contrastive learning. Brief Bioinform 2023; 25:bbad516. [PMID: 38221904 PMCID: PMC10788681 DOI: 10.1093/bib/bbad516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 12/04/2023] [Accepted: 12/07/2023] [Indexed: 01/16/2024] Open
Abstract
Identifying the binding affinity between a drug and its target is essential in drug discovery and repurposing. Numerous computational approaches have been proposed for understanding these interactions. However, most existing methods only utilize either the molecular structure information of drugs and targets or the interaction information of drug-target bipartite networks. They may fail to combine the molecule-scale and network-scale features to obtain high-quality representations. In this study, we propose CSCo-DTA, a novel cross-scale graph contrastive learning approach for drug-target binding affinity prediction. The proposed model combines features learned from the molecular scale and the network scale to capture information from both local and global perspectives. We conducted experiments on two benchmark datasets, and the proposed model outperformed existing state-of-art methods. The ablation experiment demonstrated the significance and efficacy of multi-scale features and cross-scale contrastive learning modules in improving the prediction performance. Moreover, we applied the CSCo-DTA to predict the novel potential targets for Erlotinib and validated the predicted targets with the molecular docking analysis.
Collapse
Affiliation(s)
- Jingru Wang
- School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China
- Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi’an, 710072, China
- The National Engineering Laboratory for Integrated Aerospace-Ground-Ocean Big Data Application Technology, Xi’an, 710072, China
| | - Yihang Xiao
- School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China
- Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi’an, 710072, China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China
- Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi’an, 710072, China
- The National Engineering Laboratory for Integrated Aerospace-Ground-Ocean Big Data Application Technology, Xi’an, 710072, China
| | - Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China
- Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi’an, 710072, China
- The National Engineering Laboratory for Integrated Aerospace-Ground-Ocean Big Data Application Technology, Xi’an, 710072, China
- Research and Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen, 518000, China
| |
Collapse
|
9
|
Wang Y, Jiao Q, Wang J, Cai X, Zhao W, Cui X. Prediction of protein-ligand binding affinity with deep learning. Comput Struct Biotechnol J 2023; 21:5796-5806. [PMID: 38213884 PMCID: PMC10782002 DOI: 10.1016/j.csbj.2023.11.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 11/03/2023] [Accepted: 11/03/2023] [Indexed: 01/13/2024] Open
Abstract
The prediction of binding affinities between target proteins and small molecule drugs is essential for speeding up the drug research and design process. To attain precise and effective affinity prediction, computer-aided methods are employed in the drug discovery pipeline. In the last decade, a variety of computational methods has been developed, with deep learning being the most commonly used approach. We have gathered several deep learning methods and classified them into convolutional neural networks (CNNs), graph neural networks (GNNs), and Transformers for analysis and discussion. Initially, we conducted an analysis of the different deep learning methods, focusing on their feature construction and model architecture. We discussed the advantages and disadvantages of each model. Subsequently, we conducted experiments using four deep learning methods on the PDBbind v.2016 core set. We evaluated their prediction capabilities in various affinity intervals and statistically and visually analyzed the samples of correct and incorrect predictions for each model. Through visual analysis, we attempted to combine the strengths of the four models to improve the Root Mean Square Error (RMSE) of predicted affinities by 1.6% (reducing the absolute value to 1.101) and the Pearson Correlation Coefficient (R) by 2.9% (increasing the absolute value to 0.894) compared to the current state-of-the-art method. Lastly, we discussed the challenges faced by current deep learning methods in affinity prediction and proposed potential solutions to address these issues.
Collapse
Affiliation(s)
- Yuxiao Wang
- School of Computer Science and Technology, Shandong University, Qingdao 266237, Shandong, China
| | - Qihong Jiao
- School of Computer Science and Technology, Shandong University, Qingdao 266237, Shandong, China
| | - Jingxuan Wang
- School of Computer Science and Technology, Shandong University, Qingdao 266237, Shandong, China
| | - Xiaojun Cai
- School of Computer Science and Technology, Shandong University, Qingdao 266237, Shandong, China
| | - Wei Zhao
- State Key Laboratory of Microbial Technology, Shandong University, Qingdao 266237, Shandong, China
| | - Xuefeng Cui
- School of Computer Science and Technology, Shandong University, Qingdao 266237, Shandong, China
| |
Collapse
|
10
|
DoubleSG-DTA: Deep Learning for Drug Discovery: Case Study on the Non-Small Cell Lung Cancer with EGFRT790M Mutation. Pharmaceutics 2023; 15:pharmaceutics15020675. [PMID: 36839996 PMCID: PMC9965659 DOI: 10.3390/pharmaceutics15020675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Revised: 02/05/2023] [Accepted: 02/14/2023] [Indexed: 02/19/2023] Open
Abstract
Drug-targeted therapies are promising approaches to treating tumors, and research on receptor-ligand interactions for discovering high-affinity targeted drugs has been accelerating drug development. This study presents a mechanism-driven deep learning-based computational model to learn double drug sequences, protein sequences, and drug graphs to project drug-target affinities (DTAs), which was termed the DoubleSG-DTA. We deployed lightweight graph isomorphism networks to aggregate drug graph representations and discriminate between molecular structures, and stacked multilayer squeeze-and-excitation networks to selectively enhance spatial features of drug and protein sequences. What is more, cross-multi-head attentions were constructed to further model the non-covalent molecular docking behavior. The multiple cross-validation experimental evaluations on various datasets indicated that DoubleSG-DTA consistently outperformed all previously reported works. To showcase the value of DoubleSG-DTA, we applied it to generate promising hit compounds of Non-Small Cell Lung Cancer harboring EGFRT790M mutation from natural products, which were consistent with reported laboratory studies. Afterward, we further investigated the interpretability of the graph-based "black box" model and highlighted the active structures that contributed the most. DoubleSG-DTA thus provides a powerful and interpretable framework that extrapolates for potential chemicals to modulate the systemic response to disease.
Collapse
|
11
|
Boyles F, Deane CM, Morris GM. Learning from Docked Ligands: Ligand-Based Features Rescue Structure-Based Scoring Functions When Trained on Docked Poses. J Chem Inf Model 2022; 62:5329-5341. [PMID: 34469150 DOI: 10.1021/acs.jcim.1c00096] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Machine learning scoring functions for protein-ligand binding affinity have been found to consistently outperform classical scoring functions when trained and tested on crystal structures of bound protein-ligand complexes. However, it is less clear how these methods perform when applied to docked poses of complexes. We explore how the use of docked rather than crystallographic poses for both training and testing affects the performance of machine learning scoring functions. Using the PDBbind Core Sets as benchmarks, we show that the performance of a structure-based machine learning scoring function trained and tested on docked poses is lower than that of the same scoring function trained and tested on crystallographic poses. We construct a hybrid scoring function by combining both structure-based and ligand-based features, and show that its ability to predict binding affinity using docked poses is comparable to that of purely structure-based scoring functions trained and tested on crystal poses. We also present a new, freely available validation set─the Updated DUD-E Diverse Subset─for binding affinity prediction using data from DUD-E and ChEMBL. Despite strong performance on docked poses of the PDBbind Core Sets, we find that our hybrid scoring function sometimes generalizes poorly to a protein target not represented in the training set, demonstrating the need for improved scoring functions and additional validation benchmarks.
Collapse
Affiliation(s)
- Fergus Boyles
- Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, United Kingdom
| | - Charlotte M Deane
- Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, United Kingdom
| | - Garrett M Morris
- Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, United Kingdom
| |
Collapse
|
12
|
Wei B, Zhang Y, Gong X. DeepLPI: a novel deep learning-based model for protein-ligand interaction prediction for drug repurposing. Sci Rep 2022; 12:18200. [PMID: 36307509 PMCID: PMC9616420 DOI: 10.1038/s41598-022-23014-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Accepted: 10/21/2022] [Indexed: 12/31/2022] Open
Abstract
The substantial cost of new drug research and development has consistently posed a huge burden for both pharmaceutical companies and patients. In order to lower the expenditure and development failure rate, repurposing existing and approved drugs by identifying interactions between drug molecules and target proteins based on computational methods have gained growing attention. Here, we propose the DeepLPI, a novel deep learning-based model that mainly consists of ResNet-based 1-dimensional convolutional neural network (1D CNN) and bi-directional long short term memory network (biLSTM), to establish an end-to-end framework for protein-ligand interaction prediction. We first encode the raw drug molecular sequences and target protein sequences into dense vector representations, which go through two ResNet-based 1D CNN modules to derive features, respectively. The extracted feature vectors are concatenated and further fed into the biLSTM network, followed by the MLP module to finally predict protein-ligand interaction. We downloaded the well-known BindingDB and Davis dataset for training and testing our DeepLPI model. We also applied DeepLPI on a COVID-19 dataset for externally evaluating the prediction ability of DeepLPI. To benchmark our model, we compared our DeepLPI with the baseline methods of DeepCDA and DeepDTA, and observed that our DeepLPI outperformed these methods, suggesting the high accuracy of the DeepLPI towards protein-ligand interaction prediction. The high prediction performance of DeepLPI on the different datasets displayed its high capability of protein-ligand interaction in generalization, demonstrating that the DeepLPI has the potential to pinpoint new drug-target interactions and to find better destinations for proven drugs.
Collapse
Affiliation(s)
- Bomin Wei
- Princeton International School of Mathematics and Science, 19 Lambert Drive, Princeton, NJ 08540 USA
| | - Yue Zhang
- grid.223827.e0000 0001 2193 0096Department of Internal Medicine, University of Utah, Salt Lake City, UT 84132 USA ,grid.223827.e0000 0001 2193 0096Division of Epidemiology, University of Utah, Salt Lake City, UT 84132 USA
| | - Xiang Gong
- Princeton International School of Mathematics and Science, 19 Lambert Drive, Princeton, NJ 08540 USA
| |
Collapse
|
13
|
Yan X, Liu Y. Graph-sequence attention and transformer for predicting drug-target affinity. RSC Adv 2022; 12:29525-29534. [PMID: 36320763 PMCID: PMC9562047 DOI: 10.1039/d2ra05566j] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Accepted: 10/04/2022] [Indexed: 11/30/2022] Open
Abstract
Drug-target binding affinity (DTA) prediction has drawn increasing interest due to its substantial position in the drug discovery process. The development of new drugs is costly, time-consuming, and often accompanied by safety issues. Drug repurposing can avoid the expensive and lengthy process of drug development by finding new uses for already approved drugs. Therefore, it is of great significance to develop effective computational methods to predict DTAs. The attention mechanisms allow the computational method to focus on the most relevant parts of the input and have been proven to be useful for various tasks. In this study, we proposed a novel model based on self-attention, called GSATDTA, to predict the binding affinity between drugs and targets. For the representation of drugs, we use Bi-directional Gated Recurrent Units (BiGRU) to extract the SMILES representation from SMILES sequences, and graph neural networks to extract the graph representation of the molecular graphs. Then we utilize an attention mechanism to fuse the two representations of the drug. For the target/protein, we utilized an efficient transformer to learn the representation of the protein, which can capture the long-distance relationships in the sequence of amino acids. We conduct extensive experiments to compare our model with state-of-the-art models. Experimental results show that our model outperforms the current state-of-the-art methods on two independent datasets.
Collapse
Affiliation(s)
- Xiangfeng Yan
- School of Computer Science and Technology, Heilongjiang UniversityHarbinChina
| | - Yong Liu
- School of Computer Science and Technology, Heilongjiang UniversityHarbinChina
| |
Collapse
|
14
|
McGibbon M, Money-Kyrle S, Blay V, Houston DR. SCORCH: Improving structure-based virtual screening with machine learning classifiers, data augmentation, and uncertainty estimation. J Adv Res 2022; 46:135-147. [PMID: 35901959 PMCID: PMC10105235 DOI: 10.1016/j.jare.2022.07.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2022] [Revised: 07/08/2022] [Accepted: 07/09/2022] [Indexed: 11/17/2022] Open
Abstract
INTRODUCTION The discovery of a new drug is a costly and lengthy endeavour. The computational prediction of which small molecules can bind to a protein target can accelerate this process if the predictions are fast and accurate enough. Recent machine-learning scoring functions re-evaluate the output of molecular docking to achieve more accurate predictions. However, previous scoring functions were trained on crystalised protein-ligand complexes and datasets of decoys. The limited availability of crystal structures and biases in the decoy datasets can lower the performance of scoring functions. OBJECTIVES To address key limitations of previous scoring functions and thus improve the predictive performance of structure-based virtual screening. METHODS A novel machine-learning scoring function was created, named SCORCH (Scoring COnsensus for RMSD-based Classification of Hits). To develop SCORCH, training data is augmented by considering multiple ligand poses and labelling poses based on their RMSD from the native pose. Decoy bias is addressed by generating property-matched decoys for each ligand and using the same methodology for preparing and docking decoys and ligands. A consensus of 3 different machine learning approaches is also used to improve performance. RESULTS We find that multi-pose augmentation in SCORCH improves its docking power and screening power on independent benchmark datasets. SCORCH outperforms an equivalent scoring function trained on single poses, with a 1% enrichment factor (EF) of 13.78 vs. 10.86 on 18 DEKOIS 2.0 targets and a mean native pose rank of 5.9 vs 30.4 on CSAR 2014. Additionally, SCORCH outperforms widely used scoring functions in virtual screening and pose prediction on independent benchmark datasets. CONCLUSION By rationally addressing key limitations of previous scoring functions, SCORCH improves the performance of virtual screening. SCORCH also provides an estimate of its uncertainty, which can help reduce the cost and time required for drug discovery.
Collapse
Affiliation(s)
- Miles McGibbon
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, UK
| | - Sam Money-Kyrle
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, UK
| | - Vincent Blay
- Department of Microbiology and Environmental Toxicology, University of California at Santa Cruz, Santa Cruz, CA 95064, USA; Institute for Integrative Systems Biology (I(2)SysBio), Universitat de València and Spanish Research Council (CSIC), 46980 Valencia, Spain.
| | - Douglas R Houston
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, UK.
| |
Collapse
|
15
|
Meli R, Morris GM, Biggin PC. Scoring Functions for Protein-Ligand Binding Affinity Prediction using Structure-Based Deep Learning: A Review. FRONTIERS IN BIOINFORMATICS 2022; 2:885983. [PMID: 36187180 PMCID: PMC7613667 DOI: 10.3389/fbinf.2022.885983] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 05/11/2022] [Indexed: 01/01/2023] Open
Abstract
The rapid and accurate in silico prediction of protein-ligand binding free energies or binding affinities has the potential to transform drug discovery. In recent years, there has been a rapid growth of interest in deep learning methods for the prediction of protein-ligand binding affinities based on the structural information of protein-ligand complexes. These structure-based scoring functions often obtain better results than classical scoring functions when applied within their applicability domain. Here we review structure-based scoring functions for binding affinity prediction based on deep learning, focussing on different types of architectures, featurization strategies, data sets, methods for training and evaluation, and the role of explainable artificial intelligence in building useful models for real drug-discovery applications.
Collapse
Affiliation(s)
- Rocco Meli
- Department of Biochemistry, University of Oxford, Oxford, United Kingdom
| | - Garrett M. Morris
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Philip C. Biggin
- Department of Biochemistry, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
16
|
Multi-TransDTI: Transformer for Drug–Target Interaction Prediction Based on Simple Universal Dictionaries with Multi-View Strategy. Biomolecules 2022; 12:biom12050644. [PMID: 35625572 PMCID: PMC9138327 DOI: 10.3390/biom12050644] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 04/19/2022] [Accepted: 04/25/2022] [Indexed: 01/03/2023] Open
Abstract
Prediction on drug–target interaction has always been a crucial link for drug discovery and repositioning, which have witnessed tremendous progress in recent years. Despite many efforts made, the existing representation learning or feature generation approaches of both drugs and proteins remain complicated as well as in high dimension. In addition, it is difficult for current methods to extract local important residues from sequence information while remaining focused on global structure. At the same time, massive data is not always easily accessible, which makes model learning from small datasets imminent. As a result, we propose an end-to-end learning model with SUPD and SUDD methods to encode drugs and proteins, which not only leave out the complicated feature extraction process but also greatly reduce the dimension of the embedding matrix. Meanwhile, we use a multi-view strategy with a transformer to extract local important residues of proteins for better representation learning. Finally, we evaluate our model on the BindingDB dataset in comparisons with different state-of-the-art models from comprehensive indicators. In results of 100% BindingDB, our AUC, AUPR, ACC, and F1-score reached 90.9%, 89.8%, 84.2%, and 84.3% respectively, which successively exceed the average values of other models by 2.2%, 2.3%, 2.6%, and 2.6%. Moreover, our model also generally surpasses their performance on 30% and 50% BindingDB datasets.
Collapse
|
17
|
Li M, Lu Z, Wu Y, Li Y. BACPI: a bi-directional attention neural network for compound-protein interaction and binding affinity prediction. Bioinformatics 2022; 38:1995-2002. [PMID: 35043942 DOI: 10.1093/bioinformatics/btac035] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 12/06/2021] [Accepted: 01/14/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The identification of compound-protein interactions (CPIs) is an essential step in the process of drug discovery. The experimental determination of CPIs is known for a large amount of funds and time it consumes. Computational model has therefore become a promising and efficient alternative for predicting novel interactions between compounds and proteins on a large scale. Most supervised machine learning prediction models are approached as a binary classification problem, which aim to predict whether there is an interaction between the compound and the protein or not. However, CPI is not a simple binary on-off relationship, but a continuous value reflects how tightly the compound binds to a particular target protein, also called binding affinity. RESULTS In this study, we propose an end-to-end neural network model, called BACPI, to predict CPI and binding affinity. We employ graph attention network and convolutional neural network (CNN) to learn the representations of compounds and proteins and develop a bi-directional attention neural network model to integrate the representations. To evaluate the performance of BACPI, we use three CPI datasets and four binding affinity datasets in our experiments. The results show that, when predicting CPIs, BACPI significantly outperforms other available machine learning methods on both balanced and unbalanced datasets. This suggests that the end-to-end neural network model that predicts CPIs directly from low-level representations is more robust than traditional machine learning-based methods. And when predicting binding affinities, BACPI achieves higher performance on large datasets compared to other state-of-the-art deep learning methods. This comparison result suggests that the proposed method with bi-directional attention neural network can capture the important regions of compounds and proteins for binding affinity prediction. AVAILABILITY AND IMPLEMENTATION Data and source codes are available at https://github.com/CSUBioGroup/BACPI.
Collapse
Affiliation(s)
- Min Li
- School of Computer Science and Engineering, Central South University, Changsha, Hunan, 410083, China
| | - Zhangli Lu
- School of Computer Science and Engineering, Central South University, Changsha, Hunan, 410083, China
| | - Yifan Wu
- School of Computer Science and Engineering, Central South University, Changsha, Hunan, 410083, China
| | - YaoHang Li
- Department of Computer Science, Old Dominion University, Norfolk, VA, USA
| |
Collapse
|
18
|
Rezaei MA, Li Y, Wu D, Li X, Li C. Deep Learning in Drug Design: Protein-Ligand Binding Affinity Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:407-417. [PMID: 33360998 PMCID: PMC8942327 DOI: 10.1109/tcbb.2020.3046945] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Computational drug design relies on the calculation of binding strength between two biological counterparts especially a chemical compound, i.e., a ligand, and a protein. Predicting the affinity of protein-ligand binding with reasonable accuracy is crucial for drug discovery, and enables the optimization of compounds to achieve better interaction with their target protein. In this paper, we propose a data-driven framework named DeepAtom to accurately predict the protein-ligand binding affinity. With 3D Convolutional Neural Network (3D-CNN) architecture, DeepAtom could automatically extract binding related atomic interaction patterns from the voxelized complex structure. Compared with the other CNN based approaches, our light-weight model design effectively improves the model representational capacity, even with the limited available training data. We carried out validation experiments on the PDBbind v.2016 benchmark and the independent Astex Diverse Set. We demonstrate that the less feature engineering dependent DeepAtom approach consistently outperforms the other baseline scoring methods. We also compile and propose a new benchmark dataset to further improve the model performances. With the new dataset as training input, DeepAtom achieves Pearson's R=0.83 and RMSE=1.23 pK units on the PDBbind v.2016 core set. The promising results demonstrate that DeepAtom models can be potentially adopted in computational drug development protocols such as molecular docking and virtual screening.
Collapse
Affiliation(s)
- Mohammad A. Rezaei
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development (CNPD3), University of Florida
| | - Yanjun Li
- Large-scale Intelligent Systems Laboratory, NSF Center for Big Learning, University of Florida Gainesville, FL, USA
| | - Dapeng Wu
- Large-scale Intelligent Systems Laboratory, NSF Center for Big Learning, University of Florida Gainesville, FL, USA
| | - Xiaolin Li
- Cognization Lab, Palo Alto, California, USA
| | - Chenglong Li
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development (CNPD3), University of Florida
- Large-scale Intelligent Systems Laboratory, NSF Center for Big Learning, University of Florida Gainesville, FL, USA
| |
Collapse
|
19
|
Li H, Lu G, Sze KH, Su X, Chan WY, Leung KS. Machine-learning scoring functions trained on complexes dissimilar to the test set already outperform classical counterparts on a blind benchmark. Brief Bioinform 2021; 22:bbab225. [PMID: 34169324 PMCID: PMC8575004 DOI: 10.1093/bib/bbab225] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 04/27/2021] [Accepted: 05/23/2021] [Indexed: 11/12/2022] Open
Abstract
The superior performance of machine-learning scoring functions for docking has caused a series of debates on whether it is due to learning knowledge from training data that are similar in some sense to the test data. With a systematically revised methodology and a blind benchmark realistically mimicking the process of prospective prediction of binding affinity, we have evaluated three broadly used classical scoring functions and five machine-learning counterparts calibrated with both random forest and extreme gradient boosting using both solo and hybrid features, showing for the first time that machine-learning scoring functions trained exclusively on a proportion of as low as 8% complexes dissimilar to the test set already outperform classical scoring functions, a percentage that is far lower than what has been recently reported on all the three CASF benchmarks. The performance of machine-learning scoring functions is underestimated due to the absence of similar samples in some artificially created training sets that discard the full spectrum of complexes to be found in a prospective environment. Given the inevitability of any degree of similarity contained in a large dataset, the criteria for scoring function selection depend on which one can make the best use of all available materials. Software code and data are provided at https://github.com/cusdulab/MLSF for interested readers to rapidly rebuild the scoring functions and reproduce our results, even to make extended analyses on their own benchmarks.
Collapse
Affiliation(s)
| | - Gang Lu
- School of Biomedical Sciences, Chinese University of Hong Kong, Hong Kong
| | - Kam-Heung Sze
- Bioinformatics Unit, Hong Kong Medical Technology Institute, Hong Kong
| | - Xianwei Su
- Chinese University of Hong Kong, Hong Kong
| | - Wai-Yee Chan
- CUHK-SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences, Chinese University of Hong Kong, Hong Kong
| | - Kwong-Sak Leung
- Computer Science and Engineering in the Chinese University of Hong Kong, Hong Kong
| |
Collapse
|
20
|
Lennox M, Robertson N, Devereux B. Modelling Drug-Target Binding Affinity using a BERT based Graph Neural network. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2021; 2021:4348-4353. [PMID: 34892183 DOI: 10.1109/embc46164.2021.9629695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Understanding the interactions between novel drugs and target proteins is fundamentally important in disease research as discovering drug-protein interactions can be an exceptionally time-consuming and expensive process. Alternatively, this process can be simulated using modern deep learning methods that have the potential of utilising vast quantities of data to reduce the cost and time required to provide accurate predictions. We seek to leverage a set of BERT-style models that have been pre-trained on vast quantities of both protein and drug data. The encodings produced by each model are then utilised as node representations for a graph convolutional neural network, which in turn are used to model the interactions without the need to simultaneously fine-tune both protein and drug BERT models to the task. We evaluate the performance of our approach on two drug-target interaction datasets that were previously used as benchmarks in recent work.Our results significantly improve upon a vanilla BERT baseline approach as well as the former state-of-the-art methods for each task dataset. Our approach builds upon past work in two key areas; firstly, we take full advantage of two large pre-trained BERT models that provide improved representations of task-relevant properties of both drugs and proteins. Secondly, inspired by work in natural language processing that investigates how linguistic structure is represented in such models, we perform interpretability analyses that allow us to locate functionally-relevant areas of interest within each drug and protein. By modelling the drug-target interactions as a graph as opposed to a set of isolated interactions, we demonstrate the benefits of combining large pre-trained models and a graph neural network to make state-of-the-art predictions on drug-target binding affinity.
Collapse
|
21
|
Using diverse potentials and scoring functions for the development of improved machine-learned models for protein-ligand affinity and docking pose prediction. J Comput Aided Mol Des 2021; 35:1095-1123. [PMID: 34708263 DOI: 10.1007/s10822-021-00423-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Accepted: 10/11/2021] [Indexed: 10/20/2022]
Abstract
The advent of computational drug discovery holds the promise of significantly reducing the effort of experimentalists, along with monetary cost. More generally, predicting the binding of small organic molecules to biological macromolecules has far-reaching implications for a range of problems, including metabolomics. However, problems such as predicting the bound structure of a protein-ligand complex along with its affinity have proven to be an enormous challenge. In recent years, machine learning-based methods have proven to be more accurate than older methods, many based on simple linear regression. Nonetheless, there remains room for improvement, as these methods are often trained on a small set of features, with a single functional form for any given physical effect, and often with little mention of the rationale behind choosing one functional form over another. Moreover, it is not entirely clear why one machine learning method is favored over another. In this work, we endeavor to undertake a comprehensive effort towards developing high-accuracy, machine-learned scoring functions, systematically investigating the effects of machine learning method and choice of features, and, when possible, providing insights into the relevant physics using methods that assess feature importance. Here, we show synergism among disparate features, yielding adjusted R2 with experimental binding affinities of up to 0.871 on an independent test set and enrichment for native bound structures of up to 0.913. When purely physical terms that model enthalpic and entropic effects are used in the training, we use feature importance assessments to probe the relevant physics and hopefully guide future investigators working on this and other computational chemistry problems.
Collapse
|
22
|
Di Filippo JI, Cavasotto CN. Guided structure-based ligand identification and design via artificial intelligence modeling. Expert Opin Drug Discov 2021; 17:71-78. [PMID: 34544293 DOI: 10.1080/17460441.2021.1979514] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
INTRODUCTION The implementation of Artificial Intelligence (AI) methodologies to drug discovery (DD) are on the rise. Several applications have been developed for structure-based DD, where AI methods provide an alternative framework for the identification of ligands for validated therapeutic targets, as well as the de novo design of ligands through generative models. AREAS COVERED Herein, the authors review the contributions between the 2019 to present period regarding the application of AI methods to structure-based virtual screening (SBVS) which encompasses mainly molecular docking applications - binding pose prediction and binary classification for ligand or hit identification-, as well as de novo drug design driven by machine learning (ML) generative models, and the validation of AI models in structure-based screening. Studies are reviewed in terms of their main objective, used databases, implemented methodology, input and output, and key results . EXPERT OPINION More profound analyses regarding the validity and applicability of AI methods in DD have begun to appear. In the near future, we expect to see more structure-based generative models- which are scarce in comparison to ligand-based generative models-, the implementation of standard guidelines for validating the generated structures, and more analyses regarding the validation of AI methods in structure-based DD.
Collapse
Affiliation(s)
- Juan I Di Filippo
- Computational Drug Design and Biomedical Informatics Laboratory, Instituto de Investigaciones en Medicina Traslacional (IIMT), CONICET-Universidad Austral, Pilar, Buenos Aires, Argentina.,Facultad de Ciencias Biomédicas, and Facultad de Ingeniería, Universidad Austral, Pilar, Buenos Aires, Argentina.,Austral Institute for Applied Artificial Intelligence, Universidad Austral, Pilar, Buenos Aires, Argentina
| | - Claudio N Cavasotto
- Computational Drug Design and Biomedical Informatics Laboratory, Instituto de Investigaciones en Medicina Traslacional (IIMT), CONICET-Universidad Austral, Pilar, Buenos Aires, Argentina.,Facultad de Ciencias Biomédicas, and Facultad de Ingeniería, Universidad Austral, Pilar, Buenos Aires, Argentina.,Austral Institute for Applied Artificial Intelligence, Universidad Austral, Pilar, Buenos Aires, Argentina
| |
Collapse
|
23
|
Ahmed A, Mam B, Sowdhamini R. DEELIG: A Deep Learning Approach to Predict Protein-Ligand Binding Affinity. Bioinform Biol Insights 2021; 15:11779322211030364. [PMID: 34290496 PMCID: PMC8274096 DOI: 10.1177/11779322211030364] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2020] [Accepted: 06/05/2021] [Indexed: 12/03/2022] Open
Abstract
Protein-ligand binding prediction has extensive biological significance. Binding affinity helps in understanding the degree of protein-ligand interactions and is a useful measure in drug design. Protein-ligand docking using virtual screening and molecular dynamic simulations are required to predict the binding affinity of a ligand to its cognate receptor. Performing such analyses to cover the entire chemical space of small molecules requires intense computational power. Recent developments using deep learning have enabled us to make sense of massive amounts of complex data sets where the ability of the model to “learn” intrinsic patterns in a complex plane of data is the strength of the approach. Here, we have incorporated convolutional neural networks to find spatial relationships among data to help us predict affinity of binding of proteins in whole superfamilies toward a diverse set of ligands without the need of a docked pose or complex as user input. The models were trained and validated using a stringent methodology for feature extraction. Our model performs better in comparison to some existing methods used widely and is suitable for predictions on high-resolution protein crystal (⩽2.5 Å) and nonpeptide ligand as individual inputs. Our approach to network construction and training on protein-ligand data set prepared in-house has yielded significant insights. We have also tested DEELIG on few COVID-19 main protease-inhibitor complexes relevant to the current public health scenario. DEELIG-based predictions can be incorporated in existing databases including RSCB PDB, PDBMoad, and PDBbind in filling missing binding affinity data for protein-ligand complexes.
Collapse
Affiliation(s)
- Asad Ahmed
- National Institute of Technology Warangal, Warangal, India
| | - Bhavika Mam
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, India
- The University of Trans-Disciplinary Health Sciences and Technology (TDU), Bangalore, India
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, India
- Ramanathan Sowdhamini, National Centre for Biological Sciences, Tata Institute of Fundamental Research, GKVK Campus, Bangalore 560065, Karnataka, India.
| |
Collapse
|
24
|
Sánchez-Cruz N, Medina-Franco JL, Mestres J, Barril X. Extended connectivity interaction features: improving binding affinity prediction through chemical description. Bioinformatics 2021; 37:1376-1382. [PMID: 33226061 DOI: 10.1093/bioinformatics/btaa982] [Citation(s) in RCA: 48] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Revised: 10/27/2020] [Accepted: 11/10/2020] [Indexed: 12/22/2022] Open
Abstract
MOTIVATION Machine-learning scoring functions (SFs) have been found to outperform standard SFs for binding affinity prediction of protein-ligand complexes. A plethora of reports focus on the implementation of increasingly complex algorithms, while the chemical description of the system has not been fully exploited. RESULTS Herein, we introduce Extended Connectivity Interaction Features (ECIF) to describe protein-ligand complexes and build machine-learning SFs with improved predictions of binding affinity. ECIF are a set of protein-ligand atom-type pair counts that take into account each atom's connectivity to describe it and thus define the pair types. ECIF were used to build different machine-learning models to predict protein-ligand affinities (pKd/pKi). The models were evaluated in terms of 'scoring power' on the Comparative Assessment of Scoring Functions 2016. The best models built on ECIF achieved Pearson correlation coefficients of 0.857 when used on its own, and 0.866 when used in combination with ligand descriptors, demonstrating ECIF descriptive power. AVAILABILITY AND IMPLEMENTATION Data and code to reproduce all the results are freely available at https://github.com/DIFACQUIM/ECIF. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Norberto Sánchez-Cruz
- Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico
| | - José L Medina-Franco
- Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico
| | - Jordi Mestres
- Research Group on Systems Pharmacology, Research Program on Biomedical Informatics (GRIB), IMIM Hospital del Mar Medical Research Institute and University Pompeu Fabra, Parc de Recerca Biomedica (PRBB), 08003 Barcelona, Catalonia, Spain
- Chemotargets SL, Parc Cientific de Barcelona (PCB), 08028 Barcelona, Catalonia, Spain
| | - Xavier Barril
- Institut de Biomedicina de la Universitat de Barcelona (IBUB) and Facultat de Farmacia, Universitat de Barcelona, 08028 Barcelona, Spain
- Catalan Institution for Research and Advanced Studies (ICREA), 08010 Barcelona, Spain
| |
Collapse
|
25
|
Zhu J, Jiang Y, Jia L, Xu L, Cai Y, Chen Y, Zhu N, Li H, Jin J. A multi-conformational virtual screening approach based on machine learning targeting PI3Kγ. Mol Divers 2021; 25:1271-1282. [PMID: 34160714 DOI: 10.1007/s11030-021-10243-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 06/03/2021] [Indexed: 12/13/2022]
Abstract
Nowadays, more and more attention has been attracted to develop selective PI3Kγ inhibitors, but the unique structural features of PI3Kγ protein make it a very big challenge. In the present study, a virtual screening strategy based on machine learning with multiple PI3Kγ protein structures was developed to screen novel PI3Kγ inhibitors. First, six mainstream docking programs were chosen to evaluate their scoring power and screening power; CDOCKER and Glide show satisfactory reliability and accuracy against the PI3Kγ system. Next, virtual screening integrating multiple PI3Kγ protein structures was demonstrated to significantly improve the screening enrichment rate comparing to that with an individual protein structure. Last, a multi-conformational Naïve Bayesian Classification model with the optimal docking programs was constructed, and it performed a true capability in the screening of PI3Kγ inhibitors. Taken together, the current study could provide some guidance for the docking-based virtual screening to discover novel PI3Kγ inhibitors.
Collapse
Affiliation(s)
- Jingyu Zhu
- School of Pharmaceutical Sciences, Jiangnan University, Wuxi, 214122, Jiangsu, China.
| | - Yingmin Jiang
- School of Pharmaceutical Sciences, Jiangnan University, Wuxi, 214122, Jiangsu, China
| | - Lei Jia
- School of Pharmaceutical Sciences, Jiangnan University, Wuxi, 214122, Jiangsu, China
| | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou, 213001, China
| | - Yanfei Cai
- School of Pharmaceutical Sciences, Jiangnan University, Wuxi, 214122, Jiangsu, China
| | - Yun Chen
- School of Pharmaceutical Sciences, Jiangnan University, Wuxi, 214122, Jiangsu, China
| | - Nannan Zhu
- School of Pharmaceutical Sciences, Jiangnan University, Wuxi, 214122, Jiangsu, China
| | - Huazhong Li
- School of Biotechnology, Jiangnan University, Wuxi, 214122, Jiangsu, China
| | - Jian Jin
- School of Pharmaceutical Sciences, Jiangnan University, Wuxi, 214122, Jiangsu, China.
| |
Collapse
|
26
|
Zeng Y, Chen X, Luo Y, Li X, Peng D. Deep drug-target binding affinity prediction with multiple attention blocks. Brief Bioinform 2021; 22:6231754. [PMID: 33866349 PMCID: PMC8083346 DOI: 10.1093/bib/bbab117] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/12/2021] [Accepted: 03/13/2021] [Indexed: 11/23/2022] Open
Abstract
Drug-target interaction (DTI) prediction has drawn increasing interest due to its substantial position in the drug discovery process. Many studies have introduced computational models to treat DTI prediction as a regression task, which directly predict the binding affinity of drug-target pairs. However, existing studies (i) ignore the essential correlations between atoms when encoding drug compounds and (ii) model the interaction of drug-target pairs simply by concatenation. Based on those observations, in this study, we propose an end-to-end model with multiple attention blocks to predict the binding affinity scores of drug-target pairs. Our proposed model offers the abilities to (i) encode the correlations between atoms by a relation-aware self-attention block and (ii) model the interaction of drug representations and target representations by the multi-head attention block. Experimental results of DTI prediction on two benchmark datasets show our approach outperforms existing methods, which are benefit from the correlation information encoded by the relation-aware self-attention block and the interaction information extracted by the multi-head attention block. Moreover, we conduct the experiments on the effects of max relative position length and find out the best max relative position length value \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$k \in \{3, 5\}$\end{document}. Furthermore, we apply our model to predict the binding affinity of Corona Virus Disease 2019 (COVID-19)-related genome sequences and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$3137$\end{document} FDA-approved drugs.
Collapse
Affiliation(s)
- Yuni Zeng
- College of Computer Science, Sichuan University, Chengdu, Sichuan,610065, China
| | - Xiangru Chen
- College of Computer Science, Sichuan University, Chengdu, Sichuan,610065, China
| | - Yujie Luo
- Shenzhen Peng Cheng Laboratory, Shenzhen, 518052, China
| | - Xuedong Li
- Chengdu Sobey Digital Technology Co., Ltd, Chengdu, 610041,China
| | - Dezhong Peng
- College of Computer Science, Sichuan University, Chengdu, Sichuan,610065, China
| |
Collapse
|
27
|
Shim J, Hong ZY, Sohn I, Hwang C. Prediction of drug-target binding affinity using similarity-based convolutional neural network. Sci Rep 2021; 11:4416. [PMID: 33627791 PMCID: PMC7904939 DOI: 10.1038/s41598-021-83679-y] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 01/18/2021] [Indexed: 12/02/2022] Open
Abstract
Identifying novel drug–target interactions (DTIs) plays an important role in drug discovery. Most of the computational methods developed for predicting DTIs use binary classification, whose goal is to determine whether or not a drug–target (DT) pair interacts. However, it is more meaningful but also more challenging to predict the binding affinity that describes the strength of the interaction between a DT pair. If the binding affinity is not sufficiently large, such drug may not be useful. Therefore, the methods for predicting DT binding affinities are very valuable. The increase in novel public affinity data available in the DT-related databases enables advanced deep learning techniques to be used to predict binding affinities. In this paper, we propose a similarity-based model that applies 2-dimensional (2D) convolutional neural network (CNN) to the outer products between column vectors of two similarity matrices for the drugs and targets to predict DT binding affinities. To our best knowledge, this is the first application of 2D CNN in similarity-based DT binding affinity prediction. The validation results on multiple public datasets show that the proposed model is an effective approach for DT binding affinity prediction and can be quite helpful in drug development process.
Collapse
Affiliation(s)
- Jooyong Shim
- Department of Statistics, Institute of Statistical Information, Inje University, Gimhae, Gyeongsangnamdo, South Korea
| | | | | | - Changha Hwang
- Department of Applied Statistics, Dankook University, Yongin, Gyeonggido, 16890, South Korea.
| |
Collapse
|
28
|
Bitencourt-Ferreira G, Duarte da Silva A, Filgueira de Azevedo W. Application of Machine Learning Techniques to Predict Binding Affinity for Drug Targets: A Study of Cyclin-Dependent Kinase 2. Curr Med Chem 2021; 28:253-265. [PMID: 31729287 DOI: 10.2174/2213275912666191102162959] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2019] [Revised: 08/22/2019] [Accepted: 09/24/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND The elucidation of the structure of cyclin-dependent kinase 2 (CDK2) made it possible to develop targeted scoring functions for virtual screening aimed to identify new inhibitors for this enzyme. CDK2 is a protein target for the development of drugs intended to modulate cellcycle progression and control. Such drugs have potential anticancer activities. OBJECTIVE Our goal here is to review recent applications of machine learning methods to predict ligand- binding affinity for protein targets. To assess the predictive performance of classical scoring functions and targeted scoring functions, we focused our analysis on CDK2 structures. METHODS We have experimental structural data for hundreds of binary complexes of CDK2 with different ligands, many of them with inhibition constant information. We investigate here computational methods to calculate the binding affinity of CDK2 through classical scoring functions and machine- learning models. RESULTS Analysis of the predictive performance of classical scoring functions available in docking programs such as Molegro Virtual Docker, AutoDock4, and Autodock Vina indicated that these methods failed to predict binding affinity with significant correlation with experimental data. Targeted scoring functions developed through supervised machine learning techniques showed a significant correlation with experimental data. CONCLUSION Here, we described the application of supervised machine learning techniques to generate a scoring function to predict binding affinity. Machine learning models showed superior predictive performance when compared with classical scoring functions. Analysis of the computational models obtained through machine learning could capture essential structural features responsible for binding affinity against CDK2.
Collapse
Affiliation(s)
- Gabriela Bitencourt-Ferreira
- Laboratory of Computational Systems Biology. Pontifical Catholic University of Rio Grande do Sul (PUCRS). Av. Ipiranga, 6681 Porto Alegre/RS 90619-900 , Brazil
| | - Amauri Duarte da Silva
- Specialization Program in Bioinformatics. Pontifical Catholic University of Rio Grande do Sul (PUCRS). Av. Ipiranga, 6681 Porto Alegre/RS 90619-900, Brazil
| | - Walter Filgueira de Azevedo
- Laboratory of Computational Systems Biology. Pontifical Catholic University of Rio Grande do Sul (PUCRS). Av. Ipiranga, 6681 Porto Alegre/RS 90619-900 , Brazil
| |
Collapse
|
29
|
Guedes IA, Barreto AMS, Marinho D, Krempser E, Kuenemann MA, Sperandio O, Dardenne LE, Miteva MA. New machine learning and physics-based scoring functions for drug discovery. Sci Rep 2021; 11:3198. [PMID: 33542326 PMCID: PMC7862620 DOI: 10.1038/s41598-021-82410-1] [Citation(s) in RCA: 66] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Accepted: 01/20/2021] [Indexed: 12/11/2022] Open
Abstract
Scoring functions are essential for modern in silico drug discovery. However, the accurate prediction of binding affinity by scoring functions remains a challenging task. The performance of scoring functions is very heterogeneous across different target classes. Scoring functions based on precise physics-based descriptors better representing protein–ligand recognition process are strongly needed. We developed a set of new empirical scoring functions, named DockTScore, by explicitly accounting for physics-based terms combined with machine learning. Target-specific scoring functions were developed for two important drug targets, proteases and protein–protein interactions, representing an original class of molecules for drug discovery. Multiple linear regression (MLR), support vector machine and random forest algorithms were employed to derive general and target-specific scoring functions involving optimized MMFF94S force-field terms, solvation and lipophilic interactions terms, and an improved term accounting for ligand torsional entropy contribution to ligand binding. DockTScore scoring functions demonstrated to be competitive with the current best-evaluated scoring functions in terms of binding energy prediction and ranking on four DUD-E datasets and will be useful for in silico drug design for diverse proteins as well as for specific targets such as proteases and protein–protein interactions. Currently, the MLR DockTScore is available at www.dockthor.lncc.br.
Collapse
Affiliation(s)
- Isabella A Guedes
- Laboratório Nacional de Computação Científica, Petrópolis, 25651-075, Brazil.,Inserm U973, Université Paris Diderot, Paris, France
| | - André M S Barreto
- Laboratório Nacional de Computação Científica, Petrópolis, 25651-075, Brazil
| | - Diogo Marinho
- Laboratório Nacional de Computação Científica, Petrópolis, 25651-075, Brazil
| | | | | | - Olivier Sperandio
- Inserm U973, Université Paris Diderot, Paris, France.,Structural Bioinformatics Unit, CNRS UMR3528, Institut Pasteur, 75015, Paris, France
| | - Laurent E Dardenne
- Laboratório Nacional de Computação Científica, Petrópolis, 25651-075, Brazil.
| | - Maria A Miteva
- Inserm U973, Université Paris Diderot, Paris, France. .,Inserm U1268 "Medicinal Chemistry and Translational Research", CiTCoM, UMR 8038, CNRS, Université de Paris, 75006, Paris, France.
| |
Collapse
|
30
|
Bao J, He X, Zhang JZ. Development of a New Scoring Function for Virtual Screening: APBScore. J Chem Inf Model 2020; 60:6355-6365. [DOI: 10.1021/acs.jcim.0c00474] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Affiliation(s)
- Jingxiao Bao
- Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China
| | - Xiao He
- Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China
- NYU-ECNU Center for Computational Chemistry, NYU Shanghai, Shanghai 200062, China
| | - John Z.H. Zhang
- Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China
- NYU-ECNU Center for Computational Chemistry, NYU Shanghai, Shanghai 200062, China
- Department of Chemistry, New York University, New York, New York 10003, United States
- Collaborative Innovation Center of Extreme Optics, Shanxi University, Taiyuan, Shanxi 030006, China
| |
Collapse
|
31
|
Francoeur PG, Masuda T, Sunseri J, Jia A, Iovanisci RB, Snyder I, Koes DR. Three-Dimensional Convolutional Neural Networks and a Cross-Docked Data Set for Structure-Based Drug Design. J Chem Inf Model 2020; 60:4200-4215. [DOI: 10.1021/acs.jcim.0c00411] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Paul G. Francoeur
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Tomohide Masuda
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Jocelyn Sunseri
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Andrew Jia
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Richard B. Iovanisci
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Ian Snyder
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - David R. Koes
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| |
Collapse
|
32
|
Li H, Sze K, Lu G, Ballester PJ. Machine‐learning scoring functions for structure‐based virtual screening. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2020. [DOI: 10.1002/wcms.1478] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Hongjian Li
- Cancer Research Center of Marseille (INSERM U1068, Institut Paoli‐Calmettes, Aix‐Marseille Université UM105, CNRS UMR7258) Marseille France
- CUHK‐SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences Chinese University of Hong Kong Shatin Hong Kong
| | - Kam‐Heung Sze
- CUHK‐SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences Chinese University of Hong Kong Shatin Hong Kong
| | - Gang Lu
- CUHK‐SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences Chinese University of Hong Kong Shatin Hong Kong
| | - Pedro J. Ballester
- Cancer Research Center of Marseille (INSERM U1068, Institut Paoli‐Calmettes, Aix‐Marseille Université UM105, CNRS UMR7258) Marseille France
| |
Collapse
|
33
|
Karlov D, Sosnin S, Fedorov MV, Popov P. graphDelta: MPNN Scoring Function for the Affinity Prediction of Protein-Ligand Complexes. ACS OMEGA 2020; 5:5150-5159. [PMID: 32201802 PMCID: PMC7081425 DOI: 10.1021/acsomega.9b04162] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/06/2019] [Accepted: 02/21/2020] [Indexed: 06/04/2023]
Abstract
In this work, we present graph-convolutional neural networks for the prediction of binding constants of protein-ligand complexes. We derived the model using multi task learning, where the target variables are the dissociation constant (K d), inhibition constant (K i), and half maximal inhibitory concentration (IC50). Being rigorously trained on the PDBbind dataset, the model achieves the Pearson correlation coefficient of 0.87 and the RMSE value of 1.05 in pK units, outperforming recently developed 3D convolutional neural network model K deep.
Collapse
Affiliation(s)
- Dmitry
S. Karlov
- Skolkovo
Institute of Science and Technology, Moscow 143026, Russia
| | - Sergey Sosnin
- Skolkovo
Institute of Science and Technology, Moscow 143026, Russia
- Skolkovo
Innovation Center,Syntelly LLC, 42 Bolshoy Boulevard, Moscow 143026, Russia
| | - Maxim V. Fedorov
- Skolkovo
Institute of Science and Technology, Moscow 143026, Russia
- Skolkovo
Innovation Center,Syntelly LLC, 42 Bolshoy Boulevard, Moscow 143026, Russia
- University
of Strathclyde, Physics
John Anderson Building, 107 Rottenrow East, Glasgow UK G4 0NG, U.K.
| | - Petr Popov
- Skolkovo
Institute of Science and Technology, Moscow 143026, Russia
- Moscow
Institute of Physics and Technology, Dolgoprudny 141701, Russia
| |
Collapse
|
34
|
Abstract
Recently, machine learning (ML) has established itself in various worldwide benchmarking competitions in computational biology, including Critical Assessment of Structure Prediction (CASP) and Drug Design Data Resource (D3R) Grand Challenges. However, the intricate structural complexity and high ML dimensionality of biomolecular datasets obstruct the efficient application of ML algorithms in the field. In addition to data and algorithm, an efficient ML machinery for biomolecular predictions must include structural representation as an indispensable component. Mathematical representations that simplify the biomolecular structural complexity and reduce ML dimensionality have emerged as a prime winner in D3R Grand Challenges. This review is devoted to the recent advances in developing low-dimensional and scalable mathematical representations of biomolecules in our laboratory. We discuss three classes of mathematical approaches, including algebraic topology, differential geometry, and graph theory. We elucidate how the physical and biological challenges have guided the evolution and development of these mathematical apparatuses for massive and diverse biomolecular data. We focus the performance analysis on protein-ligand binding predictions in this review although these methods have had tremendous success in many other applications, such as protein classification, virtual screening, and the predictions of solubility, solvation free energies, toxicity, partition coefficients, protein folding stability changes upon mutation, etc.
Collapse
Affiliation(s)
- Duc Duy Nguyen
- Department of Mathematics, Michigan State University, MI 48824, USA.
| | - Zixuan Cang
- Department of Mathematics, Michigan State University, MI 48824, USA.
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI 48824, USA. and Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA and Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA
| |
Collapse
|
35
|
Su M, Feng G, Liu Z, Li Y, Wang R. Tapping on the Black Box: How Is the Scoring Power of a Machine-Learning Scoring Function Dependent on the Training Set? J Chem Inf Model 2020; 60:1122-1136. [DOI: 10.1021/acs.jcim.9b00714] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Minyi Su
- State Key Laboratory of Bioorganic and Natural Products Chemistry, Center for Excellence in Molecular Synthesis, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 345 Lingling Road, Shanghai 200032, People’s Republic of China
- University of Chinese Academy of Sciences, Beijing 100049, People’s Republic of China
| | - Guoqin Feng
- State Key Laboratory of Bioorganic and Natural Products Chemistry, Center for Excellence in Molecular Synthesis, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 345 Lingling Road, Shanghai 200032, People’s Republic of China
- University of Chinese Academy of Sciences, Beijing 100049, People’s Republic of China
| | - Zhihai Liu
- State Key Laboratory of Bioorganic and Natural Products Chemistry, Center for Excellence in Molecular Synthesis, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 345 Lingling Road, Shanghai 200032, People’s Republic of China
| | - Yan Li
- State Key Laboratory of Bioorganic and Natural Products Chemistry, Center for Excellence in Molecular Synthesis, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 345 Lingling Road, Shanghai 200032, People’s Republic of China
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People’s Republic of China
| | - Renxiao Wang
- State Key Laboratory of Bioorganic and Natural Products Chemistry, Center for Excellence in Molecular Synthesis, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 345 Lingling Road, Shanghai 200032, People’s Republic of China
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People’s Republic of China
- Shanxi Key Laboratory of Innovative Drugs for the Treatment of Serious Diseases Basing on Chronic Inflammation, College of Traditional Chinese Medicines, Shanxi University of Chinese Medicine, Taiyuan, Shanxi 030619, People’s Republic of China
| |
Collapse
|
36
|
Mahmoud AH, Masters MR, Yang Y, Lill MA. Elucidating the multiple roles of hydration for accurate protein-ligand binding prediction via deep learning. Commun Chem 2020; 3:19. [PMID: 36703428 PMCID: PMC9814895 DOI: 10.1038/s42004-020-0261-x] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Accepted: 01/16/2020] [Indexed: 01/29/2023] Open
Abstract
Accurate and efficient prediction of protein-ligand interactions has been a long-lasting dream of practitioners in drug discovery. The insufficient treatment of hydration is widely recognized to be a major limitation for accurate protein-ligand scoring. Using an integration of molecular dynamics simulations on thousands of protein structures with novel big-data analytics based on convolutional neural networks and deep Taylor decomposition, we consistently identify here three different patterns of hydration to be essential for protein-ligand interactions. In addition to desolvation and water-mediated interactions, the formation of enthalpically favorable networks of first-shell water molecules around solvent-exposed ligand moieties is identified to be essential for protein-ligand binding. Despite being currently neglected in drug discovery, this hydration phenomenon could lead to new avenues in optimizing the free energy of ligand binding. Application of deep neural networks incorporating hydration to docking provides 89% accuracy in binding pose ranking, an essential step for rational structure-based drug design.
Collapse
Affiliation(s)
- Amr H. Mahmoud
- grid.169077.e0000 0004 1937 2197Department of Medicinal Chemistry and Molecular Pharmacology, College of Pharmacy, Purdue University, 575 Stadium Mall Drive, West Lafayette, IN 47906 USA
| | - Matthew R. Masters
- grid.169077.e0000 0004 1937 2197Department of Medicinal Chemistry and Molecular Pharmacology, College of Pharmacy, Purdue University, 575 Stadium Mall Drive, West Lafayette, IN 47906 USA
| | - Ying Yang
- grid.169077.e0000 0004 1937 2197Department of Medicinal Chemistry and Molecular Pharmacology, College of Pharmacy, Purdue University, 575 Stadium Mall Drive, West Lafayette, IN 47906 USA
| | - Markus A. Lill
- grid.169077.e0000 0004 1937 2197Department of Medicinal Chemistry and Molecular Pharmacology, College of Pharmacy, Purdue University, 575 Stadium Mall Drive, West Lafayette, IN 47906 USA ,grid.6612.30000 0004 1937 0642Department of Pharmaceutical Sciences, University of Basel, Klingelbergstrasse 50, 4056 Basel, Switzerland
| |
Collapse
|
37
|
Li H, Sze K, Lu G, Ballester PJ. Machine‐learning scoring functions for structure‐based drug lead optimization. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2020. [DOI: 10.1002/wcms.1465] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Affiliation(s)
- Hongjian Li
- CUHK‐SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences Chinese University of Hong Kong Shatin Hong Kong
| | - Kam‐Heung Sze
- CUHK‐SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences Chinese University of Hong Kong Shatin Hong Kong
| | - Gang Lu
- CUHK‐SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences Chinese University of Hong Kong Shatin Hong Kong
| | - Pedro J. Ballester
- Cancer Research Center of Marseille (INSERM U1068, Institut Paoli‐Calmettes, Aix‐Marseille Université UM105, CNRS UMR7258) Marseille France
| |
Collapse
|
38
|
Boyles F, Deane CM, Morris GM. Learning from the ligand: using ligand-based features to improve binding affinity prediction. Bioinformatics 2019; 36:758-764. [DOI: 10.1093/bioinformatics/btz665] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Revised: 08/14/2019] [Accepted: 08/21/2019] [Indexed: 12/27/2022] Open
Abstract
Abstract
Motivation
Machine learning scoring functions for protein–ligand binding affinity prediction have been found to consistently outperform classical scoring functions. Structure-based scoring functions for universal affinity prediction typically use features describing interactions derived from the protein–ligand complex, with limited information about the chemical or topological properties of the ligand itself.
Results
We demonstrate that the performance of machine learning scoring functions are consistently improved by the inclusion of diverse ligand-based features. For example, a Random Forest (RF) combining the features of RF-Score v3 with RDKit molecular descriptors achieved Pearson correlation coefficients of up to 0.836, 0.780 and 0.821 on the PDBbind 2007, 2013 and 2016 core sets, respectively, compared to 0.790, 0.746 and 0.814 when using the features of RF-Score v3 alone. Excluding proteins and/or ligands that are similar to those in the test sets from the training set has a significant effect on scoring function performance, but does not remove the predictive power of ligand-based features. Furthermore a RF using only ligand-based features is predictive at a level similar to classical scoring functions and it appears to be predicting the mean binding affinity of a ligand for its protein targets.
Availability and implementation
Data and code to reproduce all the results are freely available at http://opig.stats.ox.ac.uk/resources.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fergus Boyles
- Department of Statistics, University of Oxford, Oxford, UK
| | | | | |
Collapse
|
39
|
Nguyen DD, Wei GW. AGL-Score: Algebraic Graph Learning Score for Protein-Ligand Binding Scoring, Ranking, Docking, and Screening. J Chem Inf Model 2019; 59:3291-3304. [PMID: 31257871 PMCID: PMC6664294 DOI: 10.1021/acs.jcim.9b00334] [Citation(s) in RCA: 121] [Impact Index Per Article: 24.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Although algebraic graph theory-based models have been widely applied in physical modeling and molecular studies, they are typically incompetent in the analysis and prediction of biomolecular properties, confirming the common belief that "one cannot hear the shape of a drum". A new development in the century-old question about the spectrum-geometry relationship is provided. Novel algebraic graph learning score (AGL-Score) models are proposed to encode high-dimensional physical and biological information into intrinsically low-dimensional representations. The proposed AGL-Score models employ multiscale weighted colored subgraphs to describe crucial molecular and biomolecular interactions in terms of graph invariants derived from graph Laplacian, its pseudo-inverse, and adjacency matrices. Additionally, AGL-Score models are integrated with an advanced machine learning algorithm to predict biomolecular macroscopic properties from the low-dimensional graph representation of biomolecular structures. The proposed AGL-Score models are extensively validated for their scoring power, ranking power, docking power, and screening power via a number of benchmark datasets, namely CASF-2007, CASF-2013, and CASF-2016. Numerical results indicate that the proposed AGL-Score models are able to outperform other state-of-the-art scoring functions in protein-ligand binding scoring, ranking, docking, and screening. This study indicates that machine learning methods are powerful tools for molecular docking and virtual screening. It also indicates that spectral geometry or spectral graph theory has the ability to infer geometric properties.
Collapse
Affiliation(s)
- Duc Duy Nguyen
- Department of Mathematics , Michigan State University , East Lansing , Michigan 48824 , United States
| | - Guo-Wei Wei
- Department of Mathematics , Michigan State University , East Lansing , Michigan 48824 , United States
- Department of Biochemistry and Molecular Biology Michigan State University , East Lansing , Michigan 48824 , United States
- Department of Electrical and Computer Engineering Michigan State University , East Lansing , Michigan 48824 , United States
| |
Collapse
|
40
|
Yang X, Wang Y, Byrne R, Schneider G, Yang S. Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery. Chem Rev 2019; 119:10520-10594. [PMID: 31294972 DOI: 10.1021/acs.chemrev.8b00728] [Citation(s) in RCA: 342] [Impact Index Per Article: 68.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Artificial intelligence (AI), and, in particular, deep learning as a subcategory of AI, provides opportunities for the discovery and development of innovative drugs. Various machine learning approaches have recently (re)emerged, some of which may be considered instances of domain-specific AI which have been successfully employed for drug discovery and design. This review provides a comprehensive portrayal of these machine learning techniques and of their applications in medicinal chemistry. After introducing the basic principles, alongside some application notes, of the various machine learning algorithms, the current state-of-the art of AI-assisted pharmaceutical discovery is discussed, including applications in structure- and ligand-based virtual screening, de novo drug design, physicochemical and pharmacokinetic property prediction, drug repurposing, and related aspects. Finally, several challenges and limitations of the current methods are summarized, with a view to potential future directions for AI-assisted drug discovery and design.
Collapse
Affiliation(s)
- Xin Yang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| | - Yifei Wang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| | - Ryan Byrne
- ETH Zurich , Department of Chemistry and Applied Biosciences , Vladimir-Prelog-Weg 4 , CH-8093 Zurich , Switzerland
| | - Gisbert Schneider
- ETH Zurich , Department of Chemistry and Applied Biosciences , Vladimir-Prelog-Weg 4 , CH-8093 Zurich , Switzerland
| | - Shengyong Yang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| |
Collapse
|
41
|
Li H, Peng J, Sidorov P, Leung Y, Leung KS, Wong MH, Lu G, Ballester PJ. Classical scoring functions for docking are unable to exploit large volumes of structural and interaction data. Bioinformatics 2019; 35:3989-3995. [DOI: 10.1093/bioinformatics/btz183] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2018] [Revised: 02/04/2019] [Accepted: 03/13/2019] [Indexed: 12/15/2022] Open
Abstract
Abstract
Motivation
Studies have shown that the accuracy of random forest (RF)-based scoring functions (SFs), such as RF-Score-v3, increases with more training samples, whereas that of classical SFs, such as X-Score, does not. Nevertheless, the impact of the similarity between training and test samples on this matter has not been studied in a systematic manner. It is therefore unclear how these SFs would perform when only trained on protein-ligand complexes that are highly dissimilar or highly similar to the test set. It is also unclear whether SFs based on machine learning algorithms other than RF can also improve accuracy with increasing training set size and to what extent they learn from dissimilar or similar training complexes.
Results
We present a systematic study to investigate how the accuracy of classical and machine-learning SFs varies with protein-ligand complex similarities between training and test sets. We considered three types of similarity metrics, based on the comparison of either protein structures, protein sequences or ligand structures. Regardless of the similarity metric, we found that incorporating a larger proportion of similar complexes to the training set did not make classical SFs more accurate. In contrast, RF-Score-v3 was able to outperform X-Score even when trained on just 32% of the most dissimilar complexes, showing that its superior performance owes considerably to learning from dissimilar training complexes to those in the test set. In addition, we generated the first SF employing Extreme Gradient Boosting (XGBoost), XGB-Score, and observed that it also improves with training set size while outperforming the rest of SFs. Given the continuous growth of training datasets, the development of machine-learning SFs has become very appealing.
Availability and implementation
https://github.com/HongjianLi/MLSF
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hongjian Li
- SDIVF R&D Centre, Hong Kong Science Park, Sha Tin, New Territories, Hong Kong
- CUHK-SDU Joint Laboratory on Reproductive Genetics School of Biomedical Sciences, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong
| | - Jiangjun Peng
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi’an, China
| | - Pavel Sidorov
- Cancer Research Center of Marseille CRCM, INSERM, Institut Paoli-Calmettes, Aix-Marseille University, CNRS, F-13009 Marseille, France
| | | | - Kwong-Sak Leung
- Institute of Future Cities
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong
| | - Man-Hon Wong
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong
| | - Gang Lu
- CUHK-SDU Joint Laboratory on Reproductive Genetics School of Biomedical Sciences, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong
| | - Pedro J Ballester
- Cancer Research Center of Marseille CRCM, INSERM, Institut Paoli-Calmettes, Aix-Marseille University, CNRS, F-13009 Marseille, France
| |
Collapse
|
42
|
Nguyen DD, Wei GW. DG-GL: Differential geometry-based geometric learning of molecular datasets. INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN BIOMEDICAL ENGINEERING 2019; 35:e3179. [PMID: 30693661 PMCID: PMC6598676 DOI: 10.1002/cnm.3179] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/25/2018] [Revised: 11/21/2018] [Accepted: 12/06/2018] [Indexed: 05/11/2023]
Abstract
MOTIVATION Despite its great success in various physical modeling, differential geometry (DG) has rarely been devised as a versatile tool for analyzing large, diverse, and complex molecular and biomolecular datasets because of the limited understanding of its potential power in dimensionality reduction and its ability to encode essential chemical and biological information in differentiable manifolds. RESULTS We put forward a differential geometry-based geometric learning (DG-GL) hypothesis that the intrinsic physics of three-dimensional (3D) molecular structures lies on a family of low-dimensional manifolds embedded in a high-dimensional data space. We encode crucial chemical, physical, and biological information into 2D element interactive manifolds, extracted from a high-dimensional structural data space via a multiscale discrete-to-continuum mapping using differentiable density estimators. Differential geometry apparatuses are utilized to construct element interactive curvatures in analytical forms for certain analytically differentiable density estimators. These low-dimensional differential geometry representations are paired with a robust machine learning algorithm to showcase their descriptive and predictive powers for large, diverse, and complex molecular and biomolecular datasets. Extensive numerical experiments are carried out to demonstrate that the proposed DG-GL strategy outperforms other advanced methods in the predictions of drug discovery-related protein-ligand binding affinity, drug toxicity, and molecular solvation free energy. AVAILABILITY AND IMPLEMENTATION http://weilab.math.msu.edu/DG-GL/ Contact: wei@math.msu.edu.
Collapse
Affiliation(s)
- Duc Duy Nguyen
- Department of Mathematics, Michigan State University, East Lansing, 48824, Michigan
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824, Michigan
- Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, Michigan
| |
Collapse
|
43
|
Wójcikowski M, Siedlecki P, Ballester PJ. Building Machine-Learning Scoring Functions for Structure-Based Prediction of Intermolecular Binding Affinity. Methods Mol Biol 2019; 2053:1-12. [PMID: 31452095 DOI: 10.1007/978-1-4939-9752-7_1] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Molecular docking enables large-scale prediction of whether and how small molecules bind to a macromolecular target. Machine-learning scoring functions are particularly well suited to predict the strength of this interaction. Here we describe how to build RF-Score, a scoring function utilizing the machine-learning technique known as Random Forest (RF). We also point out how to use different data, features, and regression models using either R or Python programming languages.
Collapse
Affiliation(s)
| | - Pawel Siedlecki
- Institute of Biochemistry and Biophysics PAS, Warsaw, Poland
- Department of Systems Biology, Institute of Experimental Plant Biology and Biotechnology, University of Warsaw, Warsaw, Poland
| | - Pedro J Ballester
- Cancer Research Center of Marseille, INSERM U1068, Marseille, France.
- Institut Paoli-Calmettes, Marseille, France.
- Aix-Marseille Université, Marseille, France.
- CNRS UMR7258, Marseille, France.
| |
Collapse
|
44
|
Abstract
Recent progress in the development of scientific libraries with machine-learning techniques paved the way for the implementation of integrated computational tools to predict ligand-binding affinity. The prediction of binding affinity uses the atomic coordinates of protein-ligand complexes. These new computational tools made application of a broad spectrum of machine-learning techniques to study protein-ligand interactions possible. The essential aspect of these machine-learning approaches is to train a new computational model by using technologies such as supervised machine-learning techniques, convolutional neural network, and random forest to mention the most commonly applied methods. In this chapter, we focus on supervised machine-learning techniques and their applications in the development of protein-targeted scoring functions for the prediction of binding affinity. We discuss the development of the program SAnDReS and its application to the creation of machine-learning models to predict inhibition of cyclin-dependent kinase and HIV-1 protease. Moreover, we describe the scoring function space, and how to use it to explain the development of targeted scoring functions.
Collapse
Affiliation(s)
- Gabriela Bitencourt-Ferreira
- Escola de Ciências da Saúde, Pontifícia Universidade Católica do Rio Grande do Sul-PUCRS, Porto Alegre, RS, Brazil
| | - Walter Filgueira de Azevedo
- Escola de Ciências da Saúde, Pontifícia Universidade Católica do Rio Grande do Sul-PUCRS, Porto Alegre, RS, Brazil.
| |
Collapse
|
45
|
Guedes IA, Pereira FSS, Dardenne LE. Empirical Scoring Functions for Structure-Based Virtual Screening: Applications, Critical Aspects, and Challenges. Front Pharmacol 2018; 9:1089. [PMID: 30319422 PMCID: PMC6165880 DOI: 10.3389/fphar.2018.01089] [Citation(s) in RCA: 134] [Impact Index Per Article: 22.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2018] [Accepted: 09/07/2018] [Indexed: 12/19/2022] Open
Abstract
Structure-based virtual screening (VS) is a widely used approach that employs the knowledge of the three-dimensional structure of the target of interest in the design of new lead compounds from large-scale molecular docking experiments. Through the prediction of the binding mode and affinity of a small molecule within the binding site of the target of interest, it is possible to understand important properties related to the binding process. Empirical scoring functions are widely used for pose and affinity prediction. Although pose prediction is performed with satisfactory accuracy, the correct prediction of binding affinity is still a challenging task and crucial for the success of structure-based VS experiments. There are several efforts in distinct fronts to develop even more sophisticated and accurate models for filtering and ranking large libraries of compounds. This paper will cover some recent successful applications and methodological advances, including strategies to explore the ligand entropy and solvent effects, training with sophisticated machine-learning techniques, and the use of quantum mechanics. Particular emphasis will be given to the discussion of critical aspects and further directions for the development of more accurate empirical scoring functions.
Collapse
Affiliation(s)
- Isabella A Guedes
- Grupo de Modelagem Molecular em Sistemas Biológicos, Laboratório Nacional de Computação Científica, Petrópolis, Brazil
| | - Felipe S S Pereira
- Grupo de Modelagem Molecular em Sistemas Biológicos, Laboratório Nacional de Computação Científica, Petrópolis, Brazil
| | - Laurent E Dardenne
- Grupo de Modelagem Molecular em Sistemas Biológicos, Laboratório Nacional de Computação Científica, Petrópolis, Brazil
| |
Collapse
|
46
|
Abstract
Motivation The identification of novel drug-target (DT) interactions is a substantial part of the drug discovery process. Most of the computational methods that have been proposed to predict DT interactions have focused on binary classification, where the goal is to determine whether a DT pair interacts or not. However, protein-ligand interactions assume a continuum of binding strength values, also called binding affinity and predicting this value still remains a challenge. The increase in the affinity data available in DT knowledge-bases allows the use of advanced learning techniques such as deep learning architectures in the prediction of binding affinities. In this study, we propose a deep-learning based model that uses only sequence information of both targets and drugs to predict DT interaction binding affinities. The few studies that focus on DT binding affinity prediction use either 3D structures of protein-ligand complexes or 2D features of compounds. One novel approach used in this work is the modeling of protein sequences and compound 1D representations with convolutional neural networks (CNNs). Results The results show that the proposed deep learning based model that uses the 1D representations of targets and drugs is an effective approach for drug target binding affinity prediction. The model in which high-level representations of a drug and a target are constructed via CNNs achieved the best Concordance Index (CI) performance in one of our larger benchmark datasets, outperforming the KronRLS algorithm and SimBoost, a state-of-the-art method for DT binding affinity prediction. Availability and implementation https://github.com/hkmztrk/DeepDTA. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hakime Öztürk
- Department of Computer Engineering, Bogazici University, Istanbul, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Bogazici University, Istanbul, Turkey
| | - Elif Ozkirimli
- Department of Chemical Engineering, Bogazici University, Istanbul, Turkey
| |
Collapse
|
47
|
Gaillard T. Evaluation of AutoDock and AutoDock Vina on the CASF-2013 Benchmark. J Chem Inf Model 2018; 58:1697-1706. [DOI: 10.1021/acs.jcim.8b00312] [Citation(s) in RCA: 138] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Affiliation(s)
- Thomas Gaillard
- Laboratoire de Biochimie (CNRS UMR7654), Department of Biology, Ecole Polytechnique, 91128 Palaiseau, France
| |
Collapse
|
48
|
Li H, Peng J, Leung Y, Leung KS, Wong MH, Lu G, Ballester PJ. The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction. Biomolecules 2018. [PMID: 29538331 PMCID: PMC5871981 DOI: 10.3390/biom8010012] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future.
Collapse
Affiliation(s)
- Hongjian Li
- SDIVF R&D Centre, Hong Kong Science Park, Sha Tin, New Territories, Hong Kong, China.
- Institute of Future Cities, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China.
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China.
| | - Jiangjun Peng
- Institute of Future Cities, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China.
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China.
| | - Yee Leung
- Institute of Future Cities, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China.
| | - Kwong-Sak Leung
- Institute of Future Cities, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China.
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China.
| | - Man-Hon Wong
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China.
| | - Gang Lu
- School of Biomedical Sciences, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China.
| | - Pedro J Ballester
- Cancer Research Center of Marseille, INSERM U1068, F-13009 Marseille, France.
- Institut Paoli-Calmettes, F-13009 Marseille, France.
- Aix-Marseille Université, F-13284 Marseille, France.
- CNRS UMR7258, F-13009 Marseille, France.
| |
Collapse
|
49
|
Cang Z, Wei GW. Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction. INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN BIOMEDICAL ENGINEERING 2018; 34. [PMID: 28677268 DOI: 10.1002/cnm.2914] [Citation(s) in RCA: 93] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/14/2017] [Revised: 06/27/2017] [Accepted: 06/29/2017] [Indexed: 05/17/2023]
Abstract
Protein-ligand binding is a fundamental biological process that is paramount to many other biological processes, such as signal transduction, metabolic pathways, enzyme construction, cell secretion, and gene expression. Accurate prediction of protein-ligand binding affinities is vital to rational drug design and the understanding of protein-ligand binding and binding induced function. Existing binding affinity prediction methods are inundated with geometric detail and involve excessively high dimensions, which undermines their predictive power for massive binding data. Topology provides the ultimate level of abstraction and thus incurs too much reduction in geometric information. Persistent homology embeds geometric information into topological invariants and bridges the gap between complex geometry and abstract topology. However, it oversimplifies biological information. This work introduces element specific persistent homology (ESPH) or multicomponent persistent homology to retain crucial biological information during topological simplification. The combination of ESPH and machine learning gives rise to a powerful paradigm for macromolecular analysis. Tests on 2 large data sets indicate that the proposed topology-based machine-learning paradigm outperforms other existing methods in protein-ligand binding affinity predictions. ESPH reveals protein-ligand binding mechanism that can not be attained from other conventional techniques. The present approach reveals that protein-ligand hydrophobic interactions are extended to 40Å away from the binding site, which has a significant ramification to drug and protein design.
Collapse
Affiliation(s)
- Zixuan Cang
- Department of Mathematics, Michigan State University, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA
| |
Collapse
|
50
|
Jiménez J, Škalič M, Martínez-Rosell G, De Fabritiis G. KDEEP: Protein–Ligand Absolute Binding Affinity Prediction via 3D-Convolutional Neural Networks. J Chem Inf Model 2018; 58:287-296. [DOI: 10.1021/acs.jcim.7b00650] [Citation(s) in RCA: 389] [Impact Index Per Article: 64.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Affiliation(s)
- José Jiménez
- Computational
Biophysics Laboratory, Universitat Pompeu Fabra, Parc de Recerca Biomèdica de Barcelona, Carrer del Dr. Aiguader
88, Barcelona 08003, Spain
| | - Miha Škalič
- Computational
Biophysics Laboratory, Universitat Pompeu Fabra, Parc de Recerca Biomèdica de Barcelona, Carrer del Dr. Aiguader
88, Barcelona 08003, Spain
| | - Gerard Martínez-Rosell
- Computational
Biophysics Laboratory, Universitat Pompeu Fabra, Parc de Recerca Biomèdica de Barcelona, Carrer del Dr. Aiguader
88, Barcelona 08003, Spain
| | - Gianni De Fabritiis
- Computational
Biophysics Laboratory, Universitat Pompeu Fabra, Parc de Recerca Biomèdica de Barcelona, Carrer del Dr. Aiguader
88, Barcelona 08003, Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Passeig Lluis Companys 23, 08010 Barcelona, Spain
| |
Collapse
|