1
|
Ferraz-Caetano J, Teixeira F, Cordeiro MNDS. Data-driven, explainable machine learning model for predicting volatile organic compounds' standard vaporization enthalpy. CHEMOSPHERE 2024; 359:142257. [PMID: 38719116 DOI: 10.1016/j.chemosphere.2024.142257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 04/18/2024] [Accepted: 05/04/2024] [Indexed: 05/21/2024]
Abstract
The accurate prediction of standard vaporization enthalpy (ΔvapHm°) for volatile organic compounds (VOCs) is of paramount importance in environmental chemistry, industrial applications and regulatory compliance. To overcome traditional experimental methods for predicting ΔvapHm° of VOCs, machine learning (ML) models enable a high-throughput, cost-effective property estimation. But despite a rising momentum, existing ML algorithms still present limitations in prediction accuracy and broad chemical applications. In this work, we present a data driven, explainable supervised ML model to predict ΔvapHm° of VOCs. The model was built on an established experimental database of 2410 unique molecules and 223 VOCs categorized by chemical groups. Using supervised ML regression algorithms, the Random Forest successfully predicted VOCs' ΔvapHm° with a mean absolute error of 3.02 kJ mol-1 and a 95% test score. The model was successfully validated through the prediction of ΔvapHm° for a known database of VOCs and through molecular group hold-out tests. Through chemical feature importance analysis, this explainable model revealed that VOC polarizability, connectivity indexes and electrotopological state are key for the model's prediction accuracy. We thus present a replicable and explainable model, which can be further expanded towards the prediction of other thermodynamic properties of VOCs.
Collapse
Affiliation(s)
- José Ferraz-Caetano
- LAQV-REQUIMTE - Department of Chemistry and Biochemistry - Faculty of Sciences, University of Porto - Rua do Campo Alegre, S/N, 4169-007, Porto, Portugal.
| | - Filipe Teixeira
- CQUM - Centre of Chemistry, University of Minho, Campus de Gualtar, 4710-057, Braga, Portugal
| | - M Natália D S Cordeiro
- LAQV-REQUIMTE - Department of Chemistry and Biochemistry - Faculty of Sciences, University of Porto - Rua do Campo Alegre, S/N, 4169-007, Porto, Portugal.
| |
Collapse
|
2
|
Duan Y, Yang X, Zeng X, Wang W, Deng Y, Cao D. Enhancing Molecular Property Prediction through Task-Oriented Transfer Learning: Integrating Universal Structural Insights and Domain-Specific Knowledge. J Med Chem 2024; 67:9575-9586. [PMID: 38748846 DOI: 10.1021/acs.jmedchem.4c00692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2024]
Abstract
Precisely predicting molecular properties is crucial in drug discovery, but the scarcity of labeled data poses a challenge for applying deep learning methods. While large-scale self-supervised pretraining has proven an effective solution, it often neglects domain-specific knowledge. To tackle this issue, we introduce Task-Oriented Multilevel Learning based on BERT (TOML-BERT), a dual-level pretraining framework that considers both structural patterns and domain knowledge of molecules. TOML-BERT achieved state-of-the-art prediction performance on 10 pharmaceutical datasets. It has the capability to mine contextual information within molecular structures and extract domain knowledge from massive pseudo-labeled data. The dual-level pretraining accomplished significant positive transfer, with its two components making complementary contributions. Interpretive analysis elucidated that the effectiveness of the dual-level pretraining lies in the prior learning of a task-related molecular representation. Overall, TOML-BERT demonstrates the potential of combining multiple pretraining tasks to extract task-oriented knowledge, advancing molecular property prediction in drug discovery.
Collapse
Affiliation(s)
- Yanjing Duan
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha Hunan 410013, P. R. China
| | - Xixi Yang
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410013, P. R. China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410013, P. R. China
| | - Wenxuan Wang
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha Hunan 410013, P. R. China
| | - Youchao Deng
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha Hunan 410013, P. R. China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha Hunan 410013, P. R. China
- Institute for Advancing Translational Medicine in Bone & Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR 999077, P. R. China
| |
Collapse
|
3
|
Xia X, Liu Y, Zheng C, Zhang X, Wu Q, Gao X, Zeng X, Su Y. Evolutionary Multiobjective Molecule Optimization in an Implicit Chemical Space. J Chem Inf Model 2024. [PMID: 38870455 DOI: 10.1021/acs.jcim.4c00031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2024]
Abstract
Optimization techniques play a pivotal role in advancing drug development, serving as the foundation of numerous generative methods tailored to efficiently design optimized molecules derived from existing lead compounds. However, existing methods often encounter difficulties in generating diverse, novel, and high-property molecules that simultaneously optimize multiple drug properties. To overcome this bottleneck, we propose a multiobjective molecule optimization framework (MOMO). MOMO employs a specially designed Pareto-based multiproperty evaluation strategy at the molecular sequence level to guide the evolutionary search in an implicit chemical space. A comparative analysis of MOMO with five state-of-the-art methods across two benchmark multiproperty molecule optimization tasks reveals that MOMO markedly outperforms them in terms of diversity, novelty, and optimized properties. The practical applicability of MOMO in drug discovery has also been validated on four challenging tasks in the real-world discovery problem. These results suggest that MOMO can provide a useful tool to facilitate molecule optimization problems with multiple properties.
Collapse
Affiliation(s)
- Xin Xia
- The Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei 230601, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, 5089 Wangjiang West Road, Hefei 230088, AnhuiChina
| | - Yiping Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410012, China
| | - Chunhou Zheng
- The Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei 230601, China
| | - Xingyi Zhang
- The Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei 230601, China
| | - Qingwen Wu
- The Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei 230601, China
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410012, China
| | - Yansen Su
- The Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei 230601, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, 5089 Wangjiang West Road, Hefei 230088, AnhuiChina
| |
Collapse
|
4
|
Aksamit N, Hou J, Li Y, Ombuki-Berman B. Integrating transformers and many-objective optimization for drug design. BMC Bioinformatics 2024; 25:208. [PMID: 38849719 PMCID: PMC11161990 DOI: 10.1186/s12859-024-05822-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Accepted: 05/30/2024] [Indexed: 06/09/2024] Open
Abstract
BACKGROUND Drug design is a challenging and important task that requires the generation of novel and effective molecules that can bind to specific protein targets. Artificial intelligence algorithms have recently showed promising potential to expedite the drug design process. However, existing methods adopt multi-objective approaches which limits the number of objectives. RESULTS In this paper, we expand this thread of research from the many-objective perspective, by proposing a novel framework that integrates a latent Transformer-based model for molecular generation, with a drug design system that incorporates absorption, distribution, metabolism, excretion, and toxicity prediction, molecular docking, and many-objective metaheuristics. We compared the performance of two latent Transformer models (ReLSO and FragNet) on a molecular generation task and show that ReLSO outperforms FragNet in terms of reconstruction and latent space organization. We then explored six different many-objective metaheuristics based on evolutionary algorithms and particle swarm optimization on a drug design task involving potential drug candidates to human lysophosphatidic acid receptor 1, a cancer-related protein target. CONCLUSION We show that multi-objective evolutionary algorithm based on dominance and decomposition performs the best in terms of finding molecules that satisfy many objectives, such as high binding affinity and low toxicity, and high drug-likeness. Our framework demonstrates the potential of combining Transformers and many-objective computational intelligence for drug design.
Collapse
Affiliation(s)
- Nicholas Aksamit
- Department of Computer Science, Brock University, 1812 Sir Isaac Brock Way, St. Catharines, ON, L2S 3A1, Canada
| | - Jinqiang Hou
- Department of Chemistry, Lakehead University, 955 Oliver Road, Thunder Bay, ON, P7B 5E1, Canada
- Thunder Bay Regional Health Research Institute, 980 Oliver Road, Thunder Bay, ON, P7B 6V4, Canada
| | - Yifeng Li
- Department of Computer Science, Brock University, 1812 Sir Isaac Brock Way, St. Catharines, ON, L2S 3A1, Canada.
- Department of Biological Sciences, Brock University, 1812 Sir Isaac Brock Way, St. Catharines, ON, L2S 3A1, Canada.
| | - Beatrice Ombuki-Berman
- Department of Computer Science, Brock University, 1812 Sir Isaac Brock Way, St. Catharines, ON, L2S 3A1, Canada.
| |
Collapse
|
5
|
Tran TTV, Tayara H, Chong KT. AMPred-CNN: Ames mutagenicity prediction model based on convolutional neural networks. Comput Biol Med 2024; 176:108560. [PMID: 38754218 DOI: 10.1016/j.compbiomed.2024.108560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Revised: 04/15/2024] [Accepted: 05/05/2024] [Indexed: 05/18/2024]
Abstract
Mutagenicity assessment plays a pivotal role in the safety evaluation of chemicals, pharmaceuticals, and environmental compounds. In recent years, the development of robust computational models for predicting chemical mutagenicity has gained significant attention, driven by the need for efficient and cost-effective toxicity assessments. In this paper, we proposed AMPred-CNN, an innovative Ames mutagenicity prediction model based on Convolutional Neural Networks (CNNs), uniquely employing molecular structures as images to leverage CNNs' powerful feature extraction capabilities. The study employs the widely used benchmark mutagenicity dataset from Hansen et al. for model development and evaluation. Comparative analyses with traditional ML models on different molecular features reveal substantial performance enhancements. AMPred-CNN outshines these models, demonstrating superior accuracy, AUC, F1 score, MCC, sensitivity, and specificity on the test set. Notably, AMPred-CNN is further benchmarked against seven recent ML and DL models, consistently showcasing superior performance with an impressive AUC of 0.954. Our study highlights the effectiveness of CNNs in advancing mutagenicity prediction, paving the way for broader applications in toxicology and drug development.
Collapse
Affiliation(s)
- Thi Tuyet Van Tran
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea; Faculty of Information Technology, An Giang University, Long Xuyen 880000, Viet Nam; Vietnam National University-Ho Chi Minh City, Ho Chi Minh 700000, Viet Nam.
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Republic of Korea.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea; Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Republic of Korea.
| |
Collapse
|
6
|
Zhang VY, O'Connor SL, Welsh WJ, James MH. Machine learning models to predict ligand binding affinity for the orexin 1 receptor. ARTIFICIAL INTELLIGENCE CHEMISTRY 2024; 2:100040. [PMID: 38476266 PMCID: PMC10927255 DOI: 10.1016/j.aichem.2023.100040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/14/2024]
Abstract
The orexin 1 receptor (OX1R) is a G-protein coupled receptor that regulates a variety of physiological processes through interactions with the neuropeptides orexin A and B. Selective OX1R antagonists exhibit therapeutic effects in preclinical models of several behavioral disorders, including drug seeking and overeating. However, currently there are no selective OX1R antagonists approved for clinical use, fueling demand for novel compounds that act at this target. In this study, we meticulously curated a dataset comprising over 1300 OX1R ligands using a stringent filter and criteria cascade. Subsequently, we developed highly predictive quantitative structure-activity relationship (QSAR) models employing the optimized hyper-parameters for the random forest machine learning algorithm and twelve 2D molecular descriptors selected by recursive feature elimination with a 5-fold cross-validation process. The predictive capacity of the QSAR model was further assessed using an external test set and enrichment study, confirming its high predictivity. The practical applicability of our final QSAR model was demonstrated through virtual screening of the DrugBank database. This revealed two FDA-approved drugs (isavuconazole and cabozantinib) as potential OX1R ligands, confirmed by radiolabeled OX1R binding assays. To our best knowledge, this study represents the first report of highly predictive QSAR models on a large comprehensive dataset of diverse OX1R ligands, which should prove useful for the discovery and design of new compounds targeting this receptor.
Collapse
Affiliation(s)
- Vanessa Y Zhang
- Department of Psychiatry, Robert Wood Johnson Medical School, Rutgers University and Rutgers Biomedical Health Sciences, Piscataway, NJ, USA
- Brain Health Institute, Rutgers University and Rutgers Biomedical and Health Sciences, Piscataway, NJ, USA
- West Windsor-Plainsboro High School South, West Windsor, NJ, USA
| | - Shayna L O'Connor
- Department of Psychiatry, Robert Wood Johnson Medical School, Rutgers University and Rutgers Biomedical Health Sciences, Piscataway, NJ, USA
- Brain Health Institute, Rutgers University and Rutgers Biomedical and Health Sciences, Piscataway, NJ, USA
| | - William J Welsh
- Department of Pharmacology, Robert Wood Johnson Medical School, Rutgers University and Rutgers Biomedical Health Sciences, Piscataway, NJ, USA
| | - Morgan H James
- Department of Psychiatry, Robert Wood Johnson Medical School, Rutgers University and Rutgers Biomedical Health Sciences, Piscataway, NJ, USA
- Brain Health Institute, Rutgers University and Rutgers Biomedical and Health Sciences, Piscataway, NJ, USA
| |
Collapse
|
7
|
Das M, Ghosh A, Sunoj RB. Advances in machine learning with chemical language models in molecular property and reaction outcome predictions. J Comput Chem 2024; 45:1160-1176. [PMID: 38299229 DOI: 10.1002/jcc.27315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 01/06/2024] [Accepted: 01/09/2024] [Indexed: 02/02/2024]
Abstract
Molecular properties and reactions form the foundation of chemical space. Over the years, innumerable molecules have been synthesized, a smaller fraction of them found immediate applications, while a larger proportion served as a testimony to creative and empirical nature of the domain of chemical science. With increasing emphasis on sustainable practices, it is desirable that a target set of molecules are synthesized preferably through a fewer empirical attempts instead of a larger library, to realize an active candidate. In this front, predictive endeavors using machine learning (ML) models built on available data acquire high timely significance. Prediction of molecular property and reaction outcome remain one of the burgeoning applications of ML in chemical science. Among several methods of encoding molecular samples for ML models, the ones that employ language like representations are gaining steady popularity. Such representations would additionally help adopt well-developed natural language processing (NLP) models for chemical applications. Given this advantageous background, herein we describe several successful chemical applications of NLP focusing on molecular property and reaction outcome predictions. From relatively simpler recurrent neural networks (RNNs) to complex models like transformers, different network architecture have been leveraged for tasks such as de novo drug design, catalyst generation, forward and retro-synthesis predictions. The chemical language model (CLM) provides promising avenues toward a broad range of applications in a time and cost-effective manner. While we showcase an optimistic outlook of CLMs, attention is also placed on the persisting challenges in reaction domain, which would optimistically be addressed by advanced algorithms tailored to chemical language and with increased availability of high-quality datasets.
Collapse
Affiliation(s)
- Manajit Das
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Ankit Ghosh
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Raghavan B Sunoj
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
- Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Mumbai, India
| |
Collapse
|
8
|
Du W, Zhao L, Wu R, Huang B, Liu S, Liu Y, Huang H, Shi G. Predicting drug-Protein interaction with deep learning framework for molecular graphs and sequences: Potential candidates against SAR-CoV-2. PLoS One 2024; 19:e0299696. [PMID: 38728335 PMCID: PMC11086825 DOI: 10.1371/journal.pone.0299696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 02/14/2024] [Indexed: 05/12/2024] Open
Abstract
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) caused the COVID-19 disease, which represents a new life-threatening disaster. Regarding viral infection, many therapeutics have been investigated to alleviate the epidemiology such as vaccines and receptor decoys. However, the continuous mutating coronavirus, especially the variants of Delta and Omicron, are tended to invalidate the therapeutic biological product. Thus, it is necessary to develop molecular entities as broad-spectrum antiviral drugs. Coronavirus replication is controlled by the viral 3-chymotrypsin-like cysteine protease (3CLpro) enzyme, which is required for the virus's life cycle. In the cases of severe acute respiratory syndrome coronavirus (SARS-CoV) and middle east respiratory syndrome coronavirus (MERS-CoV), 3CLpro has been shown to be a promising therapeutic development target. Here we proposed an attention-based deep learning framework for molecular graphs and sequences, training from the BindingDB 3CLpro dataset (114,555 compounds). After construction of such model, we conducted large-scale screening the in vivo/vitro dataset (276,003 compounds) from Zinc Database and visualize the candidate compounds with attention score. geometric-based affinity prediction was employed for validation. Finally, we established a 3CLpro-specific deep learning framework, namely GraphDPI-3CL (AUROC: 0.958) achieved superior performance beyond the existing state of the art model and discovered 10 molecules with a high binding affinity of 3CLpro and superior binding mode.
Collapse
Affiliation(s)
- Weian Du
- Department of Dermatology, Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Liang Zhao
- Shenzhen Health Development Research and Data Management Center, Shenzhen, China
| | - Rong Wu
- Department of Dermatology, Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Boning Huang
- School of Finance, Shanghai University of Finance and Economics, Shanghai, China
| | - Si Liu
- Department of Cosmetic and Plastic Surgery, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Yufeng Liu
- Department of Cosmetic and Plastic Surgery, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Huaiqiu Huang
- Department of Dermatology, Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Ge Shi
- Department of Cosmetic and Plastic Surgery, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
9
|
Qiu X, Wang H, Tan X, Fang Z. G-K BertDTA: A graph representation learning and semantic embedding-based framework for drug-target affinity prediction. Comput Biol Med 2024; 173:108376. [PMID: 38552281 DOI: 10.1016/j.compbiomed.2024.108376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 03/21/2024] [Accepted: 03/24/2024] [Indexed: 04/17/2024]
Abstract
Developing new drugs is costly, time-consuming, and risky. Drug-target affinity (DTA), indicating the binding capability between drugs and target proteins, is a crucial indicator for drug development. Accurately predicting interaction strength between new drug-target pairs by analyzing previous experiments aids in screening potential drug molecules, repurposing them, and developing safe and effective medicines. Existing computational models for DTA prediction rely on strings or single-graph neural networks, lacking consideration of protein structure and molecular semantic information, leading to limited accuracy. Our experiments demonstrate that string-based methods may overlook protein conformations, causing a high root mean square error (RMSE) of 3.584 in affinity due to a lack of spatial context. Single graph networks also underperform on topology features, with a 6% lower confidence interval (CI) for activity classification. Absent semantic information also limits generalization across diverse compounds, resulting in 18% increment in RMSE and 5% in misclassifications within quantifications study, restricting potential drug discovery. To address these limitations, we propose G-K BertDTA, a novel framework for accurate DTA prediction incorporating protein features, molecular semantic features, and molecular structural information. In this proposed model, we represent drugs as graphs, with a GIN employed to learn the molecular topological information. For the extraction of protein structural features, we utilize a DenseNet architecture. A knowledge-based BERT semantic model is incorporated to obtain rich pre-trained semantic embeddings, thereby enhancing the feature information. We extensively evaluated our proposed approach on the publicly available benchmark datasets (i.e., KIBA and Davis), and experimental results demonstrate the promising performance of our method, which consistently outperforms previous state-of-the-art approaches. Code is available at https://github.com/AmbitYuki/G-K-BertDTA.
Collapse
Affiliation(s)
- Xihe Qiu
- School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China
| | - Haoyu Wang
- School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China
| | - Xiaoyu Tan
- INF Technology (Shanghai) Co., Ltd., Shanghai, China
| | - Zhijun Fang
- School of Computer Science and Technology, Donghua University, Shanghai, China.
| |
Collapse
|
10
|
Schieferdecker S, Rottach F, Vock E. In Silico Prediction of Oral Acute Rodent Toxicity Using Consensus Machine Learning. J Chem Inf Model 2024; 64:3114-3122. [PMID: 38498695 DOI: 10.1021/acs.jcim.4c00056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
Acute oral toxicity (AOT) is required for the classification and labeling of chemicals according to the global harmonized system (GHS). Acute oral toxicity studies are optimized to minimize the use of animals. However, with the advent of the three Rs principles and machine learning in toxicology, alternative in silico methods became a reasonable alternative approach for addressing the AOT of new chemical matter. Here, we describe the compilation of AOT data from a commercial database and the development of a consensus classification model after evaluating different combinations of molecular representations and machine learning algorithms. The model shows significantly better performance compared to publicly available AOT models. Its performance was evaluated on an external validation data set, which was compiled from the literature, and an applicability domain was deduced.
Collapse
Affiliation(s)
| | - Florian Rottach
- Boehringer Ingelheim Pharma GmbH & Co. KG, 88397 Biberach, Germany
| | - Esther Vock
- Boehringer Ingelheim Pharma GmbH & Co. KG, 88397 Biberach, Germany
| |
Collapse
|
11
|
Lin J, He Y, Ru C, Long W, Li M, Wen Z. Advancing Adverse Drug Reaction Prediction with Deep Chemical Language Model for Drug Safety Evaluation. Int J Mol Sci 2024; 25:4516. [PMID: 38674100 PMCID: PMC11050562 DOI: 10.3390/ijms25084516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 04/15/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024] Open
Abstract
The accurate prediction of adverse drug reactions (ADRs) is essential for comprehensive drug safety evaluation. Pre-trained deep chemical language models have emerged as powerful tools capable of automatically learning molecular structural features from large-scale datasets, showing promising capabilities for the downstream prediction of molecular properties. However, the performance of pre-trained chemical language models in predicting ADRs, especially idiosyncratic ADRs induced by marketed drugs, remains largely unexplored. In this study, we propose MoLFormer-XL, a pre-trained model for encoding molecular features from canonical SMILES, in conjunction with a CNN-based model to predict drug-induced QT interval prolongation (DIQT), drug-induced teratogenicity (DIT), and drug-induced rhabdomyolysis (DIR). Our results demonstrate that the proposed model outperforms conventional models applied in previous studies for predicting DIQT, DIT, and DIR. Notably, an analysis of the learned linear attention maps highlights amines, alcohol, ethers, and aromatic halogen compounds as strongly associated with the three types of ADRs. These findings hold promise for enhancing drug discovery pipelines and reducing the drug attrition rate due to safety concerns.
Collapse
Affiliation(s)
- Jinzhu Lin
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Yujie He
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Chengxiang Ru
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Wulin Long
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Zhining Wen
- College of Chemistry, Sichuan University, Chengdu 610064, China
- Medical Big Data Center, Sichuan University, Chengdu 610064, China
| |
Collapse
|
12
|
Heyndrickx W, Mervin L, Morawietz T, Sturm N, Friedrich L, Zalewski A, Pentina A, Humbeck L, Oldenhof M, Niwayama R, Schmidtke P, Fechner N, Simm J, Arany A, Drizard N, Jabal R, Afanasyeva A, Loeb R, Verma S, Harnqvist S, Holmes M, Pejo B, Telenczuk M, Holway N, Dieckmann A, Rieke N, Zumsande F, Clevert DA, Krug M, Luscombe C, Green D, Ertl P, Antal P, Marcus D, Do Huu N, Fuji H, Pickett S, Acs G, Boniface E, Beck B, Sun Y, Gohier A, Rippmann F, Engkvist O, Göller AH, Moreau Y, Galtier MN, Schuffenhauer A, Ceulemans H. MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information. J Chem Inf Model 2024; 64:2331-2344. [PMID: 37642660 PMCID: PMC11005050 DOI: 10.1021/acs.jcim.3c00799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Indexed: 08/31/2023]
Abstract
Federated multipartner machine learning has been touted as an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource-intensive. In the landmark MELLODDY project, indeed, each of ten pharmaceutical companies realized aggregated improvements on its own classification or regression models through federated learning. To this end, they leveraged a novel implementation extending multitask learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma data set of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate the predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point toward an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performance, albeit with a saturating return. Markedly higher improvements were observed for the pharmacokinetics and safety panel assay-based task subsets.
Collapse
Affiliation(s)
| | - Lewis Mervin
- AstraZeneca
R&D, Biomedical Campus, 1 Francis Crick Ave, Cambridge CB2 0SL, U.K.
| | - Tobias Morawietz
- Bayer
Pharma
AG, Global Drug Discovery, Chemical Research,
Computational Chemistry, Aprather Weg 18 a, Wuppertal 42096, Germany
| | - Noé Sturm
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Lukas Friedrich
- Merck KGaA, Global Research & Development, Frankfurter Strasse 250, Darmstadt 64293, Germany
| | - Adam Zalewski
- Amgen Research
(Munich) GmbH, Staffelseestraße
2, Munich 81477, Germany
| | - Anastasia Pentina
- Bayer AG, Machine Learning Research, Research & Development,
Pharmaceuticals, Berlin 10117, Germany
| | - Lina Humbeck
- BI Medicinal
Chemistry Department, Boehringer Ingelheim
Pharma GmbH & Co. KG, Birkendorfer Str. 65, Biberach an der Riss 88397, Germany
| | - Martijn Oldenhof
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | - Ritsuya Niwayama
- Institut
de recherches Servier, 125 chemin de ronde Croissy-sur-Seine, Île-de-France 78290, France
| | | | - Nikolas Fechner
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Jaak Simm
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | - Adam Arany
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | | | - Rama Jabal
- Iktos, 65 rue de Prony, Paris 75017, France
| | - Arina Afanasyeva
- Modality
Informatics Group, Digital Research Solutions, Advanced Informatics
& Analytics, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan
| | - Regis Loeb
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | - Shlok Verma
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Simon Harnqvist
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Matthew Holmes
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Balazs Pejo
- Budapest
University of Technology and Economics, Department of Networked Systems and Services, Műegyetem rkp. 3, Budapest 1111, Hungary
| | | | - Nicholas Holway
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Arne Dieckmann
- Bayer
AG, API Production, Product Supply, Pharmaceuticals, Ernst-Schering-Straße 14, Bergkamen 59192, Germany
| | - Nicola Rieke
- NVIDIA
GmbH, Floessergasse 2, Munich 81369, Germany
| | | | - Djork-Arné Clevert
- Bayer AG, Machine Learning Research, Research & Development,
Pharmaceuticals, Berlin 10117, Germany
| | - Michael Krug
- Merck KGaA, Global Research & Development, Frankfurter Strasse 250, Darmstadt 64293, Germany
| | - Christopher Luscombe
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Darren Green
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Peter Ertl
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Peter Antal
- Budapest
University of Technology and Economics, Department of Measurement and Information Systems, Műegyetem rkp. 3, Budapest 1111, Hungary
| | - David Marcus
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | | | - Hideyoshi Fuji
- Modality
Informatics Group, Digital Research Solutions, Advanced Informatics
& Analytics, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan
| | - Stephen Pickett
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Gergely Acs
- Budapest
University of Technology and Economics, Department of Networked Systems and Services, Műegyetem rkp. 3, Budapest 1111, Hungary
| | - Eric Boniface
- Substra
Foundation - Labelia Labs, 4 rue Voltaire, Nantes 44000, France
| | - Bernd Beck
- BI Medicinal
Chemistry Department, Boehringer Ingelheim
Pharma GmbH & Co. KG, Birkendorfer Str. 65, Biberach an der Riss 88397, Germany
| | - Yax Sun
- Amgen
Research, 1 Amgen Center
Drive, Thousand Oaks, California 92130, United States
| | - Arnaud Gohier
- Institut
de recherches Servier, 125 chemin de ronde Croissy-sur-Seine, Île-de-France 78290, France
| | - Friedrich Rippmann
- Merck KGaA, Global Research & Development, Frankfurter Strasse 250, Darmstadt 64293, Germany
| | - Ola Engkvist
- AstraZeneca, Molecular AI, Discovery Sciences,
R&D, Pepparedsleden
1, Mölndal 431 50, Sweden
| | - Andreas H. Göller
- Bayer
Pharma
AG, Global Drug Discovery, Chemical Research,
Computational Chemistry, Aprather Weg 18 a, Wuppertal 42096, Germany
| | - Yves Moreau
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | | | - Ansgar Schuffenhauer
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Hugo Ceulemans
- Janssen
Pharmaceutica NV, Turnhoutseweg 30, Beerse 2340, Belgium
| |
Collapse
|
13
|
Matsukiyo Y, Yamanaka C, Yamanishi Y. De Novo Generation of Chemical Structures of Inhibitor and Activator Candidates for Therapeutic Target Proteins by a Transformer-Based Variational Autoencoder and Bayesian Optimization. J Chem Inf Model 2024; 64:2345-2355. [PMID: 37768595 DOI: 10.1021/acs.jcim.3c00824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/29/2023]
Abstract
Deep generative models for molecular generation have been gaining much attention as structure generators to accelerate drug discovery. However, most previously developed methods are chemistry-centric approaches, and comprehensive biological responses in the cell have not been taken into account. In this study, we propose a novel computational method, TRIOMPHE-BOA (transcriptome-based inference and generation of molecules with desired phenotypes using the Bayesian optimization algorithm), to generate new chemical structures of inhibitor or activator candidates for therapeutic target proteins by integrating chemically and genetically perturbed transcriptome profiles. In the algorithm, the substructures of multiple molecules that were selected based on the transcriptome analysis are fused in the design of new chemical structures by exploring the latent space of a Transformer-based variational autoencoder using Bayesian optimization. Our results demonstrate the usefulness of the proposed method in terms of having high reproducibility of existing ligands for 10 therapeutic target proteins when compared with previous methods. Moreover, this method can be applied to proteins without detailed 3D structures or known ligands and is expected to become a powerful tool for more efficient hit identification.
Collapse
Affiliation(s)
- Yuki Matsukiyo
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Chikashige Yamanaka
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Yoshihiro Yamanishi
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
- Graduate School of Informatics, Nagoya University, Chikusa, Nagoya, Aichi 464-8601, Japan
| |
Collapse
|
14
|
Ferraz-Caetano J, Teixeira F, Cordeiro MNDS. Explainable Supervised Machine Learning Model To Predict Solvation Gibbs Energy. J Chem Inf Model 2024; 64:2250-2262. [PMID: 37603608 PMCID: PMC11005042 DOI: 10.1021/acs.jcim.3c00544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Indexed: 08/23/2023]
Abstract
Many challenges persist in developing accurate computational models for predicting solvation free energy (ΔGsol). Despite recent developments in Machine Learning (ML) methodologies that outperformed traditional quantum mechanical models, several issues remain concerning explanatory insights for broad chemical predictions with an acceptable speed-accuracy trade-off. To overcome this, we present a novel supervised ML model to predict the ΔGsol for an array of solvent-solute pairs. Using two different ensemble regressor algorithms, we made fast and accurate property predictions using open-source chemical features, encoding complex electronic, structural, and surface area descriptors for every solvent and solute. By integrating molecular properties and chemical interaction features, we have analyzed individual descriptor importance and optimized our model though explanatory information form feature groups. On aqueous and organic solvent databases, ML models revealed the predictive relevance of solutes with increasing polar surface area and decreasing polarizability, yielding better results than state-of-the-art benchmark Neural Network methods (without complex quantum mechanical or molecular dynamic simulations). Both algorithms successfully outperformed previous ΔGsol predictions methods, with a maximum absolute error of 0.22 ± 0.02 kcal mol-1, further validated in an external benchmark database and with solvent hold-out tests. With these explanatory and statistical insights, they allow a thoughtful application of this method for predicting other thermodynamic properties, stressing the relevance of ML modeling for further complex computational chemistry problems.
Collapse
Affiliation(s)
- José Ferraz-Caetano
- Department
of Chemistry and Biochemistry − Faculty of Sciences, University of Porto - Rua do Campo Alegre, S/N, 4169-007 Porto, Portugal
| | - Filipe Teixeira
- Centre
of Chemistry, University of Minho, Campus
de Gualtar, 4710-057 Braga, Portugal
| | - M. Natália D. S. Cordeiro
- Department
of Chemistry and Biochemistry − Faculty of Sciences, University of Porto - Rua do Campo Alegre, S/N, 4169-007 Porto, Portugal
| |
Collapse
|
15
|
Svensson E, Hoedt PJ, Hochreiter S, Klambauer G. HyperPCM: Robust Task-Conditioned Modeling of Drug-Target Interactions. J Chem Inf Model 2024; 64:2539-2553. [PMID: 38185877 PMCID: PMC11005051 DOI: 10.1021/acs.jcim.3c01417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 11/27/2023] [Accepted: 11/27/2023] [Indexed: 01/09/2024]
Abstract
A central problem in drug discovery is to identify the interactions between drug-like compounds and protein targets. Over the past few decades, various quantitative structure-activity relationship (QSAR) and proteo-chemometric (PCM) approaches have been developed to model and predict these interactions. While QSAR approaches solely utilize representations of the drug compound, PCM methods incorporate both representations of the protein target and the drug compound, enabling them to achieve above-chance predictive accuracy on previously unseen protein targets. Both QSAR and PCM approaches have recently been improved by machine learning and deep neural networks, that allow the development of drug-target interaction prediction models from measurement data. However, deep neural networks typically require large amounts of training data and cannot robustly adapt to new tasks, such as predicting interaction for unseen protein targets at inference time. In this work, we propose to use HyperNetworks to efficiently transfer information between tasks during inference and thus to accurately predict drug-target interactions on unseen protein targets. Our HyperPCM method reaches state-of-the-art performance compared to previous methods on multiple well-known benchmarks, including Davis, DUD-E, and a ChEMBL derived data set, and particularly excels at zero-shot inference involving unseen protein targets. Our method, as well as reproducible data preparation, is available at https://github.com/ml-jku/hyper-dti.
Collapse
Affiliation(s)
- Emma Svensson
- ELLIS
Unit Linz & Institute for Machine Learning, Johannes Kepler University, Linz 4040, Austria
- Molecular
AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, 431 83, Sweden
| | - Pieter-Jan Hoedt
- ELLIS
Unit Linz & Institute for Machine Learning, Johannes Kepler University, Linz 4040, Austria
| | - Sepp Hochreiter
- ELLIS
Unit Linz & Institute for Machine Learning, Johannes Kepler University, Linz 4040, Austria
- Institute
of Advanced Research in Artificial Intelligence (IARAI), Vienna 1030, Austria
| | - Günter Klambauer
- ELLIS
Unit Linz & Institute for Machine Learning, Johannes Kepler University, Linz 4040, Austria
| |
Collapse
|
16
|
Pang C, Qiao J, Zeng X, Zou Q, Wei L. Deep Generative Models in De Novo Drug Molecule Generation. J Chem Inf Model 2024; 64:2174-2194. [PMID: 37934070 DOI: 10.1021/acs.jcim.3c01496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2023]
Abstract
The discovery of new drugs has important implications for human health. Traditional methods for drug discovery rely on experiments to optimize the structure of lead molecules, which are time-consuming and high-cost. Recently, artificial intelligence has exhibited promising and efficient performance for drug-like molecule generation. In particular, deep generative models achieve great success in de novo generation of drug-like molecules with desired properties, showing massive potential for novel drug discovery. In this study, we review the recent progress of molecule generation using deep generative models, mainly focusing on molecule representations, public databases, data processing tools, and advanced artificial intelligence based molecule generation frameworks. In particular, we present a comprehensive comparison of state-of-the-art deep generative models for molecule generation and a summary of commonly used molecular design strategies. We identify research gaps and challenges of molecule generation such as the need for better databases, missing 3D information in molecular representation, and the lack of high-precision evaluation metrics. We suggest future directions for molecular generation and drug discovery.
Collapse
Affiliation(s)
- Chao Pang
- School of Software, Shandong University, Jinan 250100, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250100, China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250100, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250100, China
| | - Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, Changsha 410082, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan 250100, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250100, China
| |
Collapse
|
17
|
Yi J, Shi S, Fu L, Yang Z, Nie P, Lu A, Wu C, Deng Y, Hsieh C, Zeng X, Hou T, Cao D. OptADMET: a web-based tool for substructure modifications to improve ADMET properties of lead compounds. Nat Protoc 2024; 19:1105-1121. [PMID: 38263521 DOI: 10.1038/s41596-023-00942-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Accepted: 10/27/2023] [Indexed: 01/25/2024]
Abstract
Lead optimization is a crucial step in the drug discovery process, which aims to design potential drug candidates from biologically active hits. During lead optimization, active hits undergo modifications to improve their absorption, distribution, metabolism, excretion and toxicity (ADMET) profiles. Medicinal chemists face key questions regarding which compound(s) should be synthesized next and how to balance multiple ADMET properties. Reliable transformation rules from multiple experimental analyses are critical to improve this decision-making process. We developed OptADMET ( https://cadd.nscc-tj.cn/deploy/optadmet/ ), an integrated web-based platform that provides chemical transformation rules for 32 ADMET properties and leverages prior experimental data for lead optimization. The multiproperty transformation rule database contains a total of 41,779 validated transformation rules generated from the analysis of 177,191 reliable experimental datasets. Additionally, 146,450 rules were generated by analyzing 239,194 molecular data predictions. OptADMET provides the ADMET profiles of all optimized molecules from the queried molecule and enables the prediction of desirable substructure transformations and subsequent validation of drug candidates. OptADMET is based on matched molecular pairs analysis derived from synthetic chemistry, thus providing improved practicality over other methods. OptADMET is designed for use by both experimental and computational scientists.
Collapse
Affiliation(s)
- Jiacai Yi
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, China
- School of Computer Science, National University of Defense Technology, Changsha, China
| | - Shaohua Shi
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, China
- School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, China
| | - Li Fu
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, China
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, China
| | - Ziyi Yang
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, China
| | - Pengfei Nie
- National Supercomputer Center in Tianjin, Tianjin, China
| | - Aiping Lu
- School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, China
- Guangdong-Hong Kong-Macau Joint Lab on Chinese Medicine and Immune Disease Research, Guangzhou, China
| | - Chengkun Wu
- School of Computer Science, National University of Defense Technology, Changsha, China
| | - Yafeng Deng
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, China
| | - Changyu Hsieh
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, China
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Xiangxiang Zeng
- Deparment of Computer Science, Hunan University, Changsha, China
| | - Tingjun Hou
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, China.
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China.
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, China.
- School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, China.
| |
Collapse
|
18
|
Chen L, Jiang J, Dou B, Feng H, Liu J, Zhu Y, Zhang B, Zhou T, Wei GW. Machine learning study of the extended drug-target interaction network informed by pain related voltage-gated sodium channels. Pain 2024; 165:908-921. [PMID: 37851391 PMCID: PMC11021136 DOI: 10.1097/j.pain.0000000000003089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 09/09/2023] [Indexed: 10/19/2023]
Abstract
ABSTRACT Pain is a significant global health issue, and the current treatment options for pain management have limitations in terms of effectiveness, side effects, and potential for addiction. There is a pressing need for improved pain treatments and the development of new drugs. Voltage-gated sodium channels, particularly Nav1.3, Nav1.7, Nav1.8, and Nav1.9, play a crucial role in neuronal excitability and are predominantly expressed in the peripheral nervous system. Targeting these channels may provide a means to treat pain while minimizing central and cardiac adverse effects. In this study, we construct protein-protein interaction (PPI) networks based on pain-related sodium channels and develop a corresponding drug-target interaction network to identify potential lead compounds for pain management. To ensure reliable machine learning predictions, we carefully select 111 inhibitor data sets from a pool of more than 1000 targets in the PPI network. We employ 3 distinct machine learning algorithms combined with advanced natural language processing (NLP)-based embeddings, specifically pretrained transformer and autoencoder representations. Through a systematic screening process, we evaluate the side effects and repurposing potential of more than 150,000 drug candidates targeting Nav1.7 and Nav1.8 sodium channels. In addition, we assess the ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties of these candidates to identify leads with near-optimal characteristics. Our strategy provides an innovative platform for the pharmacological development of pain treatments, offering the potential for improved efficacy and reduced side effects.
Collapse
Affiliation(s)
- Long Chen
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, P R. China
| | - Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, P R. China
- Department of Mathematics, Michigan State University, East Lansing, MI, United States
| | - Bozheng Dou
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, P R. China
| | - Hongsong Feng
- Department of Mathematics, Michigan State University, East Lansing, MI, United States
| | - Jie Liu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, P R. China
| | - Yueying Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, P R. China
| | - Bengong Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan, P R. China
| | - Tianshou Zhou
- Key Laboratory of Computational Mathematics, Guangdong Province, and School of Mathematics, Sun Yat-sen University, Guangzhou, P R. China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, United States
| |
Collapse
|
19
|
Wang C, Ong HH, Chiba S, Rajapakse JC. GLDM: hit molecule generation with constrained graph latent diffusion model. Brief Bioinform 2024; 25:bbae142. [PMID: 38581415 PMCID: PMC10998532 DOI: 10.1093/bib/bbae142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Revised: 03/08/2024] [Accepted: 03/03/2024] [Indexed: 04/08/2024] Open
Abstract
Discovering hit molecules with desired biological activity in a directed manner is a promising but profound task in computer-aided drug discovery. Inspired by recent generative AI approaches, particularly Diffusion Models (DM), we propose Graph Latent Diffusion Model (GLDM)-a latent DM that preserves both the effectiveness of autoencoders of compressing complex chemical data and the DM's capabilities of generating novel molecules. Specifically, we first develop an autoencoder to encode the molecular data into low-dimensional latent representations and then train the DM on the latent space to generate molecules inducing targeted biological activity defined by gene expression profiles. Manipulating DM in the latent space rather than the input space avoids complicated operations to map molecule decomposition and reconstruction to diffusion processes, and thus improves training efficiency. Experiments show that GLDM not only achieves outstanding performances on molecular generation benchmarks, but also generates samples with optimal chemical properties and potentials to induce desired biological activity.
Collapse
Affiliation(s)
- Conghao Wang
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Ave, 639798, Singapore
| | - Hiok Hian Ong
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Ave, 639798, Singapore
| | - Shunsuke Chiba
- School of Chemistry, Chemical Engineering and Biotechnology, Nanyang Technological University, 21 Nanyang Link, 637371, Singapore
| | - Jagath C Rajapakse
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Ave, 639798, Singapore
| |
Collapse
|
20
|
Chang J, Ye JC. Bidirectional generation of structure and properties through a single molecular foundation model. Nat Commun 2024; 15:2323. [PMID: 38485914 PMCID: PMC10940637 DOI: 10.1038/s41467-024-46440-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2023] [Accepted: 02/27/2024] [Indexed: 03/18/2024] Open
Abstract
Recent successes of foundation models in artificial intelligence have prompted the emergence of large-scale chemical pre-trained models. Despite the growing interest in large molecular pre-trained models that provide informative representations for downstream tasks, attempts for multimodal pre-training approaches on the molecule domain were limited. To address this, here we present a multimodal molecular pre-trained model that incorporates the modalities of structure and biochemical properties, drawing inspiration from recent advances in multimodal learning techniques. Our proposed model pipeline of data handling and training objectives aligns the structure/property features in a common embedding space, which enables the model to regard bidirectional information between the molecules' structure and properties. These contributions emerge synergistic knowledge, allowing us to tackle both multimodal and unimodal downstream tasks through a single model. Through extensive experiments, we demonstrate that our model has the capabilities to solve various meaningful chemical challenges, including conditional molecule generation, property prediction, molecule classification, and reaction prediction.
Collapse
Affiliation(s)
- Jinho Chang
- Graduate School of AI, KAIST, Daejeon, South Korea
| | - Jong Chul Ye
- Graduate School of AI, KAIST, Daejeon, South Korea.
| |
Collapse
|
21
|
Hunklinger A, Hartog P, Šícho M, Godin G, Tetko IV. The openOCHEM consensus model is the best-performing open-source predictive model in the First EUOS/SLAS joint compound solubility challenge. SLAS DISCOVERY : ADVANCING LIFE SCIENCES R & D 2024; 29:100144. [PMID: 38316342 DOI: 10.1016/j.slasd.2024.01.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Revised: 01/06/2024] [Accepted: 01/22/2024] [Indexed: 02/07/2024]
Abstract
The EUOS/SLAS challenge aimed to facilitate the development of reliable algorithms to predict the aqueous solubility of small molecules using experimental data from 100 K compounds. In total, hundred teams took part in the challenge to predict low, medium and highly soluble compounds as measured by the nephelometry assay. This article describes the winning model, which was developed using the publicly available Online CHEmical database and Modeling environment (OCHEM) available on the website https://ochem.eu/article/27. We describe in detail the assumptions and steps used to select methods, descriptors and strategy which contributed to the winning solution. In particular we show that consensus based on 28 models calculated using descriptor-based and representation learning methods allowed us to obtain the best score, which was higher than those based on individual approaches or consensus models developed using each individual approach. A combination of diverse models allowed us to decrease both bias and variance of individual models and to calculate the highest score. The model based on Transformer CNN contributed the best individual score thus highlighting the power of Natural Language Processing (NLP) methods. The inclusion of information about aleatoric uncertainty would be important to better understand and use the challenge data by the contestants.
Collapse
Affiliation(s)
- Andrea Hunklinger
- Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich-Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), DE-85764 Neuherberg, Germany
| | - Peter Hartog
- Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich-Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), DE-85764 Neuherberg, Germany
| | - Martin Šícho
- Leiden Academic Centre for Drug Research, Leiden University, 55 Einsteinweg, 2333 CC Leiden, the Netherlands; CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic
| | - Guillaume Godin
- dsm-firmenich SA, Rue de la Bergère 7, CH-1242 Satigny, Switzerland
| | - Igor V Tetko
- Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich-Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), DE-85764 Neuherberg, Germany; BIGCHEM GmbH, Valerystr. 49, DE-85716 Unterschleißheim, Germany.
| |
Collapse
|
22
|
Kaufman B, Williams EC, Underkoffler C, Pederson R, Mardirossian N, Watson I, Parkhill J. COATI: Multimodal Contrastive Pretraining for Representing and Traversing Chemical Space. J Chem Inf Model 2024; 64:1145-1157. [PMID: 38316665 DOI: 10.1021/acs.jcim.3c01753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2024]
Abstract
Creating a successful small molecule drug is a challenging multiparameter optimization problem in an effectively infinite space of possible molecules. Generative models have emerged as powerful tools for traversing data manifolds composed of images, sounds, and text and offer an opportunity to dramatically improve the drug discovery and design process. To create generative optimization methods that are more useful than brute-force molecular generation and filtering via virtual screening, we propose that four integrated features are necessary: large, quantitative data sets of molecular structure and activity, an invertible vector representation of realistic accessible molecules, smooth and differentiable regressors that quantify uncertainty, and algorithms to simultaneously optimize properties of interest. Over the course of 12 months, Terray Therapeutics has collected a data set of 2 billion quantitative binding measurements of small molecules to therapeutic targets, which directly motivates multiparameter generative optimization of molecules conditioned on these data. To this end, we present contrastive optimization for accelerated therapeutic inference (COATI), a pretrained, multimodal encoder-decoder model of druglike chemical space. COATI is constructed without any human biasing of features, using contrastive learning from text and 3D representations of molecules to allow for downstream use with structural models. We demonstrate that COATI possesses many of the desired properties of universal molecular embedding: fixed-dimension, invertibility, autoencoding, accurate regression, and low computation cost. Finally, we present a novel metadynamics algorithm for generative optimization using a small subset of our proprietary data collected for a model protein, carbonic anhydrase, designing molecules that satisfy the multiparameter optimization task of potency, solubility, and drug likeness. This work sets the stage for fully integrated generative molecular design and optimization for small molecules.
Collapse
Affiliation(s)
- Benjamin Kaufman
- Terray Therapeutics, Inc., 800 Royal Oaks Dr, Monrovia, California 91016, United States
| | - Edward C Williams
- Terray Therapeutics, Inc., 800 Royal Oaks Dr, Monrovia, California 91016, United States
| | - Carl Underkoffler
- Terray Therapeutics, Inc., 800 Royal Oaks Dr, Monrovia, California 91016, United States
| | - Ryan Pederson
- Terray Therapeutics, Inc., 800 Royal Oaks Dr, Monrovia, California 91016, United States
| | - Narbe Mardirossian
- Terray Therapeutics, Inc., 800 Royal Oaks Dr, Monrovia, California 91016, United States
| | - Ian Watson
- Terray Therapeutics, Inc., 800 Royal Oaks Dr, Monrovia, California 91016, United States
| | - John Parkhill
- Terray Therapeutics, Inc., 800 Royal Oaks Dr, Monrovia, California 91016, United States
| |
Collapse
|
23
|
Yoshikai Y, Mizuno T, Nemoto S, Kusuhara H. Difficulty in chirality recognition for Transformer architectures learning chemical structures from string representations. Nat Commun 2024; 15:1197. [PMID: 38365821 PMCID: PMC10873378 DOI: 10.1038/s41467-024-45102-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Accepted: 01/11/2024] [Indexed: 02/18/2024] Open
Abstract
Recent years have seen rapid development of descriptor generation based on representation learning of extremely diverse molecules, especially those that apply natural language processing (NLP) models to SMILES, a literal representation of molecular structure. However, little research has been done on how these models understand chemical structure. To address this black box, we investigated the relationship between the learning progress of SMILES and chemical structure using a representative NLP model, the Transformer. We show that while the Transformer learns partial structures of molecules quickly, it requires extended training to understand overall structures. Consistently, the accuracy of molecular property predictions using descriptors generated from models at different learning steps was similar from the beginning to the end of training. Furthermore, we found that the Transformer requires particularly long training to learn chirality and sometimes stagnates with low performance due to misunderstanding of enantiomers. These findings are expected to deepen the understanding of NLP models in chemistry.
Collapse
Affiliation(s)
- Yasuhiro Yoshikai
- Laboratory of Molecular Pharmacokinetics, Graduate School of Pharmaceutical Sciences, The University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, Japan
| | - Tadahaya Mizuno
- Laboratory of Molecular Pharmacokinetics, Graduate School of Pharmaceutical Sciences, The University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, Japan.
| | - Shumpei Nemoto
- Laboratory of Molecular Pharmacokinetics, Graduate School of Pharmaceutical Sciences, The University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, Japan
| | - Hiroyuki Kusuhara
- Laboratory of Molecular Pharmacokinetics, Graduate School of Pharmaceutical Sciences, The University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, Japan
| |
Collapse
|
24
|
Johnson TA, Abrahamsson DP. Quantification of chemicals in non-targeted analysis without analytical standards - Understanding the mechanism of electrospray ionization and making predictions. CURRENT OPINION IN ENVIRONMENTAL SCIENCE & HEALTH 2024; 37:100529. [PMID: 38312491 PMCID: PMC10836048 DOI: 10.1016/j.coesh.2023.100529] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2024]
Abstract
The constant creation and release of new chemicals to the environment is forming an ever-widening gap between available analytical standards and known chemicals. Developing non-targeted analysis (NTA) methods that have the ability to detect a broad spectrum of compounds is critical for research and analysis of emerging contaminants. There is a need to develop methods that make it possible to identify compound structures from their MS and MS/MS information and quantify them without analytical standards. Method refinements that utilize machine learning algorithms and chemical descriptors to estimate the instrument response of particular compounds have made progress in recent years. This narrative review seeks to summarize the current state of the field of non-targeted analysis (NTA) toward quantification of unknowns without the use of analytical standards. Despite the limited accumulation of validation studies on real samples, the ongoing enhancement in data processing and refinement of machine learning tools could lead to more comprehensive chemical coverage of NTA and validated quantitative NTA methods, thus boosting confidence in their usage and enhancing the utility of quantitative NTA.
Collapse
Affiliation(s)
- Trevor A Johnson
- Division of Environmental Pediatrics, Department of Pediatrics, Grossman School of Medicine, New York University
| | - Dimitri P Abrahamsson
- Division of Environmental Pediatrics, Department of Pediatrics, Grossman School of Medicine, New York University
| |
Collapse
|
25
|
Führer F, Gruber A, Diedam H, Göller AH, Menz S, Schneckener S. A deep neural network: mechanistic hybrid model to predict pharmacokinetics in rat. J Comput Aided Mol Des 2024; 38:7. [PMID: 38294570 DOI: 10.1007/s10822-023-00547-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Accepted: 12/21/2023] [Indexed: 02/01/2024]
Abstract
An important aspect in the development of small molecules as drugs or agrochemicals is their systemic availability after intravenous and oral administration. The prediction of the systemic availability from the chemical structure of a potential candidate is highly desirable, as it allows to focus the drug or agrochemical development on compounds with a favorable kinetic profile. However, such predictions are challenging as the availability is the result of the complex interplay between molecular properties, biology and physiology and training data is rare. In this work we improve the hybrid model developed earlier (Schneckener in J Chem Inf Model 59:4893-4905, 2019). We reduce the median fold change error for the total oral exposure from 2.85 to 2.35 and for intravenous administration from 1.95 to 1.62. This is achieved by training on a larger data set, improving the neural network architecture as well as the parametrization of mechanistic model. Further, we extend our approach to predict additional endpoints and to handle different covariates, like sex and dosage form. In contrast to a pure machine learning model, our model is able to predict new end points on which it has not been trained. We demonstrate this feature by predicting the exposure over the first 24 h, while the model has only been trained on the total exposure.
Collapse
Affiliation(s)
- Florian Führer
- Engineering & Technology, Applied Mathematics, Bayer AG, 51368, Leverkusen, Germany.
| | - Andrea Gruber
- Pharmaceuticals, R&D, Preclinical Modeling & Simulation, Bayer AG, 13353, Berlin, Germany
| | - Holger Diedam
- Crop Science, Product Supply, SC Simulation & Analysis, Bayer AG, 40789, Monheim, Germany
| | - Andreas H Göller
- Pharmaceuticals, R&D, Molecular Design, Bayer AG, 42096, Wuppertal, Germany
| | - Stephan Menz
- Pharmaceuticals, R&D, Preclinical Modeling & Simulation, Bayer AG, 13353, Berlin, Germany
| | | |
Collapse
|
26
|
Du H, Wei GW, Hou T. Multiscale topology in interactomic network: from transcriptome to antiaddiction drug repurposing. Brief Bioinform 2024; 25:bbae054. [PMID: 38499497 PMCID: PMC10948341 DOI: 10.1093/bib/bbae054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2023] [Revised: 01/05/2024] [Accepted: 01/25/2024] [Indexed: 03/20/2024] Open
Abstract
The escalating drug addiction crisis in the United States underscores the urgent need for innovative therapeutic strategies. This study embarked on an innovative and rigorous strategy to unearth potential drug repurposing candidates for opioid and cocaine addiction treatment, bridging the gap between transcriptomic data analysis and drug discovery. We initiated our approach by conducting differential gene expression analysis on addiction-related transcriptomic data to identify key genes. We propose a novel topological differentiation to identify key genes from a protein-protein interaction network derived from DEGs. This method utilizes persistent Laplacians to accurately single out pivotal nodes within the network, conducting this analysis in a multiscale manner to ensure high reliability. Through rigorous literature validation, pathway analysis and data-availability scrutiny, we identified three pivotal molecular targets, mTOR, mGluR5 and NMDAR, for drug repurposing from DrugBank. We crafted machine learning models employing two natural language processing (NLP)-based embeddings and a traditional 2D fingerprint, which demonstrated robust predictive ability in gauging binding affinities of DrugBank compounds to selected targets. Furthermore, we elucidated the interactions of promising drugs with the targets and evaluated their drug-likeness. This study delineates a multi-faceted and comprehensive analytical framework, amalgamating bioinformatics, topological data analysis and machine learning, for drug repurposing in addiction treatment, setting the stage for subsequent experimental validation. The versatility of the methods we developed allows for applications across a range of diseases and transcriptomic datasets.
Collapse
Affiliation(s)
- Hongyan Du
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
- Department of Mathematics, Michigan State University, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA
| | - Tingjun Hou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| |
Collapse
|
27
|
Yi JC, Yang ZY, Zhao WT, Yang ZJ, Zhang XC, Wu CK, Lu AP, Cao DS. ChemMORT: an automatic ADMET optimization platform using deep learning and multi-objective particle swarm optimization. Brief Bioinform 2024; 25:bbae008. [PMID: 38385872 PMCID: PMC10883642 DOI: 10.1093/bib/bbae008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Revised: 12/17/2023] [Accepted: 01/02/2024] [Indexed: 02/23/2024] Open
Abstract
Drug discovery and development constitute a laborious and costly undertaking. The success of a drug hinges not only good efficacy but also acceptable absorption, distribution, metabolism, elimination, and toxicity (ADMET) properties. Overall, up to 50% of drug development failures have been contributed from undesirable ADMET profiles. As a multiple parameter objective, the optimization of the ADMET properties is extremely challenging owing to the vast chemical space and limited human expert knowledge. In this study, a freely available platform called Chemical Molecular Optimization, Representation and Translation (ChemMORT) is developed for the optimization of multiple ADMET endpoints without the loss of potency (https://cadd.nscc-tj.cn/deploy/chemmort/). ChemMORT contains three modules: Simplified Molecular Input Line Entry System (SMILES) Encoder, Descriptor Decoder and Molecular Optimizer. The SMILES Encoder can generate the molecular representation with a 512-dimensional vector, and the Descriptor Decoder is able to translate the above representation to the corresponding molecular structure with high accuracy. Based on reversible molecular representation and particle swarm optimization strategy, the Molecular Optimizer can be used to effectively optimize undesirable ADMET properties without the loss of bioactivity, which essentially accomplishes the design of inverse QSAR. The constrained multi-objective optimization of the poly (ADP-ribose) polymerase-1 inhibitor is provided as the case to explore the utility of ChemMORT.
Collapse
Affiliation(s)
- Jia-Cai Yi
- School of Computer Science, National University of Defense Technology, Changsha 410073, Hunan, PR China
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, P. R. China
| | - Zi-Yi Yang
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, P. R. China
| | - Wen-Tao Zhao
- School of Computer Science, National University of Defense Technology, Changsha 410073, Hunan, PR China
| | - Zhi-Jiang Yang
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, P. R. China
| | - Xiao-Chen Zhang
- School of Computer Science, National University of Defense Technology, Changsha 410073, Hunan, PR China
| | - Cheng-Kun Wu
- State Key Laboratory of High-Performance Computing, Changsha 410073, Hunan, PR China
| | - Ai-Ping Lu
- Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, P. R. China
| | - Dong-Sheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, P. R. China
- Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, P. R. China
| |
Collapse
|
28
|
Viganò EL, Ballabio D, Roncaglioni A. Artificial Intelligence and Machine Learning Methods to Evaluate Cardiotoxicity following the Adverse Outcome Pathway Frameworks. TOXICS 2024; 12:87. [PMID: 38276722 PMCID: PMC10820364 DOI: 10.3390/toxics12010087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 01/15/2024] [Accepted: 01/17/2024] [Indexed: 01/27/2024]
Abstract
Cardiovascular disease is a leading global cause of mortality. The potential cardiotoxic effects of chemicals from different classes, such as environmental contaminants, pesticides, and drugs can significantly contribute to effects on health. The same chemical can induce cardiotoxicity in different ways, following various Adverse Outcome Pathways (AOPs). In addition, the potential synergistic effects between chemicals further complicate the issue. In silico methods have become essential for tackling the problem from different perspectives, reducing the need for traditional in vivo testing, and saving valuable resources in terms of time and money. Artificial intelligence (AI) and machine learning (ML) are among today's advanced approaches for evaluating chemical hazards. They can serve, for instance, as a first-tier component of Integrated Approaches to Testing and Assessment (IATA). This study employed ML and AI to assess interactions between chemicals and specific biological targets within the AOP networks for cardiotoxicity, starting with molecular initiating events (MIEs) and progressing through key events (KEs). We explored methods to encode chemical information in a suitable way for ML and AI. We started with commonly used approaches in Quantitative Structure-Activity Relationship (QSAR) methods, such as molecular descriptors and different types of fingerprint. We then increased the complexity of encoders, incorporating graph-based methods, auto-encoders, and character embeddings employed in neural language processing. We also developed a multimodal neural network architecture, capable of considering the complementary nature of different chemical representations simultaneously. The potential of this approach, compared to more conventional architectures designed to handle a single encoder, becomes apparent when the amount of data increases.
Collapse
Affiliation(s)
- Edoardo Luca Viganò
- Laboratory of Environmental Toxicology and Chemistry, Department of Environmental Health Sciences, Istituto di Ricerche Farmacologiche Mario Negri IRCSS, 20156 Milan, Italy;
| | - Davide Ballabio
- Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of Milano-Bicocca, 20126 Milan, Italy;
| | - Alessandra Roncaglioni
- Laboratory of Environmental Toxicology and Chemistry, Department of Environmental Health Sciences, Istituto di Ricerche Farmacologiche Mario Negri IRCSS, 20156 Milan, Italy;
| |
Collapse
|
29
|
Colliandre L, Muller C. Bayesian Optimization in Drug Discovery. Methods Mol Biol 2024; 2716:101-136. [PMID: 37702937 DOI: 10.1007/978-1-0716-3449-3_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/14/2023]
Abstract
Drug discovery deals with the search for initial hits and their optimization toward a targeted clinical profile. Throughout the discovery pipeline, the candidate profile will evolve, but the optimization will mainly stay a trial-and-error approach. Tons of in silico methods have been developed to improve and fasten this pipeline. Bayesian optimization (BO) is a well-known method for the determination of the global optimum of a function. In the last decade, BO has gained popularity in the early drug design phase. This chapter starts with the concept of black box optimization applied to drug design and presents some approaches to tackle it. Then it focuses on BO and explains its principle and all the algorithmic building blocks needed to implement it. This explanation aims to be accessible to people involved in drug discovery projects. A strong emphasis is made on the solutions to deal with the specific constraints of drug discovery. Finally, a large set of practical applications of BO is highlighted.
Collapse
|
30
|
Olmedo DA, Durant-Archibold AA, López-Pérez JL, Medina-Franco JL. Design and Diversity Analysis of Chemical Libraries in Drug Discovery. Comb Chem High Throughput Screen 2024; 27:502-515. [PMID: 37409545 DOI: 10.2174/1386207326666230705150110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 05/30/2023] [Accepted: 05/30/2023] [Indexed: 07/07/2023]
Abstract
Chemical libraries and compound data sets are among the main inputs to start the drug discovery process at universities, research institutes, and the pharmaceutical industry. The approach used in the design of compound libraries, the chemical information they possess, and the representation of structures, play a fundamental role in the development of studies: chemoinformatics, food informatics, in silico pharmacokinetics, computational toxicology, bioinformatics, and molecular modeling to generate computational hits that will continue the optimization process of drug candidates. The prospects for growth in drug discovery and development processes in chemical, biotechnological, and pharmaceutical companies began a few years ago by integrating computational tools with artificial intelligence methodologies. It is anticipated that it will increase the number of drugs approved by regulatory agencies shortly.
Collapse
Affiliation(s)
- Dionisio A Olmedo
- Centro de Investigaciones Farmacognósticas de la Flora Panameña (CIFLORPAN), Facultad de Farmacia, Universidad de Panamá, Ciudad de Panamá, Apartado, 0824-00178, Panamá
- Sistema Nacional de Investigación (SNI), Secretaria Nacional de Ciencia, Tecnología e Innovación (SENACYT), Ciudad del Saber, Clayton, Panamá
| | - Armando A Durant-Archibold
- Centro de Biodiversidad y Descubrimiento de Drogas, Instituto de Investigaciones Científicas y Servicios de Alta Tecnología (INDICASAT AIP), Apartado, 0843-01103, Panamá
- Departamento de Bioquímica, Facultad de Ciencias Naturales, Exactas y Tecnología, Universidad de Panamá, Ciudad de Panamá, Panamá
| | - José Luis López-Pérez
- CESIFAR, Departamento de Farmacología, Facultad de Medicina, Universidad de Panamá, Ciudad de Panamá, Panamá
- Departamento de Ciencias Farmacéuticas, Facultad de Farmacia, Universidad de Salamanca, Avda. Campo Charro s/n, 37071 Salamanca, España
| | - José Luis Medina-Franco
- DIFACQUIM Grupo de Investigación, Departamento de Farmacia, Escuela de Química, Universidad Nacional Autónoma de México, Ciudad de México, Apartado, 04510, México
| |
Collapse
|
31
|
Schimunek J, Seidl P, Elez K, Hempel T, Le T, Noé F, Olsson S, Raich L, Winter R, Gokcan H, Gusev F, Gutkin EM, Isayev O, Kurnikova MG, Narangoda CH, Zubatyuk R, Bosko IP, Furs KV, Karpenko AD, Kornoushenko YV, Shuldau M, Yushkevich A, Benabderrahmane MB, Bousquet-Melou P, Bureau R, Charton B, Cirou BC, Gil G, Allen WJ, Sirimulla S, Watowich S, Antonopoulos N, Epitropakis N, Krasoulis A, Itsikalis V, Theodorakis S, Kozlovskii I, Maliutin A, Medvedev A, Popov P, Zaretckii M, Eghbal-Zadeh H, Halmich C, Hochreiter S, Mayr A, Ruch P, Widrich M, Berenger F, Kumar A, Yamanishi Y, Zhang KYJ, Bengio E, Bengio Y, Jain MJ, Korablyov M, Liu CH, Marcou G, Glaab E, Barnsley K, Iyengar SM, Ondrechen MJ, Haupt VJ, Kaiser F, Schroeder M, Pugliese L, Albani S, Athanasiou C, Beccari A, Carloni P, D'Arrigo G, Gianquinto E, Goßen J, Hanke A, Joseph BP, Kokh DB, Kovachka S, Manelfi C, Mukherjee G, Muñiz-Chicharro A, Musiani F, Nunes-Alves A, Paiardi G, Rossetti G, Sadiq SK, Spyrakis F, Talarico C, Tsengenes A, Wade RC, Copeland C, Gaiser J, Olson DR, Roy A, Venkatraman V, Wheeler TJ, Arthanari H, Blaschitz K, Cespugli M, Durmaz V, Fackeldey K, Fischer PD, Gorgulla C, Gruber C, Gruber K, Hetmann M, Kinney JE, Padmanabha Das KM, Pandita S, Singh A, Steinkellner G, Tesseyre G, Wagner G, Wang ZF, Yust RJ, Druzhilovskiy DS, Filimonov DA, Pogodin PV, Poroikov V, Rudik AV, Stolbov LA, Veselovsky AV, De Rosa M, De Simone G, Gulotta MR, Lombino J, Mekni N, Perricone U, Casini A, Embree A, Gordon DB, Lei D, Pratt K, Voigt CA, Chen KY, Jacob Y, Krischuns T, Lafaye P, Zettor A, Rodríguez ML, White KM, Fearon D, Von Delft F, Walsh MA, Horvath D, Brooks CL, Falsafi B, Ford B, García-Sastre A, Yup Lee S, Naffakh N, Varnek A, Klambauer G, Hermans TM. A community effort in SARS-CoV-2 drug discovery. Mol Inform 2024; 43:e202300262. [PMID: 37833243 DOI: 10.1002/minf.202300262] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 10/13/2023] [Accepted: 10/13/2023] [Indexed: 10/15/2023]
Abstract
The COVID-19 pandemic continues to pose a substantial threat to human lives and is likely to do so for years to come. Despite the availability of vaccines, searching for efficient small-molecule drugs that are widely available, including in low- and middle-income countries, is an ongoing challenge. In this work, we report the results of an open science community effort, the "Billion molecules against COVID-19 challenge", to identify small-molecule inhibitors against SARS-CoV-2 or relevant human receptors. Participating teams used a wide variety of computational methods to screen a minimum of 1 billion virtual molecules against 6 protein targets. Overall, 31 teams participated, and they suggested a total of 639,024 molecules, which were subsequently ranked to find 'consensus compounds'. The organizing team coordinated with various contract research organizations (CROs) and collaborating institutions to synthesize and test 878 compounds for biological activity against proteases (Nsp5, Nsp3, TMPRSS2), nucleocapsid N, RdRP (only the Nsp12 domain), and (alpha) spike protein S. Overall, 27 compounds with weak inhibition/binding were experimentally identified by binding-, cleavage-, and/or viral suppression assays and are presented here. Open science approaches such as the one presented here contribute to the knowledge base of future drug discovery efforts in finding better SARS-CoV-2 treatments.
Collapse
|
32
|
Pérez-Correa I, Giunta PD, Mariño FJ, Francesconi JA. Transformer-Based Representation of Organic Molecules for Potential Modeling of Physicochemical Properties. J Chem Inf Model 2023; 63:7676-7688. [PMID: 38062559 DOI: 10.1021/acs.jcim.3c01548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2023]
Abstract
In this work, we study the use of three configurations of an autoencoder neural network to process organic substances with the aim of generating meaningful molecular descriptors that can be employed to develop property prediction models. A total of 18,322,500 compounds represented as SMILES strings were used to train the model, demonstrating that a latent space of 24 units is able to adequately reconstruct the data. After AE training, an analysis of the latent space properties in terms of compound similarity was carried out, indicating that this space possesses desired properties for the potential development of models for forecasting physical properties of organic compounds. As a final step, a QSPR model was developed to predict the boiling point of chemical substances based on the AE descriptors. 5276 substances were used for the regression task, and the predictive ability was compared with models available in the literature evaluated on the same database. The final AE model has an overall error of 1.40% (1.39% with augmented SMILES) in the prediction of the boiling temperature, while other models have errors between 2.0 and 3.2%. This shows that the SMILES representation is comparable and even outperforms the state-of-the-art representations widely used in the literature.
Collapse
Affiliation(s)
- Ignacio Pérez-Correa
- Instituto de Tecnologías del Hidrógeno y Energías Sostenibles (ITHES), UBA-CONICET, Ciudad Universitaria, Intendente Güiraldes 2160, Ciudad de Buenos Aires C1428EGA, Argentina
| | - Pablo D Giunta
- Instituto de Tecnologías del Hidrógeno y Energías Sostenibles (ITHES), UBA-CONICET, Ciudad Universitaria, Intendente Güiraldes 2160, Ciudad de Buenos Aires C1428EGA, Argentina
| | - Fernando J Mariño
- Instituto de Tecnologías del Hidrógeno y Energías Sostenibles (ITHES), UBA-CONICET, Ciudad Universitaria, Intendente Güiraldes 2160, Ciudad de Buenos Aires C1428EGA, Argentina
| | - Javier A Francesconi
- Centro de Investigación y Desarrollo en Tecnología de Alimentos (CIDTA), UTN-FRRo, Estanislao Zeballos 1341, Rosario S2000BQA, Argentina
| |
Collapse
|
33
|
Wang F, Pasin D, Skinnider MA, Liigand J, Kleis JN, Brown D, Oler E, Sajed T, Gautam V, Harrison S, Greiner R, Foster LJ, Dalsgaard PW, Wishart DS. Deep Learning-Enabled MS/MS Spectrum Prediction Facilitates Automated Identification Of Novel Psychoactive Substances. Anal Chem 2023; 95:18326-18334. [PMID: 38048435 PMCID: PMC10733899 DOI: 10.1021/acs.analchem.3c02413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 11/10/2023] [Accepted: 11/13/2023] [Indexed: 12/06/2023]
Abstract
The market for illicit drugs has been reshaped by the emergence of more than 1100 new psychoactive substances (NPS) over the past decade, posing a major challenge to the forensic and toxicological laboratories tasked with detecting and identifying them. Tandem mass spectrometry (MS/MS) is the primary method used to screen for NPS within seized materials or biological samples. The most contemporary workflows necessitate labor-intensive and expensive MS/MS reference standards, which may not be available for recently emerged NPS on the illicit market. Here, we present NPS-MS, a deep learning method capable of accurately predicting the MS/MS spectra of known and hypothesized NPS from their chemical structures alone. NPS-MS is trained by transfer learning from a generic MS/MS prediction model on a large data set of MS/MS spectra. We show that this approach enables a more accurate identification of NPS from experimentally acquired MS/MS spectra than any existing method. We demonstrate the application of NPS-MS to identify a novel derivative of phencyclidine (PCP) within an unknown powder seized in Denmark without the use of any reference standards. We anticipate that NPS-MS will allow forensic laboratories to identify more rapidly both known and newly emerging NPS. NPS-MS is available as a web server at https://nps-ms.ca/, which provides MS/MS spectra prediction capabilities for given NPS compounds. Additionally, it offers MS/MS spectra identification against a vast database comprising approximately 8.7 million predicted NPS compounds from DarkNPS and 24.5 million predicted ESI-QToF-MS/MS spectra for these compounds.
Collapse
Affiliation(s)
- Fei Wang
- Department
of Computing Science, University of Alberta, Edmonton, Alberta T6G 2E8, Canada
- Alberta
Machine Intelligence Institute, Edmonton, Alberta T5J
3B1, Canada
| | - Daniel Pasin
- Section
of Forensic Chemistry, Department of Forensic Medicine, University of Copenhagen, Copenhagen 2100, Denmark
| | - Michael A. Skinnider
- Michael
Smith Laboratories, University of British
Columbia, Vancouver, British Columbia V6T 1Z4, Canada
- Lewis-Sigler
Institute for Integrative Genomics, Princeton
University, Princeton, New Jersey 08544, United States
- Ludwig Institute
for Cancer Research, Princeton University, Princeton, New Jersey 08544, United States
| | - Jaanus Liigand
- Department
of Biological Sciences, University of Alberta, Edmonton, Alberta T6G 2E9, Canada
- Institute
of Chemistry, University of Tartu, Tartu 50411, Estonia
| | - Jan-Niklas Kleis
- Institute
of Forensic Medicine, Forensic Toxicology, Johannes Gutenberg University Mainz, Mainz 55131, Germany
| | - David Brown
- Forensic
Science Laboratory, ChemCentre, Bentley, Western Australia 6102, Australia
- School of Molecular and Life Sciences, Curtin University, Bentley, Western Australia 6009, Australia
| | - Eponine Oler
- Department
of Biological Sciences, University of Alberta, Edmonton, Alberta T6G 2E9, Canada
| | - Tanvir Sajed
- Department
of Biological Sciences, University of Alberta, Edmonton, Alberta T6G 2E9, Canada
| | - Vasuk Gautam
- Department
of Biological Sciences, University of Alberta, Edmonton, Alberta T6G 2E9, Canada
| | - Stephen Harrison
- Forensic
Science Laboratory, ChemCentre, Bentley, Western Australia 6102, Australia
| | - Russell Greiner
- Department
of Computing Science, University of Alberta, Edmonton, Alberta T6G 2E8, Canada
- Alberta
Machine Intelligence Institute, Edmonton, Alberta T5J
3B1, Canada
| | - Leonard J. Foster
- Michael
Smith Laboratories, University of British
Columbia, Vancouver, British Columbia V6T 1Z4, Canada
- Department
of Biochemistry and Molecular Biology, University
of British Columbia, Vancouver, British Columbia V6T 2A1, Canada
| | - Petur Weihe Dalsgaard
- Section
of Forensic Chemistry, Department of Forensic Medicine, University of Copenhagen, Copenhagen 2100, Denmark
| | - David S. Wishart
- Department
of Computing Science, University of Alberta, Edmonton, Alberta T6G 2E8, Canada
- Department
of Biological Sciences, University of Alberta, Edmonton, Alberta T6G 2E9, Canada
- Department of Laboratory
Medicine and Pathology, University of Alberta, Edmonton, Alberta T6G 1C9, Canada
- Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton, Alberta T6G 2C8, Canada
- Biological Sciences Division, Pacific Northwest
National Laboratory, Richland, Washington 99354, United States
| |
Collapse
|
34
|
Wang R, Feng H, Wei GW. ChatGPT in Drug Discovery: A Case Study on Anticocaine Addiction Drug Development with Chatbots. J Chem Inf Model 2023; 63:7189-7209. [PMID: 37956228 PMCID: PMC11021135 DOI: 10.1021/acs.jcim.3c01429] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
The birth of ChatGPT, a cutting-edge language model-based chatbot developed by OpenAI, ushered in a new era in AI. However, due to potential pitfalls, its role in rigorous scientific research is not clear yet. This paper vividly showcases its innovative application within the field of drug discovery. Focused specifically on developing anticocaine addiction drugs, the study employs GPT-4 as a virtual guide, offering strategic and methodological insights to researchers working on generative models for drug candidates. The primary objective is to generate optimal drug-like molecules with desired properties. By leveraging the capabilities of ChatGPT, the study introduces a novel approach to the drug discovery process. This symbiotic partnership between AI and researchers transforms how drug development is approached. Chatbots become facilitators, steering researchers toward innovative methodologies and productive paths for creating effective drug candidates. This research sheds light on the collaborative synergy between human expertise and AI assistance, wherein ChatGPT's cognitive abilities enhance the design and development of pharmaceutical solutions. This paper not only explores the integration of advanced AI in drug discovery but also reimagines the landscape by advocating for AI-powered chatbots as trailblazers in revolutionizing therapeutic innovation.
Collapse
Affiliation(s)
- Rui Wang
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Hongsong Feng
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
35
|
Wang Z, Feng Z, Li Y, Li B, Wang Y, Sha C, He M, Li X. BatmanNet: bi-branch masked graph transformer autoencoder for molecular representation. Brief Bioinform 2023; 25:bbad400. [PMID: 38033291 PMCID: PMC10783874 DOI: 10.1093/bib/bbad400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 10/02/2023] [Accepted: 10/17/2023] [Indexed: 12/02/2023] Open
Abstract
Although substantial efforts have been made using graph neural networks (GNNs) for artificial intelligence (AI)-driven drug discovery, effective molecular representation learning remains an open challenge, especially in the case of insufficient labeled molecules. Recent studies suggest that big GNN models pre-trained by self-supervised learning on unlabeled datasets enable better transfer performance in downstream molecular property prediction tasks. However, the approaches in these studies require multiple complex self-supervised tasks and large-scale datasets , which are time-consuming, computationally expensive and difficult to pre-train end-to-end. Here, we design a simple yet effective self-supervised strategy to simultaneously learn local and global information about molecules, and further propose a novel bi-branch masked graph transformer autoencoder (BatmanNet) to learn molecular representations. BatmanNet features two tailored complementary and asymmetric graph autoencoders to reconstruct the missing nodes and edges, respectively, from a masked molecular graph. With this design, BatmanNet can effectively capture the underlying structure and semantic information of molecules, thus improving the performance of molecular representation. BatmanNet achieves state-of-the-art results for multiple drug discovery tasks, including molecular properties prediction, drug-drug interaction and drug-target interaction, on 13 benchmark datasets, demonstrating its great potential and superiority in molecular representation learning.
Collapse
Affiliation(s)
- Zhen Wang
- College of Electrical and Information Engineering, Hunan University, Changsha, 410082, Hunan, China
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018, Zhejiang, China
| | - Zheng Feng
- Department of Health Outcomes & Biomedical Informatics, College of Medecine, University of Florida, Gainesville, 32611, FL, USA
| | - Yanjun Li
- Department of Medicinal Chemistry, College of Pharmacy, University of Florida, Gainesville, 32610, FL, USA
- Center for Natural Products, Drug Discovery and Development, University of Florida, Gainesville, 32610, FL, USA
| | - Bowen Li
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018, Zhejiang, China
| | - Yongrui Wang
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018, Zhejiang, China
| | - Chulin Sha
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018, Zhejiang, China
| | - Min He
- College of Electrical and Information Engineering, Hunan University, Changsha, 410082, Hunan, China
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018, Zhejiang, China
| | - Xiaolin Li
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018, Zhejiang, China
- ElasticMind Inc, Hangzhou, 310018, Zhejiang, China
| |
Collapse
|
36
|
Shen C, Luo J, Xia K. Molecular geometric deep learning. CELL REPORTS METHODS 2023; 3:100621. [PMID: 37875121 PMCID: PMC10694498 DOI: 10.1016/j.crmeth.2023.100621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Revised: 06/16/2023] [Accepted: 09/28/2023] [Indexed: 10/26/2023]
Abstract
Molecular representation learning plays an important role in molecular property prediction. Existing molecular property prediction models rely on the de facto standard of covalent-bond-based molecular graphs for representing molecular topology at the atomic level and totally ignore the non-covalent interactions within the molecule. In this study, we propose a molecular geometric deep learning model to predict the properties of molecules that aims to comprehensively consider the information of covalent and non-covalent interactions of molecules. The essential idea is to incorporate a more general molecular representation into geometric deep learning (GDL) models. We systematically test molecular GDL (Mol-GDL) on fourteen commonly used benchmark datasets. The results show that Mol-GDL can achieve a better performance than state-of-the-art (SOTA) methods. Extensive tests have demonstrated the important role of non-covalent interactions in molecular property prediction and the effectiveness of Mol-GDL models.
Collapse
Affiliation(s)
- Cong Shen
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410000, China; School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410000, China.
| | - Kelin Xia
- School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore.
| |
Collapse
|
37
|
Yu L, He X, Fang X, Liu L, Liu J. Deep Learning with Geometry-Enhanced Molecular Representation for Augmentation of Large-Scale Docking-Based Virtual Screening. J Chem Inf Model 2023; 63:6501-6514. [PMID: 37882338 DOI: 10.1021/acs.jcim.3c01371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2023]
Abstract
Structure-based virtual screening has been a crucial tool in drug discovery for decades. However, as the chemical space expands, the existing structure-based virtual screening techniques based on molecular docking and scoring struggle to handle billion-entry ultralarge libraries due to the high computational cost. To address this challenge, people have resorted to machine learning techniques to enhance structure-based virtual screening for efficiently exploring the vast chemical space. In those cases, compounds are usually treated as sequential strings or two-dimensional topology graphs, limiting their ability to incorporate three-dimensional structural information for downstream tasks. We herein propose a novel deep learning protocol, GEM-Screen, which utilizes the geometry-enhanced molecular representation of the compounds docking to a specific target and is trained on docking scores of a small fraction of a library through an active learning strategy to approximate the docking outcome for yet nontraining entries. This protocol is applied to virtual screening campaigns against the AmpC and D4 targets, demonstrating that GEM-Screen enriches more than 90% of the hit scaffolds for AmpC in the top 4% of model predictions and more than 80% of the hit scaffolds for D4 in the same top-ranking size of library. GEM-Screen can be used in conjunction with traditional docking programs for docking of only the top-ranked compounds to avoid the exhaustive docking of the whole library, thus allowing for discovering top-scoring compounds from billion-entry libraries in a rapid yet accurate fashion.
Collapse
Affiliation(s)
- Lan Yu
- School of Science, China Pharmaceutical University, Nanjing 210009, China
| | - Xiao He
- Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development, Shanghai Frontiers Science Center of Molecule Intelligent Syntheses, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China
- New York University-East China Normal University Center for Computational Chemistry, New York University Shanghai, Shanghai 200062, China
| | - Xiaomin Fang
- Baidu International Technology (Shenzhen) Co., Ltd., Shenzhen 518063, China
| | - Lihang Liu
- Baidu International Technology (Shenzhen) Co., Ltd., Shenzhen 518063, China
| | - Jinfeng Liu
- School of Science, China Pharmaceutical University, Nanjing 210009, China
- School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 210009, China
| |
Collapse
|
38
|
Ilnicka A, Schneider G. Designing molecules with autoencoder networks. NATURE COMPUTATIONAL SCIENCE 2023; 3:922-933. [PMID: 38177601 DOI: 10.1038/s43588-023-00548-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 10/03/2023] [Indexed: 01/06/2024]
Abstract
Autoencoders are versatile tools in molecular informatics. These unsupervised neural networks serve diverse tasks such as data-driven molecular representation and constructive molecular design. This Review explores their algorithmic foundations and applications in drug discovery, highlighting the most active areas of development and the contributions autoencoder networks have made in advancing this field. We also explore the challenges and prospects concerning the utilization of autoencoders and the various adaptations of this neural network architecture in molecular design.
Collapse
Affiliation(s)
- Agnieszka Ilnicka
- Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland
| | - Gisbert Schneider
- Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland.
| |
Collapse
|
39
|
Mullowney MW, Duncan KR, Elsayed SS, Garg N, van der Hooft JJJ, Martin NI, Meijer D, Terlouw BR, Biermann F, Blin K, Durairaj J, Gorostiola González M, Helfrich EJN, Huber F, Leopold-Messer S, Rajan K, de Rond T, van Santen JA, Sorokina M, Balunas MJ, Beniddir MA, van Bergeijk DA, Carroll LM, Clark CM, Clevert DA, Dejong CA, Du C, Ferrinho S, Grisoni F, Hofstetter A, Jespers W, Kalinina OV, Kautsar SA, Kim H, Leao TF, Masschelein J, Rees ER, Reher R, Reker D, Schwaller P, Segler M, Skinnider MA, Walker AS, Willighagen EL, Zdrazil B, Ziemert N, Goss RJM, Guyomard P, Volkamer A, Gerwick WH, Kim HU, Müller R, van Wezel GP, van Westen GJP, Hirsch AKH, Linington RG, Robinson SL, Medema MH. Artificial intelligence for natural product drug discovery. Nat Rev Drug Discov 2023; 22:895-916. [PMID: 37697042 DOI: 10.1038/s41573-023-00774-7] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/20/2023] [Indexed: 09/13/2023]
Abstract
Developments in computational omics technologies have provided new means to access the hidden diversity of natural products, unearthing new potential for drug discovery. In parallel, artificial intelligence approaches such as machine learning have led to exciting developments in the computational drug design field, facilitating biological activity prediction and de novo drug design for molecular targets of interest. Here, we describe current and future synergies between these developments to effectively identify drug candidates from the plethora of molecules produced by nature. We also discuss how to address key challenges in realizing the potential of these synergies, such as the need for high-quality datasets to train deep learning algorithms and appropriate strategies for algorithm validation.
Collapse
Affiliation(s)
| | - Katherine R Duncan
- Strathclyde Institute of Pharmacy and Biomedical Sciences, University of Strathclyde, Glasgow, UK
| | - Somayah S Elsayed
- Department of Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
| | - Neha Garg
- School of Chemistry and Biochemistry, Center for Microbial Dynamics and Infection, Georgia Institute of Technology, Atlanta, GA, USA
| | - Justin J J van der Hooft
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
- Department of Biochemistry, University of Johannesburg, Johannesburg, South Africa
| | - Nathaniel I Martin
- Biological Chemistry Group, Institute of Biology, Leiden University, Leiden, The Netherlands
| | - David Meijer
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
| | - Barbara R Terlouw
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
| | - Friederike Biermann
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
- Institute of Molecular Bio Science, Goethe-University Frankfurt, Frankfurt am Main, Germany
- LOEWE Center for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany
| | - Kai Blin
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | | | - Marina Gorostiola González
- Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden, The Netherlands
- ONCODE institute, Leiden, The Netherlands
| | - Eric J N Helfrich
- Institute of Molecular Bio Science, Goethe-University Frankfurt, Frankfurt am Main, Germany
- LOEWE Center for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany
| | - Florian Huber
- Center for Digitalization and Digitality, Hochschule Düsseldorf, Düsseldorf, Germany
| | - Stefan Leopold-Messer
- Institut für Mikrobiologie, Eidgenössische Technische Hochschule (ETH) Zürich, Zürich, Switzerland
| | - Kohulan Rajan
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Jena, Germany
| | - Tristan de Rond
- School of Chemical Sciences, University of Auckland, Auckland, New Zealand
| | - Jeffrey A van Santen
- Department of Chemistry, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Maria Sorokina
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller University, Jena, Germany
- Pharmaceuticals R&D, Bayer AG, Berlin, Germany
| | - Marcy J Balunas
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, MI, USA
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Mehdi A Beniddir
- Équipe "Chimie des Substances Naturelles", Université Paris-Saclay, CNRS, BioCIS, Orsay, France
| | - Doris A van Bergeijk
- Department of Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
| | - Laura M Carroll
- Structural and Computational Biology Unit, EMBL, Heidelberg, Germany
| | - Chase M Clark
- Division of Pharmaceutical Sciences, School of Pharmacy, University of Wisconsin-Madison, Madison, WI, USA
| | | | | | - Chao Du
- Department of Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
| | | | - Francesca Grisoni
- Institute for Complex Molecular Systems, Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
- Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht, Utrecht, The Netherlands
| | | | - Willem Jespers
- Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden, The Netherlands
| | - Olga V Kalinina
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany
- Drug Bioinformatics, Medical Faculty, Saarland University, Homburg, Germany
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany
| | | | - Hyunwoo Kim
- College of Pharmacy and Integrated Research Institute for Drug Development, Dongguk University Seoul, Goyang-si, Republic of Korea
| | - Tiago F Leao
- Center for Nuclear Energy in Agriculture, University of São Paulo, Piracicaba, Brazil
| | - Joleen Masschelein
- Center for Microbiology, VIB-KU Leuven, Heverlee, Belgium
- Department of Biology, KU Leuven, Heverlee, Belgium
| | - Evan R Rees
- Division of Pharmaceutical Sciences, School of Pharmacy, University of Wisconsin-Madison, Madison, WI, USA
| | - Raphael Reher
- Institute of Pharmaceutical Biology and Biotechnology, University of Marburg, Marburg, Germany
- Institute of Pharmacy, Martin-Luther-University Halle-Wittenberg, Halle (Saale), Germany
| | - Daniel Reker
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
- Duke Microbiome Center, Duke University, Durham, NC, USA
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence, Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | | | - Michael A Skinnider
- Adapsyn Bioscience, Hamilton, Ontario, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
| | - Allison S Walker
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA
| | - Egon L Willighagen
- Department of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, Maastricht, The Netherlands
| | - Barbara Zdrazil
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridgeshire, UK
| | - Nadine Ziemert
- Interfaculty Institute for Microbiology and Infection Medicine Tuebingen (IMIT), Institute for Bioinformatics and Medical Informatics (IBMI), University of Tuebingen, Tuebingen, Germany
| | | | - Pierre Guyomard
- Bonsai team, CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, Université de Lille, Villeneuve d'Ascq Cedex, France
| | - Andrea Volkamer
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany
- In silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - William H Gerwick
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, USA
| | - Hyun Uk Kim
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea
| | - Rolf Müller
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany
- Department of Pharmacy, Saarland University, Saarbrücken, Germany
- German Center for infection research (DZIF), Braunschweig, Germany
- Helmholtz International Lab for Anti-Infectives, Saarbrücken, Germany
| | - Gilles P van Wezel
- Department of Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
- Netherlands Institute of Ecology, NIOO-KNAW, Wageningen, The Netherlands
| | - Gerard J P van Westen
- Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden, The Netherlands.
| | - Anna K H Hirsch
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany.
- Department of Pharmacy, Saarland University, Saarbrücken, Germany.
- German Center for infection research (DZIF), Braunschweig, Germany.
- Helmholtz International Lab for Anti-Infectives, Saarbrücken, Germany.
| | - Roger G Linington
- Department of Chemistry, Simon Fraser University, Burnaby, British Columbia, Canada.
| | - Serina L Robinson
- Department of Environmental Microbiology, Eawag: Swiss Federal Institute for Aquatic Science and Technology, Dübendorf, Switzerland.
| | - Marnix H Medema
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands.
- Institute of Biology, Leiden University, Leiden, The Netherlands.
| |
Collapse
|
40
|
Wang R, Feng H, Wei GW. ChatGPT in Drug Discovery: A Case Study on Anti-Cocaine Addiction Drug Development with Chatbots. ARXIV 2023:arXiv:2308.06920v2. [PMID: 37645039 PMCID: PMC10462169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
The birth of ChatGPT, a cutting-edge language model-based chatbot developed by OpenAI, ushered in a new era in AI. However, due to potential pitfalls, its role in rigorous scientific research is not clear yet. This paper vividly showcases its innovative application within the field of drug discovery. Focused specifically on developing anti-cocaine addiction drugs, the study employs GPT-4 as a virtual guide, offering strategic and methodological insights to researchers working on generative models for drug candidates. The primary objective is to generate optimal drug-like molecules with desired properties. By leveraging the capabilities of ChatGPT, the study introduces a novel approach to the drug discovery process. This symbiotic partnership between AI and researchers transforms how drug development is approached. Chatbots become facilitators, steering researchers towards innovative methodologies and productive paths for creating effective drug candidates. This research sheds light on the collaborative synergy between human expertise and AI assistance, wherein ChatGPT's cognitive abilities enhance the design and development of potential pharmaceutical solutions. This paper not only explores the integration of advanced AI in drug discovery but also reimagines the landscape by advocating for AI-powered chatbots as trailblazers in revolutionizing therapeutic innovation.
Collapse
Affiliation(s)
- Rui Wang
- Department of Mathematics,Michigan State University, MI 48824, USA
| | - Hongsong Feng
- Department of Mathematics,Michigan State University, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics,Michigan State University, MI 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA
| |
Collapse
|
41
|
Dollar O, Joshi N, Pfaendtner J, Beck DAC. Efficient 3D Molecular Design with an E(3) Invariant Transformer VAE. J Phys Chem A 2023; 127:7844-7852. [PMID: 37670244 DOI: 10.1021/acs.jpca.3c04188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/07/2023]
Abstract
This work introduces a three-dimensional (3D) invariant graph-to-string transformer variational autoencoders (VAE) (Vagrant) for generating molecules with accurate density functional theory (DFT)-level properties. Vagrant learns to model the joint probability distribution of a 3D molecular structure and its properties by encoding molecular structures into a 3D-aware latent space. Directed navigation through this latent space implicitly optimizes the 3D structure of a molecule, and the latent embedding can be used to condition a generative transformer to predict the candidate structure as a one-dimensional (1D) sequence. Additionally, we introduce two novel sampling methods that exploit the latent characteristics of a VAE to improve performance. We show that our method outperforms comparable 3D autoregressive and diffusion methods for predicting quantum chemical property values of novel molecules in terms of both sample quality and computational efficiency.
Collapse
Affiliation(s)
- Orion Dollar
- Department of Chemical Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Nisarg Joshi
- Department of Chemical Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Jim Pfaendtner
- Department of Chemical Engineering, University of Washington, Seattle, Washington 98195, United States
| | - David A C Beck
- Department of Chemical Engineering, University of Washington, Seattle, Washington 98195, United States
- escience Institute, University of Washington, Seattle, Washington 98195, United States
| |
Collapse
|
42
|
Feng H, Wang R, Zhan CG, Wei GW. Multiobjective Molecular Optimization for Opioid Use Disorder Treatment Using Generative Network Complex. J Med Chem 2023; 66:12479-12498. [PMID: 37623046 PMCID: PMC11037444 DOI: 10.1021/acs.jmedchem.3c01053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/26/2023]
Abstract
Opioid use disorder (OUD) has emerged as a significant global public health issue, necessitating the discovery of new medications. In this study, we propose a deep generative model that combines a stochastic differential equation (SDE)-based diffusion model with a pretrained autoencoder. The molecular generator enables efficient generation of molecules that target multiple opioid receptors, including mu, kappa, and delta. Additionally, we assess the ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties of the generated molecules to identify druglike compounds. We develop a molecular optimization approach to enhance the pharmacokinetic properties of some lead compounds. Advanced binding affinity predictors were built using molecular fingerprints, including autoencoder embeddings, transformer embeddings, and topological Laplacians. Our process yields druglike molecules that can be used in highly focused experimental studies to further evaluate their pharmacological effects. Our machine learning platform serves as a valuable tool for designing effective molecules to address OUD.
Collapse
Affiliation(s)
- Hongsong Feng
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Rui Wang
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Chang-Guo Zhan
- Department of Pharmaceutical Sciences, University of Kentucky, Lexington, Kentucky 40506, United States
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
43
|
Cremer J, Medrano Sandonas L, Tkatchenko A, Clevert DA, De Fabritiis G. Equivariant Graph Neural Networks for Toxicity Prediction. Chem Res Toxicol 2023; 36. [PMID: 37690056 PMCID: PMC10583285 DOI: 10.1021/acs.chemrestox.3c00032] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Indexed: 09/12/2023]
Abstract
Predictive modeling of toxicity is a crucial step in the drug discovery pipeline. It can help filter out molecules with a high probability of failing in the early stages of de novo drug design. Thus, several machine learning (ML) models have been developed to predict the toxicity of molecules by combining classical ML techniques or deep neural networks with well-known molecular representations such as fingerprints or 2D graphs. But the more natural, accurate representation of molecules is expected to be defined in physical 3D space like in ab initio methods. Recent studies successfully used equivariant graph neural networks (EGNNs) for representation learning based on 3D structures to predict quantum-mechanical properties of molecules. Inspired by this, we investigated the performance of EGNNs to construct reliable ML models for toxicity prediction. We used the equivariant transformer (ET) model in TorchMD-NET for this. Eleven toxicity data sets taken from MoleculeNet, TDCommons, and ToxBenchmark have been considered to evaluate the capability of ET for toxicity prediction. Our results show that ET adequately learns 3D representations of molecules that can successfully correlate with toxicity activity, achieving good accuracies on most data sets comparable to state-of-the-art models. We also test a physicochemical property, namely, the total energy of a molecule, to inform the toxicity prediction with a physical prior. However, our work suggests that these two properties can not be related. We also provide an attention weight analysis for helping to understand the toxicity prediction in 3D space and thus increase the explainability of the ML model. In summary, our findings offer promising insights considering 3D geometry information via EGNNs and provide a straightforward way to integrate molecular conformers into ML-based pipelines for predicting and investigating toxicity prediction in physical space. We expect that in the future, especially for larger, more diverse data sets, EGNNs will be an essential tool in this domain.
Collapse
Affiliation(s)
- Julian Cremer
- Computational
Science Laboratory, Universitat Pompeu Fabra,
Barcelona Biomedical Research Park (PRBB), Carrer Dr. Aiguader 88, 08003 Barcelona, Spain
- Machine
Learning Research, Pfizer Worldwide Research
Development and Medical, Linkstr. 10, 10785 Berlin, Germany
| | - Leonardo Medrano Sandonas
- Department
of Physics and Materials Science, University
of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Alexandre Tkatchenko
- Department
of Physics and Materials Science, University
of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Djork-Arné Clevert
- Machine
Learning Research, Pfizer Worldwide Research
Development and Medical, Linkstr. 10, 10785 Berlin, Germany
| | - Gianni De Fabritiis
- Computational
Science Laboratory, Universitat Pompeu Fabra,
Barcelona Biomedical Research Park (PRBB), Carrer Dr. Aiguader 88, 08003 Barcelona, Spain
- ICREA, Passeig Lluis Companys 23, 08010 Barcelona, Spain
| |
Collapse
|
44
|
Zhang Y, Menke J, He J, Nittinger E, Tyrchan C, Koch O, Zhao H. Similarity-based pairing improves efficiency of siamese neural networks for regression tasks and uncertainty quantification. J Cheminform 2023; 15:75. [PMID: 37649050 PMCID: PMC10469421 DOI: 10.1186/s13321-023-00744-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 08/10/2023] [Indexed: 09/01/2023] Open
Abstract
Siamese networks, representing a novel class of neural networks, consist of two identical subnetworks sharing weights but receiving different inputs. Here we present a similarity-based pairing method for generating compound pairs to train Siamese neural networks for regression tasks. In comparison with the conventional exhaustive pairing, it reduces the algorithm complexity from O(n2) to O(n). It also results in a better prediction performance consistently on the three physicochemical datasets, using a multilayer perceptron with the circular fingerprint as a proof of concept. We further include into a Siamese neural network the transformer-based Chemformer, which extracts task-specific features from the simplified molecular-input line-entry system representation of compounds. Additionally, we propose a means to measure the prediction uncertainty by utilizing the variance in predictions from a set of reference compounds. Our results demonstrate that the high prediction accuracy correlates with the high confidence. Finally, we investigate implications of the similarity property principle in machine learning.
Collapse
Affiliation(s)
- Yumeng Zhang
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden
- Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden
| | - Janosch Menke
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden.
- Institute of Pharmaceutical and Medicinal Chemistry, Westfälische Wilhelms-Universität Münster, 48149, Münster, Germany.
| | - Jiazhen He
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, 43183, Gothenburg, Sweden
| | - Eva Nittinger
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden
| | - Christian Tyrchan
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden
| | - Oliver Koch
- Institute of Pharmaceutical and Medicinal Chemistry, Westfälische Wilhelms-Universität Münster, 48149, Münster, Germany
| | - Hongtao Zhao
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden.
| |
Collapse
|
45
|
Boldini D, Grisoni F, Kuhn D, Friedrich L, Sieber SA. Practical guidelines for the use of gradient boosting for molecular property prediction. J Cheminform 2023; 15:73. [PMID: 37641120 PMCID: PMC10464382 DOI: 10.1186/s13321-023-00743-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 08/09/2023] [Indexed: 08/31/2023] Open
Abstract
Decision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure-activity relationship (QSAR) modeling. Among them, gradient boosting has recently garnered particular attention, for its performance in data science competitions, virtual screening campaigns, and bioactivity prediction. However, different variants of gradient boosting exist, the most popular being XGBoost, LightGBM and CatBoost. Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total. Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets. In terms of feature importance, the models surprisingly rank molecular features differently, reflecting differences in regularization techniques and decision tree structures. Thus, expert knowledge must always be employed when evaluating data-driven explanations of bioactivity. Furthermore, our results show that the relevance of each hyperparameter varies greatly across datasets and that it is crucial to optimize as many hyperparameters as possible to maximize the predictive performance. In conclusion, our study provides the first set of guidelines for cheminformatics practitioners to effectively train, optimize and evaluate gradient boosting models for virtual screening and QSAR applications.
Collapse
Affiliation(s)
- Davide Boldini
- Department of Bioscience, Center for Functional Protein Assemblies (CPA), Technical University of Munich, Garching bei Munich, Germany
| | - Francesca Grisoni
- Department of Biomedical Engineering, Institute for Complex Molecular Sciences, Eindhoven University of Technology, Eindhoven, The Netherlands
- Centre for Living Technologies, Alliance TU/E, WUR, UU, UMC Utrecht, Utrecht, The Netherlands
| | | | | | - Stephan A Sieber
- Department of Bioscience, Center for Functional Protein Assemblies (CPA), Technical University of Munich, Garching bei Munich, Germany.
| |
Collapse
|
46
|
Lanini J, Santarossa G, Sirockin F, Lewis R, Fechner N, Misztela H, Lewis S, Maziarz K, Stanley M, Segler M, Stiefl N, Schneider N. PREFER: A New Predictive Modeling Framework for Molecular Discovery. J Chem Inf Model 2023; 63:4497-4504. [PMID: 37487018 DOI: 10.1021/acs.jcim.3c00523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/26/2023]
Abstract
Machine-learning and deep-learning models have been extensively used in cheminformatics to predict molecular properties, to reduce the need for direct measurements, and to accelerate compound prioritization. However, different setups and frameworks and the large number of molecular representations make it difficult to properly evaluate, reproduce, and compare them. Here we present a new PREdictive modeling FramEwoRk for molecular discovery (PREFER), written in Python (version 3.7.7) and based on AutoSklearn (version 0.14.7), that allows comparison between different molecular representations and common machine-learning models. We provide an overview of the design of our framework and show exemplary use cases and results of several representation-model combinations on diverse data sets, both public and in-house. Finally, we discuss the use of PREFER on small data sets. The code of the framework is freely available on GitHub.
Collapse
Affiliation(s)
- Jessica Lanini
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Gianluca Santarossa
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Finton Sirockin
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Richard Lewis
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Nikolas Fechner
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | | | - Sarah Lewis
- Microsoft Research AI4Science, Cambridge CB1 2FB, U.K
| | | | - Megan Stanley
- Microsoft Research AI4Science, Cambridge CB1 2FB, U.K
| | - Marwin Segler
- Microsoft Research AI4Science, Cambridge CB1 2FB, U.K
| | - Nikolaus Stiefl
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Nadine Schneider
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| |
Collapse
|
47
|
Gu Y, Li J, Kang H, Zhang B, Zheng S. Employing Molecular Conformations for Ligand-Based Virtual Screening with Equivariant Graph Neural Network and Deep Multiple Instance Learning. Molecules 2023; 28:5982. [PMID: 37630234 PMCID: PMC10459669 DOI: 10.3390/molecules28165982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 07/27/2023] [Accepted: 08/03/2023] [Indexed: 08/27/2023] Open
Abstract
Ligand-based virtual screening (LBVS) is a promising approach for rapid and low-cost screening of potentially bioactive molecules in the early stage of drug discovery. Compared with traditional similarity-based machine learning methods, deep learning frameworks for LBVS can more effectively extract high-order molecule structure representations from molecular fingerprints or structures. However, the 3D conformation of a molecule largely influences its bioactivity and physical properties, and has rarely been considered in previous deep learning-based LBVS methods. Moreover, the relative bioactivity benchmark dataset is still lacking. To address these issues, we introduce a novel end-to-end deep learning architecture trained from molecular conformers for LBVS. We first extracted molecule conformers from multiple public molecular bioactivity data and consolidated them into a large-scale bioactivity benchmark dataset, which totally includes millions of endpoints and molecules corresponding to 954 targets. Then, we devised a deep learning-based LBVS called EquiVS to learn molecule representations from conformers for bioactivity prediction. Specifically, graph convolutional network (GCN) and equivariant graph neural network (EGNN) are sequentially stacked to learn high-order molecule-level and conformer-level representations, followed with attention-based deep multiple-instance learning (MIL) to aggregate these representations and then predict the potential bioactivity for the query molecule on a given target. We conducted various experiments to validate the data quality of our benchmark dataset, and confirmed EquiVS achieved better performance compared with 10 traditional machine learning or deep learning-based LBVS methods. Further ablation studies demonstrate the significant contribution of molecular conformation for bioactivity prediction, as well as the reasonability and non-redundancy of deep learning architecture in EquiVS. Finally, a model interpretation case study on CDK2 shows the potential of EquiVS in optimal conformer discovery. The overall study shows that our proposed benchmark dataset and EquiVS method have promising prospects in virtual screening applications.
Collapse
Affiliation(s)
- Yaowen Gu
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China; (Y.G.); (J.L.); (H.K.)
- Department of Chemistry, New York University, New York, NY 10027, USA
| | - Jiao Li
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China; (Y.G.); (J.L.); (H.K.)
| | - Hongyu Kang
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China; (Y.G.); (J.L.); (H.K.)
- Department of Biomedical Engineering, School of Life Science, Beijing Institute of Technology, Beijing 100081, China
| | - Bowen Zhang
- Beijing StoneWise Technology Co., Ltd., Beijing 100080, China;
| | - Si Zheng
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China; (Y.G.); (J.L.); (H.K.)
- Institute for Artificial Intelligence, Department of Computer Science and Technology, BNRist, Tsinghua University, Beijing 100084, China
| |
Collapse
|
48
|
Yamanaka C, Uki S, Kaitoh K, Iwata M, Yamanishi Y. De novo drug design based on patient gene expression profiles via deep learning. Mol Inform 2023; 42:e2300064. [PMID: 37475603 DOI: 10.1002/minf.202300064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 05/25/2023] [Accepted: 07/20/2023] [Indexed: 07/22/2023]
Abstract
Computational de novo drug design is a challenging issue in medicine, and it is desirable to consider all of the relevant information of the biological systems in a disease state. Here, we propose a novel computational method to generate drug candidate molecular structures from patient gene expression profiles via deep learning, which we call DRAGONET. Our model can generate new molecules that are likely to counteract disease-specific gene expression patterns in patients, which is made possible by exploring the latent space constructed by a transformer-based variational autoencoder and integrating the substructures of disease-correlated molecules. We applied DRAGONET to generate drug candidate molecules for gastric cancer, atopic dermatitis, and Alzheimer's disease, and demonstrated that the newly generated molecules were chemically similar to registered drugs for each disease. This approach is applicable to diseases with unknown therapeutic target proteins and will make a significant contribution to the field of precision medicine.
Collapse
Affiliation(s)
- Chikashige Yamanaka
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan
| | - Shunya Uki
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan
| | - Kazuma Kaitoh
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan
- Graduate School of Informatics, Nagoya University, Chikusa, Nagoya, 464-8602, Japan
| | - Michio Iwata
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan
| | - Yoshihiro Yamanishi
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan
- Graduate School of Informatics, Nagoya University, Chikusa, Nagoya, 464-8602, Japan
| |
Collapse
|
49
|
Litsa EE, Chenthamarakshan V, Das P, Kavraki LE. An end-to-end deep learning framework for translating mass spectra to de-novo molecules. Commun Chem 2023; 6:132. [PMID: 37353554 DOI: 10.1038/s42004-023-00932-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Accepted: 06/13/2023] [Indexed: 06/25/2023] Open
Abstract
Elucidating the structure of a chemical compound is a fundamental task in chemistry with applications in multiple domains including drug discovery, precision medicine, and biomarker discovery. The common practice for elucidating the structure of a compound is to obtain a mass spectrum and subsequently retrieve its structure from spectral databases. However, these methods fail for novel molecules that are not present in the reference database. We propose Spec2Mol, a deep learning architecture for molecular structure recommendation given mass spectra alone. Spec2Mol is inspired by the Speech2Text deep learning architectures for translating audio signals into text. Our approach is based on an encoder-decoder architecture. The encoder learns the spectra embeddings, while the decoder, pre-trained on a massive dataset of chemical structures for translating between different molecular representations, reconstructs SMILES sequences of the recommended chemical structures. We have evaluated Spec2Mol by assessing the molecular similarity between the recommended structures and the original structure. Our analysis showed that Spec2Mol is able to identify the presence of key molecular substructures from its mass spectrum, and shows on par performance, when compared to existing fragmentation tree methods particularly when test structure information is not available during training or present in the reference database.
Collapse
Affiliation(s)
- Eleni E Litsa
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | - Payel Das
- IBM Research, IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA.
| | - Lydia E Kavraki
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
50
|
Dost K, Pullar-Strecker Z, Brydon L, Zhang K, Hafner J, Riddle PJ, Wicker JS. Combatting over-specialization bias in growing chemical databases. J Cheminform 2023; 15:53. [PMID: 37208694 DOI: 10.1186/s13321-023-00716-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Accepted: 03/25/2023] [Indexed: 05/21/2023] Open
Abstract
BACKGROUND Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers' experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space. PROPOSED SOLUTION In this paper, we propose CANCELS (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. CANCELS does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain. RESULTS An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that CANCELS produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor's performance while reducing the number of required experiments. Overall, we believe that CANCELS can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels .
Collapse
Affiliation(s)
- Katharina Dost
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand.
- enviPath UG & Co. KG, In den Graswiesen 13, 55437, Ockenheim, Germany.
| | - Zac Pullar-Strecker
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
| | - Liam Brydon
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
| | - Kunyang Zhang
- Eawag-Swiss Federal Institute of Aquatic Science and Technology, Überlandstrasse 133, 8600, Dübendorf, Switzerland
| | - Jasmin Hafner
- Eawag-Swiss Federal Institute of Aquatic Science and Technology, Überlandstrasse 133, 8600, Dübendorf, Switzerland
| | - Patricia J Riddle
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
| | - Jörg S Wicker
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
- enviPath UG & Co. KG, In den Graswiesen 13, 55437, Ockenheim, Germany
| |
Collapse
|