1
|
Vieira Wyzykowski A, Niazi FF, Dickson A. AGDIFF: Attention-Enhanced Diffusion for Molecular Geometry Prediction. J Chem Inf Model 2025; 65:1798-1811. [PMID: 39933880 PMCID: PMC11863375 DOI: 10.1021/acs.jcim.4c01896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Revised: 01/30/2025] [Accepted: 02/03/2025] [Indexed: 02/13/2025]
Abstract
Accurate prediction of molecular geometries is crucial for drug discovery and materials science. Existing fast conformer prediction algorithms often rely on approximate empirical energy functions, resulting in low accuracy. More accurate methods like ab initio molecular dynamics and Markov chain Monte Carlo can be computationally expensive due to the need for evaluating quantum mechanical energy functions. To address this, we introduce AGDIFF, a novel machine learning framework that utilizes diffusion models for efficient and accurate molecular structure prediction. AGDIFF extends previous models (such as GeoDiff) by enhancing the global, local, and edge encoders with attention mechanisms, an improved SchNet architecture, batch normalization, and feature expansion techniques. AGDIFF outperforms GeoDiff on both the GEOM-QM9 and GEOM-Drugs data sets. For GEOM-QM9, with a threshold (δ) of 0.5 Å, AGDIFF achieves a mean COV-R of 93.08% and a mean MAT-R of 0.1965 Å. On the more complex GEOM-Drugs data set, using δ = 1.25 Å, AGDIFF attains a median COV-R of 100.00% and a mean MAT-R of 0.8237 Å. These findings demonstrate AGDIFF's potential to advance molecular modeling techniques, enabling more efficient and accurate prediction of molecular geometries, thus contributing to computational chemistry, drug discovery, and materials design. https://github.com/ADicksonLab/AGDIFF.
Collapse
Affiliation(s)
| | - Fatemeh Fathi Niazi
- Department
of Computational Mathematics, Science &
Engineering Michigan State University, East Lansing, Michigan 48824, United States
| | - Alex Dickson
- Department
of Biochemistry & Molecular Biology Michigan State University, East Lansing, Michigan 48824, United States
- Department
of Computational Mathematics, Science &
Engineering Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
2
|
Chang HC, Tsai MH, Li YP. Enhancing Activation Energy Predictions under Data Constraints Using Graph Neural Networks. J Chem Inf Model 2025; 65:1367-1377. [PMID: 39862160 PMCID: PMC11815826 DOI: 10.1021/acs.jcim.4c02319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2024] [Revised: 01/14/2025] [Accepted: 01/14/2025] [Indexed: 01/27/2025]
Abstract
Accurately predicting activation energies is crucial for understanding chemical reactions and modeling complex reaction systems. However, the high computational cost of quantum chemistry methods often limits the feasibility of large-scale studies, leading to a scarcity of high-quality activation energy data. In this work, we explore and compare three innovative approaches (transfer learning, delta learning, and feature engineering) to enhance the accuracy of activation energy predictions using graph neural networks, specifically focusing on methods that incorporate low-cost, low-level computational data. Using the Chemprop model, we systematically evaluated how these methods leverage data from semiempirical quantum mechanics (SQM) calculations to improve predictions. Delta learning, which adjusts low-level SQM activation energies to align with high-level CCSD(T)-F12a targets, emerged as the most effective method, achieving high accuracy with substantially reduced data requirements. Notably, delta learning trained with just 20-30% of high-level data matched or exceeded the performance of other methods trained with full data sets, making it advantageous in data-scarce scenarios. However, its reliance on transition state searches imposes significant computational demands during model application. Transfer learning, which pretrains models on large data sets of low-level data, provided mixed results, particularly when there was a mismatch in the reaction distributions between the training and target data sets. Feature engineering, which involves adding computed molecular properties as input features, showed modest gains, particularly in thermodynamic properties. Our study highlights the trade-offs between accuracy and computational demand in selecting the best approach for enhancing activation energy predictions. These insights provide valuable guidelines for researchers aiming to apply machine learning in chemical reaction engineering, helping to balance accuracy with resource constraints.
Collapse
Affiliation(s)
- Han-Chung Chang
- Department
of Chemical Engineering, National Taiwan
University, No. 1, Section 4, Roosevelt Road, Taipei 10617, Taiwan
| | - Ming-Hsuan Tsai
- Department
of Chemical Engineering, National Taiwan
University, No. 1, Section 4, Roosevelt Road, Taipei 10617, Taiwan
| | - Yi-Pei Li
- Department
of Chemical Engineering, National Taiwan
University, No. 1, Section 4, Roosevelt Road, Taipei 10617, Taiwan
- Taiwan
International Graduate Program on Sustainable Chemical Science and
Technology (TIGP-SCST), No. 128, Section 2, Academia Road, Taipei 11529, Taiwan
| |
Collapse
|
3
|
Reidenbach D, Krishnapriyan AS. CoarsenConf: Equivariant Coarsening with Aggregated Attention for Molecular Conformer Generation. J Chem Inf Model 2025; 65:22-30. [PMID: 39688534 PMCID: PMC11733938 DOI: 10.1021/acs.jcim.4c01001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2024] [Revised: 10/24/2024] [Accepted: 11/11/2024] [Indexed: 12/18/2024]
Abstract
Molecular conformer generation (MCG) is an important task in cheminformatics and drug discovery. The ability to efficiently generate low-energy 3D structures can avoid expensive quantum mechanical simulations, leading to accelerated virtual screenings and enhanced structural exploration. Several generative models have been developed for MCG, but many struggle to consistently produce high-quality conformers for meaningful downstream applications. To address these issues, we introduce CoarsenConf, which coarse-grains molecular graphs based on torsional angles and integrates them into an SE(3)-equivariant hierarchical variational autoencoder. Through equivariant coarse-graining, we aggregate the fine-grained atomic coordinates of subgraphs connected via rotatable bonds, creating a variable-length coarse-grained latent representation. Our model uses a novel aggregated attention mechanism to restore fine-grained coordinates from the coarse-grained latent representation, enabling efficient generation of accurate conformers. Furthermore, we evaluate the chemical and biochemical quality of our generated conformers on multiple downstream applications, including property prediction and large-scale oracle-based protein docking. Overall, CoarsenConf generates more accurate conformer ensembles compared to prior generative models.
Collapse
Affiliation(s)
- Danny Reidenbach
- Department
of Chemical Engineering, Department of Computer Science, University of California Berkeley, Berkeley, California 94720, United States
- NVIDIA, Santa Clara, California 95051, United States
| | - Aditi S. Krishnapriyan
- Department
of Chemical Engineering, Department of Computer Science, University of California Berkeley, Berkeley, California 94720, United States
| |
Collapse
|
4
|
Xia Q, Fu Q, Shen C, Brenk R, Huang N. Assessing small molecule conformational sampling methods in molecular docking. J Comput Chem 2025; 46:e27516. [PMID: 39476310 DOI: 10.1002/jcc.27516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Revised: 09/05/2024] [Accepted: 10/13/2024] [Indexed: 01/01/2025]
Abstract
Small molecule conformational sampling plays a pivotal role in molecular docking. Recent advancements have led to the emergence of various conformational sampling methods, each employing distinct algorithms. This study investigates the impact of different small molecule conformational sampling methods in molecular docking using UCSF DOCK 3.7. Specifically, six traditional sampling methods (Omega, BCL::Conf, CCDC Conformer Generator, ConfGenX, Conformator, RDKit ETKDGv3) and a deep learning-based model (Torsional Diffusion) for generating conformational ensembles are evaluated. These ensembles are subsequently docked against the Platinum Diverse Dataset, the PoseBusters dataset and the DUDE-Z dataset to assess binding pose reproducibility and screening power. Notably, different sampling methods exhibit varying performance due to their unique preferences, such as dihedral angle sampling ranges on rotatable bonds. Combining complementary methods may lead to further improvements in docking performance.
Collapse
Affiliation(s)
- Qiancheng Xia
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China
- National Institute of Biological Sciences, Beijing, China
| | - Qiuyu Fu
- National Institute of Biological Sciences, Beijing, China
| | - Cheng Shen
- National Institute of Biological Sciences, Beijing, China
| | - Ruth Brenk
- Department of Biomedicine, University of Bergen, Bergen, Norway
| | - Niu Huang
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China
- National Institute of Biological Sciences, Beijing, China
| |
Collapse
|
5
|
Liu H, Qin Y, Niu Z, Xu M, Wu J, Xiao X, Lei J, Ran T, Chen H. How Good are Current Pocket-Based 3D Generative Models?: The Benchmark Set and Evaluation of Protein Pocket-Based 3D Molecular Generative Models. J Chem Inf Model 2024; 64:9260-9275. [PMID: 39629985 DOI: 10.1021/acs.jcim.4c01598] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2024]
Abstract
The development of a three-dimensional (3D) molecular generative model based on protein pockets has recently attracted a lot of attention. This type of model aims to achieve the simultaneous generation of molecular graphs and 3D binding conformation under the constraint of protein binding. Various pocket-based generative models have been proposed; however, currently, there is a lack of systematic and objective evaluation metrics for these models. To address this issue, a comprehensive benchmark data set, named POKMOL-3D, is proposed to evaluate protein pocket-based 3D molecular generative models. It includes 32 protein targets together with their known active compounds as a test set to evaluate the versatility of generation models to mimic the real-world scenario. Additionally, a series of two-dimensional (2D) and 3D evaluation metrics with some newly created ones was integrated to assess the quality of generated molecular structures and their binding conformations. It is expected that this work can enhance our comprehension of the effectiveness and weakness of current 3D generative models and stimulate the discussion on challenges and useful guidance for developing the next wave of molecular generative models.
Collapse
Affiliation(s)
- Haoyang Liu
- State Key Laboratory of Medicinal Chemical Biology and College of Life Sciences, Nankai University, 94 Weijin Road, Tianjin 300071, China
- Division of Drug and Vaccine Research, Guangzhou National Laboratory, Guangzhou 510005, Guangdong, China
| | - Yifei Qin
- School of Pharmacy and Food Engineering, Wuyi University, Jiangmen 529020, Guangdong, China
| | - Zhangming Niu
- National Heart and Lung Institute, Imperial College London, London SW7 2AZ, U.K
- MindRank AI, Hangzhou 311113, Zhejiang, China
- AI Research Center, MindRank Technologies Limited, London EC2N 2AX, U.K
| | - Mingyuan Xu
- Division of Drug and Vaccine Research, Guangzhou National Laboratory, Guangzhou 510005, Guangdong, China
| | - Jiaqiang Wu
- School of Pharmacy and Food Engineering, Wuyi University, Jiangmen 529020, Guangdong, China
| | - Xianglu Xiao
- MindRank AI, Hangzhou 311113, Zhejiang, China
- AI Research Center, MindRank Technologies Limited, London EC2N 2AX, U.K
- Bioengineering Department and Imperial-X, Imperial College London, London W12 7SL, U.K
| | - Jinping Lei
- School of Pharmaceutical Science, Sun Yat-Sen University, Guangzhou 510006, China
| | - Ting Ran
- Division of Drug and Vaccine Research, Guangzhou National Laboratory, Guangzhou 510005, Guangdong, China
| | - Hongming Chen
- Division of Drug and Vaccine Research, Guangzhou National Laboratory, Guangzhou 510005, Guangdong, China
- School of Basic Medical Sciences, Guangzhou Laboratory, Guangzhou Medical University, Guangzhou 511436, China
| |
Collapse
|
6
|
Wang D, Dong X, Zhang X, Hu L. GADIFF: a transferable graph attention diffusion model for generating molecular conformations. Brief Bioinform 2024; 26:bbae676. [PMID: 39737569 DOI: 10.1093/bib/bbae676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 11/04/2024] [Accepted: 12/15/2024] [Indexed: 01/01/2025] Open
Abstract
The diffusion generative model has achieved remarkable performance across various research fields. In this study, we propose a transferable graph attention diffusion model, GADIFF, for a molecular conformation generation task. With adopting multiple equivariant networks in the Markov chain, GADIFF adds GIN (Graph Isomorphism Network) to acquire local information of subgraphs with different edge types (atomic bonds, bond angle interactions, torsion angle interactions, long-range interactions) and applies MSA (Multi-head Self-attention) as noise attention mechanism to capture global molecular information, which improves the representative of features. In addition, we utilize MSA to calculate dynamic noise weights to boost molecular conformation noise prediction. Upon the improvements, GADIFF achieves competitive performance compared with recently reported state-of-the-art models in terms of generation diversity(COV-R, COV-P), accuracy (MAT-R, MAT-P), and property prediction for GEOM-QM9 and GEOM-Drugs datasets. In particular, on the GEOM-Drugs dataset, the average COV-R is improved by 3.75% compared with the best baseline model at a threshold (1.25 Å). Furthermore, a transfer model named GADIFF-NCI based on GADIFF is developed to generate conformations for noncovalent interaction (NCI) molecular systems. It takes GADIFF with GEOM-QM9 dataset as a pre-trained model, and incorporates a graph encoder for learning molecular vectors at the NCI molecular level. The resulting NCI molecular conformations are reasonable, as assessed by the evaluation of conformation and property predictions. This suggests that the proposed transferable model may hold noteworthy value for the study of multi-molecular conformations. The code and data of GADIFF is freely downloaded from https://github.com/WangDHg/GADIFF.
Collapse
Affiliation(s)
- Donghan Wang
- School of Information Science and Technology, Northeast Normal University, 130117 Changchun, China
| | - Xu Dong
- School of Information Science and Technology, Northeast Normal University, 130117 Changchun, China
| | - Xueyou Zhang
- School of Information Science and Technology, Northeast Normal University, 130117 Changchun, China
| | - LiHong Hu
- School of Information Science and Technology, Northeast Normal University, 130117 Changchun, China
| |
Collapse
|
7
|
Luo Y, Fang J, Li S, Liu Z, Wu J, Zhang A, Du W, Wang X. Text-guided small molecule generation via diffusion model. iScience 2024; 27:110992. [PMID: 39759073 PMCID: PMC11700631 DOI: 10.1016/j.isci.2024.110992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Revised: 06/23/2024] [Accepted: 09/16/2024] [Indexed: 01/07/2025] Open
Abstract
The de novo generation of molecules with targeted properties is crucial in biology, chemistry, and drug discovery. Current generative models are limited to using single property values as conditions, struggling with complex customizations described in detailed human language. To address this, we propose the text guidance instead, and introduce TextSMOG, a new text-guided small molecule generation approach via diffusion model, which integrates language and diffusion models for text-guided small molecule generation. This method uses textual conditions to guide molecule generation, enhancing both stability and diversity. Experimental results show TextSMOG's proficiency in capturing and utilizing information from textual descriptions, making it a powerful tool for generating 3D molecular structures in response to complex textual customizations.
Collapse
Affiliation(s)
- Yanchen Luo
- University of Science and Technology of China, Hefei, Anhui, China
| | - Junfeng Fang
- University of Science and Technology of China, Hefei, Anhui, China
| | - Sihang Li
- University of Science and Technology of China, Hefei, Anhui, China
| | - Zhiyuan Liu
- National University of Singapore, Singapore, Singapore
| | - Jiancan Wu
- University of Science and Technology of China, Hefei, Anhui, China
| | - An Zhang
- National University of Singapore, Singapore, Singapore
| | - Wenjie Du
- University of Science and Technology of China, Hefei, Anhui, China
| | - Xiang Wang
- University of Science and Technology of China, Hefei, Anhui, China
| |
Collapse
|
8
|
Fan Z, Yang Y, Xu M, Chen H. EC-Conf: A ultra-fast diffusion model for molecular conformation generation with equivariant consistency. J Cheminform 2024; 16:107. [PMID: 39228003 PMCID: PMC11373173 DOI: 10.1186/s13321-024-00893-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Accepted: 08/06/2024] [Indexed: 09/05/2024] Open
Abstract
Despite recent advancement in 3D molecule conformation generation driven by diffusion models, its high computational cost in iterative diffusion/denoising process limits its application. Here, an equivariant consistency model (EC-Conf) was proposed as a fast diffusion method for low-energy conformation generation. In EC-Conf, a modified SE (3)-equivariant transformer model was directly used to encode the Cartesian molecular conformations and a highly efficient consistency diffusion process was carried out to generate molecular conformations. It was demonstrated that, with only one sampling step, it can already achieve comparable quality to other diffusion-based models running with thousands denoising steps. Its performance can be further improved with a few more sampling iterations. The performance of EC-Conf is evaluated on both GEOM-QM9 and GEOM-Drugs sets. Our results demonstrate that the efficiency of EC-Conf for learning the distribution of low energy molecular conformation is at least two magnitudes higher than current SOTA diffusion models and could potentially become a useful tool for conformation generation and sampling. SCIENTIFIC CONTRIBUTIONS: In this work, we proposed an equivariant consistency model that significantly improves the efficiency of conformation generation in diffusion-based models while maintaining high structural quality. This method serves as a general framework and can be further extended to more complex structure generation and prediction tasks, including those involving proteins, in future steps.
Collapse
Affiliation(s)
- Zhiguang Fan
- School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, 510006, China
- Guangzhou National Laboratory, Guangzhou, 510005, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, 510006, China
| | - Mingyuan Xu
- Guangzhou National Laboratory, Guangzhou, 510005, China.
| | - Hongming Chen
- Guangzhou National Laboratory, Guangzhou, 510005, China.
- Guangzhou Medical University, Guangzhou, 511495, China.
| |
Collapse
|
9
|
Grambow CA, Weir H, Cunningham CN, Biancalani T, Chuang KV. CREMP: Conformer-rotamer ensembles of macrocyclic peptides for machine learning. Sci Data 2024; 11:859. [PMID: 39122750 PMCID: PMC11316032 DOI: 10.1038/s41597-024-03698-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 07/29/2024] [Indexed: 08/12/2024] Open
Abstract
Computational and machine learning approaches to model the conformational landscape of macrocyclic peptides have the potential to enable rational design and optimization. However, accurate, fast, and scalable methods for modeling macrocycle geometries remain elusive. Recent deep learning approaches have significantly accelerated protein structure prediction and the generation of small-molecule conformational ensembles, yet similar progress has not been made for macrocyclic peptides due to their unique properties. Here, we introduce CREMP, a resource generated for the rapid development and evaluation of machine learning models for macrocyclic peptides. CREMP contains 36,198 unique macrocyclic peptides and their high-quality structural ensembles generated using the Conformer-Rotamer Ensemble Sampling Tool (CREST). Altogether, this new dataset contains nearly 31.3 million unique macrocycle geometries, each annotated with energies derived from semi-empirical extended tight-binding (xTB) DFT calculations. Additionally, we include 3,258 macrocycles with reported passive permeability data to couple conformational ensembles to experiment. We anticipate that this dataset will enable the development of machine learning models that can improve peptide design and optimization for novel therapeutics.
Collapse
Affiliation(s)
- Colin A Grambow
- Prescient Design, Genentech, 1 DNA Way, South San Francisco, CA, 94080, USA.
| | - Hayley Weir
- Prescient Design, Genentech, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - Christian N Cunningham
- Department of Peptide Therapeutics, Genentech, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - Tommaso Biancalani
- Biology Research | Development, Genentech, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - Kangway V Chuang
- Prescient Design, Genentech, 1 DNA Way, South San Francisco, CA, 94080, USA.
| |
Collapse
|
10
|
Guzman-Pando A, Ramirez-Alonso G, Arzate-Quintana C, Camarillo-Cisneros J. Deep learning algorithms applied to computational chemistry. Mol Divers 2024; 28:2375-2410. [PMID: 38151697 DOI: 10.1007/s11030-023-10771-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 11/14/2023] [Indexed: 12/29/2023]
Abstract
Recently, there has been a significant increase in the use of deep learning techniques in the molecular sciences, which have shown high performance on datasets and the ability to generalize across data. However, no model has achieved perfect performance in solving all problems, and the pros and cons of each approach remain unclear to those new to the field. Therefore, this paper aims to review deep learning algorithms that have been applied to solve molecular challenges in computational chemistry. We proposed a comprehensive categorization that encompasses two primary approaches; conventional deep learning and geometric deep learning models. This classification takes into account the distinct techniques employed by the algorithms within each approach. We present an up-to-date analysis of these algorithms, emphasizing their key features and open issues. This includes details of input descriptors, datasets used, open-source code availability, task solutions, and actual research applications, focusing on general applications rather than specific ones such as drug discovery. Furthermore, our report discusses trends and future directions in molecular algorithm design, including the input descriptors used for each deep learning model, GPU usage, training and forward processing time, model parameters, the most commonly used datasets, libraries, and optimization schemes. This information aids in identifying the most suitable algorithms for a given task. It also serves as a reference for the datasets and input data frequently used for each algorithm technique. In addition, it provides insights into the benefits and open issues of each technique, and supports the development of novel computational chemistry systems.
Collapse
Affiliation(s)
- Abimael Guzman-Pando
- Computational Chemistry Physics Laboratory, Facultad de Medicina y Ciencias Biomédicas, Universidad Autónoma de Chihuahua, Campus II, 31125, Chihuahua, Mexico
| | - Graciela Ramirez-Alonso
- Faculty of Engineering, Universidad Autónoma de Chihuahua, Campus II, 31125, Chihuahua, Mexico
| | - Carlos Arzate-Quintana
- Computational Chemistry Physics Laboratory, Facultad de Medicina y Ciencias Biomédicas, Universidad Autónoma de Chihuahua, Campus II, 31125, Chihuahua, Mexico
| | - Javier Camarillo-Cisneros
- Computational Chemistry Physics Laboratory, Facultad de Medicina y Ciencias Biomédicas, Universidad Autónoma de Chihuahua, Campus II, 31125, Chihuahua, Mexico.
| |
Collapse
|
11
|
Ai C, Yang H, Liu X, Dong R, Ding Y, Guo F. MTMol-GPT: De novo multi-target molecular generation with transformer-based generative adversarial imitation learning. PLoS Comput Biol 2024; 20:e1012229. [PMID: 38924082 PMCID: PMC11233020 DOI: 10.1371/journal.pcbi.1012229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Revised: 07/09/2024] [Accepted: 06/03/2024] [Indexed: 06/28/2024] Open
Abstract
De novo drug design is crucial in advancing drug discovery, which aims to generate new drugs with specific pharmacological properties. Recently, deep generative models have achieved inspiring progress in generating drug-like compounds. However, the models prioritize a single target drug generation for pharmacological intervention, neglecting the complicated inherent mechanisms of diseases, and influenced by multiple factors. Consequently, developing novel multi-target drugs that simultaneously target specific targets can enhance anti-tumor efficacy and address issues related to resistance mechanisms. To address this issue and inspired by Generative Pre-trained Transformers (GPT) models, we propose an upgraded GPT model with generative adversarial imitation learning for multi-target molecular generation called MTMol-GPT. The multi-target molecular generator employs a dual discriminator model using the Inverse Reinforcement Learning (IRL) method for a concurrently multi-target molecular generation. Extensive results show that MTMol-GPT generates various valid, novel, and effective multi-target molecules for various complex diseases, demonstrating robustness and generalization capability. In addition, molecular docking and pharmacophore mapping experiments demonstrate the drug-likeness properties and effectiveness of generated molecules potentially improve neuropsychiatric interventions. Furthermore, our model's generalizability is exemplified by a case study focusing on the multi-targeted drug design for breast cancer. As a broadly applicable solution for multiple targets, MTMol-GPT provides new insight into future directions to enhance potential complex disease therapeutics by generating high-quality multi-target molecules in drug discovery.
Collapse
Affiliation(s)
- Chengwei Ai
- School of computer science and engineering, Central South University, Changsha, China
| | - Hongpeng Yang
- Department of computer science and engineering, University of South Carolina, Columbia, South Carolina, United States of America
| | - Xiaoyi Liu
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, China
- Ministry of Education, Engineering Research Center for Pharmaceutics of Chinese Materia Medica and New Drug Development, Beijing, China
| | - Ruihan Dong
- Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Fei Guo
- School of computer science and engineering, Central South University, Changsha, China
| |
Collapse
|
12
|
Kuznetsov M, Ryabov F, Schutski R, Shayakhmetov R, Lin YC, Aliper A, Polykovskiy D. COSMIC: Molecular Conformation Space Modeling in Internal Coordinates with an Adversarial Framework. J Chem Inf Model 2024; 64:3610-3620. [PMID: 38668753 PMCID: PMC11094738 DOI: 10.1021/acs.jcim.3c00989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 03/29/2024] [Accepted: 04/02/2024] [Indexed: 05/14/2024]
Abstract
The fast and accurate conformation space modeling is an essential part of computational approaches for solving ligand and structure-based drug discovery problems. Recent state-of-the-art diffusion models for molecular conformation generation show promising distribution coverage and physical plausibility metrics but suffer from a slow sampling procedure. We propose a novel adversarial generative framework, COSMIC, that shows comparable generative performance but provides a time-efficient sampling and training procedure. Given a molecular graph and random noise, the generator produces a conformation in two stages. First, it constructs a conformation in a rotation and translation invariant representation─internal coordinates. In the second step, the model predicts the distances between neighboring atoms and performs a few fast optimization steps to refine the initial conformation. The proposed model considers conformation energy, achieving comparable space coverage, and diversity metrics results.
Collapse
Affiliation(s)
- Maksim Kuznetsov
- Insilico
Medicine Canada Inc., 1250 René-Lévesque Ouest, Suite 3710, Montréal, Québec H3B 4W8, Canada
| | - Fedor Ryabov
- Insilico
Medicine Hong Kong Ltd., Unit 310, 3/F, Building 8W, Phase 2, Hong Kong Science Park, Pak
Shek Kok, New Territories, Hong Kong 999077, China
| | - Roman Schutski
- Insilico
Medicine Hong Kong Ltd., Unit 310, 3/F, Building 8W, Phase 2, Hong Kong Science Park, Pak
Shek Kok, New Territories, Hong Kong 999077, China
| | - Rim Shayakhmetov
- Insilico
Medicine Canada Inc., 1250 René-Lévesque Ouest, Suite 3710, Montréal, Québec H3B 4W8, Canada
| | - Yen-Chu Lin
- Insilico
Medicine Taiwan Ltd., Taipei City 110208, Taiwan
| | - Alex Aliper
- Insilico
Medicine Hong Kong Ltd., Unit 310, 3/F, Building 8W, Phase 2, Hong Kong Science Park, Pak
Shek Kok, New Territories, Hong Kong 999077, China
| | - Daniil Polykovskiy
- Insilico
Medicine Canada Inc., 1250 René-Lévesque Ouest, Suite 3710, Montréal, Québec H3B 4W8, Canada
| |
Collapse
|
13
|
Ju W, Fang Z, Gu Y, Liu Z, Long Q, Qiao Z, Qin Y, Shen J, Sun F, Xiao Z, Yang J, Yuan J, Zhao Y, Wang Y, Luo X, Zhang M. A Comprehensive Survey on Deep Graph Representation Learning. Neural Netw 2024; 173:106207. [PMID: 38442651 DOI: 10.1016/j.neunet.2024.106207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 01/23/2024] [Accepted: 02/21/2024] [Indexed: 03/07/2024]
Abstract
Graph representation learning aims to effectively encode high-dimensional sparse graph-structured data into low-dimensional dense vectors, which is a fundamental task that has been widely studied in a range of fields, including machine learning and data mining. Classic graph embedding methods follow the basic idea that the embedding vectors of interconnected nodes in the graph can still maintain a relatively close distance, thereby preserving the structural information between the nodes in the graph. However, this is sub-optimal due to: (i) traditional methods have limited model capacity which limits the learning performance; (ii) existing techniques typically rely on unsupervised learning strategies and fail to couple with the latest learning paradigms; (iii) representation learning and downstream tasks are dependent on each other which should be jointly enhanced. With the remarkable success of deep learning, deep graph representation learning has shown great potential and advantages over shallow (traditional) methods, there exist a large number of deep graph representation learning techniques have been proposed in the past decade, especially graph neural networks. In this survey, we conduct a comprehensive survey on current deep graph representation learning algorithms by proposing a new taxonomy of existing state-of-the-art literature. Specifically, we systematically summarize the essential components of graph representation learning and categorize existing approaches by the ways of graph neural network architectures and the most recent advanced learning paradigms. Moreover, this survey also provides the practical and promising applications of deep graph representation learning. Last but not least, we state new perspectives and suggest challenging directions which deserve further investigations in the future.
Collapse
Affiliation(s)
- Wei Ju
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Zheng Fang
- School of Intelligence Science and Technology, Peking University, Beijing, 100871, China
| | - Yiyang Gu
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Zequn Liu
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Qingqing Long
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100086, China
| | - Ziyue Qiao
- Artificial Intelligence Thrust, The Hong Kong University of Science and Technology, Guangzhou, 511453, China
| | - Yifang Qin
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Jianhao Shen
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Fang Sun
- Department of Computer Science, University of California, Los Angeles, 90095, USA
| | - Zhiping Xiao
- Department of Computer Science, University of California, Los Angeles, 90095, USA
| | - Junwei Yang
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Jingyang Yuan
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Yusheng Zhao
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Yifan Wang
- School of Information Technology & Management, University of International Business and Economics, Beijing, 100029, China
| | - Xiao Luo
- Department of Computer Science, University of California, Los Angeles, 90095, USA.
| | - Ming Zhang
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China.
| |
Collapse
|
14
|
Wan F, Wong F, Collins JJ, de la Fuente-Nunez C. Machine learning for antimicrobial peptide identification and design. NATURE REVIEWS BIOENGINEERING 2024; 2:392-407. [PMID: 39850516 PMCID: PMC11756916 DOI: 10.1038/s44222-024-00152-x] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2025]
Abstract
Artificial intelligence (AI) and machine learning (ML) models are being deployed in many domains of society and have recently reached the field of drug discovery. Given the increasing prevalence of antimicrobial resistance, as well as the challenges intrinsic to antibiotic development, there is an urgent need to accelerate the design of new antimicrobial therapies. Antimicrobial peptides (AMPs) are therapeutic agents for treating bacterial infections, but their translation into the clinic has been slow owing to toxicity, poor stability, limited cellular penetration and high cost, among other issues. Recent advances in AI and ML have led to breakthroughs in our abilities to predict biomolecular properties and structures and to generate new molecules. The ML-based modelling of peptides may overcome some of the disadvantages associated with traditional drug discovery and aid the rapid development and translation of AMPs. Here, we provide an introduction to this emerging field and survey ML approaches that can be used to address issues currently hindering AMP development. We also outline important limitations that can be addressed for the broader adoption of AMPs in clinical practice, as well as new opportunities in data-driven peptide design.
Collapse
Affiliation(s)
- Fangping Wan
- Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA
- Department of Chemical and Biomolecular Engineering, University of Pennsylvania, Philadelphia, PA, USA
- Department of Chemistry, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA
- Penn Institute for Computational Science, University of Pennsylvania, Philadelphia, PA, USA
- These authors contributed equally: Fangping Wan, Felix Wong
| | - Felix Wong
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
- These authors contributed equally: Fangping Wan, Felix Wong
| | - James J. Collins
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, USA
- These authors jointly supervised this work: James J. Collins, Cesar de la Fuente-Nunez
| | - Cesar de la Fuente-Nunez
- Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA
- Department of Chemical and Biomolecular Engineering, University of Pennsylvania, Philadelphia, PA, USA
- Department of Chemistry, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA
- Penn Institute for Computational Science, University of Pennsylvania, Philadelphia, PA, USA
- These authors jointly supervised this work: James J. Collins, Cesar de la Fuente-Nunez
| |
Collapse
|
15
|
Ding Y, Qiang B, Chen Q, Liu Y, Zhang L, Liu Z. Exploring Chemical Reaction Space with Machine Learning Models: Representation and Feature Perspective. J Chem Inf Model 2024; 64:2955-2970. [PMID: 38489239 DOI: 10.1021/acs.jcim.4c00004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2024]
Abstract
Chemical reactions serve as foundational building blocks for organic chemistry and drug design. In the era of large AI models, data-driven approaches have emerged to innovate the design of novel reactions, optimize existing ones for higher yields, and discover new pathways for synthesizing chemical structures comprehensively. To effectively address these challenges with machine learning models, it is imperative to derive robust and informative representations or engage in feature engineering using extensive data sets of reactions. This work aims to provide a comprehensive review of established reaction featurization approaches, offering insights into the selection of representations and the design of features for a wide array of tasks. The advantages and limitations of employing SMILES, molecular fingerprints, molecular graphs, and physics-based properties are meticulously elaborated. Solutions to bridge the gap between different representations will also be critically evaluated. Additionally, we introduce a new frontier in chemical reaction pretraining, holding promise as an innovative yet unexplored avenue.
Collapse
Affiliation(s)
- Yuheng Ding
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Bo Qiang
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Qixuan Chen
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Yiqiao Liu
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Liangren Zhang
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Zhenming Liu
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| |
Collapse
|
16
|
Williams DC, Inala N. Physics-Informed Generative Model for Drug-like Molecule Conformers. J Chem Inf Model 2024; 64:2988-3007. [PMID: 38486425 DOI: 10.1021/acs.jcim.3c01816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
We present a diffusion-based generative model for conformer generation. Our model is focused on the reproduction of the bonded structure and is constructed from the associated terms traditionally found in classical force fields to ensure a physically relevant representation. Techniques in deep learning are used to infer atom typing and geometric parameters from a training set. Conformer sampling is achieved by taking advantage of recent advancements in diffusion-based generation. By training on large, synthetic data sets of diverse, drug-like molecules optimized with the semiempirical GFN2-xTB method, high accuracy is achieved for bonded parameters, exceeding that of conventional, knowledge-based methods. Results are also compared to experimental structures from the Protein Databank and the Cambridge Structural Database.
Collapse
Affiliation(s)
- David C Williams
- Nobias Therapeutics, Inc., 144 S Whisman Rd, Suite C, Mountain View, California 94041, United States
| | - Neil Inala
- Nobias Therapeutics, Inc., 144 S Whisman Rd, Suite C, Mountain View, California 94041, United States
| |
Collapse
|
17
|
Guo Z, Liu J, Wang Y, Chen M, Wang D, Xu D, Cheng J. Diffusion models in bioinformatics and computational biology. NATURE REVIEWS BIOENGINEERING 2024; 2:136-154. [PMID: 38576453 PMCID: PMC10994218 DOI: 10.1038/s44222-023-00114-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 08/25/2023] [Indexed: 04/06/2024]
Abstract
Denoising diffusion models embody a type of generative artificial intelligence that can be applied in computer vision, natural language processing and bioinformatics. In this Review, we introduce the key concepts and theoretical foundations of three diffusion modelling frameworks (denoising diffusion probabilistic models, noise-conditioned scoring networks and score stochastic differential equations). We then explore their applications in bioinformatics and computational biology, including protein design and generation, drug and small-molecule design, protein-ligand interaction modelling, cryo-electron microscopy image data analysis and single-cell data analysis. Finally, we highlight open-source diffusion model tools and consider the future applications of diffusion models in bioinformatics.
Collapse
Affiliation(s)
- Zhiye Guo
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
- NextGen Precision Health, University of Missouri, Columbia, MO, USA
| | - Jian Liu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
- NextGen Precision Health, University of Missouri, Columbia, MO, USA
| | - Yanli Wang
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
- NextGen Precision Health, University of Missouri, Columbia, MO, USA
| | - Mengrui Chen
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
- NextGen Precision Health, University of Missouri, Columbia, MO, USA
| | - Duolin Wang
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
- NextGen Precision Health, University of Missouri, Columbia, MO, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
- NextGen Precision Health, University of Missouri, Columbia, MO, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
- NextGen Precision Health, University of Missouri, Columbia, MO, USA
| |
Collapse
|
18
|
Stylianakis I, Zervos N, Lii JH, Pantazis DA, Kolocouris A. Conformational energies of reference organic molecules: benchmarking of common efficient computational methods against coupled cluster theory. J Comput Aided Mol Des 2023; 37:607-656. [PMID: 37597063 PMCID: PMC10618395 DOI: 10.1007/s10822-023-00513-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Accepted: 06/03/2023] [Indexed: 08/21/2023]
Abstract
We selected 145 reference organic molecules that include model fragments used in computer-aided drug design. We calculated 158 conformational energies and barriers using force fields, with wide applicability in commercial and free softwares and extensive application on the calculation of conformational energies of organic molecules, e.g. the UFF and DREIDING force fields, the Allinger's force fields MM3-96, MM3-00, MM4-8, the MM2-91 clones MMX and MM+, the MMFF94 force field, MM4, ab initio Hartree-Fock (HF) theory with different basis sets, the standard density functional theory B3LYP, the second-order post-HF MP2 theory and the Domain-based Local Pair Natural Orbital Coupled Cluster DLPNO-CCSD(T) theory, with the latter used for accurate reference values. The data set of the organic molecules includes hydrocarbons, haloalkanes, conjugated compounds, and oxygen-, nitrogen-, phosphorus- and sulphur-containing compounds. We reviewed in detail the conformational aspects of these model organic molecules providing the current understanding of the steric and electronic factors that determine the stability of low energy conformers and the literature including previous experimental observations and calculated findings. While progress on the computer hardware allows the calculations of thousands of conformations for later use in drug design projects, this study is an update from previous classical studies that used, as reference values, experimental ones using a variety of methods and different environments. The lowest mean error against the DLPNO-CCSD(T) reference was calculated for MP2 (0.35 kcal mol-1), followed by B3LYP (0.69 kcal mol-1) and the HF theories (0.81-1.0 kcal mol-1). As regards the force fields, the lowest errors were observed for the Allinger's force fields MM3-00 (1.28 kcal mol-1), ΜΜ3-96 (1.40 kcal mol-1) and the Halgren's MMFF94 force field (1.30 kcal mol-1) and then for the MM2-91 clones MMX (1.77 kcal mol-1) and MM+ (2.01 kcal mol-1) and MM4 (2.05 kcal mol-1). The DREIDING (3.63 kcal mol-1) and UFF (3.77 kcal mol-1) force fields have the lowest performance. These model organic molecules we used are often present as fragments in drug-like molecules. The values calculated using DLPNO-CCSD(T) make up a valuable data set for further comparisons and for improved force field parameterization.
Collapse
Affiliation(s)
- Ioannis Stylianakis
- Department of Medicinal Chemistry, Faculty of Pharmacy, National and Kapodistrian University of Athens, Panepistimioupolis Zografou, 15771, Athens, Greece
| | - Nikolaos Zervos
- Department of Medicinal Chemistry, Faculty of Pharmacy, National and Kapodistrian University of Athens, Panepistimioupolis Zografou, 15771, Athens, Greece
| | - Jenn-Huei Lii
- Department of Chemistry, National Changhua University of Education, Changhua City, Taiwan
| | - Dimitrios A Pantazis
- Max-Planck-Institut für Kohlenforschung, Kaiser-Wilhelm-Platz 1, 45470, Mülheim an der Ruhr, Germany
| | - Antonios Kolocouris
- Department of Medicinal Chemistry, Faculty of Pharmacy, National and Kapodistrian University of Athens, Panepistimioupolis Zografou, 15771, Athens, Greece.
- Laboratory of Medicinal Chemistry, Section of Pharmaceutical Chemistry, Department of Pharmacy, National and Kapodistrian University of Athens, Panepistimiopolis-Zografou, 15771, Athens, Greece.
| |
Collapse
|
19
|
Park YJ, Kim H, Jo J, Yoon S. Deep contrastive learning of molecular conformation for efficient property prediction. NATURE COMPUTATIONAL SCIENCE 2023; 3:1015-1022. [PMID: 38177719 DOI: 10.1038/s43588-023-00560-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 10/31/2023] [Indexed: 01/06/2024]
Abstract
Data-driven deep learning algorithms provide accurate prediction of high-level quantum-chemical molecular properties. However, their inputs must be constrained to the same quantum-chemical level of geometric relaxation as the training dataset, limiting their flexibility. Adopting alternative cost-effective conformation generative methods introduces domain-shift problems, deteriorating prediction accuracy. Here we propose a deep contrastive learning-based domain-adaptation method called Local Atomic environment Contrastive Learning (LACL). LACL learns to alleviate the disparities in distribution between the two geometric conformations by comparing different conformation-generation methods. We found that LACL forms a domain-agnostic latent space that encapsulates the semantics of an atom's local atomic environment. LACL achieves quantum-chemical accuracy while circumventing the geometric relaxation bottleneck and could enable future application scenarios such as inverse molecular engineering and large-scale screening. Our approach is also generalizable from small organic molecules to long chains of biological and pharmacological molecules.
Collapse
Affiliation(s)
- Yang Jeong Park
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea.
- Institute of New Media and Communications, Seoul National University, Seoul, Republic of Korea.
- Department of Nuclear Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - HyunGi Kim
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea
| | - Jeonghee Jo
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea
- Institute of New Media and Communications, Seoul National University, Seoul, Republic of Korea
| | - Sungroh Yoon
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea.
- Institute of New Media and Communications, Seoul National University, Seoul, Republic of Korea.
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul, Republic of Korea.
| |
Collapse
|
20
|
McNutt A, Bisiriyu F, Song S, Vyas A, Hutchison GR, Koes DR. Conformer Generation for Structure-Based Drug Design: How Many and How Good? J Chem Inf Model 2023; 63:6598-6607. [PMID: 37903507 PMCID: PMC10647020 DOI: 10.1021/acs.jcim.3c01245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 10/18/2023] [Accepted: 10/19/2023] [Indexed: 11/01/2023]
Abstract
Conformer generation, the assignment of realistic 3D coordinates to a small molecule, is fundamental to structure-based drug design. Conformational ensembles are required for rigid-body matching algorithms, such as shape-based or pharmacophore approaches, and even methods that treat the ligand flexibly, such as docking, are dependent on the quality of the provided conformations due to not sampling all degrees of freedom (e.g., only sampling torsions). Here, we empirically elucidate some general principles about the size, diversity, and quality of the conformational ensembles needed to get the best performance in common structure-based drug discovery tasks. In many cases, our findings may parallel "common knowledge" well-known to practitioners of the field. Nonetheless, we feel that it is valuable to quantify these conformational effects while reproducing and expanding upon previous studies. Specifically, we investigate the performance of a state-of-the-art generative deep learning approach versus a more classical geometry-based approach, the effect of energy minimization as a postprocessing step, the effect of ensemble size (maximum number of conformers), and construction (filtering by root-mean-square deviation for diversity) and how these choices influence the ability to recapitulate bioactive conformations and perform pharmacophore screening and molecular docking.
Collapse
Affiliation(s)
- Andrew
T. McNutt
- Department
of Computational and Systems Biology, University
of Pittsburgh, Pittsburgh, Pennsylvania 15213, United States
| | - Fatimah Bisiriyu
- The
Neighborhood Academy, Pittsburgh, Pennsylvania 15206, United States
| | - Sophia Song
- Upper
St. Clair High School, Pittsburgh, Pennsylvania 15241, United States
| | - Ananya Vyas
- Taylor
Allderdice High School, Pittsburgh, Pennsylvania 15217, United States
| | - Geoffrey R. Hutchison
- Department of Chemistry, University of Pittsburgh, Pittsburgh, Pennsylvania 15213, United States
- Department
of Chemical and Petroleum Engineering, University
of Pittsburgh, Pittsburgh, Pennsylvania 15213, United States
| | - David Ryan Koes
- Department
of Computational and Systems Biology, University
of Pittsburgh, Pittsburgh, Pennsylvania 15213, United States
| |
Collapse
|
21
|
Wang Z, Zhong H, Zhang J, Pan P, Wang D, Liu H, Yao X, Hou T, Kang Y. Small-Molecule Conformer Generators: Evaluation of Traditional Methods and AI Models on High-Quality Data Sets. J Chem Inf Model 2023; 63:6525-6536. [PMID: 37883143 DOI: 10.1021/acs.jcim.3c01519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2023]
Abstract
Small-molecule conformer generation (SMCG) is an extremely important task in both ligand- and structure-based computer-aided drug design, especially during the hit discovery phase. Recently, a multitude of artificial intelligence (AI) models tailored for SMCG have emerged. Despite developers typically furnishing performance evaluation data upon releasing their AI models, a comprehensive and equitable performance comparison between AI models and conventional methods is still lacking. In this study, we curated a new benchmarking data set comprising 3354 high-quality ligand bioactive conformations. Subsequently, we conducted a systematic assessment of the performance of four widely adopted traditional methods (i.e., ConfGenX, Conformator, OMEGA, and RDKit ETKDG) and five AI models (i.e., ConfGF, DMCG, GeoDiff, GeoMol, and torsional diffusion) in the tasks of reproducing bioactive and low-energy conformations of small molecules. In the former task, the AI models have no advantage, particularly with a maximum ensemble size of 1. Even the best-performing AI model GeoMol is still worse than any of the tested traditional methods. Conversely, in the latter task, the torsional diffusion model shows obvious advantages, surpassing the best-performing traditional method ConfGenX by 26.09 and 12.97% on the COV-R and COV-P metrics, respectively. Furthermore, the influence of force field-based fine-tuning on the quality of the generated conformers was also discussed. Finally, a user-friendly Web server called fastSMCG was developed to enable researchers to rapidly and flexibly generate small-molecule conformers using both traditional and AI methods. We anticipate that our work will offer valuable practical assistance to the scientific community in this field.
Collapse
Affiliation(s)
- Zhe Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Haiyang Zhong
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Jintu Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Peichen Pan
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Dong Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Huanxiang Liu
- Faculty of Applied Science, Macao Polytechnic University, Macao SAR 999078, China
| | - Xiaojun Yao
- State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao SAR 999078, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yu Kang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
22
|
Ilnicka A, Schneider G. Designing molecules with autoencoder networks. NATURE COMPUTATIONAL SCIENCE 2023; 3:922-933. [PMID: 38177601 DOI: 10.1038/s43588-023-00548-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 10/03/2023] [Indexed: 01/06/2024]
Abstract
Autoencoders are versatile tools in molecular informatics. These unsupervised neural networks serve diverse tasks such as data-driven molecular representation and constructive molecular design. This Review explores their algorithmic foundations and applications in drug discovery, highlighting the most active areas of development and the contributions autoencoder networks have made in advancing this field. We also explore the challenges and prospects concerning the utilization of autoencoders and the various adaptations of this neural network architecture in molecular design.
Collapse
Affiliation(s)
- Agnieszka Ilnicka
- Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland
| | - Gisbert Schneider
- Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland.
| |
Collapse
|
23
|
Hagg A, Kirschner KN. Open-Source Machine Learning in Computational Chemistry. J Chem Inf Model 2023; 63:4505-4532. [PMID: 37466636 PMCID: PMC10430767 DOI: 10.1021/acs.jcim.3c00643] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Indexed: 07/20/2023]
Abstract
The field of computational chemistry has seen a significant increase in the integration of machine learning concepts and algorithms. In this Perspective, we surveyed 179 open-source software projects, with corresponding peer-reviewed papers published within the last 5 years, to better understand the topics within the field being investigated by machine learning approaches. For each project, we provide a short description, the link to the code, the accompanying license type, and whether the training data and resulting models are made publicly available. Based on those deposited in GitHub repositories, the most popular employed Python libraries are identified. We hope that this survey will serve as a resource to learn about machine learning or specific architectures thereof by identifying accessible codes with accompanying papers on a topic basis. To this end, we also include computational chemistry open-source software for generating training data and fundamental Python libraries for machine learning. Based on our observations and considering the three pillars of collaborative machine learning work, open data, open source (code), and open models, we provide some suggestions to the community.
Collapse
Affiliation(s)
- Alexander Hagg
- Institute
of Technology, Resource and Energy-Efficient Engineering (TREE), University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
- Department
of Electrical Engineering, Mechanical Engineering and Technical Journalism, University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
| | - Karl N. Kirschner
- Institute
of Technology, Resource and Energy-Efficient Engineering (TREE), University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
- Department
of Computer Science, University of Applied
Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
| |
Collapse
|
24
|
Tang M, Li B, Chen H. Application of message passing neural networks for molecular property prediction. Curr Opin Struct Biol 2023; 81:102616. [PMID: 37267824 DOI: 10.1016/j.sbi.2023.102616] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 04/28/2023] [Accepted: 05/04/2023] [Indexed: 06/04/2023]
Abstract
Accurate molecular property prediction, as one of the classical cheminformatics topics, plays a prominent role in the fields of computer-aided drug design. For instance, property prediction models can be used to quickly screen large molecular libraries to find lead compounds. Message-passing neural networks (MPNNs), a sub-class of Graph neural networks (GNNs), have recently been demonstrated to outperform other deep learning methods on a variety of tasks, including the prediction of molecular characteristics. In this survey, we provide a brief review of the MPNN models and their applications on molecular property prediction.
Collapse
Affiliation(s)
- Miru Tang
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong Province, China; Bioland Laboratory (Guangzhou Regenerative Medicine and Health-Guangdong Laboratory), Guangzhou, 510530, China; State Key Laboratory of Respiratory Disease, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, 510530, China
| | - Baiqing Li
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong Province, China
| | - Hongming Chen
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong Province, China.
| |
Collapse
|
25
|
Lungu CN, Putz MV. SARS-CoV-2 Spike Protein Interaction Space. Int J Mol Sci 2023; 24:12058. [PMID: 37569436 PMCID: PMC10418891 DOI: 10.3390/ijms241512058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 07/10/2023] [Accepted: 07/12/2023] [Indexed: 08/13/2023] Open
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a +sense single-strand RNA virus. The virus has four major surface proteins: spike (S), envelope (E), membrane (M), and nucleocapsid (N), respectively. The constitutive proteins present a high grade of symmetry. Identifying a binding site is difficult. The virion is approximately 50-200 nm in diameter. Angiotensin-converting enzyme 2 (ACE2) acts as the cell receptor for the virus. SARS-CoV-2 has an increased affinity to human ACE2 compared with the original SAR strain. Topological space, and its symmetry, is a critical component in molecular interactions. By exploring this space, a suitable ligand space can be characterized accordingly. A spike protein (S) computational model in a complex with ACE 2 was generated using silica methods. Topological spaces were probed using high computational throughput screening techniques to identify and characterize the topological space of both SARS and SARS-CoV-2 spike protein and its ligand space. In order to identify the symmetry clusters, computational analysis techniques, together with statistical analysis, were utilized. The computations are based on crystallographic protein data bank PDB-based models of constitutive proteins. Cartesian coordinates of component atoms and some cluster maps were generated and analyzed. Dihedral angles were used in order to compute a topological receptor space. This computational study uses a multimodal representation of spike protein interactions with some fragment proteins. The chemical space of the receptors (a dimensional volume) suggests the relevance of the receptor as a drug target. The spike protein S of SARS and SARS-CoV-2 is analyzed and compared. The results suggest a mirror symmetry of SARS and SARS-CoV-2 spike proteins. The results show thatSARS-CoV-2 space is variable and has a distinct topology. In conclusion, surface proteins grant virion variability and symmetry in interactions with a potential complementary target (protein, antibody, ligand). The mirror symmetry of dihedral angle clusters determines a high specificity of the receptor space.
Collapse
Affiliation(s)
- Claudiu N. Lungu
- Department of Morphological and Functional Science, University of Medicine and Pharmacy Dunarea de Jos, Str. Alexandru Ioan Cuza No. 36, 800017 Galati, Romania;
| | - Mihai V. Putz
- Laboratory of Structural and Computational Physical-Chemistry for Nanosciences and QSAR, Biology-Chemistry Department, Faculty of Chemistry, Biology, Geography, West University of Timisoara, Str. Pestalozzi No. 16, 300115 Timisoara, Romania
| |
Collapse
|
26
|
Tran T, Ekenna C. Molecular Descriptors Property Prediction Using Transformer-Based Approach. Int J Mol Sci 2023; 24:11948. [PMID: 37569322 PMCID: PMC10419034 DOI: 10.3390/ijms241511948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 07/21/2023] [Accepted: 07/24/2023] [Indexed: 08/13/2023] Open
Abstract
In this study, we introduce semi-supervised machine learning models designed to predict molecular properties. Our model employs a two-stage approach, involving pre-training and fine-tuning. Particularly, our model leverages a substantial amount of labeled and unlabeled data consisting of SMILES strings, a text representation system for molecules. During the pre-training stage, our model capitalizes on the Masked Language Model, which is widely used in natural language processing, for learning molecular chemical space representations. During the fine-tuning stage, our model is trained on a smaller labeled dataset to tackle specific downstream tasks, such as classification or regression. Preliminary results indicate that our model demonstrates comparable performance to state-of-the-art models on the chosen downstream tasks from MoleculeNet. Additionally, to reduce the computational overhead, we propose a new approach taking advantage of 3D compound structures for calculating the attention score used in the end-to-end transformer model to predict anti-malaria drug candidates. The results show that using the proposed attention score, our end-to-end model is able to have comparable performance with pre-trained models.
Collapse
|
27
|
Zhang Z, Wang G, Li R, Ni L, Zhang R, Cheng K, Ren Q, Kong X, Ni S, Tong X, Luo L, Wang D, Lu X, Zheng M, Li X. Tora3D: an autoregressive torsion angle prediction model for molecular 3D conformation generation. J Cheminform 2023; 15:57. [PMID: 37287071 DOI: 10.1186/s13321-023-00726-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 05/20/2023] [Indexed: 06/09/2023] Open
Abstract
Three-dimensional (3D) conformations of a small molecule profoundly affect its binding to the target of interest, the resulting biological effects, and its disposition in living organisms, but it is challenging to accurately characterize the conformational ensemble experimentally. Here, we proposed an autoregressive torsion angle prediction model Tora3D for molecular 3D conformer generation. Rather than directly predicting the conformations in an end-to-end way, Tora3D predicts a set of torsion angles of rotatable bonds by an interpretable autoregressive method and reconstructs the 3D conformations from them, which keeps structural validity during reconstruction. Another advancement of our method over other conformational generation methods is the ability to use energy to guide the conformation generation. In addition, we propose a new message-passing mechanism that applies the Transformer to the graph to solve the difficulty of remote message passing. Tora3D shows superior performance to prior computational models in the trade-off between accuracy and efficiency, and ensures conformational validity, accuracy, and diversity in an interpretable way. Overall, Tora3D can be used for the quick generation of diverse molecular conformations and 3D-based molecular representation, contributing to a wide range of downstream drug design tasks.
Collapse
Affiliation(s)
- Zimei Zhang
- Division of Life Science and Medicine, University of Science and Technology of China, Hefei, 230026, Anhui, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Gang Wang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
| | - Rui Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- School of Pharmacy, China Pharmaceutical University, 639 Longmian Road, Nanjing, 211198, China
| | - Lin Ni
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- Nanjing University of Chinese Medicine, 138 Xianlin Road, Nanjing, 210023, China
| | - RunZe Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
| | - Kaiyang Cheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- Nanjing University of Chinese Medicine, 138 Xianlin Road, Nanjing, 210023, China
| | - Qun Ren
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- Nanjing University of Chinese Medicine, 138 Xianlin Road, Nanjing, 210023, China
| | - Xiangtai Kong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
| | - Shengkun Ni
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
| | - Xiaochu Tong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
| | - Li Luo
- Precision Pharmacy & Drug Development Center, Department of Pharmacy, Tangdu Hospital, Fourth Military Medical University, Xi'an, 710038, China
| | | | - Xiaojie Lu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
| | - Mingyue Zheng
- Division of Life Science and Medicine, University of Science and Technology of China, Hefei, 230026, Anhui, China.
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China.
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China.
- Nanjing University of Chinese Medicine, 138 Xianlin Road, Nanjing, 210023, China.
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China.
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China.
| |
Collapse
|
28
|
Kubečka J, Knattrup Y, Engsvang M, Jensen AB, Ayoubi D, Wu H, Christiansen O, Elm J. Current and future machine learning approaches for modeling atmospheric cluster formation. NATURE COMPUTATIONAL SCIENCE 2023; 3:495-503. [PMID: 38177415 DOI: 10.1038/s43588-023-00435-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 03/16/2023] [Indexed: 01/06/2024]
Abstract
The formation of strongly bound atmospheric molecular clusters is the first step towards forming new aerosol particles. Recent advances in the application of machine learning models open an enormous opportunity for complementing expensive quantum chemical calculations with efficient machine learning predictions. In this Perspective, we present how data-driven approaches can be applied to accelerate cluster configurational sampling, thereby greatly increasing the number of chemically relevant systems that can be covered.
Collapse
Affiliation(s)
- Jakub Kubečka
- Department of Chemistry, Aarhus University, Aarhus, Denmark
| | - Yosef Knattrup
- Department of Chemistry, Aarhus University, Aarhus, Denmark
| | | | | | - Daniel Ayoubi
- Department of Chemistry, Aarhus University, Aarhus, Denmark
| | - Haide Wu
- Department of Chemistry, Aarhus University, Aarhus, Denmark
| | | | - Jonas Elm
- Department of Chemistry, Aarhus University, Aarhus, Denmark.
- iCLIMATE Aarhus University Interdisciplinary Centre for Climate Change, Aarhus, Denmark.
| |
Collapse
|
29
|
Zaripova K, Cosmo L, Kazi A, Ahmadi SA, Bronstein MM, Navab N. Graph-in-Graph (GiG): Learning interpretable latent graphs in non-Euclidean domain for biological and healthcare applications. Med Image Anal 2023; 88:102839. [PMID: 37263109 DOI: 10.1016/j.media.2023.102839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Revised: 04/26/2023] [Accepted: 05/06/2023] [Indexed: 06/03/2023]
Abstract
Graphs are a powerful tool for representing and analyzing unstructured, non-Euclidean data ubiquitous in the healthcare domain. Two prominent examples are molecule property prediction and brain connectome analysis. Importantly, recent works have shown that considering relationships between input data samples has a positive regularizing effect on the downstream task in healthcare applications. These relationships are naturally modeled by a (possibly unknown) graph structure between input samples. In this work, we propose Graph-in-Graph (GiG), a neural network architecture for protein classification and brain imaging applications that exploits the graph representation of the input data samples and their latent relation. We assume an initially unknown latent-graph structure between graph-valued input data and propose to learn a parametric model for message passing within and across input graph samples, end-to-end along with the latent structure connecting the input graphs. Further, we introduce a Node Degree Distribution Loss (NDDL) that regularizes the predicted latent relationships structure. This regularization can significantly improve the downstream task. Moreover, the obtained latent graph can represent patient population models or networks of molecule clusters, providing a level of interpretability and knowledge discovery in the input domain, which is of particular value in healthcare.
Collapse
Affiliation(s)
- Kamilia Zaripova
- Department of Computer Science, Technical University of Munich, Munich, Germany.
| | - Luca Cosmo
- Department of Environmental Sciences, Informatics and Statistics, Ca' Foscari University of Venice, Venice, Italy; Informatics Department, USI University of Lugano, Lugano, Switzerland
| | - Anees Kazi
- Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Harvard Medical School, Boston, USA
| | | | | | - Nassir Navab
- Department of Computer Science, Technical University of Munich, Munich, Germany; Whiting School of Engineering, Johns Hopkins University, Baltimore, USA
| |
Collapse
|
30
|
Jun Yim S, Gyak KW, Kawale SA, Mottafegh A, Park CH, Ko Y, Kim I, Soo Jee S, Kim DP. One-flow Multi-step Synthesis of a Monomer as a Precursor of Thermal-Conductive Semiconductor Packaging Polymer via Multi-phasic Separation. J IND ENG CHEM 2023. [DOI: 10.1016/j.jiec.2023.03.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2023]
|
31
|
Ma S, Liu JW. Self-supervised contrastive learning for heterogeneous graph based on multi-pretext tasks. Neural Comput Appl 2023. [DOI: 10.1007/s00521-023-08234-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/19/2023]
|
32
|
Zhang H, Li S, Zhang J, Wang Z, Wang J, Jiang D, Bian Z, Zhang Y, Deng Y, Song J, Kang Y, Hou T. SDEGen: learning to evolve molecular conformations from thermodynamic noise for conformation generation. Chem Sci 2023; 14:1557-1568. [PMID: 36794194 PMCID: PMC9906649 DOI: 10.1039/d2sc04429c] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 01/11/2023] [Indexed: 01/13/2023] Open
Abstract
Generation of representative conformations for small molecules is a fundamental task in cheminformatics and computer-aided drug discovery, but capturing the complex distribution of conformations that contains multiple low energy minima is still a great challenge. Deep generative modeling, aiming to learn complex data distributions, is a promising approach to tackle the conformation generation problem. Here, inspired by stochastic dynamics and recent advances in generative modeling, we developed SDEGen, a novel conformation generation model based on stochastic differential equations. Compared with existing conformation generation methods, it enjoys the following advantages: (1) high model capacity to capture multimodal conformation distribution, thereby searching for multiple low-energy conformations of a molecule quickly, (2) higher conformation generation efficiency, almost ten times faster than the state-of-the-art score-based model, ConfGF, and (3) a clear physical interpretation to learn how a molecule evolves in a stochastic dynamics system starting from noise and eventually relaxing to the conformation that falls in low energy minima. Extensive experiments demonstrate that SDEGen has surpassed existing methods in different tasks for conformation generation, interatomic distance distribution prediction, and thermodynamic property estimation, showing great potential for real-world applications.
Collapse
Affiliation(s)
- Haotian Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Shengming Li
- College of Computer Science and Technology, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Jintu Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
- State Key Lab of CAD&CG, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Zhe Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Jike Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
- School of Computer Science, Wuhan University Wuhan 430072 Hubei China
| | - Dejun Jiang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Zhiwen Bian
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Yixue Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Yafeng Deng
- Hangzhou Carbonsilicon AI Technology Co., Ltd Hangzhou 310018 Zhejiang China
| | - Jianfei Song
- Hangzhou Carbonsilicon AI Technology Co., Ltd Hangzhou 310018 Zhejiang China
| | - Yu Kang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
- State Key Lab of CAD&CG, Zhejiang University Hangzhou 310058 Zhejiang China
| |
Collapse
|
33
|
Pang S, Zhang K, Wang G, Lin JCW, Wang F, Meng X, Wang S, Zhang Y. AF-GCN: Completing Various Graph Tasks Efficiently via Adaptive Quadratic Frequency Response Function in Graph Spectral Domain. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.12.054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
34
|
Weinreich J, Lemm D, von Rudorff GF, von Lilienfeld OA. Ab initio machine learning of phase space averages. J Chem Phys 2022; 157:024303. [PMID: 35840379 DOI: 10.1063/5.0095674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Equilibrium structures determine material properties and biochemical functions. We here propose to machine learn phase space averages, conventionally obtained by ab initio or force-field-based molecular dynamics (MD) or Monte Carlo (MC) simulations. In analogy to ab initio MD, our ab initio machine learning (AIML) model does not require bond topologies and, therefore, enables a general machine learning pathway to obtain ensemble properties throughout the chemical compound space. We demonstrate AIML for predicting Boltzmann averaged structures after training on hundreds of MD trajectories. The AIML output is subsequently used to train machine learning models of free energies of solvation using experimental data and to reach competitive prediction errors (mean absolute error ∼ 0.8 kcal/mol) for out-of-sample molecules-within milliseconds. As such, AIML effectively bypasses the need for MD or MC-based phase space sampling, enabling exploration campaigns of Boltzmann averages throughout the chemical compound space at a much accelerated pace. We contextualize our findings by comparison to state-of-the-art methods resulting in a Pareto plot for the free energy of solvation predictions in terms of accuracy and time.
Collapse
Affiliation(s)
- Jan Weinreich
- Faculty of Physics, University of Vienna, Kolingasse 14-16, AT-1090 Wien, Austria
| | - Dominik Lemm
- Faculty of Physics, University of Vienna, Kolingasse 14-16, AT-1090 Wien, Austria
| | | | | |
Collapse
|
35
|
Xu Z, Escalera S, Pavão A, Richard M, Tu WW, Yao Q, Zhao H, Guyon I. Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform. PATTERNS 2022; 3:100543. [PMID: 35845844 PMCID: PMC9278500 DOI: 10.1016/j.patter.2022.100543] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 03/21/2022] [Accepted: 06/03/2022] [Indexed: 11/29/2022]
Abstract
Obtaining a standardized benchmark of computational methods is a major issue in data-science communities. Dedicated frameworks enabling fair benchmarking in a unified environment are yet to be developed. Here, we introduce Codabench, a meta-benchmark platform that is open sourced and community driven for benchmarking algorithms or software agents versus datasets or tasks. A public instance of Codabench is open to everyone free of charge and allows benchmark organizers to fairly compare submissions under the same setting (software, hardware, data, algorithms), with custom protocols and data formats. Codabench has unique features facilitating easy organization of flexible and reproducible benchmarks, such as the possibility of reusing templates of benchmarks and supplying compute resources on demand. Codabench has been used internally and externally on various applications, receiving more than 130 users and 2,500 submissions. As illustrative use cases, we introduce four diverse benchmarks covering graph machine learning, cancer heterogeneity, clinical diagnosis, and reinforcement learning. Codabench facilitates flexible, easy, and reproducible benchmarking Organizers can customize benchmark design and submission format Organizers may host their own platform instance or use the public instance Four use cases in diverse domains are introduced to demonstrate the key features
In almost all communities working on data science, researchers face increasingly severe issues of reproducibility and fair comparison. Researchers work on their own version of hardware/software environment, code, and data, and consequently, the published results are hardly comparable. We introduce Codabench, a meta-benchmark platform, that is capable of flexible and easy benchmarking and supports reproducibility. Codabench is an important step toward benchmarking and reproducible research. It has been used in various communities including graph machine learning, cancer heterogeneity, clinical diagnosis, and reinforcement learning. Codabench is ready to help trendy research, e.g., artificial intelligence (AI) for science and data-centric AI.
Collapse
Affiliation(s)
- Zhen Xu
- 4Paradigm, Beijing 100085, China
- Corresponding author
| | - Sergio Escalera
- Computer Vision Center, Universitat de Barcelona, 08007 Barcelona, Spain
| | - Adrien Pavão
- LISN/CNRS/INRIA, University Paris-Saclay, 91190 Gif-sur-Yvette, France
| | - Magali Richard
- University Grenoble Alpes, CNRS, UMR 5525, VetAgro Sup, Grenoble INP, TIMC, 38000 Grenoble, France
| | | | | | | | - Isabelle Guyon
- LISN/CNRS/INRIA, University Paris-Saclay, 91190 Gif-sur-Yvette, France
- ChaLearn, Berkeley, CA, USA
- Corresponding author
| |
Collapse
|
36
|
Spiekermann KA, Pattanaik L, Green WH. Fast Predictions of Reaction Barrier Heights: Toward Coupled-Cluster Accuracy. J Phys Chem A 2022; 126:3976-3986. [PMID: 35727075 DOI: 10.1021/acs.jpca.2c02614] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Quantitative estimates of reaction barriers are essential for developing kinetic mechanisms and predicting reaction outcomes. However, the lack of experimental data and the steep scaling of accurate quantum calculations often hinder the ability to obtain reliable kinetic values. Here, we train a directed message passing neural network on nearly 24,000 diverse gas-phase reactions calculated at CCSD(T)-F12a/cc-pVDZ-F12//ωB97X-D3/def2-TZVP. Our model uses 75% fewer parameters than previous studies, an improved reaction representation, and proper data splits to accurately estimate performance on unseen reactions. Using information from only the reactant and product, our model quickly predicts barrier heights with a testing MAE of 2.6 kcal mol-1 relative to the coupled-cluster data, making it more accurate than a good density functional theory calculation. Furthermore, our results show that future modeling efforts to estimate reaction properties would significantly benefit from fine-tuning calibration using a transfer learning technique. We anticipate this model will accelerate and improve kinetic predictions for small molecule chemistry.
Collapse
Affiliation(s)
- Kevin A Spiekermann
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Lagnajit Pattanaik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - William H Green
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
37
|
Lin X, Jiang Y, Yang Y. Molecular distance matrix prediction based on graph convolutional networks. J Mol Struct 2022. [DOI: 10.1016/j.molstruc.2022.132540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
38
|
GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci Data 2022; 9:185. [PMID: 35449137 PMCID: PMC9023519 DOI: 10.1038/s41597-022-01288-4] [Citation(s) in RCA: 69] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Accepted: 03/04/2022] [Indexed: 12/23/2022] Open
Abstract
Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers that are accessible to a molecule. Property prediction could be improved by using conformer ensembles as input, but there is no large-scale dataset that contains graphs annotated with accurate conformers and experimental data. Here we use advanced sampling and semi-empirical density functional theory (DFT) to generate 37 million molecular conformations for over 450,000 molecules. The Geometric Ensemble Of Molecules (GEOM) dataset contains conformers for 133,000 species from QM9, and 317,000 species with experimental data related to biophysics, physiology, and physical chemistry. Ensembles of 1,511 species with BACE-1 inhibition data are also labeled with high-quality DFT free energies in an implicit water solvent, and 534 ensembles are further optimized with DFT. GEOM will assist in the development of models that predict properties from conformer ensembles, and generative models that sample 3D conformations. Measurement(s) | Conformer geometries and properties | Technology Type(s) | Computational Chemistry |
Collapse
|
39
|
Ragoza M, Masuda T, Koes DR. Generating 3D molecules conditional on receptor binding sites with deep generative models. Chem Sci 2022; 13:2701-2713. [PMID: 35356675 PMCID: PMC8890264 DOI: 10.1039/d1sc05976a] [Citation(s) in RCA: 57] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 02/06/2022] [Indexed: 11/22/2022] Open
Abstract
The goal of structure-based drug discovery is to find small molecules that bind to a given target protein. Deep learning has been used to generate drug-like molecules with certain cheminformatic properties, but has not yet been applied to generating 3D molecules predicted to bind to proteins by sampling the conditional distribution of protein-ligand binding interactions. In this work, we describe for the first time a deep learning system for generating 3D molecular structures conditioned on a receptor binding site. We approach the problem using a conditional variational autoencoder trained on an atomic density grid representation of cross-docked protein-ligand structures. We apply atom fitting and bond inference procedures to construct valid molecular conformations from generated atomic densities. We evaluate the properties of the generated molecules and demonstrate that they change significantly when conditioned on mutated receptors. We also explore the latent space learned by our generative model using sampling and interpolation techniques. This work opens the door for end-to-end prediction of stable bioactive molecules from protein structures with deep learning.
Collapse
Affiliation(s)
- Matthew Ragoza
- Intelligent Systems Program, University of Pittsburgh Pittsburgh PA 15213 USA
| | - Tomohide Masuda
- Department of Computational and Systems Biology, University of Pittsburgh Pittsburgh PA 15213 USA
| | - David Ryan Koes
- Department of Computational and Systems Biology, University of Pittsburgh Pittsburgh PA 15213 USA
| |
Collapse
|
40
|
Ager Meldgaard S, Köhler J, Lund Mortensen H, Christiansen MPV, Noé F, Hammer B. Generating stable molecules using imitation and reinforcement learning. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2022. [DOI: 10.1088/2632-2153/ac3eb4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Abstract
Chemical space is routinely explored by machine learning methods to discover interesting molecules, before time-consuming experimental synthesizing is attempted. However, these methods often rely on a graph representation, ignoring 3D information necessary for determining the stability of the molecules. We propose a reinforcement learning (RL) approach for generating molecules in Cartesian coordinates allowing for quantum chemical prediction of the stability. To improve sample-efficiency we learn basic chemical rules from imitation learning (IL) on the GDB-11 database to create an initial model applicable for all stoichiometries. We then deploy multiple copies of the model conditioned on a specific stoichiometry in a RL setting. The models correctly identify low energy molecules in the database and produce novel isomers not found in the training set. Finally, we apply the model to larger molecules to show how RL further refines the IL model in domains far from the training data.
Collapse
|
41
|
Chen BP, Chen Y, Zeng GQ, She Q. Fractional-order convolutional neural networks with population extremal optimization. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.01.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
42
|
Gebauer NWA, Gastegger M, Hessmann SSP, Müller KR, Schütt KT. Inverse design of 3d molecular structures with conditional generative neural networks. Nat Commun 2022; 13:973. [PMID: 35190542 PMCID: PMC8861047 DOI: 10.1038/s41467-022-28526-y] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2021] [Accepted: 01/28/2022] [Indexed: 11/09/2022] Open
Abstract
The rational design of molecules with desired properties is a long-standing challenge in chemistry. Generative neural networks have emerged as a powerful approach to sample novel molecules from a learned distribution. Here, we propose a conditional generative neural network for 3d molecular structures with specified chemical and structural properties. This approach is agnostic to chemical bonding and enables targeted sampling of novel molecules from conditional distributions, even in domains where reference calculations are sparse. We demonstrate the utility of our method for inverse design by generating molecules with specified motifs or composition, discovering particularly stable molecules, and jointly targeting multiple electronic properties beyond the training regime.
Collapse
Affiliation(s)
- Niklas W A Gebauer
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany.
- Berlin Institute for the Foundations of Learning and Data, 10587, Berlin, Germany.
- BASLEARN-TU Berlin/BASF Joint Lab for Machine Learning, Technische Universität Berlin, 10587, Berlin, Germany.
| | - Michael Gastegger
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany
- BASLEARN-TU Berlin/BASF Joint Lab for Machine Learning, Technische Universität Berlin, 10587, Berlin, Germany
| | - Stefaan S P Hessmann
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany
- Berlin Institute for the Foundations of Learning and Data, 10587, Berlin, Germany
| | - Klaus-Robert Müller
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany
- Berlin Institute for the Foundations of Learning and Data, 10587, Berlin, Germany
- Department of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-gu, Seoul, 02841, Korea
- Max-Planck-Institut für Informatik, 66123, Saarbrücken, Germany
| | - Kristof T Schütt
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany.
- Berlin Institute for the Foundations of Learning and Data, 10587, Berlin, Germany.
| |
Collapse
|
43
|
Steiner M, Reiher M. Autonomous Reaction Network Exploration in Homogeneous and Heterogeneous Catalysis. Top Catal 2022; 65:6-39. [PMID: 35185305 PMCID: PMC8816766 DOI: 10.1007/s11244-021-01543-9] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/17/2021] [Indexed: 12/11/2022]
Abstract
Autonomous computations that rely on automated reaction network elucidation algorithms may pave the way to make computational catalysis on a par with experimental research in the field. Several advantages of this approach are key to catalysis: (i) automation allows one to consider orders of magnitude more structures in a systematic and open-ended fashion than what would be accessible by manual inspection. Eventually, full resolution in terms of structural varieties and conformations as well as with respect to the type and number of potentially important elementary reaction steps (including decomposition reactions that determine turnover numbers) may be achieved. (ii) Fast electronic structure methods with uncertainty quantification warrant high efficiency and reliability in order to not only deliver results quickly, but also to allow for predictive work. (iii) A high degree of autonomy reduces the amount of manual human work, processing errors, and human bias. Although being inherently unbiased, it is still steerable with respect to specific regions of an emerging network and with respect to the addition of new reactant species. This allows for a high fidelity of the formalization of some catalytic process and for surprising in silico discoveries. In this work, we first review the state of the art in computational catalysis to embed autonomous explorations into the general field from which it draws its ingredients. We then elaborate on the specific conceptual issues that arise in the context of autonomous computational procedures, some of which we discuss at an example catalytic system. GRAPHICAL ABSTRACT SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s11244-021-01543-9.
Collapse
Affiliation(s)
- Miguel Steiner
- Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Markus Reiher
- Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| |
Collapse
|
44
|
Abstract
Computational methods play an increasingly important role in drug discovery. Structure-based drug design (SBDD), in particular, includes techniques that take into account the structure of the macromolecular target to predict compounds that are likely to establish optimal interactions with the binding site. The current interest in machine learning algorithms based on deep neural networks encouraged the application of deep learning to SBDD related problems. This chapter covers selected works in this active area of research.
Collapse
|
45
|
Tong X, Liu X, Tan X, Li X, Jiang J, Xiong Z, Xu T, Jiang H, Qiao N, Zheng M. Generative Models for De Novo Drug Design. J Med Chem 2021; 64:14011-14027. [PMID: 34533311 DOI: 10.1021/acs.jmedchem.1c00927] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Artificial intelligence (AI) is booming. Among various AI approaches, generative models have received much attention in recent years. Inspired by these successes, researchers are now applying generative model techniques to de novo drug design, which has been considered as the "holy grail" of drug discovery. In this Perspective, we first focus on describing models such as recurrent neural network, autoencoder, generative adversarial network, transformer, and hybrid models with reinforcement learning. Next, we summarize the applications of generative models to drug design, including generating various compounds to expand the compound library and designing compounds with specific properties, and we also list a few publicly available molecular design tools based on generative models which can be used directly to generate molecules. In addition, we also introduce current benchmarks and metrics frequently used for generative models. Finally, we discuss the challenges and prospects of using generative models to aid drug design.
Collapse
Affiliation(s)
- Xiaochu Tong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.,University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Xiaohong Liu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.,University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Xiaoqin Tan
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.,University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.,University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Jiaxin Jiang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Zhaoping Xiong
- Laboratory of Health Intelligence, Huawei Technologies Co., Ltd, Shenzhen 518100, China
| | | | - Hualiang Jiang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.,University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Nan Qiao
- Laboratory of Health Intelligence, Huawei Technologies Co., Ltd, Shenzhen 518100, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.,University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| |
Collapse
|
46
|
Keith JA, Vassilev-Galindo V, Cheng B, Chmiela S, Gastegger M, Müller KR, Tkatchenko A. Combining Machine Learning and Computational Chemistry for Predictive Insights Into Chemical Systems. Chem Rev 2021; 121:9816-9872. [PMID: 34232033 PMCID: PMC8391798 DOI: 10.1021/acs.chemrev.1c00107] [Citation(s) in RCA: 270] [Impact Index Per Article: 67.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Indexed: 12/23/2022]
Abstract
Machine learning models are poised to make a transformative impact on chemical sciences by dramatically accelerating computational algorithms and amplifying insights available from computational chemistry methods. However, achieving this requires a confluence and coaction of expertise in computer science and physical sciences. This Review is written for new and experienced researchers working at the intersection of both fields. We first provide concise tutorials of computational chemistry and machine learning methods, showing how insights involving both can be achieved. We follow with a critical review of noteworthy applications that demonstrate how computational chemistry and machine learning can be used together to provide insightful (and useful) predictions in molecular and materials modeling, retrosyntheses, catalysis, and drug design.
Collapse
Affiliation(s)
- John A. Keith
- Department
of Chemical and Petroleum Engineering Swanson School of Engineering, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, United States
| | - Valentin Vassilev-Galindo
- Department
of Physics and Materials Science, University
of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Bingqing Cheng
- Accelerate
Programme for Scientific Discovery, Department
of Computer Science and Technology, 15 J. J. Thomson Avenue, Cambridge CB3 0FD, United Kingdom
| | - Stefan Chmiela
- Department
of Software Engineering and Theoretical Computer Science, Technische Universität Berlin, 10587, Berlin, Germany
| | - Michael Gastegger
- Department
of Software Engineering and Theoretical Computer Science, Technische Universität Berlin, 10587, Berlin, Germany
| | - Klaus-Robert Müller
- Machine
Learning Group, Technische Universität
Berlin, 10587, Berlin, Germany
- Department
of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-gu, Seoul, 02841, Korea
- Max-Planck-Institut für Informatik, 66123 Saarbrücken, Germany
- Google Research, Brain Team, 10117 Berlin, Germany
| | - Alexandre Tkatchenko
- Department
of Physics and Materials Science, University
of Luxembourg, L-1511 Luxembourg City, Luxembourg
| |
Collapse
|
47
|
Lemm D, von Rudorff GF, von Lilienfeld OA. Machine learning based energy-free structure predictions of molecules, transition states, and solids. Nat Commun 2021; 12:4468. [PMID: 34294693 PMCID: PMC8298673 DOI: 10.1038/s41467-021-24525-7] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 06/22/2021] [Indexed: 02/06/2023] Open
Abstract
The computational prediction of atomistic structure is a long-standing problem in physics, chemistry, materials, and biology. Conventionally, force-fields or ab initio methods determine structure through energy minimization, which is either approximate or computationally demanding. This accuracy/cost trade-off prohibits the generation of synthetic big data sets accounting for chemical space with atomistic detail. Exploiting implicit correlations among relaxed structures in training data sets, our machine learning model Graph-To-Structure (G2S) generalizes across compound space in order to infer interatomic distances for out-of-sample compounds, effectively enabling the direct reconstruction of coordinates, and thereby bypassing the conventional energy optimization task. The numerical evidence collected includes 3D coordinate predictions for organic molecules, transition states, and crystalline solids. G2S improves systematically with training set size, reaching mean absolute interatomic distance prediction errors of less than 0.2 Å for less than eight thousand training structures - on par or better than conventional structure generators. Applicability tests of G2S include successful predictions for systems which typically require manual intervention, improved initial guesses for subsequent conventional ab initio based relaxation, and input generation for subsequent use of structure based quantum machine learning models.
Collapse
Affiliation(s)
- Dominik Lemm
- Faculty of Physics, University of Vienna, Vienna, Austria
| | | | - O Anatole von Lilienfeld
- Faculty of Physics, University of Vienna, Vienna, Austria.
- Institute of Physical Chemistry and National Center for Computational Design and Discovery of Novel Materials (MARVEL), Department of Chemistry, University of Basel, Basel, Switzerland.
| |
Collapse
|
48
|
Terayama K, Sumita M, Katouda M, Tsuda K, Okuno Y. Efficient Search for Energetically Favorable Molecular Conformations against Metastable States via Gray-Box Optimization. J Chem Theory Comput 2021; 17:5419-5427. [PMID: 34261321 DOI: 10.1021/acs.jctc.1c00301] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
In order to accurately understand and estimate molecular properties, finding energetically favorable molecular conformations is the most fundamental task for atomistic computational research on molecules and materials. Geometry optimization based on quantum chemical calculations has enabled the conformation prediction of arbitrary molecules, including de novo ones. However, it is computationally expensive to perform geometry optimizations for enormous conformers. In this study, we introduce the gray-box optimization (GBO) framework, which enables optimal control over the entire geometry optimization process, among multiple conformers. Algorithms designed for GBO roughly estimate energetically preferable conformers during their geometry optimization iterations. They then preferentially compute promising conformers. To evaluate the performance of the GBO framework, we applied it to a test set consisting of seven dipeptides and mycophenolic acid to determine their stable conformations at the density functional theory level. We thus preferentially obtained energetically favorable conformations. Furthermore, the computational costs required to find the most stable conformation were significantly reduced (approximately 1% on average, compared to the naive approach for the dipeptides).
Collapse
Affiliation(s)
- Kei Terayama
- Graduate School of Medical Life Science, Yokohama City University, Tsurumi-ku, Yokohama 230-0045, Japan.,RIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan.,Medical Sciences Innovation Hub Program, RIKEN, Yokohama 230-0045, Japan.,Graduate School of Medicine, Kyoto University, Sakyo-ku, Kyoto 606-8507, Japan
| | - Masato Sumita
- RIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan.,International Center for Materials Nanoarchitectonics(WPI-MANA), National Institute for Materials Science, Tsukuba 305-0044, Japan
| | - Michio Katouda
- Department of Computational Science and Technology, Research Organization for Information Science and Technology, Minato-ku, Tokyo 105-0013, Japan.,Waseda Research Institute for Science and Engineering, Waseda University, Sinjuku-ku, Tokyo 169-8555, Japan
| | - Koji Tsuda
- RIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan.,Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa 277-8561, Japan.,Research and Services Division of Materials Data and Integrated System, National Institute for Materials Science, Tsukuba 305-0047, Japan
| | - Yasushi Okuno
- Medical Sciences Innovation Hub Program, RIKEN, Yokohama 230-0045, Japan.,Graduate School of Medicine, Kyoto University, Sakyo-ku, Kyoto 606-8507, Japan
| |
Collapse
|
49
|
Moskal M, Beker W, Szymkuć S, Grzybowski BA. Scaffold‐Directed Face Selectivity Machine‐Learned from Vectors of Non‐covalent Interactions. Angew Chem Int Ed Engl 2021. [DOI: 10.1002/ange.202101986] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Affiliation(s)
- Martyna Moskal
- Institute of Organic Chemistry Polish Academy of Sciences Ul. Kasprzaka 44/52 01-224 Warsaw Poland
- Allchemy, Inc. Highland IN USA
| | - Wiktor Beker
- Institute of Organic Chemistry Polish Academy of Sciences Ul. Kasprzaka 44/52 01-224 Warsaw Poland
- Allchemy, Inc. Highland IN USA
| | - Sara Szymkuć
- Institute of Organic Chemistry Polish Academy of Sciences Ul. Kasprzaka 44/52 01-224 Warsaw Poland
- Allchemy, Inc. Highland IN USA
| | - Bartosz A. Grzybowski
- Institute of Organic Chemistry Polish Academy of Sciences Ul. Kasprzaka 44/52 01-224 Warsaw Poland
- Allchemy, Inc. Highland IN USA
- IBS Center for Soft and Living Matter and Department of Chemistry UNIST 50, UNIST-gil, Eonyang-eup, Ulju-gun Ulsan South Korea
| |
Collapse
|
50
|
Moskal M, Beker W, Szymkuć S, Grzybowski BA. Scaffold-Directed Face Selectivity Machine-Learned from Vectors of Non-covalent Interactions. Angew Chem Int Ed Engl 2021; 60:15230-15235. [PMID: 33876554 DOI: 10.1002/anie.202101986] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 03/29/2021] [Indexed: 11/06/2022]
Abstract
This work describes a method to vectorize and Machine-Learn, ML, non-covalent interactions responsible for scaffold-directed reactions important in synthetic chemistry. Models trained on this representation predict correct face of approach in ca. 90 % of Michael additions or Diels-Alder cycloadditions. These accuracies are significantly higher than those based on traditional ML descriptors, energetic calculations, or intuition of experienced synthetic chemists. Our results also emphasize the importance of ML models being provided with relevant mechanistic knowledge; without such knowledge, these models cannot easily "transfer-learn" and extrapolate to previously unseen reaction mechanisms.
Collapse
Affiliation(s)
- Martyna Moskal
- Institute of Organic Chemistry, Polish Academy of Sciences, Ul. Kasprzaka 44/52, 01-224, Warsaw, Poland.,Allchemy, Inc., Highland, IN, USA
| | - Wiktor Beker
- Institute of Organic Chemistry, Polish Academy of Sciences, Ul. Kasprzaka 44/52, 01-224, Warsaw, Poland.,Allchemy, Inc., Highland, IN, USA
| | - Sara Szymkuć
- Institute of Organic Chemistry, Polish Academy of Sciences, Ul. Kasprzaka 44/52, 01-224, Warsaw, Poland.,Allchemy, Inc., Highland, IN, USA
| | - Bartosz A Grzybowski
- Institute of Organic Chemistry, Polish Academy of Sciences, Ul. Kasprzaka 44/52, 01-224, Warsaw, Poland.,Allchemy, Inc., Highland, IN, USA.,IBS Center for Soft and Living Matter and Department of Chemistry, UNIST, 50, UNIST-gil, Eonyang-eup, Ulju-gun, Ulsan, South Korea
| |
Collapse
|