1
|
Gangwal A, Lavecchia A. Unleashing the power of generative AI in drug discovery. Drug Discov Today 2024; 29:103992. [PMID: 38663579 DOI: 10.1016/j.drudis.2024.103992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Revised: 03/22/2024] [Accepted: 04/18/2024] [Indexed: 05/04/2024]
Abstract
Artificial intelligence (AI) is revolutionizing drug discovery by enhancing precision, reducing timelines and costs, and enabling AI-driven computer-aided drug design. This review focuses on recent advancements in deep generative models (DGMs) for de novo drug design, exploring diverse algorithms and their profound impact. It critically analyses the challenges that are intricately interwoven into these technologies, proposing strategies to unlock their full potential. It features case studies of both successes and failures in advancing drugs to clinical trials with AI assistance. Last, it outlines a forward-looking plan for optimizing DGMs in de novo drug design, thereby fostering faster and more cost-effective drug development.
Collapse
Affiliation(s)
- Amit Gangwal
- Department of Natural Product Chemistry, Shri Vile Parle Kelavani Mandal's Institute of Pharmacy, Dhule 424001, Maharashtra, India
| | - Antonio Lavecchia
- "Drug Discovery" Laboratory, Department of Pharmacy, University of Naples Federico II, I-80131 Naples, Italy.
| |
Collapse
|
2
|
Dutschmann TM, Schlenker V, Baumann K. Chemoinformatic regression methods and their applicability domain. Mol Inform 2024:e202400018. [PMID: 38803302 DOI: 10.1002/minf.202400018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 03/24/2024] [Accepted: 03/25/2024] [Indexed: 05/29/2024]
Abstract
The growing interest in chemoinformatic model uncertainty calls for a summary of the most widely used regression techniques and how to estimate their reliability. Regression models learn a mapping from the space of explanatory variables to the space of continuous output values. Among other limitations, the predictive performance of the model is restricted by the training data used for model fitting. Identification of unusual objects by outlier detection methods can improve model performance. Additionally, proper model evaluation necessitates defining the limitations of the model, often called the applicability domain. Comparable to certain classifiers, some regression techniques come with built-in methods or augmentations to quantify their (un)certainty, while others rely on generic procedures. The theoretical background of their working principles and how to deduce specific and general definitions for their domain of applicability shall be explained.
Collapse
Affiliation(s)
- Thomas-Martin Dutschmann
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, 38106, Braunschweig, Germany
| | - Valerie Schlenker
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, 38106, Braunschweig, Germany
| | - Knut Baumann
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, 38106, Braunschweig, Germany
| |
Collapse
|
3
|
Wossnig L, Furtmann N, Buchanan A, Kumar S, Greiff V. Best practices for machine learning in antibody discovery and development. Drug Discov Today 2024; 29:104025. [PMID: 38762089 DOI: 10.1016/j.drudis.2024.104025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 04/25/2024] [Accepted: 05/13/2024] [Indexed: 05/20/2024]
Abstract
In the past 40 years, therapeutic antibody discovery and development have advanced considerably, with machine learning (ML) offering a promising way to speed up the process by reducing costs and the number of experiments required. Recent progress in ML-guided antibody design and development (D&D) has been hindered by the diversity of data sets and evaluation methods, which makes it difficult to conduct comparisons and assess utility. Establishing standards and guidelines will be crucial for the wider adoption of ML and the advancement of the field. This perspective critically reviews current practices, highlights common pitfalls and proposes method development and evaluation guidelines for various ML-based techniques in therapeutic antibody D&D. Addressing challenges across the ML process, best practices are recommended for each stage to enhance reproducibility and progress.
Collapse
Affiliation(s)
- Leonard Wossnig
- LabGenius Ltd, The Biscuit Factory, 100 Drummond Road, London SE16 4DG, UK; Department of Computer Science, University College London, 66-72 Gower St, London WC1E 6EA, UK.
| | - Norbert Furtmann
- R&D Large Molecules Research Platform, Sanofi Deutschland GmbH, Industriepark Höchst, Frankfurt Am Main, Germany
| | - Andrew Buchanan
- Biologics Engineering, R&D, AstraZeneca, Cambridge CB2 0AA, UK
| | - Sandeep Kumar
- Computational Protein Design and Modeling Group, Computational Science, Moderna Therapeutics, 200 Technology Square, Cambridge, MA 02139, USA
| | - Victor Greiff
- Department of Immunology and Oslo University Hospital, University of Oslo, Oslo, Norway
| |
Collapse
|
4
|
Tan L, Hirte S, Palmacci V, Stork C, Kirchmair J. Tackling assay interference associated with small molecules. Nat Rev Chem 2024; 8:319-339. [PMID: 38622244 DOI: 10.1038/s41570-024-00593-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/29/2024] [Indexed: 04/17/2024]
Abstract
Biochemical and cell-based assays are essential to discovering and optimizing efficacious and safe drugs, agrochemicals and cosmetics. However, false assay readouts stemming from colloidal aggregation, chemical reactivity, chelation, light signal attenuation and emission, membrane disruption, and other interference mechanisms remain a considerable challenge in screening synthetic compounds and natural products. To address assay interference, a range of powerful experimental approaches are available and in silico methods are now gaining traction. This Review begins with an overview of the scope and limitations of experimental approaches for tackling assay interference. It then focuses on theoretical methods, discusses strategies for their integration with experimental approaches, and provides recommendations for best practices. The Review closes with a summary of the critical facts and an outlook on potential future developments.
Collapse
Affiliation(s)
- Lu Tan
- Drug Discovery Sciences, Boehringer Ingelheim RCV GmbH & Co KG, Vienna, Austria
| | - Steffen Hirte
- Department of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry, Faculty of Life Sciences, University of Vienna, Vienna, Austria
- Vienna Doctoral School of Pharmaceutical, Nutritional and Sport Sciences (PhaNuSpo), University of Vienna, Vienna, Austria
| | - Vincenzo Palmacci
- Department of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry, Faculty of Life Sciences, University of Vienna, Vienna, Austria
- Vienna Doctoral School of Pharmaceutical, Nutritional and Sport Sciences (PhaNuSpo), University of Vienna, Vienna, Austria
| | - Conrad Stork
- Department of Informatics, Center for Bioinformatics, Faculty of Mathematics, Informatics and Natural Sciences, Universität Hamburg, Hamburg, Germany
- BASF SE, Ludwigshafen am Rhein, Germany
| | - Johannes Kirchmair
- Department of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry, Faculty of Life Sciences, University of Vienna, Vienna, Austria.
- Christian Doppler Laboratory for Molecular Informatics in the Biosciences, Department for Pharmaceutical Sciences, University of Vienna, Vienna, Austria.
| |
Collapse
|
5
|
Sigmund LM, S SS, Albers A, Erdmann P, Paton RS, Greb L. Predicting Lewis Acidity: Machine Learning the Fluoride Ion Affinity of p-Block-Atom-Based Molecules. Angew Chem Int Ed Engl 2024; 63:e202401084. [PMID: 38452299 DOI: 10.1002/anie.202401084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/01/2024] [Accepted: 03/04/2024] [Indexed: 03/09/2024]
Abstract
"How strong is this Lewis acid?" is a question researchers often approach by calculating its fluoride ion affinity (FIA) with quantum chemistry. Here, we present FIA49k, an extensive FIA dataset with 48,986 data points calculated at the RI-DSD-BLYP-D3(BJ)/def2-QZVPP//PBEh-3c level of theory, including 13 different p-block atoms as the fluoride accepting site. The FIA49k dataset was used to train FIA-GNN, two message-passing graph neural networks, which predict gas and solution phase FIA values of molecules excluded from training with a mean absolute error of 14 kJ mol-1 (r2=0.93) from the SMILES string of the Lewis acid as the only input. The level of accuracy is notable, given the wide energetic range of 750 kJ mol-1 spanned by FIA49k. The model's value was demonstrated with four case studies, including predictions for molecules extracted from the Cambridge Structural Database and by reproducing results from catalysis research available in the literature. Weaknesses of the model are evaluated and interpreted chemically. FIA-GNN and the FIA49k dataset can be reached via a free web app (www.grebgroup.de/fia-gnn).
Collapse
Affiliation(s)
- Lukas M Sigmund
- Anorganisch-Chemisches Institut, Ruprecht-Karls-Universität Heidelberg, Im Neuenheimer Feld 270, 69120, Heidelberg, Germany
- Department of Chemistry, Colorado State University, 1301 Center Avenue, Fort Collins, CO, 80523, USA
| | - Shree Sowndarya S
- Department of Chemistry, Colorado State University, 1301 Center Avenue, Fort Collins, CO, 80523, USA
| | - Andreas Albers
- Anorganisch-Chemisches Institut, Ruprecht-Karls-Universität Heidelberg, Im Neuenheimer Feld 270, 69120, Heidelberg, Germany
| | - Philipp Erdmann
- Anorganisch-Chemisches Institut, Ruprecht-Karls-Universität Heidelberg, Im Neuenheimer Feld 270, 69120, Heidelberg, Germany
| | - Robert S Paton
- Department of Chemistry, Colorado State University, 1301 Center Avenue, Fort Collins, CO, 80523, USA
| | - Lutz Greb
- Anorganisch-Chemisches Institut, Ruprecht-Karls-Universität Heidelberg, Im Neuenheimer Feld 270, 69120, Heidelberg, Germany
| |
Collapse
|
6
|
Darshan BSD, Sampathila N, Bairy MG, Belurkar S, Prabhu S, Chadaga K. Detection of anemic condition in patients from clinical markers and explainable artificial intelligence. Technol Health Care 2024:THC231207. [PMID: 38339945 DOI: 10.3233/thc-231207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/12/2024]
Abstract
BACKGROUND Anaemia is a commonly known blood illness worldwide. Red blood cell (RBC) count or oxygen carrying capability being insufficient are two ways to describe anaemia. This disorder has an impact on the quality of life. If anaemia is detected in the initial stage, appropriate care can be taken to prevent further harm. OBJECTIVE This study proposes a machine learning approach to identify anaemia from clinical markers, which will help further in clinical practice. METHODS The models are designed with a dataset of 364 samples and 12 blood test attributes. The developed algorithm is expected to provide decision support to the clinicians based on blood markers. Each model is trained and validated on several performance metrics. RESULTS The accuracy obtained by the random forest, K nearest neighbour, support vector machine, Naive Bayes, xgboost, and catboost are 97%, 98%, 95%, 95%, 98% and 97% respectively. Four explainers such as Shapley Additive Values (SHAP), QLattice, Eli5 and local interpretable model-agnostic explanations (LIME) are explored for interpreting the model predictions. CONCLUSION The study provides insights into the potential of machine learning algorithms for classification and may help in the development of automated and accurate diagnostic tools for anaemia.
Collapse
Affiliation(s)
- B S Dhruva Darshan
- Department of Biomedical Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Karnataka, India
| | - Niranjana Sampathila
- Department of Biomedical Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Karnataka, India
| | - Muralidhar G Bairy
- Department of Biomedical Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Karnataka, India
| | - Sushma Belurkar
- Haematology and Clinical Pathology Lab, Kasturba Medical College, Manipal Academy of Higher Education, Karnataka, India
| | - Srikanth Prabhu
- Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Karnataka, India
| | - Krishnaraj Chadaga
- Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Karnataka, India
| |
Collapse
|
7
|
Cichońska A, Ravikumar B, Rahman R. AI for targeted polypharmacology: The next frontier in drug discovery. Curr Opin Struct Biol 2024; 84:102771. [PMID: 38215530 DOI: 10.1016/j.sbi.2023.102771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 11/30/2023] [Accepted: 12/20/2023] [Indexed: 01/14/2024]
Abstract
In drug discovery, targeted polypharmacology, i.e., targeting multiple molecular targets with a single drug, is redefining therapeutic design to address complex diseases. Pre-selected pharmacological profiles, as exemplified in kinase drugs, promise enhanced efficacy and reduced toxicity. Historically, many of such drugs were discovered serendipitously, limiting predictability and efficacy, but currently artificial intelligence (AI) offers a transformative solution. Machine learning and deep learning techniques enable modeling protein structures, generating novel compounds, and decoding their polypharmacological effects, opening an avenue for more systematic and predictive multi-target drug design. This review explores the use of AI in identifying synergistic co-targets and delineating them from anti-targets that lead to adverse effects, and then discusses advances in AI-enabled docking, generative chemistry, and proteochemometric modeling of proteome-wide compound interactions, in the context of polypharmacology. We also provide insights into challenges ahead.
Collapse
|
8
|
Escayola S, Bahri-Laleh N, Poater A. % VBur index and steric maps: from predictive catalysis to machine learning. Chem Soc Rev 2024; 53:853-882. [PMID: 38113051 DOI: 10.1039/d3cs00725a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2023]
Abstract
Steric indices are parameters used in chemistry to describe the spatial arrangement of atoms or groups of atoms in molecules. They are important in determining the reactivity, stability, and physical properties of chemical compounds. One commonly used steric index is the steric hindrance, which refers to the obstruction or hindrance of movement in a molecule caused by bulky substituents or functional groups. Steric hindrance can affect the reactivity of a molecule by altering the accessibility of its reactive sites and influencing the geometry of its transition states. Notably, the Tolman cone angle and %VBur are prominent among these indices. Actually, steric effects can also be described using the concept of steric bulk, which refers to the space occupied by a molecule or functional group. Steric bulk can affect the solubility, melting point, boiling point, and viscosity of a substance. Even though electronic indices are more widely used, they have certain drawbacks that might shift preferences towards others. They present a higher computational cost, and often, the weight of electronics in correlation with chemical properties, e.g. binding energies, falls short in comparison to %VBur. However, it is worth noting that this may be because the steric index inherently captures part of the electronic content. Overall, steric indices play an important role in understanding the behaviour of chemical compounds and can be used to predict their reactivity, stability, and physical properties. Predictive chemistry is an approach to chemical research that uses computational methods to anticipate the properties and behaviour of these compounds and reactions, facilitating the design of new compounds and reactivities. Within this domain, predictive catalysis specifically targets the prediction of the performance and behaviour of catalysts. Ultimately, the goal is to identify new catalysts with optimal properties, leading to chemical processes that are both more efficient and sustainable. In this framework, %VBur can be a key metric for deepening our understanding of catalysis, emphasizing predictive catalysis and sustainability. Those latter concepts are needed to direct our efforts toward identifying the optimal catalyst for any reaction, minimizing waste, and reducing experimental efforts while maximizing the efficacy of the computational methods.
Collapse
Affiliation(s)
- Sílvia Escayola
- Institut de Química Computacional i Catàlisi and Departament de Química, Universitat de Girona, c/Mª Aurèlia Capmany 69, 17003 Girona, Catalonia, Spain.
- Donostia International Physics Center (DIPC), 20018 Donostia, Euskadi, Spain
| | - Naeimeh Bahri-Laleh
- Iran Polymer and Petrochemical Institute (IPPI), P.O. Box 14965/115, Tehran, Iran
- Institute for Sustainability with Knotted Chiral Meta Matter (WPI-SKCM), Hiroshima University, Hiroshima, 739-8526, Japan
| | - Albert Poater
- Institut de Química Computacional i Catàlisi and Departament de Química, Universitat de Girona, c/Mª Aurèlia Capmany 69, 17003 Girona, Catalonia, Spain.
| |
Collapse
|
9
|
Mikutis S, Lawrinowitz S, Kretzer C, Dunsmore L, Sketeris L, Rodrigues T, Werz O, Bernardes GJL. Machine Learning Uncovers Natural Product Modulators of the 5-Lipoxygenase Pathway and Facilitates the Elucidation of Their Biological Mechanisms. ACS Chem Biol 2024; 19:217-229. [PMID: 38149598 PMCID: PMC10804367 DOI: 10.1021/acschembio.3c00725] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 12/10/2023] [Accepted: 12/12/2023] [Indexed: 12/28/2023]
Abstract
Machine learning (ML) models have made inroads into chemical sciences, with optimization of chemical reactions and prediction of biologically active molecules being prime examples thereof. These models excel where physical experiments are expensive or time-consuming, for example, due to large scales or the need for materials that are difficult to obtain. Studies of natural products suffer from these issues─this class of small molecules is known for its wealth of structural diversity and wide-ranging biological activities, but their investigation is hindered by poor synthetic accessibility and lack of scalability. To facilitate the evaluation of these molecules, we designed ML models that predict which natural products can interact with a particular target or a relevant pathway. Here, we focused on discovering natural products that are capable of modulating the 5-lipoxygenase (5-LO) pathway that plays key roles in lipid signaling and inflammation. These computational approaches led to the identification of nine natural products that either directly inhibit the activity of the 5-LO enzyme or affect the cellular 5-LO pathway. Further investigation of one of these molecules, deltonin, led us to discover a new cell-type-selective mechanism of action. Our ML approach helped deorphanize natural products as well as shed light on their mechanisms and can be broadly applied to other use cases in chemical biology.
Collapse
Affiliation(s)
- Sigitas Mikutis
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K.
| | - Stefanie Lawrinowitz
- Department
of Pharmaceutical/Medicinal Chemistry, Institute of Pharmacy, Friedrich Schiller University Jena, Philosophenweg 14, 07743 Jena, Germany
| | - Christian Kretzer
- Department
of Pharmaceutical/Medicinal Chemistry, Institute of Pharmacy, Friedrich Schiller University Jena, Philosophenweg 14, 07743 Jena, Germany
| | - Lavinia Dunsmore
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K.
| | - Laurynas Sketeris
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K.
| | - Tiago Rodrigues
- Instituto
de Investigação do Medicamento (iMed), Faculdade de
Farmácia, Universidade de Lisboa, Av. Prof. Gama Pinto, 1649-003 Lisbon, Portugal
| | - Oliver Werz
- Department
of Pharmaceutical/Medicinal Chemistry, Institute of Pharmacy, Friedrich Schiller University Jena, Philosophenweg 14, 07743 Jena, Germany
| | - Gonçalo J. L. Bernardes
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K.
- Instituto
de Medicina Molecular João Lobo Antunes, Faculdade de Medicina, Universidade de Lisboa, Avenida Professor Egas Moniz, 1649-028 Lisboa, Portugal
| |
Collapse
|
10
|
King-Smith E, Faber FA, Reilly U, Sinitskiy AV, Yang Q, Liu B, Hyek D, Lee AA. Predictive Minisci late stage functionalization with transfer learning. Nat Commun 2024; 15:426. [PMID: 38225239 PMCID: PMC10789750 DOI: 10.1038/s41467-023-42145-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 10/01/2023] [Indexed: 01/17/2024] Open
Abstract
Structural diversification of lead molecules is a key component of drug discovery to explore chemical space. Late-stage functionalizations (LSFs) are versatile methodologies capable of installing functional handles on richly decorated intermediates to deliver numerous diverse products in a single reaction. Predicting the regioselectivity of LSF is still an open challenge in the field. Numerous efforts from chemoinformatics and machine learning (ML) groups have made strides in this area. However, it is arduous to isolate and characterize the multitude of LSF products generated, limiting available data and hindering pure ML approaches. We report the development of an approach that combines a message passing neural network and 13C NMR-based transfer learning to predict the atom-wise probabilities of functionalization for Minisci and P450-based functionalizations. We validated our model both retrospectively and with a series of prospective experiments, showing that it accurately predicts the outcomes of Minisci-type and P450 transformations and outperforms the well-established Fukui-based reactivity indices and other machine learning reactivity-based algorithms.
Collapse
Affiliation(s)
- Emma King-Smith
- Cavendish Laboratory, University of Cambridge, Cambridge, UK
| | - Felix A Faber
- Cavendish Laboratory, University of Cambridge, Cambridge, UK
| | - Usa Reilly
- Development & Medical, Pfizer Worldwide Research, Groton, CT, USA
| | - Anton V Sinitskiy
- Machine Learning Computational Sciences, Pfizer Worldwide Research, Cambridge, MA, USA
| | - Qingyi Yang
- Development & Medical, Pfizer Worldwide Research, Cambridge, MA, USA
| | - Bo Liu
- Spectrix Analytic Services, LLC., North Haven, CT, USA
| | - Dennis Hyek
- Spectrix Analytic Services, LLC., North Haven, CT, USA
| | - Alpha A Lee
- Cavendish Laboratory, University of Cambridge, Cambridge, UK.
| |
Collapse
|
11
|
Dong W, Tian H, Zhang W, Zhou JJ, Pang X. Development of NaCl-MgCl 2-CaCl 2 Ternary Salt for High-Temperature Thermal Energy Storage Using Machine Learning. ACS APPLIED MATERIALS & INTERFACES 2024; 16:530-539. [PMID: 38126774 DOI: 10.1021/acsami.3c13412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
NaCl-MgCl2-CaCl2 eutectic ternary chloride salts are potential heat transfer and storage materials for high-temperature thermal energy storage. In this study, first-principles molecular dynamics simulation results were used as a data set to develop an interatomic potential for ternary chloride salts using a neural network machine learning method. Deep potential molecular dynamics (DPMD) simulations were performed to predict the microstructure and thermophysical properties of the NaCl-MgCl2-CaCl2 ternary salt. This work reveals that DPMD simulations can accurately calculate the microstructure and thermophysical properties of ternary chloride salts. The association strength of chloride ions and cations follows the order of Mg2+ > Ca2+ > Na+, and the coordination number decreases gradually with increasing temperature, indicating a progressively looser and more disordered molten structure. Furthermore, thermophysical properties, such as density, specific heat capacity, thermal conductivity, and viscosity, are in good agreement with the experimental measurements. Machine learning molecular dynamics will provide a feasible multivariate molten salt exploration method for the design of next-generation solar power plants and thermal energy storage systems.
Collapse
Affiliation(s)
- Wenhao Dong
- School of Mechanical and Power Engineering, Zhengzhou University, Zhengzhou 450001, P. R. China
| | - Heqing Tian
- School of Mechanical and Power Engineering, Zhengzhou University, Zhengzhou 450001, P. R. China
| | - Wenguang Zhang
- School of Mechanical and Power Engineering, Zhengzhou University, Zhengzhou 450001, P. R. China
| | - Jun-Jie Zhou
- School of Mechanical and Power Engineering, Zhengzhou University, Zhengzhou 450001, P. R. China
| | - Xinchang Pang
- School of Materials Science and Engineering, Zhengzhou University, Zhengzhou 450001, P. R. China
| |
Collapse
|
12
|
Handa K, Thomas MC, Kageyama M, Iijima T, Bender A. On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data. J Cheminform 2023; 15:112. [PMID: 37990215 PMCID: PMC10664602 DOI: 10.1186/s13321-023-00781-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 11/10/2023] [Indexed: 11/23/2023] Open
Abstract
While a multitude of deep generative models have recently emerged there exists no best practice for their practically relevant validation. On the one hand, novel de novo-generated molecules cannot be refuted by retrospective validation (so that this type of validation is biased); but on the other hand prospective validation is expensive and then often biased by the human selection process. In this case study, we frame retrospective validation as the ability to mimic human drug design, by answering the following question: Can a generative model trained on early-stage project compounds generate middle/late-stage compounds de novo? To this end, we used experimental data that contains the elapsed time of a synthetic expansion following hit identification from five public (where the time series was pre-processed to better reflect realistic synthetic expansions) and six in-house project datasets, and used REINVENT as a widely adopted RNN-based generative model. After splitting the dataset and training REINVENT on early-stage compounds, we found that rediscovery of middle/late-stage compounds was much higher in public projects (at 1.60%, 0.64%, and 0.21% of the top 100, 500, and 5000 scored generated compounds) than in in-house projects (where the values were 0.00%, 0.03%, and 0.04%, respectively). Similarly, average single nearest neighbour similarity between early- and middle/late-stage compounds in public projects was higher between active compounds than inactive compounds; however, for in-house projects the converse was true, which makes rediscovery (if so desired) more difficult. We hence show that the generative model recovers very few middle/late-stage compounds from real-world drug discovery projects, highlighting the fundamental difference between purely algorithmic design and drug discovery as a real-world process. Evaluating de novo compound design approaches appears, based on the current study, difficult or even impossible to do retrospectively.Scientific Contribution This contribution hence illustrates aspects of evaluating the performance of generative models in a real-world setting which have not been extensively described previously and which hopefully contribute to their further future development.
Collapse
Affiliation(s)
- Koichi Handa
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK.
- Toxicology & DMPK Research Department, Teijin Institute for Bio-Medical Research, Teijin Pharma Limited, 4-3-2 Asahigaoka, Hino-Shi, Tokyo, 191-8512, Japan.
| | - Morgan C Thomas
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Michiharu Kageyama
- Toxicology & DMPK Research Department, Teijin Institute for Bio-Medical Research, Teijin Pharma Limited, 4-3-2 Asahigaoka, Hino-Shi, Tokyo, 191-8512, Japan
| | - Takeshi Iijima
- Toxicology & DMPK Research Department, Teijin Institute for Bio-Medical Research, Teijin Pharma Limited, 4-3-2 Asahigaoka, Hino-Shi, Tokyo, 191-8512, Japan
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK.
| |
Collapse
|
13
|
Deng J, Yang Z, Wang H, Ojima I, Samaras D, Wang F. A systematic study of key elements underlying molecular property prediction. Nat Commun 2023; 14:6395. [PMID: 37833262 PMCID: PMC10575948 DOI: 10.1038/s41467-023-41948-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 09/18/2023] [Indexed: 10/15/2023] Open
Abstract
Artificial intelligence (AI) has been widely applied in drug discovery with a major task as molecular property prediction. Despite booming techniques in molecular representation learning, key elements underlying molecular property prediction remain largely unexplored, which impedes further advancements in this field. Herein, we conduct an extensive evaluation of representative models using various representations on the MoleculeNet datasets, a suite of opioids-related datasets and two additional activity datasets from the literature. To investigate the predictive power in low-data and high-data space, a series of descriptors datasets of varying sizes are also assembled to evaluate the models. In total, we have trained 62,820 models, including 50,220 models on fixed representations, 4200 models on SMILES sequences and 8400 models on molecular graphs. Based on extensive experimentation and rigorous comparison, we show that representation learning models exhibit limited performance in molecular property prediction in most datasets. Besides, multiple key elements underlying molecular property prediction can affect the evaluation results. Furthermore, we show that activity cliffs can significantly impact model prediction. Finally, we explore into potential causes why representation learning models can fail and show that dataset size is essential for representation learning models to excel.
Collapse
Affiliation(s)
- Jianyuan Deng
- Stony Brook University, Department of Biomedical Informatics, Stony Brook, NY, 11794, USA
| | - Zhibo Yang
- Stony Brook University, Department of Computer Science, Stony Brook, NY, 11794, USA
| | - Hehe Wang
- Stony Brook University, Department of Chemistry, Stony Brook, NY, 11794, USA
| | - Iwao Ojima
- Stony Brook University, Department of Chemistry, Stony Brook, NY, 11794, USA
| | - Dimitris Samaras
- Stony Brook University, Department of Computer Science, Stony Brook, NY, 11794, USA
| | - Fusheng Wang
- Stony Brook University, Department of Biomedical Informatics, Stony Brook, NY, 11794, USA.
- Stony Brook University, Department of Computer Science, Stony Brook, NY, 11794, USA.
| |
Collapse
|
14
|
Dias AL, Bustillo L, Rodrigues T. Limitations of representation learning in small molecule property prediction. Nat Commun 2023; 14:6394. [PMID: 37833279 PMCID: PMC10575963 DOI: 10.1038/s41467-023-41967-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Accepted: 09/18/2023] [Indexed: 10/15/2023] Open
Abstract
Machine learning is a powerful tool for the study and design of molecules. Here the authors comment a recent publication in Nature Communications which highlights the challenges of different molecular representations for data-driven property predictions.
Collapse
Affiliation(s)
- Ana Laura Dias
- Research Institute for Medicines (iMed), Faculdade de Farmácia, Universidade de Lisboa, Lisbon, Portugal
| | - Latimah Bustillo
- Research Institute for Medicines (iMed), Faculdade de Farmácia, Universidade de Lisboa, Lisbon, Portugal
| | - Tiago Rodrigues
- Research Institute for Medicines (iMed), Faculdade de Farmácia, Universidade de Lisboa, Lisbon, Portugal.
| |
Collapse
|
15
|
Li J, Wu N, Zhang J, Wu HH, Pan K, Wang Y, Liu G, Liu X, Yao Z, Zhang Q. Machine Learning-Assisted Low-Dimensional Electrocatalysts Design for Hydrogen Evolution Reaction. NANO-MICRO LETTERS 2023; 15:227. [PMID: 37831203 PMCID: PMC10575847 DOI: 10.1007/s40820-023-01192-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Accepted: 08/10/2023] [Indexed: 10/14/2023]
Abstract
Efficient electrocatalysts are crucial for hydrogen generation from electrolyzing water. Nevertheless, the conventional "trial and error" method for producing advanced electrocatalysts is not only cost-ineffective but also time-consuming and labor-intensive. Fortunately, the advancement of machine learning brings new opportunities for electrocatalysts discovery and design. By analyzing experimental and theoretical data, machine learning can effectively predict their hydrogen evolution reaction (HER) performance. This review summarizes recent developments in machine learning for low-dimensional electrocatalysts, including zero-dimension nanoparticles and nanoclusters, one-dimensional nanotubes and nanowires, two-dimensional nanosheets, as well as other electrocatalysts. In particular, the effects of descriptors and algorithms on screening low-dimensional electrocatalysts and investigating their HER performance are highlighted. Finally, the future directions and perspectives for machine learning in electrocatalysis are discussed, emphasizing the potential for machine learning to accelerate electrocatalyst discovery, optimize their performance, and provide new insights into electrocatalytic mechanisms. Overall, this work offers an in-depth understanding of the current state of machine learning in electrocatalysis and its potential for future research.
Collapse
Affiliation(s)
- Jin Li
- College of Chemistry and Chemical Engineering, and Henan Key Laboratory of Function-Oriented Porous Materials, Luoyang Normal University, Luoyang, 471934, People's Republic of China
| | - Naiteng Wu
- College of Chemistry and Chemical Engineering, and Henan Key Laboratory of Function-Oriented Porous Materials, Luoyang Normal University, Luoyang, 471934, People's Republic of China
| | - Jian Zhang
- New Energy Technology Engineering Lab of Jiangsu Province, College of Science, Nanjing University of Posts and Telecommunications (NUPT), Nanjing, 210023, People's Republic of China
| | - Hong-Hui Wu
- School of Materials Science and Engineering, University of Science and Technology Beijing, Beijing, 100083, People's Republic of China.
- Department of Chemistry, University of Nebraska-Lincoln, Lincoln, NE, 8588, USA.
| | - Kunming Pan
- Henan Key Laboratory of High-Temperature Structural and Functional Materials, National Joint Engineering Research Center for Abrasion Control and Molding of Metal Materials, Henan University of Science and Technology, Luoyang, 471003, People's Republic of China
| | - Yingxue Wang
- National Engineering Laboratory for Risk Perception and Prevention, Beijing, 100041, People's Republic of China.
| | - Guilong Liu
- College of Chemistry and Chemical Engineering, and Henan Key Laboratory of Function-Oriented Porous Materials, Luoyang Normal University, Luoyang, 471934, People's Republic of China
| | - Xianming Liu
- College of Chemistry and Chemical Engineering, and Henan Key Laboratory of Function-Oriented Porous Materials, Luoyang Normal University, Luoyang, 471934, People's Republic of China.
| | - Zhenpeng Yao
- Center of Hydrogen Science, Shanghai Jiao Tong University, Shanghai, 200000, People's Republic of China
- State Key Laboratory of Metal Matrix Composites, School of Materials Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200000, People's Republic of China
| | - Qiaobao Zhang
- State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Materials, Xiamen University, Xiamen, 361005, People's Republic of China.
| |
Collapse
|
16
|
Shilpa S, Kashyap G, Sunoj RB. Recent Applications of Machine Learning in Molecular Property and Chemical Reaction Outcome Predictions. J Phys Chem A 2023; 127:8253-8271. [PMID: 37769193 DOI: 10.1021/acs.jpca.3c04779] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/30/2023]
Abstract
Burgeoning developments in machine learning (ML) and its rapidly growing adaptations in chemistry are noteworthy. Motivated by the successful deployments of ML in the realm of molecular property prediction (MPP) and chemical reaction prediction (CRP), herein we highlight some of its most recent applications in predictive chemistry. We present a nonmathematical and concise overview of the progression of ML implementations, ranging from an ensemble-based random forest model to advanced graph neural network algorithms. Similarly, the prospects of various feature engineering and feature learning approaches that work in conjunction with ML models are described. Highly accurate predictions reported in MPP tasks (e.g., lipophilicity, solubility, distribution coefficient), using methods such as D-MPNN, MolCLR, SMILES-BERT, and MolBERT, offer promising avenues in molecular design and drug discovery. Whereas MPP pertains to a given molecule, ML applications in chemical reactions present a different level of challenge, primarily arising from the simultaneous involvement of multiple molecules and their diverse roles in a reaction setting. The reported RMSEs in MPP tasks range from 0.287 to 2.20, while those for yield predictions are well over 4.9 in the lower end, reaching thresholds of >10.0 in several examples. Our Review concludes with a set of persisting challenges in dealing with reaction data sets and an overall optimistic outlook on benefits of ML-driven workflows for various MPP as well as CRP tasks.
Collapse
Affiliation(s)
- Shilpa Shilpa
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Gargee Kashyap
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Raghavan B Sunoj
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
- Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| |
Collapse
|
17
|
Bustillo L, Laino T, Rodrigues T. The rise of automated curiosity-driven discoveries in chemistry. Chem Sci 2023; 14:10378-10384. [PMID: 37799997 PMCID: PMC10548516 DOI: 10.1039/d3sc03367h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Accepted: 09/07/2023] [Indexed: 10/07/2023] Open
Abstract
The quest for generating novel chemistry knowledge is critical in scientific advancement, and machine learning (ML) has emerged as an asset in this pursuit. Through interpolation among learned patterns, ML can tackle tasks that were previously deemed demanding to machines. This distinctive capacity of ML provides invaluable aid to bench chemists in their daily work. However, current ML tools are typically designed to prioritize experiments with the highest likelihood of success, i.e., higher predictive confidence. In this perspective, we build on current trends that suggest a future in which ML could be just as beneficial in exploring uncharted search spaces through simulated curiosity. We discuss how low and 'negative' data can catalyse one-/few-shot learning, and how the broader use of curious ML and novelty detection algorithms can propel the next wave of chemical discoveries. We anticipate that ML for curiosity-driven research will help the community overcome potentially biased assumptions and uncover unexpected findings in the chemical sciences at an accelerated pace.
Collapse
Affiliation(s)
- Latimah Bustillo
- Research Institute for Medicines (iMed), Faculdade de Farmácia, Universidade de Lisboa Lisbon Portugal
| | - Teodoro Laino
- IBM Research Europe Säumerstrasse 4 8803 Rüschlikon Switzerland
- National Center for Competence in Research-Catalysis (NCCR-Catalysis) Zurich Switzerland
| | - Tiago Rodrigues
- Research Institute for Medicines (iMed), Faculdade de Farmácia, Universidade de Lisboa Lisbon Portugal
| |
Collapse
|
18
|
Smajić A, Rami I, Sosnin S, Ecker GF. Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets. Chem Res Toxicol 2023; 36:1300-1312. [PMID: 37439496 PMCID: PMC10445286 DOI: 10.1021/acs.chemrestox.3c00042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Indexed: 07/14/2023]
Abstract
Each year, publicly available databases are updated with new compounds from different research institutions. Positive experimental outcomes are more likely to be reported; therefore, they account for a considerable fraction of these entries. Established publicly available databases such as ChEMBL allow researchers to use information without constrictions and create predictive tools for a broad spectrum of applications in the field of toxicology. Therefore, we investigated the distribution of positive and nonpositive entries within ChEMBL for a set of off-targets and its impact on the performance of classification models when applied to pharmaceutical industry data sets. Results indicate that models trained on publicly available data tend to overpredict positives, and models based on industry data sets predict negatives more often than those built using publicly available data sets. This is strengthened even further by the visualization of the prediction space for a set of 10,000 compounds, which makes it possible to identify regions in the chemical space where predictions converge. Finally, we highlight the utilization of these models for consensus modeling for potential adverse events prediction.
Collapse
Affiliation(s)
- Aljoša Smajić
- Department of Pharmaceutical Sciences, University of Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria
| | - Iris Rami
- Department of Pharmaceutical Sciences, University of Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria
| | - Sergey Sosnin
- Department of Pharmaceutical Sciences, University of Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria
| | - Gerhard F. Ecker
- Department of Pharmaceutical Sciences, University of Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria
| |
Collapse
|
19
|
Lanini J, Santarossa G, Sirockin F, Lewis R, Fechner N, Misztela H, Lewis S, Maziarz K, Stanley M, Segler M, Stiefl N, Schneider N. PREFER: A New Predictive Modeling Framework for Molecular Discovery. J Chem Inf Model 2023; 63:4497-4504. [PMID: 37487018 DOI: 10.1021/acs.jcim.3c00523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/26/2023]
Abstract
Machine-learning and deep-learning models have been extensively used in cheminformatics to predict molecular properties, to reduce the need for direct measurements, and to accelerate compound prioritization. However, different setups and frameworks and the large number of molecular representations make it difficult to properly evaluate, reproduce, and compare them. Here we present a new PREdictive modeling FramEwoRk for molecular discovery (PREFER), written in Python (version 3.7.7) and based on AutoSklearn (version 0.14.7), that allows comparison between different molecular representations and common machine-learning models. We provide an overview of the design of our framework and show exemplary use cases and results of several representation-model combinations on diverse data sets, both public and in-house. Finally, we discuss the use of PREFER on small data sets. The code of the framework is freely available on GitHub.
Collapse
Affiliation(s)
- Jessica Lanini
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Gianluca Santarossa
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Finton Sirockin
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Richard Lewis
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Nikolas Fechner
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | | | - Sarah Lewis
- Microsoft Research AI4Science, Cambridge CB1 2FB, U.K
| | | | - Megan Stanley
- Microsoft Research AI4Science, Cambridge CB1 2FB, U.K
| | - Marwin Segler
- Microsoft Research AI4Science, Cambridge CB1 2FB, U.K
| | - Nikolaus Stiefl
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Nadine Schneider
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| |
Collapse
|
20
|
Knattrup Y, Kubečka J, Ayoubi D, Elm J. Clusterome: A Comprehensive Data Set of Atmospheric Molecular Clusters for Machine Learning Applications. ACS OMEGA 2023; 8:25155-25164. [PMID: 37483242 PMCID: PMC10357536 DOI: 10.1021/acsomega.3c02203] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Accepted: 06/16/2023] [Indexed: 07/25/2023]
Abstract
Formation and growth of atmospheric molecular clusters into aerosol particles impact the global climate and contribute to the high uncertainty in modern climate models. Cluster formation is usually studied using quantum chemical methods, which quickly becomes computationally expensive when system sizes grow. In this work, we present a large database of ∼250k atmospheric relevant cluster structures, which can be applied for developing machine learning (ML) models. The database is used to train the ML model kernel ridge regression (KRR) with the FCHL19 representation. We test the ability of the model to extrapolate from smaller clusters to larger clusters, between different molecules, between equilibrium structures and out-of-equilibrium structures, and the transferability onto systems with new interactions. We show that KRR models can extrapolate to larger sizes and transfer acid and base interactions with mean absolute errors below 1 kcal/mol. We suggest introducing an iterative ML step in configurational sampling processes, which can reduce the computational expense. Such an approach would allow us to study significantly more cluster systems at higher accuracy than previously possible and thereby allow us to cover a much larger part of relevant atmospheric compounds.
Collapse
Affiliation(s)
- Yosef Knattrup
- Department
of Chemistry, Aarhus University, Langelandsgade 140, 8000 Aarhus C, Denmark
| | - Jakub Kubečka
- Department
of Chemistry, Aarhus University, Langelandsgade 140, 8000 Aarhus C, Denmark
| | - Daniel Ayoubi
- Department
of Chemistry, Aarhus University, Langelandsgade 140, 8000 Aarhus C, Denmark
| | - Jonas Elm
- Department
of Chemistry, iClimate, Aarhus University, Langelandsgade 140, 8000 Aarhus C, Denmark
| |
Collapse
|
21
|
Wang X, Huang Y, Xie X, Liu Y, Huo Z, Lin M, Xin H, Tong R. Bayesian-optimization-assisted discovery of stereoselective aluminum complexes for ring-opening polymerization of racemic lactide. Nat Commun 2023; 14:3647. [PMID: 37339991 DOI: 10.1038/s41467-023-39405-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Accepted: 06/12/2023] [Indexed: 06/22/2023] Open
Abstract
Stereoselective ring-opening polymerization catalysts are used to produce degradable stereoregular poly(lactic acids) with thermal and mechanical properties that are superior to those of atactic polymers. However, the process of discovering highly stereoselective catalysts is still largely empirical. We aim to develop an integrated computational and experimental framework for efficient, predictive catalyst selection and optimization. As a proof of principle, we have developed a Bayesian optimization workflow on a subset of literature results for stereoselective lactide ring-opening polymerization, and using the algorithm, we identify multiple new Al complexes that catalyze either isoselective or heteroselective polymerization. In addition, feature attribution analysis uncovers mechanistically meaningful ligand descriptors, such as percent buried volume (%Vbur) and the highest occupied molecular orbital energy (EHOMO), that can access quantitative and predictive models for catalyst development.
Collapse
Affiliation(s)
- Xiaoqian Wang
- Department of Chemical Engineering, Virginia Polytechnic Institute and State University, 635 Prices Fork Road, Blacksburg, VA, 24061, USA
| | - Yang Huang
- Department of Chemical Engineering, Virginia Polytechnic Institute and State University, 635 Prices Fork Road, Blacksburg, VA, 24061, USA
| | - Xiaoyu Xie
- Department of Chemical Engineering, Virginia Polytechnic Institute and State University, 635 Prices Fork Road, Blacksburg, VA, 24061, USA
| | - Yan Liu
- Department of Chemical Engineering, Virginia Polytechnic Institute and State University, 635 Prices Fork Road, Blacksburg, VA, 24061, USA
| | - Ziyu Huo
- Department of Chemical Engineering, Virginia Polytechnic Institute and State University, 635 Prices Fork Road, Blacksburg, VA, 24061, USA
| | - Maverick Lin
- Department of Chemical Engineering, Virginia Polytechnic Institute and State University, 635 Prices Fork Road, Blacksburg, VA, 24061, USA
| | - Hongliang Xin
- Department of Chemical Engineering, Virginia Polytechnic Institute and State University, 635 Prices Fork Road, Blacksburg, VA, 24061, USA.
| | - Rong Tong
- Department of Chemical Engineering, Virginia Polytechnic Institute and State University, 635 Prices Fork Road, Blacksburg, VA, 24061, USA.
| |
Collapse
|
22
|
Kubečka J, Knattrup Y, Engsvang M, Jensen AB, Ayoubi D, Wu H, Christiansen O, Elm J. Current and future machine learning approaches for modeling atmospheric cluster formation. NATURE COMPUTATIONAL SCIENCE 2023; 3:495-503. [PMID: 38177415 DOI: 10.1038/s43588-023-00435-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 03/16/2023] [Indexed: 01/06/2024]
Abstract
The formation of strongly bound atmospheric molecular clusters is the first step towards forming new aerosol particles. Recent advances in the application of machine learning models open an enormous opportunity for complementing expensive quantum chemical calculations with efficient machine learning predictions. In this Perspective, we present how data-driven approaches can be applied to accelerate cluster configurational sampling, thereby greatly increasing the number of chemically relevant systems that can be covered.
Collapse
Affiliation(s)
- Jakub Kubečka
- Department of Chemistry, Aarhus University, Aarhus, Denmark
| | - Yosef Knattrup
- Department of Chemistry, Aarhus University, Aarhus, Denmark
| | | | | | - Daniel Ayoubi
- Department of Chemistry, Aarhus University, Aarhus, Denmark
| | - Haide Wu
- Department of Chemistry, Aarhus University, Aarhus, Denmark
| | | | - Jonas Elm
- Department of Chemistry, Aarhus University, Aarhus, Denmark.
- iCLIMATE Aarhus University Interdisciplinary Centre for Climate Change, Aarhus, Denmark.
| |
Collapse
|
23
|
Saebi M, Nan B, Herr JE, Wahlers J, Guo Z, Zurański AM, Kogej T, Norrby PO, Doyle AG, Chawla NV, Wiest O. On the use of real-world datasets for reaction yield prediction. Chem Sci 2023; 14:4997-5005. [PMID: 37206399 PMCID: PMC10189898 DOI: 10.1039/d2sc06041h] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Accepted: 03/09/2023] [Indexed: 09/30/2023] Open
Abstract
The lack of publicly available, large, and unbiased datasets is a key bottleneck for the application of machine learning (ML) methods in synthetic chemistry. Data from electronic laboratory notebooks (ELNs) could provide less biased, large datasets, but no such datasets have been made publicly available. The first real-world dataset from the ELNs of a large pharmaceutical company is disclosed and its relationship to high-throughput experimentation (HTE) datasets is described. For chemical yield predictions, a key task in chemical synthesis, an attributed graph neural network (AGNN) performs as well as or better than the best previous models on two HTE datasets for the Suzuki-Miyaura and Buchwald-Hartwig reactions. However, training the AGNN on an ELN dataset does not lead to a predictive model. The implications of using ELN data for training ML-based models are discussed in the context of yield predictions.
Collapse
Affiliation(s)
- Mandana Saebi
- Department of Computer Science and Engineering and Lucy Family Institute for Data and Society, University of Notre Dame Notre Dame IN 46556 USA
| | - Bozhao Nan
- Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame IN 46556 USA
| | - John E Herr
- Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame IN 46556 USA
| | - Jessica Wahlers
- Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame IN 46556 USA
| | - Zhichun Guo
- Department of Computer Science and Engineering and Lucy Family Institute for Data and Society, University of Notre Dame Notre Dame IN 46556 USA
| | - Andrzej M Zurański
- Department of Chemistry, Princeton University Princeton New Jersey 08544 USA
| | - Thierry Kogej
- Molecular AI, Discovery Sciences, R&D, AstraZeneca Pepparedsleden 1, SE-431 83 Mölndal Gothenburg Sweden
| | - Per-Ola Norrby
- Data Science and Modelling, Pharmaceutical Sciences, R&D, AstraZeneca Pepparedsleden 1, SE-431 83 Mölndal Gothenburg Sweden
| | - Abigail G Doyle
- Department of Chemistry, Princeton University Princeton New Jersey 08544 USA
- Department of Chemistry and Biochemistry, University of California Los Angeles California 90095 USA
| | - Nitesh V Chawla
- Department of Computer Science and Engineering and Lucy Family Institute for Data and Society, University of Notre Dame Notre Dame IN 46556 USA
| | - Olaf Wiest
- Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame IN 46556 USA
| |
Collapse
|
24
|
Bustillo L, Rodrigues T. A focus on the use of real-world datasets for yield prediction. Chem Sci 2023; 14:4958-4960. [PMID: 37206402 PMCID: PMC10189867 DOI: 10.1039/d3sc90069j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/21/2023] Open
Abstract
The prediction of reaction yields remains a challenging task for machine learning (ML), given the vast search spaces and absence of robust training data. Wiest, Chawla et al. (https://doi.org/10.1039/D2SC06041H) show that a deep learning algorithm performs well on high-throughput experimentation data but surprisingly poorly on real-world, historical data from a pharmaceutical company. The result suggests that there is considerable room for improvement when coupling ML to electronic laboratory notebook data.
Collapse
Affiliation(s)
- Latimah Bustillo
- Research Institute for Medicines (iMed), Faculty of Pharmacy, University of Lisbon Av Prof Gama Pinto 1649-003 Lisbon Portugal
| | - Tiago Rodrigues
- Research Institute for Medicines (iMed), Faculty of Pharmacy, University of Lisbon Av Prof Gama Pinto 1649-003 Lisbon Portugal
| |
Collapse
|
25
|
Jablonka K, Rosen AS, Krishnapriyan AS, Smit B. An Ecosystem for Digital Reticular Chemistry. ACS CENTRAL SCIENCE 2023; 9:563-581. [PMID: 37122448 PMCID: PMC10141625 DOI: 10.1021/acscentsci.2c01177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
The vastness of the materials design space makes it impractical to explore using traditional brute-force methods, particularly in reticular chemistry. However, machine learning has shown promise in expediting and guiding materials design. Despite numerous successful applications of machine learning to reticular materials, progress in the field has stagnated, possibly because digital chemistry is more an art than a science and its limited accessibility to inexperienced researchers. To address this issue, we present mofdscribe, a software ecosystem tailored to novice and seasoned digital chemists that streamlines the ideation, modeling, and publication process. Though optimized for reticular chemistry, our tools are versatile and can be used in nonreticular materials research. We believe that mofdscribe will enable a more reliable, efficient, and comparable field of digital chemistry.
Collapse
Affiliation(s)
- Kevin
Maik Jablonka
- Laboratory of molecular simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Rue de l’Industrie 17, CH-1951 Sion, Switzerland
| | - Andrew S. Rosen
- Department of Materials
Science and Engineering, University of California, Berkeley, California 94720, United States
- Miller Institute for Basic Research in Science, University of California, Berkeley, California 94720, United States
- Materials Science Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, United States
| | - Aditi S. Krishnapriyan
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, California 94720, United States
- Department of Electrical Engineering and
Computer Science, University of California, Berkeley, California 94720, United States
- Computational
Research Division, Lawrence Berkeley National
Laboratory, Berkeley, California 94720, United States
| | - Berend Smit
- Laboratory of molecular simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Rue de l’Industrie 17, CH-1951 Sion, Switzerland
- E-mail:
| |
Collapse
|
26
|
Kee CW. Molecular Understanding and Practical In Silico Catalyst Design in Computational Organocatalysis and Phase Transfer Catalysis-Challenges and Opportunities. Molecules 2023; 28:molecules28041715. [PMID: 36838703 PMCID: PMC9966076 DOI: 10.3390/molecules28041715] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Revised: 02/03/2023] [Accepted: 02/05/2023] [Indexed: 02/25/2023] Open
Abstract
Through the lens of organocatalysis and phase transfer catalysis, we will examine the key components to calculate or predict catalysis-performance metrics, such as turnover frequency and measurement of stereoselectivity, via computational chemistry. The state-of-the-art tools available to calculate potential energy and, consequently, free energy, together with their caveats, will be discussed via examples from the literature. Through various examples from organocatalysis and phase transfer catalysis, we will highlight the challenges related to the mechanism, transition state theory, and solvation involved in translating calculated barriers to the turnover frequency or a metric of stereoselectivity. Examples in the literature that validated their theoretical models will be showcased. Lastly, the relevance and opportunity afforded by machine learning will be discussed.
Collapse
Affiliation(s)
- Choon Wee Kee
- Institute of Sustainability for Chemicals, Energy and Environment (ISCE2), Agency for Science, Technology and Research (A*STAR), 1 Pesek Road, Jurong Island, Singapore 627833, Republic of Singapore
| |
Collapse
|
27
|
Angello NH, Rathore V, Beker W, Wołos A, Jira ER, Roszak R, Wu TC, Schroeder CM, Aspuru-Guzik A, Grzybowski BA, Burke MD. Closed-loop optimization of general reaction conditions for heteroaryl Suzuki-Miyaura coupling. Science 2022; 378:399-405. [DOI: 10.1126/science.adc8743] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
General conditions for organic reactions are important but rare, and efforts to identify them usually consider only narrow regions of chemical space. Discovering more general reaction conditions requires considering vast regions of chemical space derived from a large matrix of substrates crossed with a high-dimensional matrix of reaction conditions, rendering exhaustive experimentation impractical. Here, we report a simple closed-loop workflow that leverages data-guided matrix down-selection, uncertainty-minimizing machine learning, and robotic experimentation to discover general reaction conditions. Application to the challenging and consequential problem of heteroaryl Suzuki-Miyaura cross-coupling identified conditions that double the average yield relative to a widely used benchmark that was previously developed using traditional approaches. This study provides a practical road map for solving multidimensional chemical optimization problems with large search spaces.
Collapse
Affiliation(s)
- Nicholas H. Angello
- Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Vandana Rathore
- Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | | | - Agnieszka Wołos
- Allchemy, Inc., Highland, IN, USA
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
| | - Edward R. Jira
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Rafał Roszak
- Allchemy, Inc., Highland, IN, USA
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
| | - Tony C. Wu
- Department of Chemistry, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | - Charles M. Schroeder
- Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Department of Materials Science and Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Alán Aspuru-Guzik
- Department of Chemistry, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON, Canada
- Canadian Institute for Advanced Research, Toronto, ON, Canada
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, ON, Canada
| | - Bartosz A. Grzybowski
- Allchemy, Inc., Highland, IN, USA
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
- Center for Soft and Living Matter, Institute for Basic Science, Ulsan, Republic of Korea
- Department of Chemistry, Ulsan Institute of Science and Technology, Ulsan, Republic of Korea
| | - Martin D. Burke
- Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Carle Illinois College of Medicine, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| |
Collapse
|
28
|
Rodrigues T. A special issue on artificial intelligence for drug discovery. Bioorg Med Chem 2022; 70:116939. [PMID: 35853808 DOI: 10.1016/j.bmc.2022.116939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Affiliation(s)
- Tiago Rodrigues
- Faculty of Pharmacy, University of Lisbon, Av Prof Gama Pinto, 1649-003 Lisbon, Portugal.
| |
Collapse
|