1
|
Bloodworth S, Willoughby C, Coles SJ. Data accessibility in the chemical sciences: an analysis of recent practice in organic chemistry journals. Beilstein J Org Chem 2025; 21:864-876. [PMID: 40331050 PMCID: PMC12051459 DOI: 10.3762/bjoc.21.70] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2025] [Accepted: 04/23/2025] [Indexed: 05/08/2025] Open
Abstract
The discoverability and reusability of data is critical for machine learning to drive new discovery in the chemical sciences, and the 'FAIR Guiding Principles for scientific data management and stewardship' provide a measurable set of guidelines that can be used to ensure the accessibility of reusable data. We investigate the data practice of researchers publishing in specialist organic chemistry journals, by analyzing the outputs of 240 randomly selected research papers from 12 top-ranked journals published in early 2023. We investigate compliance with recommended (but not compulsory) data policies, assess the accessibility and reusability of data, and if the existence of specific recommendations for publishing NMR data by some journals supports author compliance. We find that, although authors meet mandated requirements, there is very limited compliance with data sharing policies that are only recommended by journals. Overall, there is little evidence to suggest that authors' publishing practice meets FAIR data guidance. We suggest first steps that researchers can take to move towards a positive culture of data sharing in organic chemistry. Routine actions that we encourage as standard practice include deposition of raw and metadata to open repositories, and inclusion of machine-readable structure identifiers for all reported compounds.
Collapse
Affiliation(s)
- Sally Bloodworth
- School of Chemistry and Chemical Engineering, University of Southampton, Highfield, Southampton SO17 1BJ, UK
| | - Cerys Willoughby
- School of Chemistry and Chemical Engineering, University of Southampton, Highfield, Southampton SO17 1BJ, UK
| | - Simon J Coles
- School of Chemistry and Chemical Engineering, University of Southampton, Highfield, Southampton SO17 1BJ, UK
| |
Collapse
|
2
|
Mroz AM, Basford AR, Hastedt F, Jayasekera IS, Mosquera-Lois I, Sedgwick R, Ballester PJ, Bocarsly JD, Antonio Del Río Chanona E, Evans ML, Frost JM, Ganose AM, Greenaway RL, Kuok Mimi Hii K, Li Y, Misener R, Walsh A, Zhang D, Jelfs KE. Cross-disciplinary perspectives on the potential for artificial intelligence across chemistry. Chem Soc Rev 2025. [PMID: 40278836 PMCID: PMC12024683 DOI: 10.1039/d5cs00146c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2025] [Indexed: 04/26/2025]
Abstract
From accelerating simulations and exploring chemical space, to experimental planning and integrating automation within experimental labs, artificial intelligence (AI) is changing the landscape of chemistry. We are seeing a significant increase in the number of publications leveraging these powerful data-driven insights and models to accelerate all aspects of chemical research. For example, how we represent molecules and materials to computer algorithms for predictive and generative models, as well as the physical mechanisms by which we perform experiments in the lab for automation. Here, we present ten diverse perspectives on the impact of AI coming from those with a range of backgrounds from experimental chemistry, computational chemistry, computer science, engineering and across different areas of chemistry, including drug discovery, catalysis, chemical automation, chemical physics, materials chemistry. The ten perspectives presented here cover a range of themes, including AI for computation, facilitating discovery, supporting experiments, and enabling technologies for transformation. We highlight and discuss imminent challenges and ways in which we are redefining problems to accelerate the impact of chemical research via AI.
Collapse
Affiliation(s)
- Austin M Mroz
- Department of Chemistry, Imperial College London, London W12 0BZ, UK.
- I-X Centre for AI in Science, Imperial College London, London W12 0BZ, UK
| | - Annabel R Basford
- Department of Chemistry, Imperial College London, London W12 0BZ, UK.
| | - Friedrich Hastedt
- Department of Chemical Engineering, Imperial College London, London SW7 2AZ, UK
| | | | | | - Ruby Sedgwick
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| | - Pedro J Ballester
- Department of Bioengineering, Imperial College London, London SW7 2AZ, UK
| | - Joshua D Bocarsly
- Department of Chemistry and Texas Center for Superconductivity, University of Houston, Houston, USA
| | | | - Matthew L Evans
- UCLouvain, Institute of Condensed Matter and Nanosciences (IMCN), Chemin des Étoiles 8, Louvain-la-Neuve 1348, Belgium
- Matgenix SRL, A6K Advanced Engineering Center, Charleroi, Belgium
- Datalab Industries Ltd, King's Lynn, Norfolk, UK
| | - Jarvist M Frost
- Department of Chemistry, Imperial College London, London W12 0BZ, UK.
| | - Alex M Ganose
- Department of Chemistry, Imperial College London, London W12 0BZ, UK.
| | | | | | - Yingzhen Li
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| | - Ruth Misener
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| | - Aron Walsh
- Department of Materials, Imperial College London, London SW7 2AZ, UK
| | - Dandan Zhang
- I-X Centre for AI in Science, Imperial College London, London W12 0BZ, UK
- Department of Bioengineering, Imperial College London, London SW7 2AZ, UK
| | - Kim E Jelfs
- Department of Chemistry, Imperial College London, London W12 0BZ, UK.
| |
Collapse
|
3
|
Song W, Sun H. Local reaction condition optimization via machine learning. J Mol Model 2025; 31:143. [PMID: 40266356 DOI: 10.1007/s00894-025-06365-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2025] [Accepted: 03/31/2025] [Indexed: 04/24/2025]
Abstract
CONTEXT Reaction condition optimization addresses shared requirements across academia and industry, particularly in chemistry, pharmaceutical development, and fine chemical engineering. This review examines recent progress and persistent challenges in machine learning-guided optimization of localized reaction conditions, with an emphasis on three core aspects: dataset, condition representation, and optimization methods, as well as the main issues in each related stage. The review explores challenges such as dataset scarcity, data quality, and the "completeness trap" in dataset preparation stage, summarizes the limitations of current molecular representation techniques in condition representation stage, and discusses the search efficiency challenges of optimization methods in optimization stage. METHODS The review analyzes the molecular representation techniques and identifies them as the primary bottleneck in advancing localized reaction condition optimization. It further examines existing optimization methodologies. Among them, Bayesian optimization and active learning emerges as the most commonly applied approaches in this field, utilizing incremental learning mechanisms and human-in-the-loop strategies to minimize experimental data requirements while mitigating molecular representation limitations. The review concludes that advancements in molecular representation techniques are essential for developing more efficient optimization methods in the future.
Collapse
Affiliation(s)
- Wenhuan Song
- School of Mechanical, Electrical & Information Engineering, Shandong University, Weihai, 264209, China.
| | - Honggang Sun
- School of Mechanical, Electrical & Information Engineering, Shandong University, Weihai, 264209, China
| |
Collapse
|
4
|
Singh S, Hernández-Lobato JM. A meta-learning approach for selectivity prediction in asymmetric catalysis. Nat Commun 2025; 16:3599. [PMID: 40234410 PMCID: PMC12000603 DOI: 10.1038/s41467-025-58854-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Accepted: 03/31/2025] [Indexed: 04/17/2025] Open
Abstract
Transition metal-catalyzed asymmetric reactions are of high contemporary importance in organic synthesis. Recently, machine learning (ML) has shown promise in accelerating the development of newer catalytic protocols. However, the need for large amount of experimental data can present a bottleneck for implementing ML models. Here, we propose a meta-learning workflow that can harness the literature-derived data to extract shared reaction features and requires only a few examples to predict the outcome of new reactions. Prototypical networks are used as a meta-learning method to predict the enantioselectivity of asymmetric hydrogenation of olefins. This meta-learning model consistently provides significant performance improvement over other popular ML methods such as random forests and graph neural networks. The performance of our meta-model is analyzed with varying sizes of training examples to demonstrate its utility even with limited data. A good model performance on an out-of-sample test set further indicates the general applicability of our approach. We believe this work will provide a leap forward in identifying promising reactions in the early phases of reaction development when minimal data is available.
Collapse
Affiliation(s)
- Sukriti Singh
- Department of Engineering, University of Cambridge, Cambridge, UK.
| | | |
Collapse
|
5
|
Kwon Y, Jeon H, Choi J, Choi YS, Kang S. Enhancing chemical reaction search through contrastive representation learning and human-in-the-loop. J Cheminform 2025; 17:51. [PMID: 40211385 PMCID: PMC11987336 DOI: 10.1186/s13321-025-00987-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Accepted: 03/15/2025] [Indexed: 04/13/2025] Open
Abstract
In synthesis planning, identifying and optimizing chemical reactions are important for the successful design of synthetic pathways to target substances. Chemical reaction databases assist chemists in gaining insights into this process. Traditionally, searching for relevant records from a reaction database has relied on the manual formulation of queries by chemists based on their search purposes, which is challenging without explicit knowledge of what they are searching for. In this study, we propose an intelligent chemical reaction search system that simplifies the process of enhancing the search results. When a user submits a query, a list of relevant records is retrieved from the reaction database. Users can express their preferences and requirements by providing binary ratings for the individual retrieved records. The search results are refined based on the user feedback. To implement this system effectively, we incorporate and adapt contrastive representation learning, dimensionality reduction, and human-in-the-loop techniques. Contrastive learning is used to train a representation model that embeds records in the reaction database as numerical vectors suitable for chemical reaction searches. Dimensionality reduction is applied to compress these vectors, thereby enhancing the search efficiency. Human-in-the-loop is integrated to iteratively update the representation model by reflecting user feedback. Through experimental investigations, we demonstrate that the proposed method effectively improves the chemical reaction search towards better alignment with user preferences and requirements. Scientific contribution This study seeks to enhance the search functionality of chemical reaction databases by drawing inspiration from recommender systems. The proposed method simplifies the search process, offering an alternative to the complexity of formulating explicit query rules. We believe that the proposed method can assist users in efficiently discovering records relevant to target reactions, especially when they encounter difficulties in crafting detailed queries due to limited knowledge.
Collapse
Affiliation(s)
- Youngchun Kwon
- Samsung Advanced Institute of Technology, Samsung Electronics Co. Ltd., 130 Samsung-ro, Yeongtong-gu, Suwon, Republic of Korea
| | - Hyunjeong Jeon
- Samsung Advanced Institute of Technology, Samsung Electronics Co. Ltd., 130 Samsung-ro, Yeongtong-gu, Suwon, Republic of Korea
| | - Joonhyuk Choi
- Samsung Advanced Institute of Technology, Samsung Electronics Co. Ltd., 130 Samsung-ro, Yeongtong-gu, Suwon, Republic of Korea
| | - Youn-Suk Choi
- Samsung Advanced Institute of Technology, Samsung Electronics Co. Ltd., 130 Samsung-ro, Yeongtong-gu, Suwon, Republic of Korea.
| | - Seokho Kang
- Department of Industrial Engineering, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon, Republic of Korea.
| |
Collapse
|
6
|
Sletten ET, Wolf JB, Danglad-Flores J, Seeberger PH. Carbohydrate Synthesis is Entering the Data-Driven Digital Era. Chemistry 2025:e202500289. [PMID: 40178205 DOI: 10.1002/chem.202500289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2025] [Revised: 03/27/2025] [Accepted: 03/28/2025] [Indexed: 04/05/2025]
Abstract
Glycans are vital in biological processes, but their nontemplated, heterogeneous structures complicate structure-function analyses. Glycosylation, the key reaction in synthetic glycochemistry, remains not entirely predictable due to its complex mechanism and the need for protecting groups that impact reaction outcomes. This concept highlights recent advancements in glycochemistry and emphasizes the integration of digital tools, including automation, computational modelling, and data management, to improve carbohydrate synthesis and support further progress in the field.
Collapse
Affiliation(s)
- Eric T Sletten
- Max Planck Institute of Colloids and Interfaces, Potsdam Science Park, Am Mühlenberg 1, 14476, Potsdam, Germany
| | - Jakob B Wolf
- Max Planck Institute of Colloids and Interfaces, Potsdam Science Park, Am Mühlenberg 1, 14476, Potsdam, Germany
- Institut für Chemie, Biochemie und Pharmazie, Freie Universität Berlin, Takusstraße 3, 14195, Berlin, Germany
| | - José Danglad-Flores
- Max Planck Institute of Colloids and Interfaces, Potsdam Science Park, Am Mühlenberg 1, 14476, Potsdam, Germany
| | - Peter H Seeberger
- Max Planck Institute of Colloids and Interfaces, Potsdam Science Park, Am Mühlenberg 1, 14476, Potsdam, Germany
- Institut für Chemie, Biochemie und Pharmazie, Freie Universität Berlin, Takusstraße 3, 14195, Berlin, Germany
| |
Collapse
|
7
|
Sigmund LM, Assante M, Johansson MJ, Norrby PO, Jorner K, Kabeshov M. Computational tools for the prediction of site- and regioselectivity of organic reactions. Chem Sci 2025; 16:5383-5412. [PMID: 40070469 PMCID: PMC11891785 DOI: 10.1039/d5sc00541h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Accepted: 03/03/2025] [Indexed: 03/14/2025] Open
Abstract
The regio- and site-selectivity of organic reactions is one of the most important aspects when it comes to synthesis planning. Due to that, massive research efforts were invested into computational models for regio- and site-selectivity prediction, and the introduction of machine learning to the chemical sciences within the past decade has added a whole new dimension to these endeavors. This review article walks through the currently available predictive tools for regio- and site-selectivity with a particular focus on machine learning models while being organized along the individual reaction classes of organic chemistry. Respective featurization techniques and model architectures are described and compared to each other; applications of the tools to critical real-world examples are highlighted. This paper aims to serve as an overview of the field's status quo for both the intended users of the tools, that is synthetic chemists, as well as for developers to find potential new research avenues.
Collapse
Affiliation(s)
- Lukas M Sigmund
- Molecular AI, Discovery Sciences, R&D, AstraZeneca Gothenburg Pepparedsleden 1 43183 Mölndal Sweden
| | - Michele Assante
- Innovation Centre in Digital Molecular Technologies, Department of Chemistry, University of Cambridge Lensfield Rd Cambridge CB2 1EW UK
- Compound Synthesis & Management, The Discovery Centre, AstraZeneca Cambridge Cambridge Biomedical Campus, 1 Francis Crick Avenue CB2 0AA Cambridge UK
| | - Magnus J Johansson
- Medicinal Chemistry, Research and Early Development, Cardiovascular, Renal and Metabolism (CVRM), BioPharmaceuticals, R&D, AstraZeneca Gothenburg Pepparedsleden 1 43183 Mölndal Sweden
| | - Per-Ola Norrby
- Data Science & Modelling, Pharmaceutical Sciences, R&D, AstraZeneca Gothenburg Pepparedsleden 1 43183 Mölndal Sweden
| | - Kjell Jorner
- ETH Zürich, Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences Vladimir-Prelog-Weg 1 CH-8093 Zürich Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, ETH Zurich Zurich Switzerland
| | - Mikhail Kabeshov
- Molecular AI, Discovery Sciences, R&D, AstraZeneca Gothenburg Pepparedsleden 1 43183 Mölndal Sweden
| |
Collapse
|
8
|
Kozlov KS, Boiko DA, Burykina JV, Ilyushenkova VV, Kostyukovich AY, Patil ED, Ananikov VP. Discovering organic reactions with a machine-learning-powered deciphering of tera-scale mass spectrometry data. Nat Commun 2025; 16:2587. [PMID: 40090941 PMCID: PMC11911446 DOI: 10.1038/s41467-025-56905-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Accepted: 01/30/2025] [Indexed: 03/19/2025] Open
Abstract
The accumulation of large datasets by the scientific community has surpassed the capacity of traditional processing methods, underscoring the critical need for innovative and efficient algorithms capable of navigating through extensive existing experimental data. Addressing this challenge, our study introduces a machine learning (ML)-powered search engine specifically tailored for analyzing tera-scale high-resolution mass spectrometry (HRMS) data. This engine harnesses a novel isotope-distribution-centric search algorithm augmented by two synergistic ML models, assisting with the discovery of hitherto unknown chemical reactions. This methodology enables the rigorous investigation of existing data, thus providing efficient support for chemical hypotheses while reducing the need for conducting additional experiments. Moreover, we extend this approach with baseline methods for automated reaction hypothesis generation. In its practical validation, our approach successfully identified several reactions, unveiling previously undescribed transformations. Among these, the heterocycle-vinyl coupling process within the Mizoroki-Heck reaction stands out, highlighting the capability of the engine to elucidate complex chemical phenomena.
Collapse
Affiliation(s)
- Konstantin S Kozlov
- Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky Prospekt 47, Moscow, Russia
| | - Daniil A Boiko
- Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky Prospekt 47, Moscow, Russia
| | - Julia V Burykina
- Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky Prospekt 47, Moscow, Russia
| | - Valentina V Ilyushenkova
- Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky Prospekt 47, Moscow, Russia
- Center for Energy Science and Technology, Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30, bld. 1, Moscow, Russia
| | - Alexander Y Kostyukovich
- Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky Prospekt 47, Moscow, Russia
| | - Ekaterina D Patil
- Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky Prospekt 47, Moscow, Russia
- Center for Energy Science and Technology, Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30, bld. 1, Moscow, Russia
| | - Valentine P Ananikov
- Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky Prospekt 47, Moscow, Russia.
| |
Collapse
|
9
|
Long L, Li R, Zhang J. Artificial Intelligence in Retrosynthesis Prediction and its Applications in Medicinal Chemistry. J Med Chem 2025; 68:2333-2355. [PMID: 39883477 DOI: 10.1021/acs.jmedchem.4c02749] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2025]
Abstract
Retrosynthesis is a strategy to analyze the synthetic routes for target molecules in medicinal chemistry. However, traditional retrosynthesis predictions performed by chemists and rule-based expert systems struggle to adapt to the vast chemical space of real-world scenarios. Artificial intelligence (AI) has revolutionized retrosynthesis prediction in recent decades, significantly increasing the accuracy and diversity of predictions for target compounds. Single-step AI-driven retrosynthesis models can be generalized into three types based on their dependence on predefined reaction templates (template-based, semitemplate-based methods, template-free models), with respective advantages and limitations, and common challenges that limit their medicinal chemistry applications. Moreover, there are relatively inadequate multi-step retrosynthesis methods, which lack strong links with single-step methods. Herein, we review the recent advancements in AI applications for retrosynthesis prediction by summarizing related techniques and the landscape of current representative retrosynthesis models and propose feasible solutions to tackle existing problems and outline future directions in this field.
Collapse
Affiliation(s)
- Lanxin Long
- Medicinal Chemistry and Bioinformatics Center, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Rui Li
- Medicinal Chemistry and Bioinformatics Center, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
- State Key Laboratory of Medical Genomics, National Research Center for Translational Medicine at Shanghai, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Jian Zhang
- Medicinal Chemistry and Bioinformatics Center, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
- State Key Laboratory of Medical Genomics, National Research Center for Translational Medicine at Shanghai, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
- Key Laboratory of Protection, Development, and Utilization of Medicinal Resources in Liupanshan Area, Ministry of Education, Peptides & Protein Drug Research Center, School of Pharmacy, Ningxia Medical University, Yinchuan 750004, China
| |
Collapse
|
10
|
Mulka R, Su D, Huang WS, Zhang L, Huang H, Lai X, Li Y, Xue XS. FluoBase: a fluorinated agents database. J Cheminform 2025; 17:19. [PMID: 39934826 DOI: 10.1186/s13321-025-00949-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Accepted: 01/06/2025] [Indexed: 02/13/2025] Open
Abstract
Organofluorine compounds, owing to their unique physicochemical properties, play an increasingly crucial role in fields such as medicine, pesticides, and advanced materials. Fluorinated reagents are indispensable for developing efficient synthetic methods for organofluorine compounds and serve as the cornerstone of organofluorine chemistry. Equally important are fluorinated functional molecules, which contribute specific properties necessary for applications in pharmaceuticals, agrochemicals, and materials science. However, information about these agents' structure, properties, and functions is scattered throughout vast literature, making it inconvenient for synthetic chemists to access and utilize them effectively. Recognizing the need for a dedicated and organized resource, we present FluoBase-a comprehensive fluorinated agents database designed to streamline access to key information about fluorinated agents. FluoBase aims to become the premier resource for information related to fluorine chemistry, serving the scientific community and anyone interested in the applications of fluorine chemistry and machine learning for property predictions. FluoBase is freely available at https://fluobase.siochemdb.com . Scientific contribution FluoBase is a database designed to provide comprehensive information on the structures, properties, and functions of fluorinated agents and functional molecules. FluoBase aims to become the premier resource for fluorine chemistry, serving the scientific community and anyone interested in the applications of fluorine chemistry and machine learning for property predictions.
Collapse
Affiliation(s)
- Rafal Mulka
- State Key Laboratory of Fluorine and Nitrogen Chemistry and Advanced Materials, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, University of Chinese Academy of Sciences, 345 Lingling Road, Shanghai, 200032, China
| | - Dan Su
- School of Chemistry and Material Sciences, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, 1 Sub-Lane Xiangshan, Hangzhou, 310024, China
| | - Wen-Shuo Huang
- School of Chemistry and Material Sciences, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, 1 Sub-Lane Xiangshan, Hangzhou, 310024, China
| | - Li Zhang
- School of Chemistry and Material Sciences, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, 1 Sub-Lane Xiangshan, Hangzhou, 310024, China
| | - Huaihai Huang
- School of Chemistry and Material Sciences, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, 1 Sub-Lane Xiangshan, Hangzhou, 310024, China
| | - Xiaoyu Lai
- State Key Laboratory of Fluorine and Nitrogen Chemistry and Advanced Materials, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, University of Chinese Academy of Sciences, 345 Lingling Road, Shanghai, 200032, China
| | - Yao Li
- State Key Laboratory of Fluorine and Nitrogen Chemistry and Advanced Materials, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, University of Chinese Academy of Sciences, 345 Lingling Road, Shanghai, 200032, China.
| | - Xiao-Song Xue
- State Key Laboratory of Fluorine and Nitrogen Chemistry and Advanced Materials, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, University of Chinese Academy of Sciences, 345 Lingling Road, Shanghai, 200032, China.
- School of Chemistry and Material Sciences, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, 1 Sub-Lane Xiangshan, Hangzhou, 310024, China.
| |
Collapse
|
11
|
Pržulj N, Malod-Dognin N. Simplicity within biological complexity. BIOINFORMATICS ADVANCES 2025; 5:vbae164. [PMID: 39927291 PMCID: PMC11805345 DOI: 10.1093/bioadv/vbae164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 10/01/2024] [Accepted: 10/23/2024] [Indexed: 02/11/2025]
Abstract
Motivation Heterogeneous, interconnected, systems-level, molecular (multi-omic) data have become increasingly available and key in precision medicine. We need to utilize them to better stratify patients into risk groups, discover new biomarkers and targets, repurpose known and discover new drugs to personalize medical treatment. Existing methodologies are limited and a paradigm shift is needed to achieve quantitative and qualitative breakthroughs. Results In this perspective paper, we survey the literature and argue for the development of a comprehensive, general framework for embedding of multi-scale molecular network data that would enable their explainable exploitation in precision medicine in linear time. Network embedding methods (also called graph representation learning) map nodes to points in low-dimensional space, so that proximity in the learned space reflects the network's topology-function relationships. They have recently achieved unprecedented performance on hard problems of utilizing few omic data in various biomedical applications. However, research thus far has been limited to special variants of the problems and data, with the performance depending on the underlying topology-function network biology hypotheses, the biomedical applications, and evaluation metrics. The availability of multi-omic data, modern graph embedding paradigms and compute power call for a creation and training of efficient, explainable and controllable models, having no potentially dangerous, unexpected behaviour, that make a qualitative breakthrough. We propose to develop a general, comprehensive embedding framework for multi-omic network data, from models to efficient and scalable software implementation, and to apply it to biomedical informatics, focusing on precision medicine and personalized drug discovery. It will lead to a paradigm shift in the computational and biomedical understanding of data and diseases that will open up ways to solve some of the major bottlenecks in precision medicine and other domains.
Collapse
Affiliation(s)
- Nataša Pržulj
- Computational Biology Department, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, 00000, United Arabic Emirates
- Barcelona Supercomputing Center, Barcelona 08034, Spain
- Department of Computer Science, University College London, London WC1E6BT, United Kingdom
- ICREA, Pg. Lluís Companys 23, Barcelona 08010, Spain
| | | |
Collapse
|
12
|
Schilling-Wilhelmi M, Ríos-García M, Shabih S, Gil MV, Miret S, Koch CT, Márquez JA, Jablonka KM. From text to insight: large language models for chemical data extraction. Chem Soc Rev 2025; 54:1125-1150. [PMID: 39703015 DOI: 10.1039/d4cs00913d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2024]
Abstract
The vast majority of chemical knowledge exists in unstructured natural language, yet structured data is crucial for innovative and systematic materials design. Traditionally, the field has relied on manual curation and partial automation for data extraction for specific use cases. The advent of large language models (LLMs) represents a significant shift, potentially enabling non-experts to extract structured, actionable data from unstructured text efficiently. While applying LLMs to chemical and materials science data extraction presents unique challenges, domain knowledge offers opportunities to guide and validate LLM outputs. This tutorial review provides a comprehensive overview of LLM-based structured data extraction in chemistry, synthesizing current knowledge and outlining future directions. We address the lack of standardized guidelines and present frameworks for leveraging the synergy between LLMs and chemical expertise. This work serves as a foundational resource for researchers aiming to harness LLMs for data-driven chemical research. The insights presented here could significantly enhance how researchers across chemical disciplines access and utilize scientific information, potentially accelerating the development of novel compounds and materials for critical societal needs.
Collapse
Affiliation(s)
- Mara Schilling-Wilhelmi
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
| | - Martiño Ríos-García
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
- Institute of Carbon Science and Technology (INCAR), CSIC, Francisco Pintado Fe 26, 33011 Oviedo, Spain
| | - Sherjeel Shabih
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - María Victoria Gil
- Institute of Carbon Science and Technology (INCAR), CSIC, Francisco Pintado Fe 26, 33011 Oviedo, Spain
| | | | - Christoph T Koch
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - José A Márquez
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Kevin Maik Jablonka
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
- Center for Energy and Environmental Chemistry Jena (CEEC Jena), Friedrich Schiller University Jena, Philosophenweg 7a, 07743 Jena, Germany
- Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Lessingstrasse 12-14, 07743 Jena, Germany
| |
Collapse
|
13
|
Jin D, Liang Y, Xiong Z, Yang X, Wang H, Zeng J, Gu S. Application of Transformers to Chemical Synthesis. Molecules 2025; 30:493. [PMID: 39942600 PMCID: PMC11821105 DOI: 10.3390/molecules30030493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2024] [Revised: 01/09/2025] [Accepted: 01/10/2025] [Indexed: 02/16/2025] Open
Abstract
Efficient chemical synthesis is critical for the production of organic chemicals, particularly in the pharmaceutical industry. Leveraging machine learning to predict chemical synthesis and improve the development efficiency has become a significant research focus in modern chemistry. Among various machine learning models, the Transformer, a leading model in natural language processing, has revolutionized numerous fields due to its powerful feature-extraction and representation-learning capabilities. Recent applications demonstrated that Transformer models can also significantly enhance the performance in chemical synthesis tasks, particularly in reaction prediction and retrosynthetic planning. This article provides a comprehensive review of the applications and innovations of Transformer models in the qualitative prediction tasks of chemical synthesis, with a focus on technical approaches, performance advantages, and the challenges associated with applying the Transformer architecture to chemical reactions. Furthermore, we discuss the future directions for improving the applications of Transformer models in chemical synthesis.
Collapse
Affiliation(s)
- Dong Jin
- School of Chemical Engineering & Pharmacy, Pharmaceutical Research Institute, Wuhan Institute of Technology, Wuhan 430205, China; (D.J.); (Y.L.); (Z.X.); (H.W.)
| | - Yuli Liang
- School of Chemical Engineering & Pharmacy, Pharmaceutical Research Institute, Wuhan Institute of Technology, Wuhan 430205, China; (D.J.); (Y.L.); (Z.X.); (H.W.)
| | - Zihao Xiong
- School of Chemical Engineering & Pharmacy, Pharmaceutical Research Institute, Wuhan Institute of Technology, Wuhan 430205, China; (D.J.); (Y.L.); (Z.X.); (H.W.)
| | - Xiaojie Yang
- Hubei Key Laboratory of Radiation Chemistry and Functional Materials, School of Nuclear Technology and Chemistry & Biology, Hubei University of Science and Technology, Xianning 437100, China;
| | - Haifeng Wang
- School of Chemical Engineering & Pharmacy, Pharmaceutical Research Institute, Wuhan Institute of Technology, Wuhan 430205, China; (D.J.); (Y.L.); (Z.X.); (H.W.)
| | - Jie Zeng
- School of Chemical Engineering & Pharmacy, Pharmaceutical Research Institute, Wuhan Institute of Technology, Wuhan 430205, China; (D.J.); (Y.L.); (Z.X.); (H.W.)
| | - Shuangxi Gu
- School of Chemical Engineering & Pharmacy, Pharmaceutical Research Institute, Wuhan Institute of Technology, Wuhan 430205, China; (D.J.); (Y.L.); (Z.X.); (H.W.)
| |
Collapse
|
14
|
Sommer T, Clarke C, García-Melchor M. Beyond chemical structures: lessons and guiding principles for the next generation of molecular databases. Chem Sci 2025; 16:1002-1016. [PMID: 39660292 PMCID: PMC11626465 DOI: 10.1039/d4sc04064c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Accepted: 11/28/2024] [Indexed: 12/12/2024] Open
Abstract
Databases of molecules and materials are indispensable for advancing chemical research, especially when enriched with electronic structure information from quantum chemistry methods like density functional theory. In this perspective, we review and analyze the current landscape of materials and molecular databases containing quantum chemical data. Our analysis reveals that the materials community has significantly benefited from data platforms such as the Materials Project, which seamlessly integrate chemical structures, electronic structure data, and open-source software. Conversely, quantum chemical data for molecular systems remains largely fragmented across individual datasets, lacking the comprehensive framework of a unified database. We distilled insights from these existing data resources into seven guiding principles termed QUANTUM, which build upon the foundational FAIR principles of data sharing (Findable, Accessible, Interoperable, and Reusable). These principles are aimed at advancing the development of molecular databases into robust, integrated data platforms. We conclude with an outlook on both short- and long-term objectives, guided by these QUANTUM principles, to foster future advancements in molecular quantum databases and enhance their utility for the research community.
Collapse
Affiliation(s)
- Timo Sommer
- School of Chemistry, CRANN and AMBER Research Centres, Trinity College Dublin, College Green Dublin 2 Ireland
| | - Cian Clarke
- School of Chemistry, CRANN and AMBER Research Centres, Trinity College Dublin, College Green Dublin 2 Ireland
| | - Max García-Melchor
- School of Chemistry, CRANN and AMBER Research Centres, Trinity College Dublin, College Green Dublin 2 Ireland
- Center for Cooperative Research on Alternative Energy (CIC EnergiGUNE), Basque Research and Technology Alliance (BRTA), Alava Technology Park Albert Einstein 48 01510 Vitoria-Gasteiz Spain
- IKERBASQUE, Basque Foundation for Science Plaza de Euskadi 5 48009 Bilbao Spain
| |
Collapse
|
15
|
Maziarz K, Tripp A, Liu G, Stanley M, Xie S, Gaiński P, Seidl P, Segler MHS. Re-evaluating retrosynthesis algorithms with Syntheseus. Faraday Discuss 2025; 256:568-586. [PMID: 39485491 DOI: 10.1039/d4fd00093e] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Abstract
Automated synthesis planning has recently re-emerged as a research area at the intersection of chemistry and machine learning. Despite the appearance of steady progress, we argue that imperfect benchmarks and inconsistent comparisons mask systematic shortcomings of existing techniques, and unnecessarily hamper progress. To remedy this, we present a synthesis planning library with an extensive benchmarking framework, called SYNTHESEUS, which promotes best practice by default, enabling consistent meaningful evaluation of single-step and multi-step synthesis planning algorithms. We demonstrate the capabilities of SYNTHESEUS by re-evaluating several previous retrosynthesis algorithms, and find that the ranking of state-of-the-art models changes in controlled evaluation experiments. We end with guidance for future works in this area, and call on the community to engage in the discussion on how to improve benchmarks for synthesis planning.
Collapse
|
16
|
Kevlishvili I, St Michel RG, Garrison AG, Toney JW, Adamji H, Jia H, Román-Leshkov Y, Kulik HJ. Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes. Faraday Discuss 2025; 256:275-303. [PMID: 39301698 DOI: 10.1039/d4fd00087k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]
Abstract
The breadth of transition metal chemical space covered by databases such as the Cambridge Structural Database and the derived computational database tmQM is not conducive to application-specific modeling and the development of structure-property relationships. Here, we employ both supervised and unsupervised natural language processing (NLP) techniques to link experimentally synthesized compounds in the tmQM database to their respective applications. Leveraging NLP models, we curate four distinct datasets: tmCAT for catalysis, tmPHOTO for photophysical activity, tmBIO for biological relevance, and tmSCO for magnetism. Analyzing the chemical substructures within each dataset reveals common chemical motifs in each of the designated applications. We then use these common chemical structures to augment our initial datasets for each application, yielding a total of 21 631 compounds in tmCAT, 4599 in tmPHOTO, 2782 in tmBIO, and 983 in tmSCO. These datasets are expected to accelerate the more targeted computational screening and development of refined structure-property relationships with machine learning.
Collapse
Affiliation(s)
- Ilia Kevlishvili
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Roland G St Michel
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Aaron G Garrison
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Jacob W Toney
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Husain Adamji
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Haojun Jia
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Yuriy Román-Leshkov
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Heather J Kulik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| |
Collapse
|
17
|
Cheng AH, Ser CT, Skreta M, Guzmán-Cordero A, Thiede L, Burger A, Aldossary A, Leong SX, Pablo-García S, Strieth-Kalthoff F, Aspuru-Guzik A. Spiers Memorial Lecture: How to do impactful research in artificial intelligence for chemistry and materials science. Faraday Discuss 2025; 256:10-60. [PMID: 39400305 DOI: 10.1039/d4fd00153b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2024]
Abstract
Machine learning has been pervasively touching many fields of science. Chemistry and materials science are no exception. While machine learning has been making a great impact, it is still not reaching its full potential or maturity. In this perspective, we first outline current applications across a diversity of problems in chemistry. Then, we discuss how machine learning researchers view and approach problems in the field. Finally, we provide our considerations for maximizing impact when researching machine learning for chemistry.
Collapse
Affiliation(s)
- Austin H Cheng
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Cher Tian Ser
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Marta Skreta
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Andrés Guzmán-Cordero
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
- Tinbergen Institute, University of Amsterdam, Amsterdam, Netherlands
| | - Luca Thiede
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Andreas Burger
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | | | - Shi Xuan Leong
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- School of Chemistry, Chemical Engineering and Biotechnology, Nanyang Technological University, Singapore 63737, Singapore
| | | | | | - Alán Aspuru-Guzik
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
- Acceleration Consortium, Toronto, Ontario M5G 1X6, Canada
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Canada
- Department of Materials Science and Engineering, University of Toronto, Canada
- Lebovic Fellow, Canadian Institute for Advanced Research (CIFAR), Canada
| |
Collapse
|
18
|
Li X, Meyer MP. Concerted or Stepwise? An Experimental and Computational Study to Reveal the Mechanistic Change as a Result of the Substituent Effects. J Org Chem 2025. [PMID: 39757808 DOI: 10.1021/acs.joc.4c02197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2025]
Abstract
This study investigates the Cope elimination reaction, focusing on the mechanistic shift between concerted and stepwise pathways influenced by substituent effects. Experimental approaches, including kinetic isotope effects (KIEs) and linear free energy relationships (LFERs), alongside density functional theory (DFT) computations, were employed to explore the influence of substituents on the reaction kinetics and pathways. Our findings reveal temperature- and substituent-dependent partitioning between the concerted syn-β elimination and the stepwise E1cB mechanism, providing deeper insights into the mechanistic diversity of elimination reactions.
Collapse
Affiliation(s)
- Xiao Li
- College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China
| | - Matthew P Meyer
- Department of Chemistry and Chemical Biology, University of California, Merced, California 95343, United States
| |
Collapse
|
19
|
Nippa DF, Müller AT, Atz K, Konrad DB, Grether U, Martin RE, Schneider G. Simple User-Friendly Reaction Format. Mol Inform 2025; 44:e202400361. [PMID: 39846425 PMCID: PMC11755691 DOI: 10.1002/minf.202400361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2024] [Revised: 01/03/2025] [Accepted: 01/06/2025] [Indexed: 01/24/2025]
Abstract
Utilizing the growing wealth of chemical reaction data can boost synthesis planning and increase success rates. Yet, the effectiveness of machine learning tools for retrosynthesis planning and forward reaction prediction relies on accessible, well-curated data presented in a structured format. Although some public and licensed reaction databases exist, they often lack essential information about reaction conditions. To address this issue and promote the principles of findable, accessible, interoperable, and reusable (FAIR) data reporting and sharing, we introduce the Simple User-Friendly Reaction Format (SURF). SURF standardizes the documentation of reaction data through a structured tabular format, requiring only a basic understanding of spreadsheets. This format enables chemists to record the synthesis of molecules in a format that is understandable by both humans and machines, which facilitates seamless sharing and integration directly into machine learning pipelines. SURF files are designed to be interoperable, easily imported into relational databases, and convertible into other formats. This complements existing initiatives like the Open Reaction Database (ORD) and Unified Data Model (UDM). At Roche, SURF plays a crucial role in democratizing FAIR reaction data sharing and expediting the chemical synthesis process.
Collapse
Affiliation(s)
- David F. Nippa
- Roche Pharma Research and Early Development (pRED)Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd.Grenzacherstrasse 1244070BaselSwitzerland
- Department of PharmacyLudwig-Maximilians-Universität MünchenButenandtstrasse 581377MunichGermany
| | - Alex T. Müller
- Roche Pharma Research and Early Development (pRED)Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd.Grenzacherstrasse 1244070BaselSwitzerland
| | - Kenneth Atz
- Roche Pharma Research and Early Development (pRED)Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd.Grenzacherstrasse 1244070BaselSwitzerland
| | - David B. Konrad
- Department of PharmacyLudwig-Maximilians-Universität MünchenButenandtstrasse 581377MunichGermany
| | - Uwe Grether
- Roche Pharma Research and Early Development (pRED)Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd.Grenzacherstrasse 1244070BaselSwitzerland
| | - Rainer E. Martin
- Roche Pharma Research and Early Development (pRED)Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd.Grenzacherstrasse 1244070BaselSwitzerland
| | - Gisbert Schneider
- Department of Biosystems Science and EngineeringETH ZurichKlingelbergstrasse 484056BaselSwitzerland
| |
Collapse
|
20
|
Vangala SR, Krishnan SR, Bung N, Nandagopal D, Ramasamy G, Kumar S, Sankaran S, Srinivasan R, Roy A. Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature. J Cheminform 2024; 16:131. [PMID: 39593165 PMCID: PMC11590295 DOI: 10.1186/s13321-024-00928-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Accepted: 11/10/2024] [Indexed: 11/28/2024] Open
Abstract
With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future.Scientific contributionIn this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.
Collapse
Affiliation(s)
- Sarveswara Rao Vangala
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad, 500081, India
| | | | - Navneet Bung
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad, 500081, India
| | - Dhandapani Nandagopal
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad, 500081, India
| | - Gomathi Ramasamy
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad, 500081, India
| | - Satyam Kumar
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad, 500081, India
| | - Sridharan Sankaran
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad, 500081, India
| | - Rajgopal Srinivasan
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad, 500081, India
| | - Arijit Roy
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad, 500081, India.
| |
Collapse
|
21
|
Chen LY, Li YP. Machine learning-guided strategies for reaction conditions design and optimization. Beilstein J Org Chem 2024; 20:2476-2492. [PMID: 39376489 PMCID: PMC11457048 DOI: 10.3762/bjoc.20.212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Accepted: 09/19/2024] [Indexed: 10/09/2024] Open
Abstract
This review surveys the recent advances and challenges in predicting and optimizing reaction conditions using machine learning techniques. The paper emphasizes the importance of acquiring and processing large and diverse datasets of chemical reactions, and the use of both global and local models to guide the design of synthetic processes. Global models exploit the information from comprehensive databases to suggest general reaction conditions for new reactions, while local models fine-tune the specific parameters for a given reaction family to improve yield and selectivity. The paper also identifies the current limitations and opportunities in this field, such as the data quality and availability, and the integration of high-throughput experimentation. The paper demonstrates how the combination of chemical engineering, data science, and ML algorithms can enhance the efficiency and effectiveness of reaction conditions design, and enable novel discoveries in synthetic chemistry.
Collapse
Affiliation(s)
- Lung-Yi Chen
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
| | - Yi-Pei Li
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
- Taiwan International Graduate Program on Sustainable Chemical Science and Technology (TIGP-SCST), No. 128, Sec. 2, Academia Road, Taipei 11529, Taiwan
| |
Collapse
|
22
|
Han Y, Deng M, Liu K, Chen J, Wang Y, Xu YN, Dian L. Computer-Aided Synthesis Planning (CASP) and Machine Learning: Optimizing Chemical Reaction Conditions. Chemistry 2024; 30:e202401626. [PMID: 39083362 DOI: 10.1002/chem.202401626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Revised: 07/27/2024] [Accepted: 07/28/2024] [Indexed: 08/02/2024]
Abstract
Computer-aided synthesis planning (CASP) has garnered increasing attention in light of recent advancements in machine learning models. While the focus is on reverse synthesis or forward outcome prediction, optimizing reaction conditions remains a significant challenge. For datasets with multiple variables, the choice of descriptors and models is pivotal. This selection dictates the effective extraction of conditional features and the achievement of higher prediction accuracy. This review delineates the origins of data in conditional optimization, the criteria for descriptor selection, the response models, and the metrics for outcome evaluation, aiming to acquaint readers with the latest research trends and facilitate more informed research in this domain.
Collapse
Affiliation(s)
- Yu Han
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Mingjing Deng
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Ke Liu
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Jia Chen
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Yuting Wang
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Yu-Ning Xu
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Longyang Dian
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
- Suzhou Institute of Shandong University, No. 388 Ruoshui Road, Suzhou Industrial Park, Suzhou, 215123, P. R. China
| |
Collapse
|
23
|
Ai Q, Meng F, Shi J, Pelkie B, Coley CW. Extracting structured data from organic synthesis procedures using a fine-tuned large language model. DIGITAL DISCOVERY 2024; 3:1822-1831. [PMID: 39157760 PMCID: PMC11322921 DOI: 10.1039/d4dd00091a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Accepted: 07/30/2024] [Indexed: 08/20/2024]
Abstract
The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD "messages" (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.
Collapse
Affiliation(s)
- Qianxiang Ai
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Fanwang Meng
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Jiale Shi
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Brenden Pelkie
- Department of Chemical Engineering, University of Washington Seattle WA USA
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| |
Collapse
|
24
|
Schmid SP, Schlosser L, Glorius F, Jorner K. Catalysing (organo-)catalysis: Trends in the application of machine learning to enantioselective organocatalysis. Beilstein J Org Chem 2024; 20:2280-2304. [PMID: 39290209 PMCID: PMC11406055 DOI: 10.3762/bjoc.20.196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Accepted: 08/09/2024] [Indexed: 09/19/2024] Open
Abstract
Organocatalysis has established itself as a third pillar of homogeneous catalysis, besides transition metal catalysis and biocatalysis, as its use for enantioselective reactions has gathered significant interest over the last decades. Concurrent to this development, machine learning (ML) has been increasingly applied in the chemical domain to efficiently uncover hidden patterns in data and accelerate scientific discovery. While the uptake of ML in organocatalysis has been comparably slow, the last two decades have showed an increased interest from the community. This review gives an overview of the work in the field of ML in organocatalysis. The review starts by giving a short primer on ML for experimental chemists, before discussing its application for predicting the selectivity of organocatalytic transformations. Subsequently, we review ML employed for privileged catalysts, before focusing on its application for catalyst and reaction design. Concluding, we give our view on current challenges and future directions for this field, drawing inspiration from the application of ML to other scientific domains.
Collapse
Affiliation(s)
- Stefan P Schmid
- Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich CH-8093, Switzerland
| | - Leon Schlosser
- Organisch-Chemisches Institut, Universität Münster, 48149 Münster, Germany
| | - Frank Glorius
- Organisch-Chemisches Institut, Universität Münster, 48149 Münster, Germany
| | - Kjell Jorner
- Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich CH-8093, Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, ETH Zurich, Zurich CH-8093, Switzerland
| |
Collapse
|
25
|
Zhang X, Li Y, Li C, Zhu J, Gan Z, Wang L, Sun X, You H. A chemical reaction entity recognition method based on a natural language data augmentation strategy. Chem Commun (Camb) 2024; 60:9610-9613. [PMID: 39148332 DOI: 10.1039/d4cc01471e] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Impressive applications of artificial intelligence in the field of chemical reaction prediction heavily depend on abundant reliable datasets. The automated extraction of reaction procedures to build structured chemical databases is of growing importance. Here, we propose a novel model named DACRER for large-scale reaction extraction, in which transfer learning and a data augmentation strategy were employed. This model was evaluated for chemical datasets and shows good performance in identifying and processing chemical texts.
Collapse
Affiliation(s)
- Xiaowen Zhang
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
| | - Yang Li
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, Anhui, China
| | - Chaoyi Li
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
| | - Jingyuan Zhu
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
| | - Zhiqiang Gan
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
| | - Lei Wang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang 277160, Shandong, China
| | - Xiaofei Sun
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang 277160, Shandong, China
| | - Hengzhi You
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
- Green Pharmaceutical Engineering Research Center, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China
| |
Collapse
|
26
|
Meza-González B, Ramírez-Palma DI, Carpio-Martínez P, Vázquez-Cuevas D, Martínez-Mayorga K, Cortés-Guzmán F. Quantum Topological Atomic Properties of 44K molecules. Sci Data 2024; 11:945. [PMID: 39209874 PMCID: PMC11362522 DOI: 10.1038/s41597-024-03723-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 07/29/2024] [Indexed: 09/04/2024] Open
Abstract
We present a data set of quantum topological properties of atoms of 44K randomly selected molecules from the GDB-9 data set. These atomic properties were obtained as defined by the quantum theory of atoms in molecules (QTAIM) within an atomic basin, a region of real space bounded by zero-flux surfaces in the electron density gradient vector field. The wave function files were generated through DFT static calculations (B3LYP/6-31G), and the atomic properties were calculated using QTAIM. The calculated atomic properties include the energy of the atomic basin, the electronic population, the magnitude of the total dipole moment, and the magnitude of the total quadrupole moment. The atomic properties allow one to understand the chemical structure, reactivity, and molecular recognition. They can be incorporated into force fields for molecular dynamics or for predicting reactive sites. We believe that this data set could facilitate new studies in chemical informatics, machine learning applied to chemistry, and computational molecular design.
Collapse
Affiliation(s)
- Brandon Meza-González
- Facultad de Química, Universidad Nacional Autónoma de México, Ciudad de Méxinclude thexico, Mexico City, Mexico
| | - David I Ramírez-Palma
- Instituto de Química, Unidad Mérida, Universidad Nacional Autónoma de México, Mérida, Yucatán, Mexico
| | - Pablo Carpio-Martínez
- Centro Conjunto de Investigación en Química Sustentable UAEM-UNAM, Carretera Toluca-Atlacomulco, km. 14.5, Toluca, Estado de México, C.P. 50200, Mexico
| | - David Vázquez-Cuevas
- Instituto de Química, Unidad Mérida, Universidad Nacional Autónoma de México, Mérida, Yucatán, Mexico
| | - Karina Martínez-Mayorga
- Instituto de Química, Unidad Mérida, Universidad Nacional Autónoma de México, Mérida, Yucatán, Mexico
| | - Fernando Cortés-Guzmán
- Facultad de Química, Universidad Nacional Autónoma de México, Ciudad de Méxinclude thexico, Mexico City, Mexico.
| |
Collapse
|
27
|
Tom G, Schmid SP, Baird SG, Cao Y, Darvish K, Hao H, Lo S, Pablo-García S, Rajaonson EM, Skreta M, Yoshikawa N, Corapi S, Akkoc GD, Strieth-Kalthoff F, Seifrid M, Aspuru-Guzik A. Self-Driving Laboratories for Chemistry and Materials Science. Chem Rev 2024; 124:9633-9732. [PMID: 39137296 PMCID: PMC11363023 DOI: 10.1021/acs.chemrev.4c00055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/15/2024]
Abstract
Self-driving laboratories (SDLs) promise an accelerated application of the scientific method. Through the automation of experimental workflows, along with autonomous experimental planning, SDLs hold the potential to greatly accelerate research in chemistry and materials discovery. This review provides an in-depth analysis of the state-of-the-art in SDL technology, its applications across various scientific disciplines, and the potential implications for research and industry. This review additionally provides an overview of the enabling technologies for SDLs, including their hardware, software, and integration with laboratory infrastructure. Most importantly, this review explores the diverse range of scientific domains where SDLs have made significant contributions, from drug discovery and materials science to genomics and chemistry. We provide a comprehensive review of existing real-world examples of SDLs, their different levels of automation, and the challenges and limitations associated with each domain.
Collapse
Affiliation(s)
- Gary Tom
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Stefan P. Schmid
- Department
of Chemistry and Applied Biosciences, ETH
Zurich, Vladimir-Prelog-Weg 1, CH-8093 Zurich, Switzerland
| | - Sterling G. Baird
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Yang Cao
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Kourosh Darvish
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Han Hao
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Stanley Lo
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
| | - Sergio Pablo-García
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
| | - Ella M. Rajaonson
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Marta Skreta
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Naruki Yoshikawa
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Samantha Corapi
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
| | - Gun Deniz Akkoc
- Forschungszentrum
Jülich GmbH, Helmholtz Institute
for Renewable Energy Erlangen-Nürnberg, Cauerstr. 1, 91058 Erlangen, Germany
- Department
of Chemical and Biological Engineering, Friedrich-Alexander Universität Erlangen-Nürnberg, Egerlandstr. 3, 91058 Erlangen, Germany
| | - Felix Strieth-Kalthoff
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- School of
Mathematics and Natural Sciences, University
of Wuppertal, Gaußstraße
20, 42119 Wuppertal, Germany
| | - Martin Seifrid
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Department
of Materials Science and Engineering, North
Carolina State University, Raleigh, North Carolina 27695, United States of America
| | - Alán Aspuru-Guzik
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
- Department
of Chemical Engineering & Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
- Department
of Materials Science & Engineering, University of Toronto, Toronto, Ontario M5S 3E4, Canada
- Lebovic
Fellow, Canadian Institute for Advanced
Research (CIFAR), 661
University Ave, Toronto, Ontario M5G 1M1, Canada
| |
Collapse
|
28
|
Gricourt G, Meyer P, Duigou T, Faulon JL. Artificial Intelligence Methods and Models for Retro-Biosynthesis: A Scoping Review. ACS Synth Biol 2024; 13:2276-2294. [PMID: 39047143 PMCID: PMC11334239 DOI: 10.1021/acssynbio.4c00091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Revised: 06/14/2024] [Accepted: 06/14/2024] [Indexed: 07/27/2024]
Abstract
Retrosynthesis aims to efficiently plan the synthesis of desirable chemicals by strategically breaking down molecules into readily available building block compounds. Having a long history in chemistry, retro-biosynthesis has also been used in the fields of biocatalysis and synthetic biology. Artificial intelligence (AI) is driving us toward new frontiers in synthesis planning and the exploration of chemical spaces, arriving at an opportune moment for promoting bioproduction that would better align with green chemistry, enhancing environmental practices. In this review, we summarize the recent advancements in the application of AI methods and models for retrosynthetic and retro-biosynthetic pathway design. These techniques can be based either on reaction templates or generative models and require scoring functions and planning strategies to navigate through the retrosynthetic graph of possibilities. We finally discuss limitations and promising research directions in this field.
Collapse
Affiliation(s)
- Guillaume Gricourt
- Université
Paris-Saclay, INRAE, AgroParisTech, Micalis
Institute, 78350 Jouy-en-Josas, France
| | - Philippe Meyer
- Université
Paris-Saclay, INRAE, AgroParisTech, Micalis
Institute, 78350 Jouy-en-Josas, France
| | - Thomas Duigou
- Université
Paris-Saclay, INRAE, AgroParisTech, Micalis
Institute, 78350 Jouy-en-Josas, France
| | - Jean-Loup Faulon
- Université
Paris-Saclay, INRAE, AgroParisTech, Micalis
Institute, 78350 Jouy-en-Josas, France
- The
University of Manchester, Manchester Institute
of Biotechnology, Manchester M1 7DN, U.K.
| |
Collapse
|
29
|
Wiest O, Bauer C, Helquist P, Norrby PO, Genheden S. Finding Relevant Retrosynthetic Disconnections for Stereocontrolled Reactions. J Chem Inf Model 2024; 64:5796-5805. [PMID: 38995078 DOI: 10.1021/acs.jcim.4c00370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
Machine learning-driven computer-aided synthesis planning (CASP) tools have become important tools for idea generation in the design of complex molecule synthesis but do not adequately address the stereochemical features of the target compounds. A novel approach to automated extraction of templates used in CASP that includes stereochemical information included in the US Patent and Trademark Office (USPTO) and an internal AstraZeneca database containing reactions from Reaxys, Pistachio, and AstraZeneca electronic lab notebooks is implemented in the freely available AiZynthFinder software. Three hundred sixty-seven templates covering reagent- and substrate-controlled as well as stereospecific reactions were extracted from the USPTO, while 20,724 templates were from the AstraZeneca database. The performance of these templates in multistep CASP is evaluated for 936 targets from the ChEMBL database and an in-house selection of 791 AZ designs. The potential and limitations are discussed for four case studies from ChEMBL and examples of FDA-approved drugs.
Collapse
Affiliation(s)
- Olaf Wiest
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, Indiana 46556, United States
| | - Christoph Bauer
- Data Science and Modelling, Pharmaceutical Sciences, R&D, AstraZeneca, Gothenburg, Pepparedsleden 1, SE-431 83 Mölndal, Sweden
| | - Paul Helquist
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, Indiana 46556, United States
| | - Per-Ola Norrby
- Data Science and Modelling, Pharmaceutical Sciences, R&D, AstraZeneca, Gothenburg, Pepparedsleden 1, SE-431 83 Mölndal, Sweden
| | - Samuel Genheden
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Pepparedsleden 1, SE-431 83 Mölndal, Sweden
| |
Collapse
|
30
|
Atz K, Nippa DF, Müller AT, Jost V, Anelli A, Reutlinger M, Kramer C, Martin RE, Grether U, Schneider G, Wuitschik G. Geometric deep learning-guided Suzuki reaction conditions assessment for applications in medicinal chemistry. RSC Med Chem 2024; 15:2310-2321. [PMID: 39026644 PMCID: PMC11253849 DOI: 10.1039/d4md00196f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Accepted: 05/25/2024] [Indexed: 07/20/2024] Open
Abstract
Suzuki cross-coupling reactions are considered a valuable tool for constructing carbon-carbon bonds in small molecule drug discovery. However, the synthesis of chemical matter often represents a time-consuming and labour-intensive bottleneck. We demonstrate how machine learning methods trained on high-throughput experimentation (HTE) data can be leveraged to enable fast reaction condition selection for novel coupling partners. We show that the trained models support chemists in determining suitable catalyst-solvent-base combinations for individual transformations including an evaluation of the need for HTE screening. We introduce an algorithm for designing 96-well plates optimized towards reaction yields and discuss the model performance of zero- and few-shot machine learning. The best-performing machine learning model achieved a three-category classification accuracy of 76.3% (±0.2%) and an F 1-score for a binary classification of 79.1% (±0.9%). Validation on eight reactions revealed a receiver operating characteristic (ROC) curve (AUC) value of 0.82 (±0.07) for few-shot machine learning. On the other hand, zero-shot machine learning models achieved a mean ROC-AUC value of 0.63 (±0.16). This study positively advocates the application of few-shot machine learning-guided reaction condition selection for HTE campaigns in medicinal chemistry and highlights practical applications as well as challenges associated with zero-shot machine learning.
Collapse
Affiliation(s)
- Kenneth Atz
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - David F Nippa
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Alex T Müller
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Vera Jost
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Andrea Anelli
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Michael Reutlinger
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Christian Kramer
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Rainer E Martin
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Uwe Grether
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Gisbert Schneider
- Department of Chemistry and Applied Biosciences, ETH Zurich Vladimir-Prelog-Weg 4 8093 Zurich Switzerland
| | - Georg Wuitschik
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| |
Collapse
|
31
|
Zhang W, Wang Q, Kong X, Xiong J, Ni S, Cao D, Niu B, Chen M, Li Y, Zhang R, Wang Y, Zhang L, Li X, Xiong Z, Shi Q, Huang Z, Fu Z, Zheng M. Fine-tuning large language models for chemical text mining. Chem Sci 2024; 15:10600-10611. [PMID: 38994403 PMCID: PMC11234886 DOI: 10.1039/d4sc00924j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Accepted: 06/02/2024] [Indexed: 07/13/2024] Open
Abstract
Extracting knowledge from complex and diverse chemical texts is a pivotal task for both experimental and computational chemists. The task is still considered to be extremely challenging due to the complexity of the chemical language and scientific literature. This study explored the power of fine-tuned large language models (LLMs) on five intricate chemical text mining tasks: compound entity recognition, reaction role labelling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance spectroscopy (NMR) data extraction, and the conversion of reaction paragraphs to action sequences. The fine-tuned LLMs demonstrated impressive performance, significantly reducing the need for repetitive and extensive prompt engineering experiments. For comparison, we guided ChatGPT (GPT-3.5-turbo) and GPT-4 with prompt engineering and fine-tuned GPT-3.5-turbo as well as other open-source LLMs such as Mistral, Llama3, Llama2, T5, and BART. The results showed that the fine-tuned ChatGPT models excelled in all tasks. They achieved exact accuracy levels ranging from 69% to 95% on these tasks with minimal annotated data. They even outperformed those task-adaptive pre-training and fine-tuning models that were based on a significantly larger amount of in-domain data. Notably, fine-tuned Mistral and Llama3 show competitive abilities. Given their versatility, robustness, and low-code capability, leveraging fine-tuned LLMs as flexible and effective toolkits for automated data acquisition could revolutionize chemical knowledge extraction.
Collapse
Affiliation(s)
- Wei Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Qinggong Wang
- Nanjing University of Chinese Medicine 138 Xianlin Road Nanjing 210023 China
| | - Xiangtai Kong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Jiacheng Xiong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Shengkun Ni
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Duanhua Cao
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou Zhejiang 310058 China
| | - Buying Niu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Mingan Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- School of Physical Science and Technology, ShanghaiTech University Shanghai 201210 China
- Lingang Laboratory Shanghai 200031 China
| | - Yameng Li
- ProtonUnfold Technology Co., Ltd Suzhou China
| | - Runze Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Yitian Wang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Lehan Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | | | - Qian Shi
- Lingang Laboratory Shanghai 200031 China
| | - Ziming Huang
- Medizinische Klinik und Poliklinik I, Klinikum der Universität München, Ludwig-Maximilians-Universität Munich Germany
| | - Zunyun Fu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
- Nanjing University of Chinese Medicine 138 Xianlin Road Nanjing 210023 China
| |
Collapse
|
32
|
Kalikadien AV, Mirza A, Hossaini AN, Sreenithya A, Pidko EA. Paving the road towards automated homogeneous catalyst design. Chempluschem 2024; 89:e202300702. [PMID: 38279609 DOI: 10.1002/cplu.202300702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 12/20/2023] [Indexed: 01/28/2024]
Abstract
In the past decade, computational tools have become integral to catalyst design. They continue to offer significant support to experimental organic synthesis and catalysis researchers aiming for optimal reaction outcomes. More recently, data-driven approaches utilizing machine learning have garnered considerable attention for their expansive capabilities. This Perspective provides an overview of diverse initiatives in the realm of computational catalyst design and introduces our automated tools tailored for high-throughput in silico exploration of the chemical space. While valuable insights are gained through methods for high-throughput in silico exploration and analysis of chemical space, their degree of automation and modularity are key. We argue that the integration of data-driven, automated and modular workflows is key to enhancing homogeneous catalyst design on an unprecedented scale, contributing to the advancement of catalysis research.
Collapse
Affiliation(s)
- Adarsh V Kalikadien
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - Adrian Mirza
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - Aydin Najl Hossaini
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - Avadakkam Sreenithya
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - Evgeny A Pidko
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| |
Collapse
|
33
|
Chen LY, Li YP. AutoTemplate: enhancing chemical reaction datasets for machine learning applications in organic chemistry. J Cheminform 2024; 16:74. [PMID: 38937840 PMCID: PMC11212196 DOI: 10.1186/s13321-024-00869-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Accepted: 06/09/2024] [Indexed: 06/29/2024] Open
Abstract
This paper presents AutoTemplate, an innovative data preprocessing protocol, addressing the crucial need for high-quality chemical reaction datasets in the realm of machine learning applications in organic chemistry. Recent advances in artificial intelligence have expanded the application of machine learning in chemistry, particularly in yield prediction, retrosynthesis, and reaction condition prediction. However, the effectiveness of these models hinges on the integrity of chemical reaction datasets, which are often plagued by inconsistencies like missing reactants, incorrect atom mappings, and outright erroneous reactions. AutoTemplate introduces a two-stage approach to refine these datasets. The first stage involves extracting meaningful reaction transformation rules and formulating generic reaction templates using a simplified SMARTS representation. This simplification broadens the applicability of templates across various chemical reactions. The second stage is template-guided reaction curation, where these templates are systematically applied to validate and correct the reaction data. This process effectively amends missing reactant information, rectifies atom-mapping errors, and eliminates incorrect data entries. A standout feature of AutoTemplate is its capability to concurrently identify and correct false chemical reactions. It operates on the premise that most reactions in datasets are accurate, using these as templates to guide the correction of flawed entries. The protocol demonstrates its efficacy across a range of chemical reactions, significantly enhancing dataset quality. This advancement provides a more robust foundation for developing reliable machine learning models in chemistry, thereby improving the accuracy of forward and retrosynthetic predictions. AutoTemplate marks a significant progression in the preprocessing of chemical reaction datasets, bridging a vital gap and facilitating more precise and efficient machine learning applications in organic synthesis. SCIENTIFIC CONTRIBUTION: The proposed automated preprocessing tool for chemical reaction data aims to identify errors within chemical databases. Specifically, if the errors involve atom mapping or the absence of reactant types, corrections can be systematically applied using reaction templates, ultimately elevating the overall quality of the database.
Collapse
Affiliation(s)
- Lung-Yi Chen
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617, Taiwan
| | - Yi-Pei Li
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617, Taiwan.
- Taiwan International Graduate Program on Sustainable Chemical Science and Technology (TIGP-SCST), No. 128, Sec. 2, Academia Road, Taipei, 11529, Taiwan.
| |
Collapse
|
34
|
Raghavan P, Rago AJ, Verma P, Hassan MM, Goshu GM, Dombrowski AW, Pandey A, Coley CW, Wang Y. Incorporating Synthetic Accessibility in Drug Design: Predicting Reaction Yields of Suzuki Cross-Couplings by Leveraging AbbVie's 15-Year Parallel Library Data Set. J Am Chem Soc 2024; 146:15070-15084. [PMID: 38768950 PMCID: PMC11157529 DOI: 10.1021/jacs.4c00098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 04/24/2024] [Accepted: 04/25/2024] [Indexed: 05/22/2024]
Abstract
Despite the increased use of computational tools to supplement medicinal chemists' expertise and intuition in drug design, predicting synthetic yields in medicinal chemistry endeavors remains an unsolved challenge. Existing design workflows could profoundly benefit from reaction yield prediction, as precious material waste could be reduced, and a greater number of relevant compounds could be delivered to advance the design, make, test, analyze (DMTA) cycle. In this work, we detail the evaluation of AbbVie's medicinal chemistry library data set to build machine learning models for the prediction of Suzuki coupling reaction yields. The combination of density functional theory (DFT)-derived features and Morgan fingerprints was identified to perform better than one-hot encoded baseline modeling, furnishing encouraging results. Overall, we observe modest generalization to unseen reactant structures within the 15-year retrospective library data set. Additionally, we compare predictions made by the model to those made by expert medicinal chemists, finding that the model can often predict both reaction success and reaction yields with greater accuracy. Finally, we demonstrate the application of this approach to suggest structurally and electronically similar building blocks to replace those predicted or observed to be unsuccessful prior to or after synthesis, respectively. The yield prediction model was used to select similar monomers predicted to have higher yields, resulting in greater synthesis efficiency of relevant drug-like molecules.
Collapse
Affiliation(s)
- Priyanka Raghavan
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, 77 Massachusetts Ave, Cambridge, Massachusetts 02139, United States
| | - Alexander J. Rago
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Pritha Verma
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Majdi M. Hassan
- RAIDERS
Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Gashaw M. Goshu
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Amanda W. Dombrowski
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Abhishek Pandey
- RAIDERS
Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Connor W. Coley
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, 77 Massachusetts Ave, Cambridge, Massachusetts 02139, United States
| | - Ying Wang
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| |
Collapse
|
35
|
Wigh D, Arrowsmith J, Pomberger A, Felton KC, Lapkin AA. ORDerly: Data Sets and Benchmarks for Chemical Reaction Data. J Chem Inf Model 2024; 64:3790-3798. [PMID: 38648077 PMCID: PMC11094788 DOI: 10.1021/acs.jcim.4c00292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 04/03/2024] [Accepted: 04/04/2024] [Indexed: 04/25/2024]
Abstract
Machine learning has the potential to provide tremendous value to life sciences by providing models that aid in the discovery of new molecules and reduce the time for new products to come to market. Chemical reactions play a significant role in these fields, but there is a lack of high-quality open-source chemical reaction data sets for training machine learning models. Herein, we present ORDerly, an open-source Python package for the customizable and reproducible preparation of reaction data stored in accordance with the increasingly popular Open Reaction Database (ORD) schema. We use ORDerly to clean United States patent data stored in ORD and generate data sets for forward prediction, retrosynthesis, as well as the first benchmark for reaction condition prediction. We train neural networks on data sets generated with ORDerly for condition prediction and show that data sets missing key cleaning steps can lead to silently overinflated performance metrics. Additionally, we train transformers for forward and retrosynthesis prediction and demonstrate how non-patent data can be used to evaluate model generalization. By providing a customizable open-source solution for cleaning and preparing large chemical reaction data, ORDerly is poised to push forward the boundaries of machine learning applications in chemistry.
Collapse
Affiliation(s)
- Daniel
S. Wigh
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| | - Joe Arrowsmith
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| | - Alexander Pomberger
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| | - Kobi C. Felton
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| | - Alexei A. Lapkin
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| |
Collapse
|
36
|
Duke R, McCoy R, Risko C, Bursten JRS. Promises and Perils of Big Data: Philosophical Constraints on Chemical Ontologies. J Am Chem Soc 2024; 146:11579-11591. [PMID: 38640489 DOI: 10.1021/jacs.3c11399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/21/2024]
Abstract
Chemistry is experiencing a paradigm shift in the way it interacts with data. So-called "big data" are collected and used at unprecedented scales with the idea that algorithms can be designed to aid in chemical discovery. As data-enabled practices become ever more ubiquitous, chemists must consider the organization and curation of their data, especially as it is presented to both humans and increasingly intelligent algorithms. One of the most promising organizational schemes for big data is a construct termed an ontology. In data science, ontologies are systems that represent relations among objects and properties in a domain of discourse. As chemistry encounters larger and larger data sets, the ontologies that support chemical research will likewise increase in complexity, and the future of chemistry will be shaped by the choices made in developing big data chemical ontologies. How such ontologies will work should therefore be a subject of significant attention in the chemical community. Now is the time for chemists to ask questions about ontology design and use: How should chemical data be organized? What can be reasonably expected from an organizational structure? Is a universal ontology tenable? As some of these questions may be new to chemists, we recommend an interdisciplinary approach that draws on the long history of philosophers of science asking questions about the organization of scientific concepts, constructs, models, and theories. This Perspective presents insights from these long-standing studies and initiates new conversations between chemists and philosophers.
Collapse
Affiliation(s)
- Rebekah Duke
- Department of Chemistry & Center for Applied Energy Research, University of Kentucky, Lexington, Kentucky 40506, United States
| | - Ryan McCoy
- Department of Philosophy, University of Kentucky, Lexington, Kentucky 40508, United States
| | - Chad Risko
- Department of Chemistry & Center for Applied Energy Research, University of Kentucky, Lexington, Kentucky 40506, United States
| | - Julia R S Bursten
- Department of Philosophy, University of Kentucky, Lexington, Kentucky 40508, United States
| |
Collapse
|
37
|
Zhang C, Arun A, Lapkin AA. Completing and Balancing Database Excerpted Chemical Reactions with a Hybrid Mechanistic-Machine Learning Approach. ACS OMEGA 2024; 9:18385-18399. [PMID: 38680356 PMCID: PMC11044172 DOI: 10.1021/acsomega.4c00262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 03/31/2024] [Accepted: 04/03/2024] [Indexed: 05/01/2024]
Abstract
Computer-aided synthesis planning (CASP) development of reaction routes requires an understanding of complete reaction structures. However, most reactions in the current databases are missing reaction coparticipants. Although reaction prediction and atom mapping tools can predict major reaction participants and trace atom rearrangements in reactions, they fail to identify the missing molecules to complete reactions. This is because these approaches are data-driven models trained on the current reaction databases, which comprise incomplete reactions. In this work, a workflow was developed to tackle the reaction completion challenge. This includes a heuristic-based method to identify balanced reactions from reaction databases and complete some imbalanced reactions by adding candidate molecules. A machine learning masked language model (MLM) was trained to learn from simplified molecular input line entry system (SMILES) sentences of these completed reactions. The model predicted missing molecules for the incomplete reactions, a workflow analogous to predicting missing words in sentences. The model is promising for the prediction of small- and middle-sized missing molecules in incomplete reaction records. The workflow combining both the heuristic and machine learning methods completed more than half of the entire reaction space.
Collapse
Affiliation(s)
- Chonghuan Zhang
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
| | - Adarsh Arun
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
- Cambridge
Centre for Advanced Research and Education in Singapore, CARES Ltd., 1 CREATE Way, CREATE Tower #05-05, Singapore 138602 Singapore
- Chemical
Data Intelligence (CDI) Pte., Ltd., 9 Raffles Place #26-01, Republic Plaza, Singapore 048619 Singapore
| | - Alexei A. Lapkin
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
- Cambridge
Centre for Advanced Research and Education in Singapore, CARES Ltd., 1 CREATE Way, CREATE Tower #05-05, Singapore 138602 Singapore
- Chemical
Data Intelligence (CDI) Pte., Ltd., 9 Raffles Place #26-01, Republic Plaza, Singapore 048619 Singapore
| |
Collapse
|
38
|
Ding Y, Qiang B, Chen Q, Liu Y, Zhang L, Liu Z. Exploring Chemical Reaction Space with Machine Learning Models: Representation and Feature Perspective. J Chem Inf Model 2024; 64:2955-2970. [PMID: 38489239 DOI: 10.1021/acs.jcim.4c00004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2024]
Abstract
Chemical reactions serve as foundational building blocks for organic chemistry and drug design. In the era of large AI models, data-driven approaches have emerged to innovate the design of novel reactions, optimize existing ones for higher yields, and discover new pathways for synthesizing chemical structures comprehensively. To effectively address these challenges with machine learning models, it is imperative to derive robust and informative representations or engage in feature engineering using extensive data sets of reactions. This work aims to provide a comprehensive review of established reaction featurization approaches, offering insights into the selection of representations and the design of features for a wide array of tasks. The advantages and limitations of employing SMILES, molecular fingerprints, molecular graphs, and physics-based properties are meticulously elaborated. Solutions to bridge the gap between different representations will also be critically evaluated. Additionally, we introduce a new frontier in chemical reaction pretraining, holding promise as an innovative yet unexplored avenue.
Collapse
Affiliation(s)
- Yuheng Ding
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Bo Qiang
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Qixuan Chen
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Yiqiao Liu
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Liangren Zhang
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Zhenming Liu
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| |
Collapse
|
39
|
Strieth-Kalthoff F, Szymkuć S, Molga K, Aspuru-Guzik A, Glorius F, Grzybowski BA. Artificial Intelligence for Retrosynthetic Planning Needs Both Data and Expert Knowledge. J Am Chem Soc 2024. [PMID: 38598363 DOI: 10.1021/jacs.4c00338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]
Abstract
Rapid advancements in artificial intelligence (AI) have enabled breakthroughs across many scientific disciplines. In organic chemistry, the challenge of planning complex multistep chemical syntheses should conceptually be well-suited for AI. Yet, the development of AI synthesis planners trained solely on reaction-example-data has stagnated and is not on par with the performance of "hybrid" algorithms combining AI with expert knowledge. This Perspective examines possible causes of these shortcomings, extending beyond the established reasoning of insufficient quantities of reaction data. Drawing attention to the intricacies and data biases that are specific to the domain of synthetic chemistry, we advocate augmenting the unique capabilities of AI with the knowledge base and the reasoning strategies of domain experts. By actively involving synthetic chemists, who are the end users of any synthesis planning software, into the development process, we envision to bridge the gap between computer algorithms and the intricate nature of chemical synthesis.
Collapse
Affiliation(s)
- Felix Strieth-Kalthoff
- University of Toronto, Department of Chemistry and Department of Computer Science, 80 St. George St., Toronto, Ontario M5S 3H6, Canada
- University of Toronto, Department of Computer Science, 10 King's College Road, Toronto, Ontario M5S 3G4, Canada
| | - Sara Szymkuć
- Allchemy, 2145 45th Street #201, Highland, Indiana 46322, United States
- Institute of Organic Chemistry, Polish Academy of Sciences, ul. Kasprzaka 44/52, Warsaw 01-224, Poland
| | - Karol Molga
- Allchemy, 2145 45th Street #201, Highland, Indiana 46322, United States
- Institute of Organic Chemistry, Polish Academy of Sciences, ul. Kasprzaka 44/52, Warsaw 01-224, Poland
| | - Alán Aspuru-Guzik
- University of Toronto, Department of Chemistry and Department of Computer Science, 80 St. George St., Toronto, Ontario M5S 3H6, Canada
- University of Toronto, Department of Computer Science, 10 King's College Road, Toronto, Ontario M5S 3G4, Canada
- Vector Institute for Artificial Intelligence, 661 University Ave., Toronto, Ontario M5G 1M1, Canada
- University of Toronto, Department of Chemical Engineering and Applied Chemistry, 200 College St., Toronto, Ontario M5S 3E5, Canada
- University of Toronto, Department of Materials Science and Engineering, 184 College St., Toronto, Ontario M5S 3E4, Canada
| | - Frank Glorius
- Universität Münster, Organisch-Chemisches Institut, Corrensstr. 36, 48149 Münster, Germany
| | - Bartosz A Grzybowski
- Institute of Organic Chemistry, Polish Academy of Sciences, ul. Kasprzaka 44/52, Warsaw 01-224, Poland
- IBS Center for Algorithmic and Robotized Synthesis, CARS, UNIST 50, UNIST-gil, Eonyang-eup, Ulju-gun, Ulsan 689-798, South Korea
- Department of Chemistry, UNIST, 50, UNIST-gil, Eonyang-eup, Ulju-gun, Ulsan 689-798, South Korea
| |
Collapse
|
40
|
Schrader ML, Schäfer FR, Schäfers F, Glorius F. Bridging the information gap in organic chemical reactions. Nat Chem 2024; 16:491-498. [PMID: 38548884 DOI: 10.1038/s41557-024-01470-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Accepted: 02/02/2024] [Indexed: 04/07/2024]
Abstract
The varying quality of scientific reports is a well-recognized problem and often results from a lack of standardization and transparency in scientific publications. This situation ultimately leads to prominent complications such as reproducibility issues and the slow uptake of newly developed synthetic methods for pharmaceutical and agrochemical applications. In recent years, various impactful approaches have been advocated to bridge information gaps and to improve the quality of experimental protocols in synthetic organic publications. Here we provide a critical overview of these strategies and present the reader with a versatile set of tools to augment their standard procedures. We formulate eight principles to improve data management in scientific publications relating to data standardization, reproducibility and evaluation, and encourage scientists to go beyond current publication standards. We are aware that this is a substantial effort, but we are convinced that the resulting improved data situation will greatly benefit the progress of chemistry.
Collapse
Affiliation(s)
- Malte L Schrader
- Organisch-Chemisches Institut, Universität Münster, Münster, Germany
| | - Felix R Schäfer
- Organisch-Chemisches Institut, Universität Münster, Münster, Germany
| | - Felix Schäfers
- Organisch-Chemisches Institut, Universität Münster, Münster, Germany
| | - Frank Glorius
- Organisch-Chemisches Institut, Universität Münster, Münster, Germany.
| |
Collapse
|
41
|
King-Smith E, Berritt S, Bernier L, Hou X, Klug-McLeod JL, Mustakis J, Sach NW, Tucker JW, Yang Q, Howard RM, Lee AA. Probing the chemical 'reactome' with high-throughput experimentation data. Nat Chem 2024; 16:633-643. [PMID: 38168924 PMCID: PMC10997498 DOI: 10.1038/s41557-023-01393-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Accepted: 11/06/2023] [Indexed: 01/05/2024]
Abstract
High-throughput experimentation (HTE) has the potential to improve our understanding of organic chemistry by systematically interrogating reactivity across diverse chemical spaces. Notable bottlenecks include few publicly available large-scale datasets and the need for facile interpretation of these data's hidden chemical insights. Here we report the development of a high-throughput experimentation analyser, a robust and statistically rigorous framework, which is applicable to any HTE dataset regardless of size, scope or target reaction outcome, which yields interpretable correlations between starting material(s), reagents and outcomes. We improve the HTE data landscape with the disclosure of 39,000+ previously proprietary HTE reactions that cover a breadth of chemistry, including cross-coupling reactions and chiral salt resolutions. The high-throughput experimentation analyser was validated on cross-coupling and hydrogenation datasets, showcasing the elucidation of statistically significant hidden relationships between reaction components and outcomes, as well as highlighting areas of dataset bias and the specific reaction spaces that necessitate further investigation.
Collapse
Affiliation(s)
- Emma King-Smith
- Cavendish Laboratory, University of Cambridge, Cambridge, UK
| | | | | | - Xinjun Hou
- Pfizer Research and Development, Cambridge, MA, USA
| | | | | | - Neal W Sach
- Pfizer Research and Development, La Jolla, CA, USA
| | | | - Qingyi Yang
- Pfizer Research and Development, Cambridge, MA, USA
| | | | - Alpha A Lee
- Cavendish Laboratory, University of Cambridge, Cambridge, UK.
| |
Collapse
|
42
|
Dobbelaere MR, Lengyel I, Stevens CV, Van Geem KM. Rxn-INSIGHT: fast chemical reaction analysis using bond-electron matrices. J Cheminform 2024; 16:37. [PMID: 38553720 PMCID: PMC10980627 DOI: 10.1186/s13321-024-00834-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 03/23/2024] [Indexed: 04/02/2024] Open
Abstract
The challenge of devising pathways for organic synthesis remains a central issue in the field of medicinal chemistry. Over the span of six decades, computer-aided synthesis planning has given rise to a plethora of potent tools for formulating synthetic routes. Nevertheless, a significant expert task still looms: determining the appropriate solvent, catalyst, and reagents when provided with a set of reactants to achieve and optimize the desired product for a specific step in the synthesis process. Typically, chemists identify key functional groups and rings that exert crucial influences at the reaction center, classify reactions into categories, and may assign them names. This research introduces Rxn-INSIGHT, an open-source algorithm based on the bond-electron matrix approach, with the purpose of automating this endeavor. Rxn-INSIGHT not only streamlines the process but also facilitates extensive querying of reaction databases, effectively replicating the thought processes of an organic chemist. The core functions of the algorithm encompass the classification and naming of reactions, extraction of functional groups, rings, and scaffolds from the involved chemical entities. The provision of reaction condition recommendations based on the similarity and prevalence of reactions eventually arises as a side application. The performance of our rule-based model has been rigorously assessed against a carefully curated benchmark dataset, exhibiting an accuracy rate exceeding 90% in reaction classification and surpassing 95% in reaction naming. Notably, it has been discerned that a pivotal factor in selecting analogous reactions lies in the analysis of ring structures participating in the reactions. An examination of ring structures within the USPTO chemical reaction database reveals that with just 35 unique rings, a remarkable 75% of all rings found in nearly 1 million products can be encompassed. Furthermore, Rxn-INSIGHT is proficient in suggesting appropriate choices for solvents, catalysts, and reagents in entirely novel reactions, all within the span of a second, utilizing nothing more than an everyday laptop.
Collapse
Affiliation(s)
- Maarten R Dobbelaere
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Faculty of Engineering and Architecture, Ghent University, Technologiepark 125, 9052, Ghent, Belgium
| | - István Lengyel
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Faculty of Engineering and Architecture, Ghent University, Technologiepark 125, 9052, Ghent, Belgium
- ChemInsights LLC, Dover, DE, 19901, USA
| | - Christian V Stevens
- SynBioC Research Group, Department of Green Chemistry and Technology, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, 9000, Ghent, Belgium
| | - Kevin M Van Geem
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Faculty of Engineering and Architecture, Ghent University, Technologiepark 125, 9052, Ghent, Belgium.
| |
Collapse
|
43
|
Pasquini M, Stenta M. LinChemIn: Route Arithmetic─Operations on Digital Synthetic Routes. J Chem Inf Model 2024; 64:1765-1771. [PMID: 38480486 DOI: 10.1021/acs.jcim.3c01819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/26/2024]
Abstract
Computational tools are revolutionizing our understanding and prediction of chemical reactivity by combining traditional data analysis techniques with new predictive models. These tools extract additional value from the reaction data corpus, but to effectively convert this value into actionable knowledge, domain specialists need to interact easily with the computer-generated output. In this application note, we demonstrate the capabilities of the open-source Python toolkit LinChemIn, which simplifies the manipulation of reaction networks and provides advanced functionality for working with synthetic routes. LinChemIn ensures chemical consistency when merging, editing, mining, and analyzing reaction networks. Its flexible input interface can process routes from various sources, including predictive models and expert input. The toolkit also efficiently extracts individual routes from the combined synthetic tree, identifying alternative paths and reaction combinations. By reducing the operational barrier to accessing and analyzing synthetic routes from multiple sources, LinChemIn facilitates a constructive interplay between artificial intelligence and human expertise.
Collapse
Affiliation(s)
- Marta Pasquini
- Syngenta Crop Protection AG, Schaffhauserstrasse, 4332 Stein, AG, Switzerland
| | - Marco Stenta
- Syngenta Crop Protection AG, Schaffhauserstrasse, 4332 Stein, AG, Switzerland
| |
Collapse
|
44
|
Tavakoli M, Miller RJ, Angel MC, Pfeiffer MA, Gutman ES, Mood AD, Van Vranken D, Baldi P. PMechDB: A Public Database of Elementary Polar Reaction Steps. J Chem Inf Model 2024; 64:1975-1983. [PMID: 38483315 PMCID: PMC10966657 DOI: 10.1021/acs.jcim.3c01810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 02/15/2024] [Accepted: 02/16/2024] [Indexed: 03/26/2024]
Abstract
Most online chemical reaction databases are not publicly accessible or are fully downloadable. These databases tend to contain reactions in noncanonicalized formats and often lack comprehensive information regarding reaction pathways, intermediates, and byproducts. Within the few publicly available databases, reactions are typically stored in the form of unbalanced, overall transformations with minimal interpretability of the underlying chemistry. These limitations present significant obstacles to data-driven applications including the development of machine learning models. As an effort to overcome these challenges, we introduce PMechDB, a publicly accessible platform designed to curate, aggregate, and share polar chemical reaction data in the form of elementary reaction steps. Our initial version of PMechDB consists of over 100,000 such steps. In the PMechDB, all reactions are stored as canonicalized and balanced elementary steps, featuring accurate atom mapping and arrow-pushing mechanisms. As an online interactive database, PMechDB provides multiple interfaces that enable users to search, download, and upload chemical reactions. We anticipate that the public availability of PMechDB and its standardized data representation will prove beneficial for chemoinformatics research and education and the development of data-driven, interpretable models for predicting reactions and pathways. PMechDB platform is accessible online at https://deeprxn.ics.uci.edu/pmechdb.
Collapse
Affiliation(s)
- Mohammadamin Tavakoli
- Department
of Computer Science, University of California,
Irvine, Irvine, California 92697, United States
| | - Ryan J. Miller
- Department
of Computer Science, University of California,
Irvine, Irvine, California 92697, United States
| | - Mirana Claire Angel
- Department
of Computer Science, University of California,
Irvine, Irvine, California 92697, United States
| | - Michael A. Pfeiffer
- Department
of Chemistry, University of California,
Irvine, Irvine, California 92697, United States
| | - Eugene S. Gutman
- Department
of Chemistry, University of California,
Irvine, Irvine, California 92697, United States
| | - Aaron D. Mood
- Department
of Chemistry, University of California,
Irvine, Irvine, California 92697, United States
| | - David Van Vranken
- Department
of Chemistry, University of California,
Irvine, Irvine, California 92697, United States
| | - Pierre Baldi
- Department
of Computer Science, University of California,
Irvine, Irvine, California 92697, United States
| |
Collapse
|
45
|
'Bandit' algorithms help chemists to discover generally applicable conditions for reactions. Nature 2024:10.1038/d41586-024-00446-5. [PMID: 38499804 DOI: 10.1038/d41586-024-00446-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
|
46
|
Listgarten J. The perpetual motion machine of AI-generated data and the distraction of ChatGPT as a 'scientist'. Nat Biotechnol 2024; 42:371-373. [PMID: 38273064 DOI: 10.1038/s41587-023-02103-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2024]
Affiliation(s)
- Jennifer Listgarten
- Electrical Engineering and Computer Science, University of California, Berkeley, Berkeley, CA, USA.
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
| |
Collapse
|
47
|
Isbrandt ES, Chapple DE, Tu NTP, Dimakos V, Beardall AMM, Boyle PD, Rowley CN, Blacquiere JM, Newman SG. Controlling Reactivity and Selectivity in the Mizoroki-Heck Reaction: High Throughput Evaluation of 1,5-Diaza-3,7-diphosphacyclooctane Ligands. J Am Chem Soc 2024; 146:5650-5660. [PMID: 38359357 DOI: 10.1021/jacs.3c14612] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2024]
Abstract
We report a high throughput evaluation of the Mizoroki-Heck reaction of diverse olefin coupling partners. Comparison of different ligands revealed the 1,5-diaza-3,7-diphosphacyclooctane (P2N2) scaffold to be more broadly applicable than common "gold standard" ligands, demonstrating that this family of readily accessible diphosphines has unrecognized potential in organic synthesis. In particular, two structurally related P2N2 ligands were identified to enable the regiodivergent arylation of styrenes. By simply altering the phosphorus substituent from a phenyl to tert-butyl group, both the linear and branched Mizoroki-Heck products can be obtained in high regioisomeric ratios. Experimental and computational mechanistic studies were performed to further probe the origin of selectivity, which suggests that both ligands coordinate to the metal in a similar manner but that rigid positioning of the phosphorus substituent forces contact with the incoming olefin in a π-π interaction (for P-Ph ligands) or with steric clash (for P-tBu ligands), dictating the regiocontrol.
Collapse
Affiliation(s)
- Eric S Isbrandt
- Centre for Catalysis Research and Innovation, Department of Chemistry and Biomolecular Sciences, University of Ottawa, 10 Marie Curie Private, Ottawa, Ontario K1N 6N5, Canada
| | - Devon E Chapple
- Department of Chemistry, Western University, 1151 Richmond Street, London, Ontario N6A 3K7, Canada
| | - Nguyen Thien Phuc Tu
- Department of Chemistry, Carleton University, 1125 Colonel By Drive, Ottawa, Ontario K1S 5B6, Canada
| | - Victoria Dimakos
- Centre for Catalysis Research and Innovation, Department of Chemistry and Biomolecular Sciences, University of Ottawa, 10 Marie Curie Private, Ottawa, Ontario K1N 6N5, Canada
| | - Anne Marie M Beardall
- Department of Chemistry, Western University, 1151 Richmond Street, London, Ontario N6A 3K7, Canada
| | - Paul D Boyle
- Department of Chemistry, Western University, 1151 Richmond Street, London, Ontario N6A 3K7, Canada
| | - Christopher N Rowley
- Department of Chemistry, Carleton University, 1125 Colonel By Drive, Ottawa, Ontario K1S 5B6, Canada
| | - Johanna M Blacquiere
- Department of Chemistry, Western University, 1151 Richmond Street, London, Ontario N6A 3K7, Canada
| | - Stephen G Newman
- Centre for Catalysis Research and Innovation, Department of Chemistry and Biomolecular Sciences, University of Ottawa, 10 Marie Curie Private, Ottawa, Ontario K1N 6N5, Canada
| |
Collapse
|
48
|
Nippa DF, Atz K, Hohler R, Müller AT, Marx A, Bartelmus C, Wuitschik G, Marzuoli I, Jost V, Wolfard J, Binder M, Stepan AF, Konrad DB, Grether U, Martin RE, Schneider G. Enabling late-stage drug diversification by high-throughput experimentation with geometric deep learning. Nat Chem 2024; 16:239-248. [PMID: 37996732 PMCID: PMC10849962 DOI: 10.1038/s41557-023-01360-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 10/03/2023] [Indexed: 11/25/2023]
Abstract
Late-stage functionalization is an economical approach to optimize the properties of drug candidates. However, the chemical complexity of drug molecules often makes late-stage diversification challenging. To address this problem, a late-stage functionalization platform based on geometric deep learning and high-throughput reaction screening was developed. Considering borylation as a critical step in late-stage functionalization, the computational model predicted reaction yields for diverse reaction conditions with a mean absolute error margin of 4-5%, while the reactivity of novel reactions with known and unknown substrates was classified with a balanced accuracy of 92% and 67%, respectively. The regioselectivity of the major products was accurately captured with a classifier F-score of 67%. When applied to 23 diverse commercial drug molecules, the platform successfully identified numerous opportunities for structural diversification. The influence of steric and electronic information on model performance was quantified, and a comprehensive simple user-friendly reaction format was introduced that proved to be a key enabler for seamlessly integrating deep learning and high-throughput experimentation for late-stage functionalization.
Collapse
Affiliation(s)
- David F Nippa
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
- Department of Pharmacy, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Kenneth Atz
- Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland
| | - Remo Hohler
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Alex T Müller
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Andreas Marx
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Christian Bartelmus
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Georg Wuitschik
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Irene Marzuoli
- Process Chemistry and Catalysis (PCC), F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Vera Jost
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Jens Wolfard
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Martin Binder
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Antonia F Stepan
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - David B Konrad
- Department of Pharmacy, Ludwig-Maximilians-Universität München, Munich, Germany.
| | - Uwe Grether
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland.
| | - Rainer E Martin
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland.
| | - Gisbert Schneider
- Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland.
- ETH Singapore SEC Ltd, Singapore, Singapore.
| |
Collapse
|
49
|
Chen LY, Li YP. Enhancing chemical synthesis: a two-stage deep neural network for predicting feasible reaction conditions. J Cheminform 2024; 16:11. [PMID: 38268009 PMCID: PMC11301986 DOI: 10.1186/s13321-024-00805-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Accepted: 01/14/2024] [Indexed: 01/26/2024] Open
Abstract
In the field of chemical synthesis planning, the accurate recommendation of reaction conditions is essential for achieving successful outcomes. This work introduces an innovative deep learning approach designed to address the complex task of predicting appropriate reagents, solvents, and reaction temperatures for chemical reactions. Our proposed methodology combines a multi-label classification model with a ranking model to offer tailored reaction condition recommendations based on relevance scores derived from anticipated product yields. To tackle the challenge of limited data for unfavorable reaction contexts, we employed the technique of hard negative sampling to generate reaction conditions that might be mistakenly classified as suitable, forcing the model to refine its decision boundaries, especially in challenging cases. Our developed model excels in proposing conditions where an exact match to the recorded solvents and reagents is found within the top-10 predictions 73% of the time. It also predicts temperatures within ± 20 [Formula: see text] of the recorded temperature in 89% of test cases. Notably, the model demonstrates its capacity to recommend multiple viable reaction conditions, with accuracy varying based on the availability of condition records associated with each reaction. What sets this model apart is its ability to suggest alternative reaction conditions beyond the constraints of the dataset. This underscores its potential to inspire innovative approaches in chemical research, presenting a compelling opportunity for advancing chemical synthesis planning and elevating the field of reaction engineering. Scientific contribution: The combination of multi-label classification and ranking models provides tailored recommendations for reaction conditions based on the reaction yields. A novel approach is presented to address the issue of data scarcity in negative reaction conditions through data augmentation.
Collapse
Affiliation(s)
- Lung-Yi Chen
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617, Taiwan
| | - Yi-Pei Li
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617, Taiwan.
- Taiwan International Graduate Program on Sustainable Chemical Science and Technology (TIGP-SCST), No. 128, Sec. 2, Academia Road, Taipei, 11529, Taiwan.
| |
Collapse
|
50
|
Bai J, Mosbach S, Taylor CJ, Karan D, Lee KF, Rihm SD, Akroyd J, Lapkin AA, Kraft M. A dynamic knowledge graph approach to distributed self-driving laboratories. Nat Commun 2024; 15:462. [PMID: 38263405 PMCID: PMC10805810 DOI: 10.1038/s41467-023-44599-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 12/21/2023] [Indexed: 01/25/2024] Open
Abstract
The ability to integrate resources and share knowledge across organisations empowers scientists to expedite the scientific discovery process. This is especially crucial in addressing emerging global challenges that require global solutions. In this work, we develop an architecture for distributed self-driving laboratories within The World Avatar project, which seeks to create an all-encompassing digital twin based on a dynamic knowledge graph. We employ ontologies to capture data and material flows in design-make-test-analyse cycles, utilising autonomous agents as executable knowledge components to carry out the experimentation workflow. Data provenance is recorded to ensure its findability, accessibility, interoperability, and reusability. We demonstrate the practical application of our framework by linking two robots in Cambridge and Singapore for a collaborative closed-loop optimisation for a pharmaceutically-relevant aldol condensation reaction in real-time. The knowledge graph autonomously evolves toward the scientist's research goals, with the two robots effectively generating a Pareto front for cost-yield optimisation in three days.
Collapse
Affiliation(s)
- Jiaru Bai
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
| | - Sebastian Mosbach
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
| | - Connor J Taylor
- Astex Pharmaceuticals, 436 Cambridge Science Park Milton Road, Cambridge, CB4 0QA, UK
- Innovation Centre in Digital Molecular Technologies, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
- Faculty of Engineering, University of Nottingham, University Park, Nottingham, NG7 2RD, UK
| | - Dogancan Karan
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
| | - Kok Foong Lee
- CMCL Innovations, Sheraton House, Cambridge, CB3 0AX, UK
| | - Simon D Rihm
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
| | - Jethro Akroyd
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
| | - Alexei A Lapkin
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
- Innovation Centre in Digital Molecular Technologies, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Markus Kraft
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK.
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore.
- School of Chemical and Biomedical Engineering, Nanyang Technological University, 62 Nanyang Drive, 637459, Singapore, Singapore.
- The Alan Turing Institute, London, NW1 2DB, UK.
| |
Collapse
|