1
|
Götz J, Richards E, Stepek IA, Takahashi Y, Huang YL, Bertschi L, Rubi B, Bode JW. Predicting three-component reaction outcomes from ~40,000 miniaturized reactant combinations. SCIENCE ADVANCES 2025; 11:eadw6047. [PMID: 40435244 PMCID: PMC12118581 DOI: 10.1126/sciadv.adw6047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/08/2025] [Accepted: 04/25/2025] [Indexed: 06/01/2025]
Abstract
Efficient drug discovery depends on reliable synthetic access to candidate molecules, but emerging machine learning approaches to predicting reaction outcomes are hampered by poor availability of high-quality data. Here, we demonstrate an on-demand synthesis platform based on a three-component reaction that delivers drug-like molecules. Miniaturization and automation enable the execution and analysis of 50,000 distinct reactions on a 3-microliter scale from 193 different substrates, producing the largest public reaction outcome dataset. With machine learning, we accurately predict the result of unknown reactions and analyze the impact of dataset size on model training, both enabling accurate outcome predictions even for unseen reactants and providing a sufficiently large dataset to critically evaluate emerging machine learning approaches to chemical reactivity.
Collapse
Affiliation(s)
- Julian Götz
- Laboratory for Organic Chemistry, Department of Chemistry and Applied Biosciences, ETH Zürich, 8093 Zürich, Switzerland
| | - Euan Richards
- Laboratory for Organic Chemistry, Department of Chemistry and Applied Biosciences, ETH Zürich, 8093 Zürich, Switzerland
| | - Iain A. Stepek
- Laboratory for Organic Chemistry, Department of Chemistry and Applied Biosciences, ETH Zürich, 8093 Zürich, Switzerland
| | - Yu Takahashi
- Laboratory for Organic Chemistry, Department of Chemistry and Applied Biosciences, ETH Zürich, 8093 Zürich, Switzerland
| | - Yi-Lin Huang
- Laboratory for Organic Chemistry, Department of Chemistry and Applied Biosciences, ETH Zürich, 8093 Zürich, Switzerland
| | - Louis Bertschi
- Molecular and Biomolecular Analysis Service (MoBiAS), Department of Chemistry and Applied Biosciences, ETH Zürich, 8093 Zürich, Switzerland
| | - Bertran Rubi
- Molecular and Biomolecular Analysis Service (MoBiAS), Department of Chemistry and Applied Biosciences, ETH Zürich, 8093 Zürich, Switzerland
| | - Jeffrey W. Bode
- Laboratory for Organic Chemistry, Department of Chemistry and Applied Biosciences, ETH Zürich, 8093 Zürich, Switzerland
| |
Collapse
|
2
|
Souza LW, Ricke ND, Chaffin BC, Fortunato ME, Jiang S, Soylu C, Caya TC, Lau SH, Wieser KA, Doyle AG, Tan KL. Applying Active Learning toward Building a Generalizable Model for Ni-Photoredox Cross-Electrophile Coupling of Aryl and Alkyl Bromides. J Am Chem Soc 2025. [PMID: 40401689 DOI: 10.1021/jacs.5c02218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/23/2025]
Abstract
When developing machine learning models for yield prediction, the two main challenges are effectively exploring condition space and substrate space. In this article, we disclose an approach for mapping the substrate space for Ni/photoredox-catalyzed cross-electrophile coupling of alkyl bromides and aryl bromides in a high-throughput experimentation (HTE) context. This model employs active learning (in particular, uncertainty querying) as a strategy to rapidly construct a yield model. Given the vastness of substrate space, we focused on an approach that builds an initial model and then uses a minimal data set to expand into new chemical spaces. In particular, we built a model for a virtual space of 22,240 compounds using less than 400 data points. We demonstrated that the model can be expanded to 33,312 compounds by adding information around 24 building blocks (<100 additional reactions). Comparing the active learning-based model to one constructed on randomly selected data showed that the active learning model was significantly better at predicting which reactions will be successful. A combination of density function theory (DFT) and difference Morgan fingerprints was employed to construct the random forest model. Feature importance analysis indicates that key DFT features that are related to the reaction mechanism (e.g., alkyl radical LUMO energy) were crucial for model performance and predictions on aryl bromides outside the training set. We anticipate that combining DFT featurization and uncertainty-based querying will help the synthetic organic community build predictive models in a data-efficient manner for other chemical reactions that feature large and diverse scopes.
Collapse
Affiliation(s)
- Lucas W Souza
- Global Discovery Chemistry, Novartis, Cambridge, Massachusetts 02139, United States
| | - Nathan D Ricke
- Global Discovery Chemistry, Novartis, Cambridge, Massachusetts 02139, United States
| | - Braden C Chaffin
- Department of Chemistry & Biochemistry, University of California, Los Angeles, California 90095, United States
| | - Mike E Fortunato
- Global Discovery Chemistry, Novartis, Cambridge, Massachusetts 02139, United States
| | - Shutian Jiang
- Department of Chemistry & Biochemistry, University of California, Los Angeles, California 90095, United States
| | - Cihan Soylu
- Global Discovery Chemistry, Novartis, Cambridge, Massachusetts 02139, United States
| | - Thomas C Caya
- Global Discovery Chemistry, Novartis, Cambridge, Massachusetts 02139, United States
| | - Sii Hong Lau
- Global Discovery Chemistry, Novartis, Cambridge, Massachusetts 02139, United States
| | - Katherine A Wieser
- Global Discovery Chemistry, Novartis, Cambridge, Massachusetts 02139, United States
| | - Abigail G Doyle
- Department of Chemistry & Biochemistry, University of California, Los Angeles, California 90095, United States
| | - Kian L Tan
- Global Discovery Chemistry, Novartis, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
3
|
Sletten ET, Wolf JB, Danglad‐Flores J, Seeberger PH. Carbohydrate Synthesis is Entering the Data-Driven Digital Era. Chemistry 2025; 31:e202500289. [PMID: 40178205 PMCID: PMC12080308 DOI: 10.1002/chem.202500289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2025] [Revised: 03/27/2025] [Accepted: 03/28/2025] [Indexed: 04/05/2025]
Abstract
Glycans are vital in biological processes, but their nontemplated, heterogeneous structures complicate structure-function analyses. Glycosylation, the key reaction in synthetic glycochemistry, remains not entirely predictable due to its complex mechanism and the need for protecting groups that impact reaction outcomes. This concept highlights recent advancements in glycochemistry and emphasizes the integration of digital tools, including automation, computational modelling, and data management, to improve carbohydrate synthesis and support further progress in the field.
Collapse
Affiliation(s)
- Eric T. Sletten
- Max Planck Institute of Colloids and InterfacesPotsdam Science ParkAm Mühlenberg 114476PotsdamGermany
| | - Jakob B. Wolf
- Max Planck Institute of Colloids and InterfacesPotsdam Science ParkAm Mühlenberg 114476PotsdamGermany
- Institut für Chemie, Biochemie und PharmazieFreie Universität BerlinTakusstraße 314195BerlinGermany
| | - José Danglad‐Flores
- Max Planck Institute of Colloids and InterfacesPotsdam Science ParkAm Mühlenberg 114476PotsdamGermany
| | - Peter H. Seeberger
- Max Planck Institute of Colloids and InterfacesPotsdam Science ParkAm Mühlenberg 114476PotsdamGermany
- Institut für Chemie, Biochemie und PharmazieFreie Universität BerlinTakusstraße 314195BerlinGermany
| |
Collapse
|
4
|
Krzyzanowski A, Pickett SD, Pogány P. Exploring BERT for Reaction Yield Prediction: Evaluating the Impact of Tokenization, Molecular Representation, and Pretraining Data Augmentation. J Chem Inf Model 2025; 65:4381-4402. [PMID: 40311104 DOI: 10.1021/acs.jcim.5c00359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2025]
Abstract
Predicting reaction yields in synthetic chemistry remains a significant challenge. This study systematically evaluates the impact of tokenization, molecular representation, pretraining data, and adversarial training on a BERT-based model for yield prediction of Buchwald-Hartwig and Suzuki-Miyaura coupling reactions using publicly available HTE data sets. We demonstrate that molecular representation choice (SMILES, DeepSMILES, SELFIES, Morgan fingerprint-based notation, IUPAC names) has minimal impact on model performance, while typically BPE and SentencePiece tokenization outperform other methods. WordPiece is strongly discouraged for SELFIES and fingerprint-based notation. Furthermore, pretraining with relatively small data sets (<100 K reactions) achieves comparable performance to larger data sets containing millions of examples. The use of artificially generated domain-specific pretraining data is proposed. The artificially generated sets prove to be a good surrogate for the reaction schemes extracted from reaction data sets such as Pistachio or Reaxys. The best performance was observed for hybrid pretraining sets combining the real and the domain-specific, artificial data. Finally, we show that a novel adversarial training approach, perturbing input embeddings dynamically, improves model robustness and generalizability for yield and reaction success prediction. These findings provide valuable insights for developing robust and practical machine learning models for yield prediction in synthetic chemistry. GSK's BERT training code base is made available to the community with this work.
Collapse
Affiliation(s)
| | - Stephen D Pickett
- GSK Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, U.K
| | - Peter Pogány
- GSK Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, U.K
| |
Collapse
|
5
|
Stephens S, Lambert KM. The Importance of Atomic Charges for Predicting Site-Selective Ir-, Ru-, and Rh-Catalyzed C-H Borylations. J Org Chem 2025; 90:6000-6012. [PMID: 40268690 PMCID: PMC12053941 DOI: 10.1021/acs.joc.5c00343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2025] [Revised: 04/04/2025] [Accepted: 04/15/2025] [Indexed: 04/25/2025]
Abstract
A supervised machine learning model has been developed that allows for the prediction of site selectivity in late-stage C-H borylations. Model development was accomplished using literature data for the site-selective (≥95%) C-H borylation of 189 unique arene, heteroarene, and aliphatic substrates that feature a total of 971 possible sp2 or sp3 C-H borylation sites. The reported experimental data was supplemented with additional chemoinformatic descriptors, computed atomic charges at the C-H borylation sites, and data from parameterization of catalytically active tris-boryl complexes resulting from the combination of seven different Ir-, Ru-, and Rh-based precatalysts with eight different ligands. Of the over 1600 parameters investigated, the computed atomic charges (e.g., Hirshfeld, ChelpG, and Mulliken charges) on the hydrogen and carbon atoms at the site of borylation were identified as the most important features that allow for the successful prediction of whether a particular C-H bond will undergo a site-selective borylation. The overall accuracy of the developed model was 88.9% ± 2.5% with precision, recall, and F1 scores of 92-95% for the nonborylating sites and 65-75% for the sites of borylation. The model was demonstrated to be generalizable to molecules outside of the training/test sets with an additional validation set of 12 electronically and structurally diverse systems.
Collapse
Affiliation(s)
- Shannon
M. Stephens
- Department of Chemistry and
Biochemistry, Old Dominion University, 4501 Elkhorn Ave, Norfolk, Virginia 23529, United States
| | - Kyle M. Lambert
- Department of Chemistry and
Biochemistry, Old Dominion University, 4501 Elkhorn Ave, Norfolk, Virginia 23529, United States
| |
Collapse
|
6
|
He Y, Lubchenko V. Knowledge as a Breaking of Ergodicity. Neural Comput 2025; 37:742-792. [PMID: 40030134 DOI: 10.1162/neco_a_01741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2024] [Accepted: 11/21/2024] [Indexed: 03/19/2025]
Abstract
We construct a thermodynamic potential that can guide training of a generative model defined on a set of binary degrees of freedom. We argue that upon reduction in description, so as to make the generative model computationally manageable, the potential develops multiple minima. This is mirrored by the emergence of multiple minima in the free energy proper of the generative model itself. The variety of training samples that employ N binary degrees of freedom is ordinarily much lower than the size 2N of the full phase space. The nonrepresented configurations, we argue, should be thought of as comprising a high-temperature phase separated by an extensive energy gap from the configurations composing the training set. Thus, training amounts to sampling a free energy surface in the form of a library of distinct bound states, each of which breaks ergodicity. The ergodicity breaking prevents escape into the near continuum of states comprising the high-temperature phase; thus, it is necessary for proper functionality. It may, however, have the side effect of limiting access to patterns that were underrepresented in the training set. At the same time, the ergodicity breaking within the library complicates both learning and retrieval. As a remedy, one may concurrently employ multiple generative models-up to one model per free energy minimum.
Collapse
Affiliation(s)
- Yang He
- Department of Chemistry, University of Houston, Houston, TX 77204 5003, U.S.A
| | - Vassiliy Lubchenko
- Department of Chemistry, University of Houston, Houston, TX 77204-5003, U.S.A
- Department of Physics, University of Houston, Houston, TX 77204-5005, U.S.A
- Texas Center for Superconductivity, University of Houston, Houston, TX 77204-5002, U.S.A.
| |
Collapse
|
7
|
Shim E, Tewari A, Cernak T, Zimmerman PM. Recommending reaction conditions with label ranking. Chem Sci 2025; 16:4109-4118. [PMID: 39906388 PMCID: PMC11788591 DOI: 10.1039/d4sc06728b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Accepted: 01/24/2025] [Indexed: 02/06/2025] Open
Abstract
Pinpointing effective reaction conditions can be challenging, even for reactions with significant precedent. Herein, models that rank reaction conditions are introduced as a conceptually new means for prioritizing experiments, distinct from the mainstream approach of yield regression. Specifically, label ranking, which operates using input features only from substrates, will be shown to better generalize to new substrates than prior models. Evaluation on practical reaction condition selection scenarios - choosing from either 4 or 18 conditions and datasets with or without missing reactions - demonstrates label ranking's utility. Ranking aggregation through Borda's method and relative simplicity are key features of label ranking to achieve consistent high performance.
Collapse
Affiliation(s)
- Eunjae Shim
- Department of Chemistry, University of Michigan Ann Arbor MI USA
| | - Ambuj Tewari
- Department of Statistics, University of Michigan Ann Arbor MI USA
- Department of Electrical Engineering and Computer Science, University of Michigan Ann Arbor MI USA
| | - Tim Cernak
- Department of Chemistry, University of Michigan Ann Arbor MI USA
- Department of Medicinal Chemistry, University of Michigan Ann Arbor MI USA
| | - Paul M Zimmerman
- Department of Chemistry, University of Michigan Ann Arbor MI USA
| |
Collapse
|
8
|
Felten S, He CQ, Emmert MH. C-H Aminoalkylation of 5-Membered Heterocycles: Influence of Descriptors, Data Set Size, and Data Quality on the Predictiveness of Machine Learning Models and Expansion of the Substrate Space Beyond 1,3-Azoles. J Org Chem 2025; 90:2613-2625. [PMID: 39933045 DOI: 10.1021/acs.joc.4c02574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/13/2025]
Abstract
We report a general C-H aminoalkylation of 5-membered heterocycles through a combined machine learning/experimental workflow. Our work describes previously unknown C-H functionalization reactivity and creates a predictive machine learning (ML) model through iterative refinement over 6 rounds of active learning. The initial model established with 1,3-azoles predicts the reactivities of N-aryl indazoles, 1,2,4-triazolopyrazines, 1,2,3-thiadiazoles, and 1,3,4-oxadiazoles, while other substrate classes (e.g., pyrazoles and 1,2,4-triazoles) are not predicted well. The final model includes the reactivities of additional heterocyclic scaffolds in the training data, which results in high predictive accuracy across all of the tested cores. The high prediction performance is shown both within the training set via cross-validation (CV R2 = 0.81) and when predicting unseen substrates of diverse molecular weight and structure (Test R2 = 0.95). The concept of feature engineering is discussed, and we benchmark mechanistically related DFT-based features that are more time-intensive and laborious in comparison with molecular descriptors and fingerprints. Importantly, this work establishes novel reactivity for heterocycles for which C-H functionalization methods are underdeveloped. Since such heterocycles are key motifs in drug discovery and development, we expect this work to be of significant use to the synthetic and synthesis-oriented ML communities.
Collapse
Affiliation(s)
- Stephanie Felten
- Process Research and Development, MRL, Merck & Co., Inc., 126 E Lincoln Ave, Rahway, New Jersey 07065, United States
| | - Cyndi Qixin He
- Computational and Structural Chemistry, MRL, Merck & Co., Inc., 126 E Lincoln Ave, Rahway, New Jersey 07065, United States
| | - Marion H Emmert
- Process Research and Development, MRL, Merck & Co., Inc., 126 E Lincoln Ave, Rahway, New Jersey 07065, United States
| |
Collapse
|
9
|
Nakajima H, Murata C, Noto N, Saito S. Database Construction for the Virtual Screening of the Ruthenium-Catalyzed Hydrogenation of Ketones. J Org Chem 2025; 90:1054-1060. [PMID: 39762115 DOI: 10.1021/acs.joc.4c02347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2025]
Abstract
During the recent development of machine-learning (ML) methods for organic synthesis, the value of "failed experiments" has increasingly been acknowledged. Accordingly, we have developed an exhaustive database comprising 300 entries of experimental data obtained by performing ruthenium-catalyzed hydrogenation reactions using 10 ketones as substrates and 30 phosphine ligands. After evaluating the predictive performance of ML models using the constructed database, we conducted a virtual screening of commercially available phosphine ligands. For the virtual screening, we utilized several models, such as histogram-based gradient boosting and Ridge regression, combined with the Mordred descriptors and MACCSKeys, respectively. The disclosed approach resulted in the identification of high-performance phosphine ligands, and the rationale behind the predictions in the virtual screening was analyzed using SHAP.
Collapse
Affiliation(s)
- Haruno Nakajima
- Graduate School of Science, Nagoya University, Nagoya 464-8602, Japan
| | - Chihaya Murata
- Graduate School of Science, Nagoya University, Nagoya 464-8602, Japan
| | - Naoki Noto
- Integrated Research Consortium on Chemical Sciences (IRCCS), Nagoya University, Nagoya 464-8602, Japan
| | - Susumu Saito
- Graduate School of Science, Nagoya University, Nagoya 464-8602, Japan
- Integrated Research Consortium on Chemical Sciences (IRCCS), Nagoya University, Nagoya 464-8602, Japan
| |
Collapse
|
10
|
Cheng AH, Ser CT, Skreta M, Guzmán-Cordero A, Thiede L, Burger A, Aldossary A, Leong SX, Pablo-García S, Strieth-Kalthoff F, Aspuru-Guzik A. Spiers Memorial Lecture: How to do impactful research in artificial intelligence for chemistry and materials science. Faraday Discuss 2025; 256:10-60. [PMID: 39400305 DOI: 10.1039/d4fd00153b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2024]
Abstract
Machine learning has been pervasively touching many fields of science. Chemistry and materials science are no exception. While machine learning has been making a great impact, it is still not reaching its full potential or maturity. In this perspective, we first outline current applications across a diversity of problems in chemistry. Then, we discuss how machine learning researchers view and approach problems in the field. Finally, we provide our considerations for maximizing impact when researching machine learning for chemistry.
Collapse
Affiliation(s)
- Austin H Cheng
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Cher Tian Ser
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Marta Skreta
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Andrés Guzmán-Cordero
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
- Tinbergen Institute, University of Amsterdam, Amsterdam, Netherlands
| | - Luca Thiede
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Andreas Burger
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | | | - Shi Xuan Leong
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- School of Chemistry, Chemical Engineering and Biotechnology, Nanyang Technological University, Singapore 63737, Singapore
| | | | | | - Alán Aspuru-Guzik
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
- Acceleration Consortium, Toronto, Ontario M5G 1X6, Canada
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Canada
- Department of Materials Science and Engineering, University of Toronto, Canada
- Lebovic Fellow, Canadian Institute for Advanced Research (CIFAR), Canada
| |
Collapse
|
11
|
Wang Z, Lin K, Pei J, Lai L. Reacon: a template- and cluster-based framework for reaction condition prediction. Chem Sci 2025; 16:854-866. [PMID: 39650221 PMCID: PMC11622862 DOI: 10.1039/d4sc05946h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Accepted: 11/27/2024] [Indexed: 12/11/2024] Open
Abstract
Computer-assisted synthesis planning has emerged as a valuable tool for organic synthesis. Prediction of reaction conditions is crucial for applying the planned synthesis routes. However, achieving diverse suggestions while ensuring the reasonableness of predictions remains an underexplored challenge. In this study, we introduce an innovative method for forecasting reaction conditions using a combination of graph neural networks, reaction templates, and clustering algorithm. Our method, trained on the refined USPTO dataset, excels with a top-3 accuracy of 63.48% in recalling the recorded conditions. Moreover, when focusing solely on recalling reactions within the same cluster, the top-3 accuracy increases to 85.65%. Finally, by applying the method to recently published molecule synthesis routes and achieving an 85.00% top-3 accuracy at the cluster level, we demonstrate our approach's capability to deliver reliable and diverse condition predictions.
Collapse
Affiliation(s)
- Zihan Wang
- BNLMS, Peking-Tsinghua Center for Life Sciences, College of Chemistry and Molecular Engineering, Peking University Beijing 100871 China
| | - Kangjie Lin
- BNLMS, Peking-Tsinghua Center for Life Sciences, College of Chemistry and Molecular Engineering, Peking University Beijing 100871 China
| | - Jianfeng Pei
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University Beijing 100871 China
| | - Luhua Lai
- BNLMS, Peking-Tsinghua Center for Life Sciences, College of Chemistry and Molecular Engineering, Peking University Beijing 100871 China
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University Beijing 100871 China
| |
Collapse
|
12
|
Xu L, Zhu J, Shen X, Chai J, Shi L, Wu B, Li W, Ma D. 6-Hydroxy Picolinohydrazides Promoted Cu(I)-Catalyzed Hydroxylation Reaction in Water: Machine-Learning Accelerated Ligands Design and Reaction Optimization. Angew Chem Int Ed Engl 2024; 63:e202412552. [PMID: 39189301 DOI: 10.1002/anie.202412552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 08/19/2024] [Accepted: 08/25/2024] [Indexed: 08/28/2024]
Abstract
Hydroxylated (hetero)arenes are privileged motifs in natural products, materials, small-molecule pharmaceuticals and serve as versatile intermediates in synthetic organic chemistry. Herein, we report an efficient Cu(I)/6-hydroxy picolinohydrazide-catalyzed hydroxylation reaction of (hetero)aryl halides (Br, Cl) in water. By establishing machine learning (ML) models, the design of ligands and optimization of reaction conditions were effectively accelerated. The N-(1,3-dimethyl-9H- carbazol-9-yl)-6-hydroxypicolinamide (L32, 6-HPA-DMCA) demonstrated high efficiency for (hetero)aryl bromides, promoting hydroxylation reactions with a minimal catalyst loading of 0.01 mol % (100 ppm) at 80 °C to reach 10000 TON; for substrates containing sensitive functional groups, the catalyst loading needs to be increased to 3.0 mol % under near-room temperature conditions. N-(2,7-Di-tert-butyl-9H-carbazol-9-yl)-6-hydroxypicolinamide (L42, 6-HPA-DTBCA) displayed superior reaction activity for chloride substrates, enabling hydroxylation reactions at 100 °C with 2-3 mol % catalyst loading. These represent the state of art for both lowest catalyst loading and temperature in the copper-catalyzed hydroxylation reactions. Furthermore, this method features a sustainable and environmentally friendly solvent system, accommodates a wide range of substrates, and shows potential for developing robust and scalable synthesis processes for key pharmaceutical intermediates.
Collapse
Affiliation(s)
- Lanting Xu
- State Key Laboratory of Chemical Biology, Shanghai Institute of Organic Chemistry, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 345 Lingling Lu, Shanghai, 200032, China
| | - Jiazhou Zhu
- Suzhou Novartis Technical Development Co., Ltd., #18-1, Tonglian Road, Bixi Subdistrict, Changshu, Jiangsu, 215537, China
| | - Xiaodong Shen
- Suzhou Novartis Technical Development Co., Ltd., #18-1, Tonglian Road, Bixi Subdistrict, Changshu, Jiangsu, 215537, China
| | - Jiashuang Chai
- Chang-Kung Chuang Institute, School of Chemistry and Molecular Engineering, East China Normal University, 500 Dongchuang Lu, Shanghai, 200062, China
| | - Lei Shi
- Suzhou Novartis Technical Development Co., Ltd., #18-1, Tonglian Road, Bixi Subdistrict, Changshu, Jiangsu, 215537, China
| | - Bin Wu
- Suzhou Novartis Technical Development Co., Ltd., #18-1, Tonglian Road, Bixi Subdistrict, Changshu, Jiangsu, 215537, China
| | - Wei Li
- Suzhou Novartis Technical Development Co., Ltd., #18-1, Tonglian Road, Bixi Subdistrict, Changshu, Jiangsu, 215537, China
| | - Dawei Ma
- State Key Laboratory of Chemical Biology, Shanghai Institute of Organic Chemistry, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 345 Lingling Lu, Shanghai, 200032, China
| |
Collapse
|
13
|
Li DZ, Gong XQ. Challenges with Literature-Derived Data in Machine Learning for Yield Prediction: A Case Study on Pd-Catalyzed Carbonylation Reactions. J Phys Chem A 2024; 128:10423-10430. [PMID: 39565904 DOI: 10.1021/acs.jpca.4c05489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2024]
Abstract
The application of machine learning (ML) to predict reaction yields has shown remarkable accuracy when based on high-throughput computational and experimental data. However, the accuracy significantly diminishes when leveraging literature-derived data, highlighting a gap in the predictive capability of the current ML models. This study, focusing on Pd-catalyzed carbonylation reactions, reveals that even with a data set of 2512 reactions, the best-performing model reaches only an R2 of 0.51. Further investigations show that the models' effectiveness is predominantly confined to predictions within narrow subsets of data, closely related and from the same literature sources, rather than across the broader, heterogeneous data sets available in the literature. The reliance on data similarity, coupled with small sample sizes from the same sources, makes the model highly sensitive to inherent fluctuations typical of small data sets, adversely impacting stability, accuracy, and generalizability. The findings underscore the inherent limitations of current ML techniques in leveraging literature-derived data for predicting chemical reaction yields, highlighting the need for more sophisticated approaches to handle the complexity and diversity of chemical data.
Collapse
Affiliation(s)
- Dong-Zhi Li
- Centre for Computational Chemistry, School of Chemistry and Molecular Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Xue-Qing Gong
- Centre for Computational Chemistry, School of Chemistry and Molecular Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
- School of Chemistry and Chemical Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
| |
Collapse
|
14
|
Uceda RG, Gijón A, Míguez‐Lago S, Cruz CM, Blanco V, Fernández‐Álvarez F, Álvarez de Cienfuegos L, Molina‐Solana M, Gómez‐Romero J, Miguel D, Mota AJ, Cuerva JM. Can Deep Learning Search for Exceptional Chiroptical Properties? The Halogenated [6]Helicene Case. Angew Chem Int Ed Engl 2024; 63:e202409998. [PMID: 39329214 PMCID: PMC11586703 DOI: 10.1002/anie.202409998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 09/11/2024] [Accepted: 09/24/2024] [Indexed: 09/28/2024]
Abstract
The relationship between chemical structure and chiroptical properties is not always clearly understood. Nowadays, efforts to develop new systems with enhanced optical properties follow the trial-error method. A large number of data would allow us to obtain more robust conclusions and guide research toward molecules with practical applications. In this sense, in this work we predict the chiroptical properties of millions of halogenated [6]helicenes in terms of the rotatory strength (R). We have used DFT calculations to randomly create derivatives including from 1 to 16 halogen atoms, that were then used as a data set to train different deep neural network models. These models allow us to i) predict the Rmax for any halogenated [6]helicene with a very low computational cost, and ii) to understand the physical reasons that favour some substitutions over others. Finally, we synthesized derivatives with higher predicted Rmax obtaining excellent correlation among the values obtained experimentally and the predicted ones.
Collapse
Affiliation(s)
- Rafael G. Uceda
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Alfonso Gijón
- Departamento de Ciencias de la Computación e Inteligencia Artificial, UGRE.T.S. de Ingenierías Informática y de TelecomunicaciónC/ Periodista Daniel Saucedo Aranda S/N18071GranadaSpain
| | - Sandra Míguez‐Lago
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Carlos M. Cruz
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Víctor Blanco
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Fátima Fernández‐Álvarez
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Luis Álvarez de Cienfuegos
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
- Instituto de Investigación BiosanitariaAvda. Madrid, 1518016GranadaSpain
| | - Miguel Molina‐Solana
- Departamento de Ciencias de la Computación e Inteligencia Artificial, UGRE.T.S. de Ingenierías Informática y de TelecomunicaciónC/ Periodista Daniel Saucedo Aranda S/N18071GranadaSpain
| | - Juan Gómez‐Romero
- Departamento de Ciencias de la Computación e Inteligencia Artificial, UGRE.T.S. de Ingenierías Informática y de TelecomunicaciónC/ Periodista Daniel Saucedo Aranda S/N18071GranadaSpain
| | - Delia Miguel
- Departamento de Fisicoquímica, UEQ, UGRFacultad de FarmaciaAvda. Profesor Clavera s/nC. U. Cartuja18071GranadaSpain
| | - Antonio J. Mota
- Departamento de Química Inorgánica, UEQ, UGRFacultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Juan M. Cuerva
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| |
Collapse
|
15
|
Roszak R, Gadina L, Wołos A, Makkawi A, Mikulak-Klucznik B, Bilgi Y, Molga K, Gołębiowska P, Popik O, Klucznik T, Szymkuć S, Moskal M, Baś S, Frydrych R, Mlynarski J, Vakuliuk O, Gryko DT, Grzybowski BA. Systematic, computational discovery of multicomponent and one-pot reactions. Nat Commun 2024; 15:10285. [PMID: 39604395 PMCID: PMC11603032 DOI: 10.1038/s41467-024-54611-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Accepted: 11/18/2024] [Indexed: 11/29/2024] Open
Abstract
Discovery of new types of reactions is essential to organic chemistry because it expands the scope of accessible molecular scaffolds and can enable more economical syntheses of existing structures. In this context, the so-called multicomponent reactions, MCRs, are of particular interest because they can build complex scaffolds from multiple starting materials in just one step, without purification of intermediates. However, for over a century of active research, MCRs have been discovered rather than designed, and their number remains limited to only several hundred. This work demonstrates that computers taught the essential knowledge of reaction mechanisms and rules of physical-organic chemistry can design - completely autonomously and in large numbers - mechanistically distinct MCRs. Moreover, when supplemented by models to approximate kinetic rates, the algorithm can predict reaction yields and identify reactions that have potential for organocatalysis. These predictions are validated by experiments spanning different modes of reactivity and diverse product scaffolds.
Collapse
Affiliation(s)
| | - Louis Gadina
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
- Center for Algorithmic and Robotized Synthesis (CARS), Institute for Basic Science (IBS), Ulsan, 44919, Republic of Korea
| | | | - Ahmad Makkawi
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
| | | | - Yasemin Bilgi
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
- Center for Algorithmic and Robotized Synthesis (CARS), Institute for Basic Science (IBS), Ulsan, 44919, Republic of Korea
| | - Karol Molga
- Allchemy Inc., Highland, IN, USA
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
| | | | - Oskar Popik
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
| | | | | | | | - Sebastian Baś
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
- Jagiellonian University, Krakow, Poland
| | - Rafał Frydrych
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
- Center for Algorithmic and Robotized Synthesis (CARS), Institute for Basic Science (IBS), Ulsan, 44919, Republic of Korea
| | - Jacek Mlynarski
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
| | - Olena Vakuliuk
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
| | - Daniel T Gryko
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland.
| | - Bartosz A Grzybowski
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland.
- Center for Algorithmic and Robotized Synthesis (CARS), Institute for Basic Science (IBS), Ulsan, 44919, Republic of Korea.
- Department of Chemistry, Ulsan Institute of Science and Technology, UNIST, Ulsan, 44919, Republic of Korea.
| |
Collapse
|
16
|
Szymkuć S, Wołos A, Roszak R, Grzybowski BA. Estimation of multicomponent reactions' yields from networks of mechanistic steps. Nat Commun 2024; 15:10286. [PMID: 39604372 PMCID: PMC11603315 DOI: 10.1038/s41467-024-54550-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Accepted: 11/14/2024] [Indexed: 11/29/2024] Open
Abstract
This work describes estimation of yields of complex, multicomponent reactions (MCRs) based on the modeled networks of mechanistic steps spanning both the main reaction pathway as well as immediate and downstream side reactions. Because experimental values of the kinetic rate constants for individual mechanistic transforms are extremely sparse, these constants are approximated here using Mayr's nucleophilicity and electrophilicity parameters fine-tuned by correction terms grounded in linear free-energy relationships. With this formalism, the model trained on the mechanistic networks of only 20 - but mechanistically- and yield-diverse MCRs - transfers well to newly discovered MCRs that are based on markedly different mechanisms and types of individual mechanistic transforms. These results suggest that mechanistic-level approach to yield estimation may be a useful alternative to models that are derived from full-reaction data and lack information about yield-lowering side reactions.
Collapse
Affiliation(s)
| | - Agnieszka Wołos
- Allchemy, Inc., Highland, IN, USA
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
| | - Rafał Roszak
- Allchemy, Inc., Highland, IN, USA.
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland.
| | - Bartosz A Grzybowski
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland.
- Center for Algorithmic and Robotized Synthesis (CARS), Institute for Basic Science (IBS), Ulsan, Republic of Korea.
- Department of Chemistry, Ulsan Institute of Science and Technology, UNIST, Ulsan, Republic of Korea.
| |
Collapse
|
17
|
Chen LY, Li YP. Machine learning-guided strategies for reaction conditions design and optimization. Beilstein J Org Chem 2024; 20:2476-2492. [PMID: 39376489 PMCID: PMC11457048 DOI: 10.3762/bjoc.20.212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Accepted: 09/19/2024] [Indexed: 10/09/2024] Open
Abstract
This review surveys the recent advances and challenges in predicting and optimizing reaction conditions using machine learning techniques. The paper emphasizes the importance of acquiring and processing large and diverse datasets of chemical reactions, and the use of both global and local models to guide the design of synthetic processes. Global models exploit the information from comprehensive databases to suggest general reaction conditions for new reactions, while local models fine-tune the specific parameters for a given reaction family to improve yield and selectivity. The paper also identifies the current limitations and opportunities in this field, such as the data quality and availability, and the integration of high-throughput experimentation. The paper demonstrates how the combination of chemical engineering, data science, and ML algorithms can enhance the efficiency and effectiveness of reaction conditions design, and enable novel discoveries in synthetic chemistry.
Collapse
Affiliation(s)
- Lung-Yi Chen
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
| | - Yi-Pei Li
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
- Taiwan International Graduate Program on Sustainable Chemical Science and Technology (TIGP-SCST), No. 128, Sec. 2, Academia Road, Taipei 11529, Taiwan
| |
Collapse
|
18
|
Han Y, Deng M, Liu K, Chen J, Wang Y, Xu YN, Dian L. Computer-Aided Synthesis Planning (CASP) and Machine Learning: Optimizing Chemical Reaction Conditions. Chemistry 2024; 30:e202401626. [PMID: 39083362 DOI: 10.1002/chem.202401626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Revised: 07/27/2024] [Accepted: 07/28/2024] [Indexed: 08/02/2024]
Abstract
Computer-aided synthesis planning (CASP) has garnered increasing attention in light of recent advancements in machine learning models. While the focus is on reverse synthesis or forward outcome prediction, optimizing reaction conditions remains a significant challenge. For datasets with multiple variables, the choice of descriptors and models is pivotal. This selection dictates the effective extraction of conditional features and the achievement of higher prediction accuracy. This review delineates the origins of data in conditional optimization, the criteria for descriptor selection, the response models, and the metrics for outcome evaluation, aiming to acquaint readers with the latest research trends and facilitate more informed research in this domain.
Collapse
Affiliation(s)
- Yu Han
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Mingjing Deng
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Ke Liu
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Jia Chen
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Yuting Wang
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Yu-Ning Xu
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Longyang Dian
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
- Suzhou Institute of Shandong University, No. 388 Ruoshui Road, Suzhou Industrial Park, Suzhou, 215123, P. R. China
| |
Collapse
|
19
|
Baczewska P, Kulczykowski M, Zambroń B, Jaszczewska-Adamczak J, Pakulski Z, Roszak R, Grzybowski BA, Mlynarski J. Machine Learning Algorithm Guides Catalyst Choices for Magnesium-Catalyzed Asymmetric Reactions. Angew Chem Int Ed Engl 2024; 63:e202318487. [PMID: 38878001 DOI: 10.1002/anie.202318487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Indexed: 08/13/2024]
Abstract
Organic-chemical literature encompasses large numbers of catalysts and reactions they can effect. Many of these examples are published merely to document the catalysts' scope but do not necessarily guarantee that a given catalyst is "optimal"-in terms of yield or enantiomeric excess-for a particular reaction. This paper describes a Machine Learning model that aims to improve such catalyst-reaction assignments based on the carefully curated literature data. As we show here for the case of asymmetric magnesium catalysis, this model achieves relatively high accuracy and offers out of-the-box predictions successfully validated by experiment, e.g., in synthetically demanding asymmetric reductions or Michael additions.
Collapse
Affiliation(s)
- Paulina Baczewska
- Institute of Organic Chemistry, Polish Academy of Sciences, Kasprzaka 44/52, 02-224, Warsaw, Poland
| | - Michał Kulczykowski
- Institute of Organic Chemistry, Polish Academy of Sciences, Kasprzaka 44/52, 02-224, Warsaw, Poland
| | - Bartosz Zambroń
- Institute of Organic Chemistry, Polish Academy of Sciences, Kasprzaka 44/52, 02-224, Warsaw, Poland
| | | | - Zbigniew Pakulski
- Institute of Organic Chemistry, Polish Academy of Sciences, Kasprzaka 44/52, 02-224, Warsaw, Poland
| | - Rafał Roszak
- Institute of Organic Chemistry, Polish Academy of Sciences, Kasprzaka 44/52, 02-224, Warsaw, Poland
| | - Bartosz A Grzybowski
- Institute of Organic Chemistry, Polish Academy of Sciences, Kasprzaka 44/52, 02-224, Warsaw, Poland
- Center for Algorithmic and Robotized Synthesis (CARS) of Korea's Institute for Basic Science (IBS) and Department of Chemistry, Ulsan National Institute of Science and Technology 50, UNIST-gil, Eonyang-eup, Ulju-gun, Ulsan, 44919, South Korea
| | - Jacek Mlynarski
- Institute of Organic Chemistry, Polish Academy of Sciences, Kasprzaka 44/52, 02-224, Warsaw, Poland
| |
Collapse
|
20
|
Singh S, Hernández-Lobato JM. Data-Driven Insights into the Transition-Metal-Catalyzed Asymmetric Hydrogenation of Olefins. J Org Chem 2024; 89:12467-12478. [PMID: 39149801 PMCID: PMC11382158 DOI: 10.1021/acs.joc.4c01396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
The transition-metal-catalyzed asymmetric hydrogenation of olefins is one of the key transformations with great utility in various industrial applications. The field has been dominated by the use of noble metal catalysts, such as iridium and rhodium. The reactions with the earth-abundant cobalt metal have increased only in recent years. In this work, we analyze the large amount of literature data available on iridium- and rhodium-catalyzed asymmetric hydrogenation. The limited data on reactions using Co catalysts are then examined in the context of Ir and Rh to obtain a better understanding of the reactivity pattern. A detailed data-driven study of the types of olefins, ligands, and reaction conditions such as solvent, temperature, and pressure is carried out. Our analysis provides an understanding of the literature trends and demonstrates that only a few olefin-ligand combinations or reaction conditions are frequently used. The knowledge of this bias in the literature data toward a certain group of substrates or reaction conditions can be useful for practitioners to design new reaction data sets that are suitable to obtain meaningful predictions from machine-learning models.
Collapse
Affiliation(s)
- Sukriti Singh
- Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, U.K
| | | |
Collapse
|
21
|
Schäfer F, Lückemeier L, Glorius F. Improving reproducibility through condition-based sensitivity assessments: application, advancement and prospect. Chem Sci 2024:d4sc03017f. [PMID: 39263664 PMCID: PMC11382186 DOI: 10.1039/d4sc03017f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Accepted: 08/29/2024] [Indexed: 09/13/2024] Open
Abstract
The fluctuating reproducibility of scientific reports presents a well-recognised issue, frequently stemming from insufficient standardisation, transparency and a lack of information in scientific publications. Consequently, the incorporation of newly developed synthetic methods into practical applications often occurs at a slow rate. In recent years, various efforts have been made to analyse the sensitivity of chemical methodologies and the variation in quantitative outcome observed across different laboratory environments. For today's chemists, determining the key factors that really matter for a reaction's outcome from all the different aspects of chemical methodology can be a challenging task. In response, we provide a detailed examination and customised recommendations surrounding the sensitivity screen, offering a comprehensive assessment of various strategies and exploring their diverse applications by research groups to improve the practicality of their methodologies.
Collapse
Affiliation(s)
- Felix Schäfer
- Universität Münster, Organisch-Chemisches Institut Corrensstraße 36 48149 Münster Germany
| | - Lukas Lückemeier
- Universität Münster, Organisch-Chemisches Institut Corrensstraße 36 48149 Münster Germany
| | - Frank Glorius
- Universität Münster, Organisch-Chemisches Institut Corrensstraße 36 48149 Münster Germany
| |
Collapse
|
22
|
Kalikadien AV, Valsecchi C, van Putten R, Maes T, Muuronen M, Dyubankova N, Lefort L, Pidko EA. Probing machine learning models based on high throughput experimentation data for the discovery of asymmetric hydrogenation catalysts. Chem Sci 2024; 15:13618-13630. [PMID: 39211503 PMCID: PMC11352728 DOI: 10.1039/d4sc03647f] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2024] [Accepted: 07/15/2024] [Indexed: 09/04/2024] Open
Abstract
Enantioselective hydrogenation of olefins by Rh-based chiral catalysts has been extensively studied for more than 50 years. Naively, one would expect that everything about this transformation is known and that selecting a catalyst that induces the desired reactivity or selectivity is a trivial task. Nonetheless, ligand engineering or selection for any new prochiral olefin remains an empirical trial-error exercise. In this study, we investigated whether machine learning techniques could be used to accelerate the identification of the most efficient chiral ligand. For this purpose, we used high throughput experimentation to build a large dataset consisting of results for Rh-catalyzed asymmetric olefin hydrogenation, specially designed for applications in machine learning. We showcased its alignment with existing literature while addressing observed discrepancies. Additionally, a computational framework for the automated and reproducible quantum-chemistry based featurization of catalyst structures was created. Together with less computationally demanding representations, these descriptors were fed into our machine learning pipeline for both out-of-domain and in-domain prediction tasks of selectivity and reactivity. For out-of-domain purposes, our models provided limited efficacy. It was found that even the most expensive descriptors do not impart significant meaning to the model predictions. The in-domain application, while partly successful for predictions of conversion, emphasizes the need for evaluating the cost-benefit ratio of computationally intensive descriptors and for tailored descriptor design. Challenges persist in predicting enantioselectivity, calling for caution in interpreting results from small datasets. Our insights underscore the importance of dataset diversity with broad substrate inclusion and suggest that mechanistic considerations could improve the accuracy of statistical models.
Collapse
Affiliation(s)
- Adarsh V Kalikadien
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology Van der Maasweg 9, 2629 HZ Delft The Netherlands
| | - Cecile Valsecchi
- Discovery, Product Development and Supply, Janssen Cilag S.p.A. Viale Fulvio Testi, 280/6 20126 Milano Italy
| | - Robbert van Putten
- Discovery, Product Development and Supply, Janssen Pharmaceutica N.V. Turnhoutseweg 30 2340 Beerse Belgium
| | - Tor Maes
- Discovery, Product Development and Supply, Janssen Pharmaceutica N.V. Turnhoutseweg 30 2340 Beerse Belgium
| | - Mikko Muuronen
- Discovery, Product Development and Supply, Janssen Pharmaceutica N.V. Turnhoutseweg 30 2340 Beerse Belgium
| | - Natalia Dyubankova
- Discovery, Product Development and Supply, Janssen Pharmaceutica N.V. Turnhoutseweg 30 2340 Beerse Belgium
| | - Laurent Lefort
- Discovery, Product Development and Supply, Janssen Pharmaceutica N.V. Turnhoutseweg 30 2340 Beerse Belgium
| | - Evgeny A Pidko
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology Van der Maasweg 9, 2629 HZ Delft The Netherlands
| |
Collapse
|
23
|
Tom G, Schmid SP, Baird SG, Cao Y, Darvish K, Hao H, Lo S, Pablo-García S, Rajaonson EM, Skreta M, Yoshikawa N, Corapi S, Akkoc GD, Strieth-Kalthoff F, Seifrid M, Aspuru-Guzik A. Self-Driving Laboratories for Chemistry and Materials Science. Chem Rev 2024; 124:9633-9732. [PMID: 39137296 PMCID: PMC11363023 DOI: 10.1021/acs.chemrev.4c00055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/15/2024]
Abstract
Self-driving laboratories (SDLs) promise an accelerated application of the scientific method. Through the automation of experimental workflows, along with autonomous experimental planning, SDLs hold the potential to greatly accelerate research in chemistry and materials discovery. This review provides an in-depth analysis of the state-of-the-art in SDL technology, its applications across various scientific disciplines, and the potential implications for research and industry. This review additionally provides an overview of the enabling technologies for SDLs, including their hardware, software, and integration with laboratory infrastructure. Most importantly, this review explores the diverse range of scientific domains where SDLs have made significant contributions, from drug discovery and materials science to genomics and chemistry. We provide a comprehensive review of existing real-world examples of SDLs, their different levels of automation, and the challenges and limitations associated with each domain.
Collapse
Affiliation(s)
- Gary Tom
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Stefan P. Schmid
- Department
of Chemistry and Applied Biosciences, ETH
Zurich, Vladimir-Prelog-Weg 1, CH-8093 Zurich, Switzerland
| | - Sterling G. Baird
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Yang Cao
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Kourosh Darvish
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Han Hao
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Stanley Lo
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
| | - Sergio Pablo-García
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
| | - Ella M. Rajaonson
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Marta Skreta
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Naruki Yoshikawa
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Samantha Corapi
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
| | - Gun Deniz Akkoc
- Forschungszentrum
Jülich GmbH, Helmholtz Institute
for Renewable Energy Erlangen-Nürnberg, Cauerstr. 1, 91058 Erlangen, Germany
- Department
of Chemical and Biological Engineering, Friedrich-Alexander Universität Erlangen-Nürnberg, Egerlandstr. 3, 91058 Erlangen, Germany
| | - Felix Strieth-Kalthoff
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- School of
Mathematics and Natural Sciences, University
of Wuppertal, Gaußstraße
20, 42119 Wuppertal, Germany
| | - Martin Seifrid
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Department
of Materials Science and Engineering, North
Carolina State University, Raleigh, North Carolina 27695, United States of America
| | - Alán Aspuru-Guzik
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
- Department
of Chemical Engineering & Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
- Department
of Materials Science & Engineering, University of Toronto, Toronto, Ontario M5S 3E4, Canada
- Lebovic
Fellow, Canadian Institute for Advanced
Research (CIFAR), 661
University Ave, Toronto, Ontario M5G 1M1, Canada
| |
Collapse
|
24
|
Huang Y, Zhang L, Deng H, Mao J. NJmat: Data-Driven Machine Learning Interface to Accelerate Material Design. J Chem Inf Model 2024; 64:6477-6491. [PMID: 39133673 DOI: 10.1021/acs.jcim.4c00493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
Machine learning techniques have significantly transformed the way materials scientists conduct research. However, the widespread deployment of machine learning software in daily experimental and simulation research for materials and chemical design has been limited. This is partly due to the substantial time investment and learning curve associated with mastering the necessary codes and computational environments. In this paper, we introduce a user-friendly, data-driven machine learning interface featuring multiple "button-clicking" functionalities to streamline the design of materials and chemicals. This interface automates the processes of transforming materials and molecules, performing feature selection, constructing machine learning models, making virtual predictions, and visualizing results. Such automation accelerates materials prediction and analysis in the inverse design process, aligning with the time criteria outlined by the Materials Genome Initiative. With simple button clicks, researchers can build machine learning models and predict new materials once they have gathered experimental or simulation data. Beyond the ease of use, NJmat offers three additional features for data-driven materials design: (1) automatic feature generation for both inorganic materials (from chemical formulas) and organic molecules (from SMILES), (2) automatic generation of Shapley plots, and (3) automatic construction of "white-box" genetic models and decision trees to provide scientific insights. We present case studies on surface design for halide perovskite materials encompassing both inorganic and organic species. These case studies illustrate general machine learning models for virtual predictions as well as the automatic featurization and Shapley/genetic model construction capabilities. We anticipate that this software tool will expedite materials and molecular design within the scope of the Materials Genome Initiative, particularly benefiting experimentalists.
Collapse
Affiliation(s)
- Yiru Huang
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Lei Zhang
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Hangyuan Deng
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Junfei Mao
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| |
Collapse
|
25
|
Abstract
Retrosynthetic simplicity is introduced as a metric by which methods can be evaluated. An argument in favor of reactions which are retrosynthetically simple is put forward, and recent examples in the context of skeletal editing from my own laboratory as well as contributions from others are analyzed critically through this lens.
Collapse
Affiliation(s)
- Mark D Levin
- Department of Chemistry, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
26
|
Kim S, Jung Y, Schrier J. Large Language Models for Inorganic Synthesis Predictions. J Am Chem Soc 2024; 146:19654-19659. [PMID: 38991051 DOI: 10.1021/jacs.4c05840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
We evaluate the effectiveness of pretrained and fine-tuned large language models (LLMs) for predicting the synthesizability of inorganic compounds and the selection of precursors needed to perform inorganic synthesis. The predictions of fine-tuned LLMs are comparable to─and sometimes better than─recent bespoke machine learning models for these tasks but require only minimal user expertise, cost, and time to develop. Therefore, this strategy can serve both as an effective and strong baseline for future machine learning studies of various chemical applications and as a practical tool for experimental chemists.
Collapse
Affiliation(s)
- Seongmin Kim
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291, Daehak-ro, Yuseong-gu, Daejeon 34141, Korea
| | - Yousung Jung
- Department of Chemical and Biological Engineering (BK21 four), Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea
- Institute of Chemical Processes, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea
- Institute of Engineering Research, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea
| | - Joshua Schrier
- Department of Chemistry and Biochemistry, Fordham University, 441 East Fordham Road, The Bronx, New York 10458, United States
| |
Collapse
|
27
|
Howard JR, Shuluk JR, Bhakare A, Anslyn EV. Data-science-guided calibration curve prediction of an MLCT-based ee determination assay for chiral amines. Chem 2024; 10:2074-2088. [PMID: 39006239 PMCID: PMC11243635 DOI: 10.1016/j.chempr.2024.05.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Circular dichroism (CD) based enantiomeric excess (ee) determination assays are optical alternatives to chromatographic ee determination in high-throughput screening (HTS) applications. However, the implementation of these assays requires calibration experiments using enantioenriched materials. We present a data-driven approach that circumvents the need for chiral resolution and calibration experiments for an octahedral Fe(II) complex (1) used for the ee determination of α-chiral primary amines. By computationally parameterizing the imine ligands formed in the assay conditions, a model of the circular dichroism (CD) response of the Fe(II) assembly was developed. Using this model, calibration curves were generated for four analytes and compared to experimentally generated curves. In a single-blind ee determination study, the ee values of unknown samples were determined within 9% mean absolute error, which rivals the error using experimentally generated calibration curves.
Collapse
Affiliation(s)
- James R. Howard
- Department of Chemistry, The University of Texas at Austin, Austin, TX 78705 (USA)
| | - Julia R. Shuluk
- Department of Chemistry, The University of Texas at Austin, Austin, TX 78705 (USA)
| | - Arya Bhakare
- Department of Chemistry, The University of Texas at Austin, Austin, TX 78705 (USA)
| | - Eric V. Anslyn
- Department of Chemistry, The University of Texas at Austin, Austin, TX 78705 (USA)
- Lead contact
| |
Collapse
|
28
|
Rost NCV, Said M, Gharib M, Lévy R, Boem F. Better nanoscience through open, collaborative, and critical discussions. MATERIALS HORIZONS 2024; 11:3005-3010. [PMID: 38578130 PMCID: PMC11216032 DOI: 10.1039/d3mh01781h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Accepted: 02/19/2024] [Indexed: 04/06/2024]
Abstract
We aim to foster a discussion of science correction and of how individual researchers can improve the quality and control of scientific production. This is crucial because although the maintenance of rigorous standards and the scrupulous control of research findings and methods are sometimes taken for granted, in practice, we are routinely confronted with articles that contain errors.
Collapse
Affiliation(s)
| | - Maha Said
- Université Sorbonne Paris Nord and Université Paris Cité, INSERM, LVTS, F-75018 Paris, France
| | - Mustafa Gharib
- Université Sorbonne Paris Nord and Université Paris Cité, INSERM, LVTS, F-75018 Paris, France
| | - Raphaël Lévy
- Université Sorbonne Paris Nord and Université Paris Cité, INSERM, LVTS, F-75018 Paris, France
| | - Federico Boem
- University of Twente, Philosophy Section, Drienerlolaan 5, 7522 NB Enschede, The Netherlands.
| |
Collapse
|
29
|
Shi Y, Derasp JS, Maschmeyer T, Hein JE. Phase transfer catalysts shift the pathway to transmetalation in biphasic Suzuki-Miyaura cross-couplings. Nat Commun 2024; 15:5436. [PMID: 38937470 PMCID: PMC11211432 DOI: 10.1038/s41467-024-49681-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Accepted: 06/14/2024] [Indexed: 06/29/2024] Open
Abstract
The Suzuki-Miyaura coupling is a widely used C-C bond forming reaction. Numerous mechanistic studies have enabled the use of low catalyst loadings and broad functional group tolerance. However, the dominant mode of transmetalation remains controversial and likely depends on the conditions employed. Herein we detail a mechanistic study of the palladium-catalyzed Suzuki-Miyaura coupling under biphasic conditions. The use of phase transfer catalysts results in a remarkable 12-fold rate enhancement in the targeted system. A shift from an oxo-palladium based transmetalation to a boronate-based pathway lies at the root of this activity. Furthermore, a study of the impact of different water loadings reveals reducing the proportion of the aqueous phase increases the reaction rate, contrary to reaction conditions typically employed in the literature. The importance of these findings is highlighted by achieving an exceptionally broad substrate scope with benzylic electrophiles using a 10-fold reduction in catalyst loading relative to literature precedent.
Collapse
Affiliation(s)
- Yao Shi
- Department of Chemistry, University of British Columbia, Vancouver, BC, V6T 1Z1, Canada
| | - Joshua S Derasp
- Department of Chemistry, University of British Columbia, Vancouver, BC, V6T 1Z1, Canada.
| | - Tristan Maschmeyer
- Department of Chemistry, University of British Columbia, Vancouver, BC, V6T 1Z1, Canada
| | - Jason E Hein
- Department of Chemistry, University of British Columbia, Vancouver, BC, V6T 1Z1, Canada.
- Department of Chemistry, University of Bergen, Bergen, Norway.
- Acceleration Consortium, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
30
|
Raghavan P, Rago AJ, Verma P, Hassan MM, Goshu GM, Dombrowski AW, Pandey A, Coley CW, Wang Y. Incorporating Synthetic Accessibility in Drug Design: Predicting Reaction Yields of Suzuki Cross-Couplings by Leveraging AbbVie's 15-Year Parallel Library Data Set. J Am Chem Soc 2024; 146:15070-15084. [PMID: 38768950 PMCID: PMC11157529 DOI: 10.1021/jacs.4c00098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 04/24/2024] [Accepted: 04/25/2024] [Indexed: 05/22/2024]
Abstract
Despite the increased use of computational tools to supplement medicinal chemists' expertise and intuition in drug design, predicting synthetic yields in medicinal chemistry endeavors remains an unsolved challenge. Existing design workflows could profoundly benefit from reaction yield prediction, as precious material waste could be reduced, and a greater number of relevant compounds could be delivered to advance the design, make, test, analyze (DMTA) cycle. In this work, we detail the evaluation of AbbVie's medicinal chemistry library data set to build machine learning models for the prediction of Suzuki coupling reaction yields. The combination of density functional theory (DFT)-derived features and Morgan fingerprints was identified to perform better than one-hot encoded baseline modeling, furnishing encouraging results. Overall, we observe modest generalization to unseen reactant structures within the 15-year retrospective library data set. Additionally, we compare predictions made by the model to those made by expert medicinal chemists, finding that the model can often predict both reaction success and reaction yields with greater accuracy. Finally, we demonstrate the application of this approach to suggest structurally and electronically similar building blocks to replace those predicted or observed to be unsuccessful prior to or after synthesis, respectively. The yield prediction model was used to select similar monomers predicted to have higher yields, resulting in greater synthesis efficiency of relevant drug-like molecules.
Collapse
Affiliation(s)
- Priyanka Raghavan
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, 77 Massachusetts Ave, Cambridge, Massachusetts 02139, United States
| | - Alexander J. Rago
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Pritha Verma
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Majdi M. Hassan
- RAIDERS
Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Gashaw M. Goshu
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Amanda W. Dombrowski
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Abhishek Pandey
- RAIDERS
Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Connor W. Coley
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, 77 Massachusetts Ave, Cambridge, Massachusetts 02139, United States
| | - Ying Wang
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| |
Collapse
|
31
|
Chen L, Gao Z, Zhang Y, Dai X, Meng F, Guo Y. A green, facile, and practical preparation of capsaicin derivatives with thiourea structure. Sci Rep 2024; 14:10576. [PMID: 38719947 PMCID: PMC11078945 DOI: 10.1038/s41598-024-61014-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 04/30/2024] [Indexed: 05/12/2024] Open
Abstract
Capsaicin derivatives with thiourea structure (CDTS) is highly noteworthy owing to its higher analgesic potency in rodent models and higher agonism in vitro. However, the direct synthesis of CDTS remains t one or more shortcomings. In this study, we present reported a green, facile, and practical synthetic method of capsaicin derivatives with thiourea structure is developed by using an automated synthetic system, leading to a series of capsaicin derivatives with various electronic properties and functionalities in good to excellent yields.
Collapse
Affiliation(s)
- Lina Chen
- State Key Laboratory of NBC Protection for Civilian, Beijing, People's Republic of China
| | - Zhenhua Gao
- State Key Laboratory of NBC Protection for Civilian, Beijing, People's Republic of China
| | - Ye Zhang
- Sichuan University of Science and Engineering, Zigong, People's Republic of China
| | - Xiandong Dai
- State Key Laboratory of NBC Protection for Civilian, Beijing, People's Republic of China
| | - Fanhua Meng
- State Key Laboratory of NBC Protection for Civilian, Beijing, People's Republic of China
| | - Yongbiao Guo
- State Key Laboratory of NBC Protection for Civilian, Beijing, People's Republic of China.
| |
Collapse
|
32
|
King-Smith E. Transfer learning for a foundational chemistry model. Chem Sci 2024; 15:5143-5151. [PMID: 38577363 PMCID: PMC10988575 DOI: 10.1039/d3sc04928k] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 11/15/2023] [Indexed: 04/06/2024] Open
Abstract
Data-driven chemistry has garnered much interest concurrent with improvements in hardware and the development of new machine learning models. However, obtaining sufficiently large, accurate datasets of a desired chemical outcome for data-driven chemistry remains a challenge. The community has made significant efforts to democratize and curate available information for more facile machine learning applications, but the limiting factor is usually the laborious nature of generating large-scale data. Transfer learning has been noted in certain applications to alleviate some of the data burden, but this protocol is typically carried out on a case-by-case basis, with the transfer learning task expertly chosen to fit the finetuning. Herein, I develop a machine learning framework capable of accurate chemistry-relevant prediction amid general sources of low data. First, a chemical "foundational model" is trained using a dataset of ∼1 million experimental organic crystal structures. A task specific module is then stacked atop this foundational model and subjected to finetuning. This approach achieves state-of-the-art performance on a diverse set of tasks: toxicity prediction, yield prediction, and odor prediction.
Collapse
|
33
|
Dobbelaere MR, Lengyel I, Stevens CV, Van Geem KM. Rxn-INSIGHT: fast chemical reaction analysis using bond-electron matrices. J Cheminform 2024; 16:37. [PMID: 38553720 PMCID: PMC10980627 DOI: 10.1186/s13321-024-00834-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 03/23/2024] [Indexed: 04/02/2024] Open
Abstract
The challenge of devising pathways for organic synthesis remains a central issue in the field of medicinal chemistry. Over the span of six decades, computer-aided synthesis planning has given rise to a plethora of potent tools for formulating synthetic routes. Nevertheless, a significant expert task still looms: determining the appropriate solvent, catalyst, and reagents when provided with a set of reactants to achieve and optimize the desired product for a specific step in the synthesis process. Typically, chemists identify key functional groups and rings that exert crucial influences at the reaction center, classify reactions into categories, and may assign them names. This research introduces Rxn-INSIGHT, an open-source algorithm based on the bond-electron matrix approach, with the purpose of automating this endeavor. Rxn-INSIGHT not only streamlines the process but also facilitates extensive querying of reaction databases, effectively replicating the thought processes of an organic chemist. The core functions of the algorithm encompass the classification and naming of reactions, extraction of functional groups, rings, and scaffolds from the involved chemical entities. The provision of reaction condition recommendations based on the similarity and prevalence of reactions eventually arises as a side application. The performance of our rule-based model has been rigorously assessed against a carefully curated benchmark dataset, exhibiting an accuracy rate exceeding 90% in reaction classification and surpassing 95% in reaction naming. Notably, it has been discerned that a pivotal factor in selecting analogous reactions lies in the analysis of ring structures participating in the reactions. An examination of ring structures within the USPTO chemical reaction database reveals that with just 35 unique rings, a remarkable 75% of all rings found in nearly 1 million products can be encompassed. Furthermore, Rxn-INSIGHT is proficient in suggesting appropriate choices for solvents, catalysts, and reagents in entirely novel reactions, all within the span of a second, utilizing nothing more than an everyday laptop.
Collapse
Affiliation(s)
- Maarten R Dobbelaere
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Faculty of Engineering and Architecture, Ghent University, Technologiepark 125, 9052, Ghent, Belgium
| | - István Lengyel
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Faculty of Engineering and Architecture, Ghent University, Technologiepark 125, 9052, Ghent, Belgium
- ChemInsights LLC, Dover, DE, 19901, USA
| | - Christian V Stevens
- SynBioC Research Group, Department of Green Chemistry and Technology, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, 9000, Ghent, Belgium
| | - Kevin M Van Geem
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Faculty of Engineering and Architecture, Ghent University, Technologiepark 125, 9052, Ghent, Belgium.
| |
Collapse
|
34
|
Rao A, Grzelczak M. Revisiting El-Sayed Synthesis: Bayesian Optimization for Revealing New Insights during the Growth of Gold Nanorods. CHEMISTRY OF MATERIALS : A PUBLICATION OF THE AMERICAN CHEMICAL SOCIETY 2024; 36:2577-2587. [PMID: 38680830 PMCID: PMC11049742 DOI: 10.1021/acs.chemmater.4c00271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 02/16/2024] [Accepted: 02/16/2024] [Indexed: 05/01/2024]
Abstract
In diverse fields, machine learning (ML) has sparked transformative changes, primarily driven by the wealth of big data. However, an alternative approach seeks to mine insights from "precious data", offering the possibility to reveal missed knowledge and escape potential knowledge traps. In this context, Bayesian optimization (BO) protocols have emerged as crucial tools for optimizing the synthesis and discovery of a broad spectrum of compounds including nanoparticles. In our work, we aimed to go beyond the commonly explored experimental conditions and showcase a workflow capable of unearthing fresh insights, even in well-studied research domains. The growth of AuNRs is a nonequilibrium process that remains poorly understood despite the presence of well-established seeded growth protocols. Traditional research aimed at understanding the mechanism of AuNR growth has primarily relied on altering one reaction condition at a time. While these studies are undeniably valuable, they often fail to capture the synergies between different reaction conditions, thus constraining the depth of insights they can offer. In the present study, we exploit BO, to identify diverse experimental conditions yielding AuNRs with similar spectroscopic characteristics. Notably, we identify viable and accelerated synthesis conditions involving elevated temperatures (36-40 °C) as well as high ascorbic acid concentrations. More importantly, we note that ascorbic acid and temperature can modulate each other's undesirable influences on the growth of AuNRs. Finally, by harnessing the power of interpretable ML algorithms, complemented by our deep chemical understanding, we revisited the established hierarchical relationships among reaction conditions that impact the El-Sayed-based growth of AuNRs.
Collapse
Affiliation(s)
- Anish Rao
- Centro
de Física de Materiales CSIC-UPV/EHU, Paseo Manuel de Lardizabal 5, 20018 Donostia San-Sebastián, Spain
| | - Marek Grzelczak
- Centro
de Física de Materiales CSIC-UPV/EHU, Paseo Manuel de Lardizabal 5, 20018 Donostia San-Sebastián, Spain
- Donostia
International Physics Center (DIPC), Paseo Manuel de Lardizabal 4, 20018 Donostia-San Sebastián, Spain
| |
Collapse
|
35
|
Gallarati S, van Gerwen P, Laplaza R, Brey L, Makaveev A, Corminboeuf C. A genetic optimization strategy with generality in asymmetric organocatalysis as a primary target. Chem Sci 2024; 15:3640-3660. [PMID: 38455002 PMCID: PMC10915838 DOI: 10.1039/d3sc06208b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Accepted: 01/30/2024] [Indexed: 03/09/2024] Open
Abstract
A catalyst possessing a broad substrate scope, in terms of both turnover and enantioselectivity, is sometimes called "general". Despite their great utility in asymmetric synthesis, truly general catalysts are difficult or expensive to discover via traditional high-throughput screening and are, therefore, rare. Existing computational tools accelerate the evaluation of reaction conditions from a pre-defined set of experiments to identify the most general ones, but cannot generate entirely new catalysts with enhanced substrate breadth. For these reasons, we report an inverse design strategy based on the open-source genetic algorithm NaviCatGA and on the OSCAR database of organocatalysts to simultaneously probe the catalyst and substrate scope and optimize generality as a primary target. We apply this strategy to the Pictet-Spengler condensation, for which we curate a database of 820 reactions, used to train statistical models of selectivity and activity. Starting from OSCAR, we define a combinatorial space of millions of catalyst possibilities, and perform evolutionary experiments on a diverse substrate scope that is representative of the whole chemical space of tetrahydro-β-carboline products. While privileged catalysts emerge, we show how genetic optimization can address the broader question of generality in asymmetric synthesis, extracting structure-performance relationships from the challenging areas of chemical space.
Collapse
Affiliation(s)
- Simone Gallarati
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Puck van Gerwen
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
- National Center for Competence in Research - Catalysis (NCCR-Catalysis), Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Ruben Laplaza
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
- National Center for Competence in Research - Catalysis (NCCR-Catalysis), Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Lucien Brey
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Alexander Makaveev
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Clemence Corminboeuf
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
- National Center for Competence in Research - Catalysis (NCCR-Catalysis), Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
- National Center for Computational Design and Discovery of Novel Materials (MARVEL), Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| |
Collapse
|
36
|
Shi R, Yu G, Huo X, Yang Y. Prediction of chemical reaction yields with large-scale multi-view pre-training. J Cheminform 2024; 16:22. [PMID: 38403627 PMCID: PMC10895839 DOI: 10.1186/s13321-024-00815-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Accepted: 02/14/2024] [Indexed: 02/27/2024] Open
Abstract
Developing machine learning models with high generalization capability for predicting chemical reaction yields is of significant interest and importance. The efficacy of such models depends heavily on the representation of chemical reactions, which has commonly been learned from SMILES or graphs of molecules using deep neural networks. However, the progression of chemical reactions is inherently determined by the molecular 3D geometric properties, which have been recently highlighted as crucial features in accurately predicting molecular properties and chemical reactions. Additionally, large-scale pre-training has been shown to be essential in enhancing the generalization capability of complex deep learning models. Based on these considerations, we propose the Reaction Multi-View Pre-training (ReaMVP) framework, which leverages self-supervised learning techniques and a two-stage pre-training strategy to predict chemical reaction yields. By incorporating multi-view learning with 3D geometric information, ReaMVP achieves state-of-the-art performance on two benchmark datasets. Notably, the experimental results indicate that ReaMVP has a significant advantage in predicting out-of-sample data, suggesting an enhanced generalization ability to predict new reactions. Scientific Contribution: This study presents the ReaMVP framework, which improves the generalization capability of machine learning models for predicting chemical reaction yields. By integrating sequential and geometric views and leveraging self-supervised learning techniques with a two-stage pre-training strategy, ReaMVP achieves state-of-the-art performance on benchmark datasets. The framework demonstrates superior predictive ability for out-of-sample data and enhances the prediction of new reactions.
Collapse
Affiliation(s)
- Runhan Shi
- Department of Computer Science and Engineering, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Gufeng Yu
- Department of Computer Science and Engineering, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Xiaohong Huo
- Shanghai Key Laboratory for Molecular Engineering of Chiral Drugs, Frontiers Science Center for Transformative Molecules, School of Chemistry and Chemical Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yang Yang
- Department of Computer Science and Engineering, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China.
| |
Collapse
|
37
|
Kim S, Noh J, Gu GH, Chen S, Jung Y. Predicting synthesis recipes of inorganic crystal materials using elementwise template formulation. Chem Sci 2024; 15:1039-1045. [PMID: 38239693 PMCID: PMC10793203 DOI: 10.1039/d3sc03538g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 12/05/2023] [Indexed: 01/22/2024] Open
Abstract
While advances in computational techniques have accelerated virtual materials design, the actual synthesis of predicted candidate materials is still an expensive and slow process. While a few initial studies attempted to predict the synthesis routes for inorganic crystals, the existing models do not yield the priority of predictions and could produce thermodynamically unrealistic precursor chemicals. Here, we propose an element-wise graph neural network to predict inorganic synthesis recipes. The trained model outperforms the popularity-based statistical baseline model for the top-k exact match accuracy test, showing the validity of our approach for inorganic solid-state synthesis. We further validate our model by the publication-year-split test, where the model trained based on the materials data until the year 2016 is shown to successfully predict synthetic precursors for the materials synthesized after 2016. The high correlation between the probability score and prediction accuracy suggests that the probability score can be interpreted as a measure of confidence levels, which can offer the priority of the predictions.
Collapse
Affiliation(s)
- Seongmin Kim
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST) 291, Daehak-ro, Yuseong-gu Daejeon 34141 South Korea
| | - Juhwan Noh
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST) 291, Daehak-ro, Yuseong-gu Daejeon 34141 South Korea
| | - Geun Ho Gu
- School of Energy Technology, Korea Institute of Energy Technology 200 Hyuksin-ro Naju 58330 South Korea
| | - Shuan Chen
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST) 291, Daehak-ro, Yuseong-gu Daejeon 34141 South Korea
| | - Yousung Jung
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST) 291, Daehak-ro, Yuseong-gu Daejeon 34141 South Korea
- School of Chemical and Biological Engineering, Institute of Chemical Processes, Seoul National University 1, Gwanak-ro, Gwanak-gu Seoul 08826 South Korea
| |
Collapse
|
38
|
Taniike T, Fujiwara A, Nakanowatari S, García-Escobar F, Takahashi K. Automatic feature engineering for catalyst design using small data without prior knowledge of target catalysis. Commun Chem 2024; 7:11. [PMID: 38216711 PMCID: PMC10786848 DOI: 10.1038/s42004-023-01086-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 12/08/2023] [Indexed: 01/14/2024] Open
Abstract
The empirical aspect of descriptor design in catalyst informatics, particularly when confronted with limited data, necessitates adequate prior knowledge for delving into unknown territories, thus presenting a logical contradiction. This study introduces a technique for automatic feature engineering (AFE) that works on small catalyst datasets, without reliance on specific assumptions or pre-existing knowledge about the target catalysis when designing descriptors and building machine-learning models. This technique generates numerous features through mathematical operations on general physicochemical features of catalytic components and extracts relevant features for the desired catalysis, essentially screening numerous hypotheses on a machine. AFE yields reasonable regression results for three types of heterogeneous catalysis: oxidative coupling of methane (OCM), conversion of ethanol to butadiene, and three-way catalysis, where only the training set is swapped. Moreover, through the application of active learning that combines AFE and high-throughput experimentation for OCM, we successfully visualize the machine's process of acquiring precise recognition of the catalyst design. Thus, AFE is a versatile technique for data-driven catalysis research and a key step towards fully automated catalyst discoveries.
Collapse
Affiliation(s)
- Toshiaki Taniike
- Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan.
| | - Aya Fujiwara
- Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan
| | - Sunao Nakanowatari
- Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan
| | | | - Keisuke Takahashi
- Department of Chemistry, Hokkaido University, North 10, West 8, Sapporo, 060-0810, Japan
| |
Collapse
|
39
|
Voinarovska V, Kabeshov M, Dudenko D, Genheden S, Tetko IV. When Yield Prediction Does Not Yield Prediction: An Overview of the Current Challenges. J Chem Inf Model 2024; 64:42-56. [PMID: 38116926 PMCID: PMC10778086 DOI: 10.1021/acs.jcim.3c01524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 11/29/2023] [Accepted: 11/30/2023] [Indexed: 12/21/2023]
Abstract
Machine Learning (ML) techniques face significant challenges when predicting advanced chemical properties, such as yield, feasibility of chemical synthesis, and optimal reaction conditions. These challenges stem from the high-dimensional nature of the prediction task and the myriad essential variables involved, ranging from reactants and reagents to catalysts, temperature, and purification processes. Successfully developing a reliable predictive model not only holds the potential for optimizing high-throughput experiments but can also elevate existing retrosynthetic predictive approaches and bolster a plethora of applications within the field. In this review, we systematically evaluate the efficacy of current ML methodologies in chemoinformatics, shedding light on their milestones and inherent limitations. Additionally, a detailed examination of a representative case study provides insights into the prevailing issues related to data availability and transferability in the discipline.
Collapse
Affiliation(s)
- Varvara Voinarovska
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
- TUM
Graduate School, Faculty of Chemistry, Technical
University of Munich, 85748 Garching, Germany
| | - Mikhail Kabeshov
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
| | - Dmytro Dudenko
- Enamine
Ltd., 78 Chervonotkatska str., 02094 Kyiv, Ukraine
| | - Samuel Genheden
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
| | - Igor V. Tetko
- Molecular
Targets and Therapeutics Center, Helmholtz Munich − Deutsches
Forschungszentrum für Gesundheit und Umwelt (GmbH), Institute of Structural Biology, 85764 Neuherberg, Germany
| |
Collapse
|
40
|
Suvarna M, Vaucher AC, Mitchell S, Laino T, Pérez-Ramírez J. Language models and protocol standardization guidelines for accelerating synthesis planning in heterogeneous catalysis. Nat Commun 2023; 14:7964. [PMID: 38042926 PMCID: PMC10693572 DOI: 10.1038/s41467-023-43836-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Accepted: 11/22/2023] [Indexed: 12/04/2023] Open
Abstract
Synthesis protocol exploration is paramount in catalyst discovery, yet keeping pace with rapid literature advances is increasingly time intensive. Automated synthesis protocol analysis is attractive for swiftly identifying opportunities and informing predictive models, however such applications in heterogeneous catalysis remain limited. In this proof-of-concept, we introduce a transformer model for this task, exemplified using single-atom heterogeneous catalysts (SACs), a rapidly expanding catalyst family. Our model adeptly converts SAC protocols into action sequences, and we use this output to facilitate statistical inference of their synthesis trends and applications, potentially expediting literature review and analysis. We demonstrate the model's adaptability across distinct heterogeneous catalyst families, underscoring its versatility. Finally, our study highlights a critical issue: the lack of standardization in reporting protocols hampers machine-reading capabilities. Embracing digital advances in catalysis demands a shift in data reporting norms, and to this end, we offer guidelines for writing protocols, significantly improving machine-readability. We release our model as an open-source web application, inviting a fresh approach to accelerate heterogeneous catalysis synthesis planning.
Collapse
Affiliation(s)
- Manu Suvarna
- Institute for Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 1, 8093, Zurich, Switzerland
| | | | - Sharon Mitchell
- Institute for Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 1, 8093, Zurich, Switzerland
| | - Teodoro Laino
- IBM Research Europe, Säumerstrasse 4, 8803, Rüschlikon, Switzerland.
| | - Javier Pérez-Ramírez
- Institute for Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 1, 8093, Zurich, Switzerland.
| |
Collapse
|
41
|
Makarov DM, Lukanov MM, Rusanov AI, Mamardashvili NZ, Ksenofontov AA. Machine learning approach for predicting the yield of pyrroles and dipyrromethanes condensation reactions with aldehydes. JOURNAL OF COMPUTATIONAL SCIENCE 2023; 74:102173. [DOI: 10.1016/j.jocs.2023.102173] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/16/2024]
|
42
|
Liu HW, He P, Li WT, Sun W, Shi K, Wang YQ, Mo QK, Zhang XY, Zhu SF. Catalyst-Oriented Design Based on Elementary Reactions (CODER) for Triarylamine Synthesis. Angew Chem Int Ed Engl 2023; 62:e202309111. [PMID: 37698233 DOI: 10.1002/anie.202309111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 09/09/2023] [Accepted: 09/12/2023] [Indexed: 09/13/2023]
Abstract
Recently, the application of computational tools to the rational design of catalysts has received considerable attention, but progress has been limited by the reliance on databases and because mechanistic data have been almost neglected. Herein, we report a new strategy for catalyst design, designated catalyst-oriented design based on elementary reactions (CODER), which fully utilizes mechanistic data, combines the strengths of computational tools and researcher experience. CODER enabled the development of extremely efficient Pd catalysts for C-N coupling, which markedly improved the efficiency of the synthesis of widely used triarylamine optoelectronic materials by enhancing the turnover numbers (up to 340000) to 1-3 orders of magnitude towards literature values.
Collapse
Affiliation(s)
- Hua-Wei Liu
- Frontiers Science Center for New Organic Matters, State Key Laboratory and Institute of Elemento-Organic Chemistry, College of Chemistry, Nankai University, 94th Weijin Road, Tianjin, 300071, China
| | - Peng He
- Frontiers Science Center for New Organic Matters, State Key Laboratory and Institute of Elemento-Organic Chemistry, College of Chemistry, Nankai University, 94th Weijin Road, Tianjin, 300071, China
| | - Wen-Tao Li
- Frontiers Science Center for New Organic Matters, State Key Laboratory and Institute of Elemento-Organic Chemistry, College of Chemistry, Nankai University, 94th Weijin Road, Tianjin, 300071, China
| | - Wei Sun
- Frontiers Science Center for New Organic Matters, State Key Laboratory and Institute of Elemento-Organic Chemistry, College of Chemistry, Nankai University, 94th Weijin Road, Tianjin, 300071, China
| | - Kai Shi
- Frontiers Science Center for New Organic Matters, State Key Laboratory and Institute of Elemento-Organic Chemistry, College of Chemistry, Nankai University, 94th Weijin Road, Tianjin, 300071, China
| | - You-Qin Wang
- Frontiers Science Center for New Organic Matters, State Key Laboratory and Institute of Elemento-Organic Chemistry, College of Chemistry, Nankai University, 94th Weijin Road, Tianjin, 300071, China
| | - Qian-Kun Mo
- Frontiers Science Center for New Organic Matters, State Key Laboratory and Institute of Elemento-Organic Chemistry, College of Chemistry, Nankai University, 94th Weijin Road, Tianjin, 300071, China
| | - Xin-Yu Zhang
- Frontiers Science Center for New Organic Matters, State Key Laboratory and Institute of Elemento-Organic Chemistry, College of Chemistry, Nankai University, 94th Weijin Road, Tianjin, 300071, China
| | - Shou-Fei Zhu
- Frontiers Science Center for New Organic Matters, State Key Laboratory and Institute of Elemento-Organic Chemistry, College of Chemistry, Nankai University, 94th Weijin Road, Tianjin, 300071, China
| |
Collapse
|
43
|
Nguyen TH, Le KM, Nguyen LH, Truong TN. Atom-Based Machine Learning Model for Quantitative Property-Structure Relationship of Electronic Properties of Fusenes and Substituted Fusenes. ACS OMEGA 2023; 8:38441-38451. [PMID: 37867641 PMCID: PMC10586267 DOI: 10.1021/acsomega.3c05212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 09/15/2023] [Indexed: 10/24/2023]
Abstract
This study presents the development of machine-learning-based quantitative structure-property relationship (QSPR) models for predicting electron affinity, ionization potential, and band gap of fusenes from different chemical classes. Three variants of the atom-based Weisfeiler-Lehman (WL) graph kernel method and the machine learning model Gaussian process regressor (GPR) were used. The data pool comprises polycyclic aromatic hydrocarbons (PAHs), thienoacenes, cyano-substituted PAHs, and nitro-substituted PAHs computed with density functional theory (DFT) at the B3LYP-D3/6-31+G(d) level of theory. The results demonstrate that the GPR/WL kernel methods can accurately predict the electronic properties of PAHs and their derivatives with root-mean-square deviations of 0.15 eV. Additionally, we also demonstrate the effectiveness of the active learning protocol for the GPR/WL kernel methods pipeline, particularly for data sets with greater diversity. The interpretation of the model for contributions of individual atoms to the predicted electronic properties provides reasons for the success of our previous degree of π-orbital overlap model.
Collapse
Affiliation(s)
- Tuan H. Nguyen
- Faculty
of Chemical Engineering, Ho Chi Minh City
University of Technology, 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City 7000000, Vietnam
| | - Khang M. Le
- Faculty
of Chemistry, VNUHCM-University of Science, 227 Nguyen Van Cu Street, Ho Chi Minh City 700000, Vietnam
| | - Lam H. Nguyen
- Faculty
of Chemistry, VNUHCM-University of Science, 227 Nguyen Van Cu Street, Ho Chi Minh City 700000, Vietnam
- Institute
for Computational Science and Technology, Ho Chi Minh City 700000, Vietnam
| | - Thanh N. Truong
- Department
of Chemistry, University of Utah, Salt Lake City, Utah 84112, United States
| |
Collapse
|
44
|
Wang X, Hsieh CY, Yin X, Wang J, Li Y, Deng Y, Jiang D, Wu Z, Du H, Chen H, Li Y, Liu H, Wang Y, Luo P, Hou T, Yao X. Generic Interpretable Reaction Condition Predictions with Open Reaction Condition Datasets and Unsupervised Learning of Reaction Center. RESEARCH (WASHINGTON, D.C.) 2023; 6:0231. [PMID: 37849643 PMCID: PMC10578430 DOI: 10.34133/research.0231] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 08/29/2023] [Indexed: 10/19/2023]
Abstract
Effective synthesis planning powered by deep learning (DL) can significantly accelerate the discovery of new drugs and materials. However, most DL-assisted synthesis planning methods offer either none or very limited capability to recommend suitable reaction conditions (RCs) for their reaction predictions. Currently, the prediction of RCs with a DL framework is hindered by several factors, including: (a) lack of a standardized dataset for benchmarking, (b) lack of a general prediction model with powerful representation, and (c) lack of interpretability. To address these issues, we first created 2 standardized RC datasets covering a broad range of reaction classes and then proposed a powerful and interpretable Transformer-based RC predictor named Parrot. Through careful design of the model architecture, pretraining method, and training strategy, Parrot improved the overall top-3 prediction accuracy on catalysis, solvents, and other reagents by as much as 13.44%, compared to the best previous model on a newly curated dataset. Additionally, the mean absolute error of the predicted temperatures was reduced by about 4 °C. Furthermore, Parrot manifests strong generalization capacity with superior cross-chemical-space prediction accuracy. Attention analysis indicates that Parrot effectively captures crucial chemical information and exhibits a high level of interpretability in the prediction of RCs. The proposed model Parrot exemplifies how modern neural network architecture when appropriately pretrained can be versatile in making reliable, generalizable, and interpretable recommendation for RCs even when the underlying training dataset may still be limited in diversity.
Collapse
Affiliation(s)
- Xiaorui Wang
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health,
Macau University of Science and Technology, Macao, 999078, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Chang-Yu Hsieh
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
| | - Xiaodan Yin
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health,
Macau University of Science and Technology, Macao, 999078, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Jike Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Yuquan Li
- College of Chemistry and Chemical Engineering,
Lanzhou University, Lanzhou, 730000, China
| | - Yafeng Deng
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Dejun Jiang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Zhenxing Wu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Hongyan Du
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
| | - Hongming Chen
- Center of Chemistry and Chemical Biology,
Guangzhou Regenerative Medicine and Health Guangdong Laboratory, Guangzhou 510530, China
| | - Yun Li
- College of Chemistry and Chemical Engineering,
Lanzhou University, Lanzhou, 730000, China
| | - Huanxiang Liu
- Faculty of Applied Sciences,
Macao Polytechnic University, Macao, 999078, China
| | - Yuwei Wang
- College of Pharmacy,
Shaanxi University of Chinese Medicine, Xianyang, Shaanxi, 712044, China
| | - Pei Luo
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health,
Macau University of Science and Technology, Macao, 999078, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
| | - Xiaojun Yao
- Faculty of Applied Sciences,
Macao Polytechnic University, Macao, 999078, China
| |
Collapse
|
45
|
Liu Z, Moroz YS, Isayev O. The challenge of balancing model sensitivity and robustness in predicting yields: a benchmarking study of amide coupling reactions. Chem Sci 2023; 14:10835-10846. [PMID: 37829036 PMCID: PMC10566507 DOI: 10.1039/d3sc03902a] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 09/12/2023] [Indexed: 10/14/2023] Open
Abstract
Accurate prediction of reaction yield is the holy grail for computer-assisted synthesis prediction, but current models have failed to generalize to large literature datasets. To understand the causes and inspire future design, we systematically benchmarked the yield prediction task. We carefully curated and augmented a literature dataset of 41 239 amide coupling reactions, each with information on reactants, products, intermediates, yields, and reaction contexts, and provided 3D structures for the molecules. We calculated molecular features related to 2D and 3D structure information, as well as physical and electronic properties. These descriptors were paired with 4 categories of machine learning methods (linear, kernel, ensemble, and neural network), yielding valuable benchmarks about feature and model performance. Despite the excellent performance on a high-throughput experiment (HTE) dataset (R2 around 0.9), no method gave satisfactory results on the literature data. The best performance was an R2 of 0.395 ± 0.020 using the stack technique. Error analysis revealed that reactivity cliff and yield uncertainty are among the main reasons for incorrect predictions. Removing reactivity cliffs and uncertain reactions boosted the R2 to 0.457 ± 0.006. These results highlight that yield prediction models must be sensitive to the reactivity change due to the subtle structure variance, as well as be robust to the uncertainty associated with yield measurements.
Collapse
Affiliation(s)
- Zhen Liu
- Department of Chemistry, Mellon College of Science, Carnegie Mellon University Pittsburgh PA 15213 USA
| | - Yurii S Moroz
- Enamine Ltd Kyïv 02660 Ukraine
- Chemspace LLC Kyïv 02094 Ukraine
- Taras Shevchenko National University of Kyïv Kyïv 01601 Ukraine
| | - Olexandr Isayev
- Department of Chemistry, Mellon College of Science, Carnegie Mellon University Pittsburgh PA 15213 USA
| |
Collapse
|
46
|
Schrier J, Norquist AJ, Buonassisi T, Brgoch J. In Pursuit of the Exceptional: Research Directions for Machine Learning in Chemical and Materials Science. J Am Chem Soc 2023; 145:21699-21716. [PMID: 37754929 DOI: 10.1021/jacs.3c04783] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/28/2023]
Abstract
Exceptional molecules and materials with one or more extraordinary properties are both technologically valuable and fundamentally interesting, because they often involve new physical phenomena or new compositions that defy expectations. Historically, exceptionality has been achieved through serendipity, but recently, machine learning (ML) and automated experimentation have been widely proposed to accelerate target identification and synthesis planning. In this Perspective, we argue that the data-driven methods commonly used today are well-suited for optimization but not for the realization of new exceptional materials or molecules. Finding such outliers should be possible using ML, but only by shifting away from using traditional ML approaches that tweak the composition, crystal structure, or reaction pathway. We highlight case studies of high-Tc oxide superconductors and superhard materials to demonstrate the challenges of ML-guided discovery and discuss the limitations of automation for this task. We then provide six recommendations for the development of ML methods capable of exceptional materials discovery: (i) Avoid the tyranny of the middle and focus on extrema; (ii) When data are limited, qualitative predictions that provide direction are more valuable than interpolative accuracy; (iii) Sample what can be made and how to make it and defer optimization; (iv) Create room (and look) for the unexpected while pursuing your goal; (v) Try to fill-in-the-blanks of input and output space; (vi) Do not confuse human understanding with model interpretability. We conclude with a description of how these recommendations can be integrated into automated discovery workflows, which should enable the discovery of exceptional molecules and materials.
Collapse
Affiliation(s)
- Joshua Schrier
- Department of Chemistry, Fordham University, The Bronx, New York 10458, United States
| | - Alexander J Norquist
- Department of Chemistry, Haverford College, Haverford, Pennsylvania 19041, United States
| | - Tonio Buonassisi
- Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Jakoah Brgoch
- Department of Chemistry and Texas Center for Superconductivity, University of Houston, Houston, Texas 77204, United States
| |
Collapse
|
47
|
Cuomo A, Ibarraran S, Sreekumar S, Li H, Eun J, Menzel JP, Zhang P, Buono F, Song JJ, Crabtree RH, Batista VS, Newhouse TR. Feed-Forward Neural Network for Predicting Enantioselectivity of the Asymmetric Negishi Reaction. ACS CENTRAL SCIENCE 2023; 9:1768-1774. [PMID: 37780365 PMCID: PMC10540279 DOI: 10.1021/acscentsci.3c00512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Indexed: 10/03/2023]
Abstract
Density functional theory (DFT) is a powerful tool to model transition state (TS) energies to predict selectivity in chemical synthesis. However, a successful multistep synthesis campaign must navigate energetically narrow differences in pathways that create some limits to rapid and unambiguous application of DFT to these problems. While powerful data science techniques may provide a complementary approach to overcome this problem, doing so with the relatively small data sets that are widespread in organic synthesis presents a significant challenge. Herein, we show that a small data set can be labeled with features from DFT TS calculations to train a feed-forward neural network for predicting enantioselectivity of a Negishi cross-coupling reaction with P-chiral hindered phosphines. This approach to modeling enantioselectivity is compared with conventional approaches, including exclusive use of DFT energies and data science approaches, using features from ligands or ground states with neural network architectures.
Collapse
Affiliation(s)
- Abbigayle
E. Cuomo
- Department
of Chemistry, Yale University, New Haven, Connecticut 06511, United States
| | - Sebastian Ibarraran
- Department
of Chemistry, Yale University, New Haven, Connecticut 06511, United States
| | - Sanil Sreekumar
- Chemical
Development, Boehringer Ingelheim Pharmaceuticals
Inc, 900 Ridgebury Road, Ridgefield, Connecticut 06877, United States
| | - Haote Li
- Department
of Chemistry, Yale University, New Haven, Connecticut 06511, United States
| | - Jungmin Eun
- Department
of Chemistry, Yale University, New Haven, Connecticut 06511, United States
| | - Jan Paul Menzel
- Department
of Chemistry, Yale University, New Haven, Connecticut 06511, United States
| | - Pengpeng Zhang
- Department
of Chemistry, Yale University, New Haven, Connecticut 06511, United States
| | - Frederic Buono
- Chemical
Development, Boehringer Ingelheim Pharmaceuticals
Inc, 900 Ridgebury Road, Ridgefield, Connecticut 06877, United States
| | - Jinhua J. Song
- Chemical
Development, Boehringer Ingelheim Pharmaceuticals
Inc, 900 Ridgebury Road, Ridgefield, Connecticut 06877, United States
| | - Robert H. Crabtree
- Department
of Chemistry, Yale University, New Haven, Connecticut 06511, United States
| | - Victor S. Batista
- Department
of Chemistry, Yale University, New Haven, Connecticut 06511, United States
| | - Timothy R. Newhouse
- Department
of Chemistry, Yale University, New Haven, Connecticut 06511, United States
| |
Collapse
|
48
|
Behnoudfar D, Simon CM, Schrier J. Data-Driven Imputation of Miscibility of Aqueous Solutions via Graph-Regularized Logistic Matrix Factorization. J Phys Chem B 2023; 127:7964-7973. [PMID: 37682958 DOI: 10.1021/acs.jpcb.3c03789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/10/2023]
Abstract
Aqueous, two-phase systems (ATPSs) may form upon mixing two solutions of independently water-soluble compounds. Many separation, purification, and extraction processes rely on ATPSs. Predicting the miscibility of solutions can accelerate and reduce the cost of the discovery of new ATPSs for these applications. Whereas previous machine learning approaches to ATPS prediction used physicochemical properties of each solute as a descriptor, in this work, we show how to impute missing miscibility outcomes directly from an incomplete collection of pairwise miscibility experiments. We use graph-regularized logistic matrix factorization (GR-LMF) to learn a latent vector of each solution from (i) the observed entries in the pairwise miscibility matrix and (ii) a graph where each node is a solution and edges are relationships indicating the general category of the solute (i.e., polymer, surfactant, salt, protein). For an experimental data set of the pairwise miscibility of 68 solutions from Peacock et al. [ACS Appl. Mater. Interfaces 2021, 13, 11449-11460], we find that GR-LMF more accurately predicts missing (im)miscibility outcomes of pairs of solutions than ordinary logistic matrix factorization and random forest classifiers that use physicochemical features of the solutes. GR-LMF obviates the need for features of the solutions and solutions to impute missing miscibility outcomes, but it cannot predict the miscibility of a new solution without some observations of its miscibility with other solutions in the training data set.
Collapse
Affiliation(s)
- Diba Behnoudfar
- School of Chemical, Biological, and Environmental Engineering, Oregon State University, Corvallis, Oregon 97331, United States
| | - Cory M Simon
- School of Chemical, Biological, and Environmental Engineering, Oregon State University, Corvallis, Oregon 97331, United States
| | - Joshua Schrier
- Department of Chemistry, Fordham University, The Bronx, New York 10458, United States
| |
Collapse
|
49
|
Li B, Su S, Zhu C, Lin J, Hu X, Su L, Yu Z, Liao K, Chen H. A deep learning framework for accurate reaction prediction and its application on high-throughput experimentation data. J Cheminform 2023; 15:72. [PMID: 37568183 PMCID: PMC10422736 DOI: 10.1186/s13321-023-00732-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 06/30/2023] [Indexed: 08/13/2023] Open
Abstract
In recent years, it has been seen that artificial intelligence (AI) starts to bring revolutionary changes to chemical synthesis. However, the lack of suitable ways of representing chemical reactions and the scarceness of reaction data has limited the wider application of AI to reaction prediction. Here, we introduce a novel reaction representation, GraphRXN, for reaction prediction. It utilizes a universal graph-based neural network framework to encode chemical reactions by directly taking two-dimension reaction structures as inputs. The GraphRXN model was evaluated by three publically available chemical reaction datasets and gave on-par or superior results compared with other baseline models. To further evaluate the effectiveness of GraphRXN, wet-lab experiments were carried out for the purpose of generating reaction data. GraphRXN model was then built on high-throughput experimentation data and a decent accuracy (R2 of 0.712) was obtained on our in-house data. This highlights that the GraphRXN model can be deployed in an integrated workflow which combines robotics and AI technologies for forward reaction prediction.
Collapse
Affiliation(s)
- Baiqing Li
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Shimin Su
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Chan Zhu
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Jie Lin
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Xinyue Hu
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Lebin Su
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Zhunzhun Yu
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Kuangbiao Liao
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China.
| | - Hongming Chen
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China.
| |
Collapse
|
50
|
Mahjour B, Zhang R, Shen Y, McGrath A, Zhao R, Mohamed OG, Lin Y, Zhang Z, Douthwaite JL, Tripathi A, Cernak T. Rapid planning and analysis of high-throughput experiment arrays for reaction discovery. Nat Commun 2023; 14:3924. [PMID: 37400469 DOI: 10.1038/s41467-023-39531-0] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Accepted: 06/13/2023] [Indexed: 07/05/2023] Open
Abstract
High-throughput experimentation (HTE) is an increasingly important tool in reaction discovery. While the hardware for running HTE in the chemical laboratory has evolved significantly in recent years, there remains a need for software solutions to navigate data-rich experiments. Here we have developed phactor™, a software that facilitates the performance and analysis of HTE in a chemical laboratory. phactor™ allows experimentalists to rapidly design arrays of chemical reactions or direct-to-biology experiments in 24, 96, 384, or 1,536 wellplates. Users can access online reagent data, such as a chemical inventory, to virtually populate wells with experiments and produce instructions to perform the reaction array manually, or with the assistance of a liquid handling robot. After completion of the reaction array, analytical results can be uploaded for facile evaluation, and to guide the next series of experiments. All chemical data, metadata, and results are stored in machine-readable formats that are readily translatable to various software. We also demonstrate the use of phactor™ in the discovery of several chemistries, including the identification of a low micromolar inhibitor of the SARS-CoV-2 main protease. Furthermore, phactor™ has been made available for free academic use in 24- and 96-well formats via an online interface.
Collapse
Affiliation(s)
- Babak Mahjour
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Rui Zhang
- Department of Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Yuning Shen
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Andrew McGrath
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Ruheng Zhao
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Osama G Mohamed
- Natural Products Discovery Core, Life Sciences Institute, University of Michigan, Ann Arbor, MI, USA
| | - Yingfu Lin
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Zirong Zhang
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - James L Douthwaite
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Ashootosh Tripathi
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
- Natural Products Discovery Core, Life Sciences Institute, University of Michigan, Ann Arbor, MI, USA
| | - Tim Cernak
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA.
- Department of Chemistry, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|