1
|
Strandgaard M, Linjordet T, Kneiding H, Burnage AL, Nova A, Jensen JH, Balcells D. A Deep Generative Model for the Inverse Design of Transition Metal Ligands and Complexes. JACS AU 2025; 5:2294-2308. [PMID: 40443902 PMCID: PMC12117439 DOI: 10.1021/jacsau.5c00242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2025] [Revised: 04/15/2025] [Accepted: 04/15/2025] [Indexed: 06/02/2025]
Abstract
Deep generative models yielding transition metal complexes (TMCs) remain scarce despite the key role of these compounds in industrial catalytic processes, anticancer therapies, and the energy transition. Compared to drug discovery within the chemical space of organic molecules, TMCs pose further challenges, including the encoding of chemical bonds of higher complexity and the need to optimize multiple properties. In this work, we developed a generative model for the inverse design of transition metal ligands and complexes, based on the junction tree variational autoencoder (JT-VAE). After implementing a SMILES-based encoding of the metal-ligand bonds, the model was trained with the tmQMg-L ligand library, allowing for the generation of thousands of novel, highly diverse monodentate (κ1) and bidentate (κ2) ligands, including imines, phosphines, and carbenes. Further, the generated ligands were labeled with two target properties reflecting the stability and electron density of the associated homoleptic iridium TMCs: the HOMO-LUMO gap (ϵ) and the charge of the metal center (q Ir). This data was used to implement a conditional model that generated ligands from a prompt, with the single- or dual-objective of optimizing either or both the ϵ and q Ir properties and allowing for chemical interpretation based on the optimization trajectories. The optimizations also had an impact on other chemical properties, including ligand dissociation energies and oxidative addition barriers. A similar model was implemented to condition ligand generation by solubility and steric bulk.
Collapse
Affiliation(s)
- Magnus Strandgaard
- Hylleraas
Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo, P.O. Box 1033, Blindern, Oslo0315, Norway
- Department
of Chemistry, University of Copenhagen, Copenhagen2100, Denmark
| | - Trond Linjordet
- Hylleraas
Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo, P.O. Box 1033, Blindern, Oslo0315, Norway
| | - Hannes Kneiding
- Hylleraas
Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo, P.O. Box 1033, Blindern, Oslo0315, Norway
| | - Arron L. Burnage
- Hylleraas
Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo, P.O. Box 1033, Blindern, Oslo0315, Norway
| | - Ainara Nova
- Hylleraas
Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo, P.O. Box 1033, Blindern, Oslo0315, Norway
- Centre
for Materials Science and Nanotechnology, Department of Chemistry, University of Oslo, OsloN-0315, Norway
| | - Jan Halborg Jensen
- Department
of Chemistry, University of Copenhagen, Copenhagen2100, Denmark
| | - David Balcells
- Hylleraas
Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo, P.O. Box 1033, Blindern, Oslo0315, Norway
| |
Collapse
|
2
|
Fan LY, Li XT, Luo XX, Zhu B, Guan W. Data-Driven Prediction of Reactivity and Additive Selection for C(sp 2)-(Hetero)Atom Bond Couplings in an Adaptive Dynamic Homogeneous Catalysis. Chemistry 2025; 31:e202500935. [PMID: 40261203 DOI: 10.1002/chem.202500935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2025] [Revised: 04/17/2025] [Accepted: 04/22/2025] [Indexed: 04/24/2025]
Abstract
Under visible light-driven redox conditions, employing transition-metal catalysis provides a powerful platform for constructing C(sp2)-(hetero)atom bonds. Although these reactions are highly significant, they require precise optimization of reaction parameters. König, Ghosh, and colleagues introduced an adaptive dynamic homogeneous catalysis (AD-HoC) platform that furnishes robust, high-yield conditions for photocatalyzed cross-coupling reactions. The AD-HoC system eliminates the need to optimize catalysts, ligands, and bases, instead, it achieves C(sp2)-(hetero)atom bond coupling by merely altering additives and substrate molecules. Leveraging the predictability of reaction conditions within the AD-HoC system, machine learning offers a method to evaluate the reactivity of substrate combinations and the categories of additives. Our research integrates high-throughput quantum mechanical calculations with cheminformatics approaches to explore the reactivity of substrate combinations and the selection of additives within the AD-HoC system. Further data-driven analysis reveals that the electronic characteristics of electrophiles and the geometric characteristics of nucleophiles are key factors regulating reactivity within the AD-HoC system. Herein, we present an end-to-end tool for prediction starting from the SMILES (Simplified Molecular-Input Line-Entry System) representation. This work demonstrates the collaborative use of computational statistics and machine learning to predict the reactivity and reaction conditions of substrate combinations, thereby enhancing the precision and efficiency of synthetic processes.
Collapse
Affiliation(s)
- Li-Yang Fan
- Institute of Functional Material Chemistry, Faculty of Chemistry, Northeast Normal University, Changchun, 130024, P. R. China
| | - Xue-Tao Li
- Institute of Functional Material Chemistry, Faculty of Chemistry, Northeast Normal University, Changchun, 130024, P. R. China
| | - Xi-Xi Luo
- Institute of Functional Material Chemistry, Faculty of Chemistry, Northeast Normal University, Changchun, 130024, P. R. China
| | - Bo Zhu
- Institute of Functional Material Chemistry, Faculty of Chemistry, Northeast Normal University, Changchun, 130024, P. R. China
| | - Wei Guan
- Institute of Functional Material Chemistry, Faculty of Chemistry, Northeast Normal University, Changchun, 130024, P. R. China
| |
Collapse
|
3
|
Harnik Y, Shalit Peleg H, Bermano AH, Milo A. Data efficient molecular image representation learning using foundation models. Chem Sci 2025:d5sc00907c. [PMID: 40417293 PMCID: PMC12100517 DOI: 10.1039/d5sc00907c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2025] [Accepted: 05/13/2025] [Indexed: 05/27/2025] Open
Abstract
Deep learning (DL) in chemistry has seen significant progress, yet its applicability is limited by the scarcity of large, labeled datasets and the difficulty of extracting meaningful molecular features. Molecular representation learning (MRL) has emerged as a powerful approach to address these challenges by decoupling feature extraction and property prediction. In MRL, a deep learning network is first trained to learn molecular features from large, unlabeled datasets and then finetuned for property prediction on smaller specialized data. Whereas MRL methods have been widely applied across chemical applications, these models are typically trained from scratch. Herein, we propose that foundation models can serve as an advantageous starting point for developing MRL models. Foundation models are large models trained on diverse datasets capable of addressing various downstream tasks. For example, large language models like OpenAI's GPT-4 can be finetuned with minimal additional data for tasks considerably different from their training. Based on this premise we leveraged OpenAI's vision foundation model, CLIP, as the backbone for developing MoleCLIP, a molecular image representation learning framework. MoleCLIP requires significantly less molecular pretraining data to match the performance of state-of-the-art models on standard benchmarks. Furthermore, MoleCLIP outperformed existing models on homogeneous catalysis datasets, emphasizing its robustness to distribution shifts, which allows it to adapt effectively to varied tasks and datasets. This successful application of a general foundation model to chemical tasks highlights the potential of innovations in DL research to advance synthetic chemistry and, more broadly, any field where molecular property description is central to discovery.
Collapse
Affiliation(s)
- Yonatan Harnik
- Department of Chemistry, Ben-Gurion University of the Negev Beer Sheva Israel
| | - Hadas Shalit Peleg
- Department of Chemistry, Ben-Gurion University of the Negev Beer Sheva Israel
| | - Amit H Bermano
- School of Computer Science, Tel Aviv University Tel Aviv Israel
| | - Anat Milo
- Department of Chemistry, Ben-Gurion University of the Negev Beer Sheva Israel
| |
Collapse
|
4
|
Wang L, Tricard N, Chen Z, Deng S. Progress in computational methods and mechanistic insights on the growth of carbon nanotubes. NANOSCALE 2025; 17:11812-11863. [PMID: 40275725 DOI: 10.1039/d4nr05487c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2025]
Abstract
Carbon nanotubes (CNTs), as a promising nanomaterial with broad applications across various fields, are continuously attracting significant research attention. Despite substantial progress in understanding their growth mechanisms, synthesis methods, and post-processing techniques, two major goals remain challenging: achieving property-targeted growth and efficient mass production. Recent advancements in computational methods driven by increased computational resources, the development of platforms, and the refinement of theoretical models, have significantly deepened our understanding of the mechanisms underlying CNT growth. This review aims to comprehensively examine the latest computational techniques that shed light on various aspects of CNT synthesis. The first part of this review focuses on progress in computational methods. Beginning with atomistic simulation approaches, we introduce the fundamentals and advancements in density functional theory (DFT), molecular dynamics (MD) simulations, and kinetic Monte Carlo (kMC) simulations. We discuss the applicability and limitations of each method in studying mechanisms of CNT growth. Then, the focus shifts to multiscale modeling approaches, where we demonstrate the coupling of atomic-scale simulations with reactor-scale multiphase flow models. Given that CNT growth inherently spans multiple temporal and spatial scales, the development and application of multiscale modeling techniques are poised to become a central focus of future computational research in this field. Furthermore, this review emphasizes the growing role played by machine learning in CNT growth research. Compared with traditional physics-based simulation methods, data-driven machine learning approaches have rapidly emerged in recent years, revolutionizing research paradigms from molecular simulation to experimental design. In the second part of this review, we highlight the latest advancements in CNT growth mechanisms and synthesis methods achieved through computational techniques. These include novel findings across fundamental growth stages, i.e., from nucleation to elongation and ultimately termination. We also examine the dynamic behaviors of catalyst nanoparticles and chirality-controlled growth processes, emphasizing how these insights contribute to advancing the field. Finally, in the concluding section, we propose future directions for advancements of computational approaches toward deeper understanding of CNT growth mechanisms and better support of CNT manufacturing.
Collapse
Affiliation(s)
- Linzheng Wang
- Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, 02139, MA, USA.
| | - Nicolas Tricard
- Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, 02139, MA, USA.
| | - Zituo Chen
- Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, 02139, MA, USA.
| | - Sili Deng
- Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, 02139, MA, USA.
| |
Collapse
|
5
|
Kartha A, Ajayakumar DP, Idris M, Ragupathy G. Unlocking the Potential of Machine Learning in Enhancing Quantum Chemical Calculations for Infrared Spectral Prediction. ACS OMEGA 2025; 10:19224-19234. [PMID: 40385139 PMCID: PMC12079248 DOI: 10.1021/acsomega.5c02405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/14/2025] [Revised: 03/21/2025] [Accepted: 03/27/2025] [Indexed: 05/20/2025]
Abstract
Infrared (IR) spectroscopy is a fundamental tool for analyzing molecular structures and chemical interactions by identifying the vibrational modes of molecules. Traditional quantum mechanical methods, such as density functional theory, are highly accurate but computationally expensive and impractical for large-scale molecular systems. This project investigates the integration of machine learning (ML) techniques to predict IR spectra, offering a promising alternative that significantly reduces computational costs while maintaining high accuracy. Additionally, the project explores the utilization of IR spectra for molecular identification and classification into molecular families, enhancing the practical utility of spectral data in various scientific applications. Using TensorFlow-based ML frameworks, models were developed and trained on a data set derived from high-quality computational chemistry analyzers. These data sets, sourced from computationally optimized geometry and IR spectrum from the Gaussian 16 Program Suite, include extensive molecular geometry data, bond lengths, vibrational modes, and other quantum mechanical properties. The models aim to predict key IR spectral features, such as vibrational frequencies and intensities, while maintaining interpretability by linking chemical and quantum mechanical principles to predictions. The integration of ML with IR spectroscopy provides a scalable as well as accelerated solution for analyzing complex molecular systems. This approach holds potential in fields such as drug discovery, materials science, and chemical engineering, where rapid and accurate spectral predictions are critical. This perspective highlights the advancements achieved, the current challenges, and the future potential of ML in the context of IR spectroscopy, providing a solid foundation for further exploration at the intersection of chemistry and data science.
Collapse
Affiliation(s)
- Adithya
Ranjith Kartha
- School
of Computer Science and Engineering (SCOPE), Vellore Institute of Technology, Vellore, Tamil Nadu 632014, India
| | - Dhanush P. Ajayakumar
- School
of Computer Science and Engineering (SCOPE), Vellore Institute of Technology, Vellore, Tamil Nadu 632014, India
| | - Muhammad Idris
- School
of Computer Science and Engineering (SCOPE), Vellore Institute of Technology, Vellore, Tamil Nadu 632014, India
| | - Gopi Ragupathy
- Department
of Chemistry, School of Advanced Sciences, Vellore Institute of Technology, Vellore 632014, India
| |
Collapse
|
6
|
Wang Z, You F. Leveraging generative models with periodicity-aware, invertible and invariant representations for crystalline materials design. NATURE COMPUTATIONAL SCIENCE 2025; 5:365-376. [PMID: 40346195 DOI: 10.1038/s43588-025-00797-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Accepted: 03/25/2025] [Indexed: 05/11/2025]
Abstract
Designing periodicity-aware, invariant and invertible representations provides an opportunity for the inverse design of crystalline materials with desired properties by generative models. This objective requires optimizing representations and refining the architecture of generative models, yet its feasibility remains uncertain, given current progress in molecular inverse generation. In this Perspective, we highlight the progress of various methods for designing representations and generative schemes for crystalline materials, discuss the challenges in the field and propose a roadmap for future developments.
Collapse
Affiliation(s)
- Zhilong Wang
- Cornell University AI for Science Institute, Cornell University, Ithaca, NY, USA
- College of Engineering, Cornell University, Ithaca, NY, USA
- Robert Frederick Smith School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, NY, USA
| | - Fengqi You
- Cornell University AI for Science Institute, Cornell University, Ithaca, NY, USA.
- College of Engineering, Cornell University, Ithaca, NY, USA.
- Robert Frederick Smith School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, NY, USA.
| |
Collapse
|
7
|
Sigmund LM, Assante M, Johansson MJ, Norrby PO, Jorner K, Kabeshov M. Computational tools for the prediction of site- and regioselectivity of organic reactions. Chem Sci 2025; 16:5383-5412. [PMID: 40070469 PMCID: PMC11891785 DOI: 10.1039/d5sc00541h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Accepted: 03/03/2025] [Indexed: 03/14/2025] Open
Abstract
The regio- and site-selectivity of organic reactions is one of the most important aspects when it comes to synthesis planning. Due to that, massive research efforts were invested into computational models for regio- and site-selectivity prediction, and the introduction of machine learning to the chemical sciences within the past decade has added a whole new dimension to these endeavors. This review article walks through the currently available predictive tools for regio- and site-selectivity with a particular focus on machine learning models while being organized along the individual reaction classes of organic chemistry. Respective featurization techniques and model architectures are described and compared to each other; applications of the tools to critical real-world examples are highlighted. This paper aims to serve as an overview of the field's status quo for both the intended users of the tools, that is synthetic chemists, as well as for developers to find potential new research avenues.
Collapse
Affiliation(s)
- Lukas M Sigmund
- Molecular AI, Discovery Sciences, R&D, AstraZeneca Gothenburg Pepparedsleden 1 43183 Mölndal Sweden
| | - Michele Assante
- Innovation Centre in Digital Molecular Technologies, Department of Chemistry, University of Cambridge Lensfield Rd Cambridge CB2 1EW UK
- Compound Synthesis & Management, The Discovery Centre, AstraZeneca Cambridge Cambridge Biomedical Campus, 1 Francis Crick Avenue CB2 0AA Cambridge UK
| | - Magnus J Johansson
- Medicinal Chemistry, Research and Early Development, Cardiovascular, Renal and Metabolism (CVRM), BioPharmaceuticals, R&D, AstraZeneca Gothenburg Pepparedsleden 1 43183 Mölndal Sweden
| | - Per-Ola Norrby
- Data Science & Modelling, Pharmaceutical Sciences, R&D, AstraZeneca Gothenburg Pepparedsleden 1 43183 Mölndal Sweden
| | - Kjell Jorner
- ETH Zürich, Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences Vladimir-Prelog-Weg 1 CH-8093 Zürich Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, ETH Zurich Zurich Switzerland
| | - Mikhail Kabeshov
- Molecular AI, Discovery Sciences, R&D, AstraZeneca Gothenburg Pepparedsleden 1 43183 Mölndal Sweden
| |
Collapse
|
8
|
Jiao Z, Mao Y, Lu R, Liu Y, Guo L, Wang Z. Fine-Tuning Graph Neural Networks via Active Learning: Unlocking the Potential of Graph Neural Networks Trained on Nonaqueous Systems for Aqueous CO 2 Reduction. J Chem Theory Comput 2025; 21:3176-3186. [PMID: 40084714 DOI: 10.1021/acs.jctc.5c00089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/16/2025]
Abstract
Graph neural networks (GNNs) have revolutionized catalysis research with their efficiency and accuracy in modeling complex chemical interactions. However, adapting GNNs trained on nonaqueous data sets to aqueous systems poses notable challenges due to intricate water interactions. In this study, we proposed an active learning-based fine-tuning approach to extend the applicability of GNNs to aqueous environments. The geometry optimization and transition state search workflows are designed to reduce computational costs while maintaining DFT-level accuracy. Applied to the CO2 reduction reaction, the workflow delivers a 2-3-fold acceleration in geometry optimization through a relaxed force threshold combined with DFT refinement. The versatility of the transition state search algorithm was demonstrated on key C-C coupling pathways, pinpointing *CO-*COH as the most energetically favorable pathway in aqueous systems of Cu and Cu-based Ag, Au, and Zn alloys. The Brønsted-Evans-Polanyi relationship remains robust under water-induced fluctuations, with alloyed metals such as Al, Ga, and Pd, along with Ag, Au, and Zn, exhibiting coupling efficiency comparable to that of Cu. Additionally, perturbation-based training on forces and energies extends the application of GNNs to aqueous ab initio molecular dynamics simulations, enabling efficient modeling of dynamical trajectories. This work presents novel approaches to adapting nonaqueous models for application in aqueous systems, highlighting GNNs' potential in solvated environments and laying a foundation for accelerating predictions of catalytic mechanisms under realistic conditions.
Collapse
Affiliation(s)
- Zihao Jiao
- International Research Center for Renewable Energy, State Key Laboratory of Multiphase Flow in Power Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
- School of Chemical Sciences, University of Auckland, Auckland 1010, New Zealand
| | - Yu Mao
- School of Chemical Sciences, University of Auckland, Auckland 1010, New Zealand
| | - Ruihu Lu
- School of Chemical Sciences, University of Auckland, Auckland 1010, New Zealand
| | - Ya Liu
- International Research Center for Renewable Energy, State Key Laboratory of Multiphase Flow in Power Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Liejin Guo
- International Research Center for Renewable Energy, State Key Laboratory of Multiphase Flow in Power Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Ziyun Wang
- School of Chemical Sciences, University of Auckland, Auckland 1010, New Zealand
| |
Collapse
|
9
|
Chen J, Gu Y, Zhu Q, Gu Y, Liang X, Ma J. Automated Machine Learning of Interfacial Interaction Descriptors and Energies in Metal-Catalyzed N 2 and CO 2 Reduction Reactions. LANGMUIR : THE ACS JOURNAL OF SURFACES AND COLLOIDS 2025; 41:3490-3502. [PMID: 39885810 DOI: 10.1021/acs.langmuir.4c04638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2025]
Abstract
The applications of machine learning (ML) in complex interfacial interactions are hindered by the time-consuming process of manual feature selection and model construction. An automated ML program was implemented with four subsequent steps: data distribution analysis, dimensionality reduction and clustering, feature selection, and model optimization. Without the need of manual intervention, the descriptors of metal charge variance (ΔQCT) and electronegativity of substrate (χsub) and metal (δχM) were raised up with good performance in predicting electrochemical reaction energies for both nitrogen reduction reaction (NRR) and CO2 reduction reaction (CO2RR) on metal-zeolites and MoS2 surfaces. The important role of interfacial interactions in tuning the catalytic reactivity in NRR and CO2RR was highlighted from SHAP analysis. It was proposed that Fe-, Cr-, Zn-, Nb-, and Ta-zeolites are favorable catalysts for NRR, while Ni-zeolite showed a preference for CO2RR. An elongated bond of N2 or a bent configuration of CO2 was shown in V-, Co-, and Mo-zeolites, indicating that the molecule could be activated after the adsorption in both NRR and CO2RR pathways. The generalizability of the automatically built ML model is demonstrated from applications to other catalytic systems such as metal-organic frameworks and SiO2 surfaces. The automated ML program is a useful tool to accelerate the data-driven exploration of relationship between structures and material properties without the need of manual feature selection.
Collapse
Affiliation(s)
- Jiawei Chen
- State Key Laboratory of Coordination Chemistry, Key Laboratory of Mesoscopic Chemistry of Ministry of Education, School of Chemistry and Chemical Engineering, Nanjing University, Nanjing 210023, P. R. China
| | - Yuming Gu
- State Key Laboratory of Coordination Chemistry, Key Laboratory of Mesoscopic Chemistry of Ministry of Education, School of Chemistry and Chemical Engineering, Nanjing University, Nanjing 210023, P. R. China
| | - Qin Zhu
- State Key Laboratory of Coordination Chemistry, Key Laboratory of Mesoscopic Chemistry of Ministry of Education, School of Chemistry and Chemical Engineering, Nanjing University, Nanjing 210023, P. R. China
| | - Yating Gu
- State Key Laboratory of Coordination Chemistry, Key Laboratory of Mesoscopic Chemistry of Ministry of Education, School of Chemistry and Chemical Engineering, Nanjing University, Nanjing 210023, P. R. China
| | - Xinyi Liang
- State Key Laboratory of Coordination Chemistry, Key Laboratory of Mesoscopic Chemistry of Ministry of Education, School of Chemistry and Chemical Engineering, Nanjing University, Nanjing 210023, P. R. China
| | - Jing Ma
- State Key Laboratory of Coordination Chemistry, Key Laboratory of Mesoscopic Chemistry of Ministry of Education, School of Chemistry and Chemical Engineering, Nanjing University, Nanjing 210023, P. R. China
| |
Collapse
|
10
|
Ramos MC, Collison CJ, White AD. A review of large language models and autonomous agents in chemistry. Chem Sci 2025; 16:2514-2572. [PMID: 39829984 PMCID: PMC11739813 DOI: 10.1039/d4sc03921a] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Accepted: 12/03/2024] [Indexed: 01/22/2025] Open
Abstract
Large language models (LLMs) have emerged as powerful tools in chemistry, significantly impacting molecule design, property prediction, and synthesis optimization. This review highlights LLM capabilities in these domains and their potential to accelerate scientific discovery through automation. We also review LLM-based autonomous agents: LLMs with a broader set of tools to interact with their surrounding environment. These agents perform diverse tasks such as paper scraping, interfacing with automated laboratories, and synthesis planning. As agents are an emerging topic, we extend the scope of our review of agents beyond chemistry and discuss across any scientific domains. This review covers the recent history, current capabilities, and design of LLMs and autonomous agents, addressing specific challenges, opportunities, and future directions in chemistry. Key challenges include data quality and integration, model interpretability, and the need for standard benchmarks, while future directions point towards more sophisticated multi-modal agents and enhanced collaboration between agents and experimental methods. Due to the quick pace of this field, a repository has been built to keep track of the latest studies: https://github.com/ur-whitelab/LLMs-in-science.
Collapse
Affiliation(s)
- Mayk Caldas Ramos
- FutureHouse Inc. San Francisco CA USA
- Department of Chemical Engineering, University of Rochester Rochester NY USA
| | - Christopher J Collison
- School of Chemistry and Materials Science, Rochester Institute of Technology Rochester NY USA
| | - Andrew D White
- FutureHouse Inc. San Francisco CA USA
- Department of Chemical Engineering, University of Rochester Rochester NY USA
| |
Collapse
|
11
|
Rani N, Kumar R, Mazumder S. AI-Driven Discovery of Asymmetric Pauson-Khand Reactions: A New Toolbox in a Synthetic Chemist's Treasure. J Phys Chem A 2024; 128:10452-10463. [PMID: 39570149 DOI: 10.1021/acs.jpca.4c06701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2024]
Abstract
Enantioselective catalytic reactions have a significant impact on chemical synthesis, and they are important components in an experimental chemist's toolbox. However, development of asymmetric catalysts often relies on the chemical intuition and experience of a synthetic chemist, making the process both time-consuming and resource-intensive. The machine-learning-assisted reaction discovery can serve as a very efficient platform for obtaining high-performing catalysts in a time-economical manner without extensive experimentation. Herein, we report a data-driven and machine learning method for reliably predicting enantiomeric excess (%ee) of 211 asymmetric Pauson-Khand reactions (PKR 1-PKR 211) between a variety of 45 unique 1,6-enyne substrates and 12 unique axially chiral biaryl ligands in the presence of different reaction conditions like varying CO gas pressure, temperature, and solvent polarity. Four different machine learning algorithms have been studied: extreme gradient boosting (XGBoost), random forest (RF), light gradient boosting machine (LGBM), and neural network (NN). A fivefold cross validation method was applied to our k-means SMOTE-augmented data set to obtain the optimized hyperparameters for the training set, and subsequently, these parameters were used in the test data set. In the case of the out-of-box set, the XGBoost method is found to be superior among all four machine learning methods investigated. Our out-of-box samples contain a total of 12 unique asymmetric Pauson-Khand reactions (PKR 212-PKR 223) arising from three new 1,3-benzodioxole-based SEGPHOS catalysts, which were never included in the training set. The XGBoost algorithm shows an impressive root mean square error (RMSE) of 7.06 (±1.11) in predicting %ee. The XGBoost-predicted %ee values match reasonably well with the experimental results. The absolute difference between the experimental and XGBoost-calculated %ee values ranges from 0.9 to 7.6 for the majority of the out-of-box Pauson-Khand reactions. The reactions with fluoro-substituted-SEGPHOS ligand L14 shows smaller deviations from the experimental %ee values compared to the reactions with L13 and L15 catalysts where the benzodioxole units do not have fluorine atoms. Finally, we have discovered a library of 3357 lead reactions with excellent %ee (≥99) by engaging the experimentally unknown combinations of the catalysts, substrates, and reaction conditions. The axially chiral biaryl catalysts and enyne substrates present in the library are synthetically accessible. The ligand space in the library is dominated by the presence of tol-BINAP and the DTBM-OMe-BIPHEP ligands. The substrate space is predominantly occupied by NTs-tethered, O-tethered, NBn-tethered, and C(CO2Me)2-tethered 1,6-enynes that have an H or methyl functional group present in the alkyne unit. Our newly discovered library assists a synthetic chemist to develop a highly enantioselective PKR by starting with a priori knowledge without extensive trial-and-error experimentation.
Collapse
Affiliation(s)
- Neha Rani
- Department of Chemistry, Indian Institute of Technology Jammu, Jammu 181221, India
| | - Rohit Kumar
- Novartis, HITEC City, Hyderabad, Telangana 500081, India
| | - Shivnath Mazumder
- Department of Chemistry, Indian Institute of Technology Jammu, Jammu 181221, India
| |
Collapse
|
12
|
Uceda RG, Gijón A, Míguez‐Lago S, Cruz CM, Blanco V, Fernández‐Álvarez F, Álvarez de Cienfuegos L, Molina‐Solana M, Gómez‐Romero J, Miguel D, Mota AJ, Cuerva JM. Can Deep Learning Search for Exceptional Chiroptical Properties? The Halogenated [6]Helicene Case. Angew Chem Int Ed Engl 2024; 63:e202409998. [PMID: 39329214 PMCID: PMC11586703 DOI: 10.1002/anie.202409998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 09/11/2024] [Accepted: 09/24/2024] [Indexed: 09/28/2024]
Abstract
The relationship between chemical structure and chiroptical properties is not always clearly understood. Nowadays, efforts to develop new systems with enhanced optical properties follow the trial-error method. A large number of data would allow us to obtain more robust conclusions and guide research toward molecules with practical applications. In this sense, in this work we predict the chiroptical properties of millions of halogenated [6]helicenes in terms of the rotatory strength (R). We have used DFT calculations to randomly create derivatives including from 1 to 16 halogen atoms, that were then used as a data set to train different deep neural network models. These models allow us to i) predict the Rmax for any halogenated [6]helicene with a very low computational cost, and ii) to understand the physical reasons that favour some substitutions over others. Finally, we synthesized derivatives with higher predicted Rmax obtaining excellent correlation among the values obtained experimentally and the predicted ones.
Collapse
Affiliation(s)
- Rafael G. Uceda
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Alfonso Gijón
- Departamento de Ciencias de la Computación e Inteligencia Artificial, UGRE.T.S. de Ingenierías Informática y de TelecomunicaciónC/ Periodista Daniel Saucedo Aranda S/N18071GranadaSpain
| | - Sandra Míguez‐Lago
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Carlos M. Cruz
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Víctor Blanco
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Fátima Fernández‐Álvarez
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Luis Álvarez de Cienfuegos
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
- Instituto de Investigación BiosanitariaAvda. Madrid, 1518016GranadaSpain
| | - Miguel Molina‐Solana
- Departamento de Ciencias de la Computación e Inteligencia Artificial, UGRE.T.S. de Ingenierías Informática y de TelecomunicaciónC/ Periodista Daniel Saucedo Aranda S/N18071GranadaSpain
| | - Juan Gómez‐Romero
- Departamento de Ciencias de la Computación e Inteligencia Artificial, UGRE.T.S. de Ingenierías Informática y de TelecomunicaciónC/ Periodista Daniel Saucedo Aranda S/N18071GranadaSpain
| | - Delia Miguel
- Departamento de Fisicoquímica, UEQ, UGRFacultad de FarmaciaAvda. Profesor Clavera s/nC. U. Cartuja18071GranadaSpain
| | - Antonio J. Mota
- Departamento de Química Inorgánica, UEQ, UGRFacultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Juan M. Cuerva
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| |
Collapse
|
13
|
Yang B, Schaefer AJ, Small BL, Leseberg JA, Bischof SM, Webster-Gardiner MS, Ess DH. Experimentally-based Fe-catalyzed ethene oligomerization machine learning model provides highly accurate prediction of propagation/termination selectivity. Chem Sci 2024:d4sc03433c. [PMID: 39449687 PMCID: PMC11495513 DOI: 10.1039/d4sc03433c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2024] [Accepted: 10/09/2024] [Indexed: 10/26/2024] Open
Abstract
Linear α-olefins (1-alkenes) are critical comonomers for ethene copolymerization. A major impediment in the development of new homogeneous Fe catalysts for ethene oligomerization to produce comonomers and other important commercial products is the prediction of propagation versus termination rates that control the α-olefin distribution (e.g., 1-butene through 1-decene), which is often referred to as a K-value. Because the transition states for propagation versus termination are generally separated by less than a one kcal mol-1 difference in energy, this selectivity cannot be accurately predicted by either DFT or wavefunction methods (even DLPNO-CCSD(T)). Therefore, we developed a sub-kcal mol-1 accuracy machine learning model based on several hundred experimental selectivity values and straightforward 2D chemical and physical features that enables the prediction of α-olefin distribution K-values. As part of our model, we developed a new ad hoc feature that boosted the model performance. This machine learning model captures the effects of a broad range of ligand architectures and chemically nonintuitive trends in oligomerization selectivity. Our machine learning model was experimentally validated by prediction of a K-value for a new Fe phosphaneyl-pyridinyl-quinoline catalyst followed by experimental measurement that showed precise agreement. In addition to quantitative predictions, we demonstrate how this machine learning model can provide qualitative catalyst design using proximity of pairs type analysis.
Collapse
Affiliation(s)
- Bo Yang
- Department of Chemistry and Biochemistry, Brigham Young University Provo Utah 84602 USA
| | - Anthony J Schaefer
- Department of Chemistry and Biochemistry, Brigham Young University Provo Utah 84602 USA
| | - Brooke L Small
- Research & Technology, Chevron Phillips Chemical 1862 Kingwood Drive Kingwood Texas 77339 USA
| | - Julie A Leseberg
- Research & Technology, Chevron Phillips Chemical 1862 Kingwood Drive Kingwood Texas 77339 USA
| | - Steven M Bischof
- Research & Technology, Chevron Phillips Chemical 1862 Kingwood Drive Kingwood Texas 77339 USA
| | | | - Daniel H Ess
- Department of Chemistry and Biochemistry, Brigham Young University Provo Utah 84602 USA
| |
Collapse
|
14
|
Singh S, Hernández-Lobato JM. Data-Driven Insights into the Transition-Metal-Catalyzed Asymmetric Hydrogenation of Olefins. J Org Chem 2024; 89:12467-12478. [PMID: 39149801 PMCID: PMC11382158 DOI: 10.1021/acs.joc.4c01396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
The transition-metal-catalyzed asymmetric hydrogenation of olefins is one of the key transformations with great utility in various industrial applications. The field has been dominated by the use of noble metal catalysts, such as iridium and rhodium. The reactions with the earth-abundant cobalt metal have increased only in recent years. In this work, we analyze the large amount of literature data available on iridium- and rhodium-catalyzed asymmetric hydrogenation. The limited data on reactions using Co catalysts are then examined in the context of Ir and Rh to obtain a better understanding of the reactivity pattern. A detailed data-driven study of the types of olefins, ligands, and reaction conditions such as solvent, temperature, and pressure is carried out. Our analysis provides an understanding of the literature trends and demonstrates that only a few olefin-ligand combinations or reaction conditions are frequently used. The knowledge of this bias in the literature data toward a certain group of substrates or reaction conditions can be useful for practitioners to design new reaction data sets that are suitable to obtain meaningful predictions from machine-learning models.
Collapse
Affiliation(s)
- Sukriti Singh
- Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, U.K
| | | |
Collapse
|
15
|
Kalikadien AV, Valsecchi C, van Putten R, Maes T, Muuronen M, Dyubankova N, Lefort L, Pidko EA. Probing machine learning models based on high throughput experimentation data for the discovery of asymmetric hydrogenation catalysts. Chem Sci 2024; 15:13618-13630. [PMID: 39211503 PMCID: PMC11352728 DOI: 10.1039/d4sc03647f] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2024] [Accepted: 07/15/2024] [Indexed: 09/04/2024] Open
Abstract
Enantioselective hydrogenation of olefins by Rh-based chiral catalysts has been extensively studied for more than 50 years. Naively, one would expect that everything about this transformation is known and that selecting a catalyst that induces the desired reactivity or selectivity is a trivial task. Nonetheless, ligand engineering or selection for any new prochiral olefin remains an empirical trial-error exercise. In this study, we investigated whether machine learning techniques could be used to accelerate the identification of the most efficient chiral ligand. For this purpose, we used high throughput experimentation to build a large dataset consisting of results for Rh-catalyzed asymmetric olefin hydrogenation, specially designed for applications in machine learning. We showcased its alignment with existing literature while addressing observed discrepancies. Additionally, a computational framework for the automated and reproducible quantum-chemistry based featurization of catalyst structures was created. Together with less computationally demanding representations, these descriptors were fed into our machine learning pipeline for both out-of-domain and in-domain prediction tasks of selectivity and reactivity. For out-of-domain purposes, our models provided limited efficacy. It was found that even the most expensive descriptors do not impart significant meaning to the model predictions. The in-domain application, while partly successful for predictions of conversion, emphasizes the need for evaluating the cost-benefit ratio of computationally intensive descriptors and for tailored descriptor design. Challenges persist in predicting enantioselectivity, calling for caution in interpreting results from small datasets. Our insights underscore the importance of dataset diversity with broad substrate inclusion and suggest that mechanistic considerations could improve the accuracy of statistical models.
Collapse
Affiliation(s)
- Adarsh V Kalikadien
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology Van der Maasweg 9, 2629 HZ Delft The Netherlands
| | - Cecile Valsecchi
- Discovery, Product Development and Supply, Janssen Cilag S.p.A. Viale Fulvio Testi, 280/6 20126 Milano Italy
| | - Robbert van Putten
- Discovery, Product Development and Supply, Janssen Pharmaceutica N.V. Turnhoutseweg 30 2340 Beerse Belgium
| | - Tor Maes
- Discovery, Product Development and Supply, Janssen Pharmaceutica N.V. Turnhoutseweg 30 2340 Beerse Belgium
| | - Mikko Muuronen
- Discovery, Product Development and Supply, Janssen Pharmaceutica N.V. Turnhoutseweg 30 2340 Beerse Belgium
| | - Natalia Dyubankova
- Discovery, Product Development and Supply, Janssen Pharmaceutica N.V. Turnhoutseweg 30 2340 Beerse Belgium
| | - Laurent Lefort
- Discovery, Product Development and Supply, Janssen Pharmaceutica N.V. Turnhoutseweg 30 2340 Beerse Belgium
| | - Evgeny A Pidko
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology Van der Maasweg 9, 2629 HZ Delft The Netherlands
| |
Collapse
|
16
|
Li X, Zhong H, Yang H, Li L, Wang Q. High-Throughput Screening and Prediction of Nucleophilicity of Amines Using Machine Learning and DFT Calculations. J Chem Inf Model 2024; 64:6361-6368. [PMID: 39116323 DOI: 10.1021/acs.jcim.4c00724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/10/2024]
Abstract
Nucleophilic index (NNu) as a significant parameter plays a crucial role in screening of amine catalysts. Indeed, the quantity and variety of amines are extensive. However, only limited amines exhibit an NNu value exceeding 4.0 eV, rendering them potential nucleophiles in chemical reactions. To address this issue, we proposed a computational method to quickly identify amines with high NNu values by using Machine Learning (ML) and high-throughput Density Functional Theory (DFT) calculations. Our approach commenced by training ML models and the exploration of Molecular Fingerprint methods as well as the development of quantitative structure-activity relationship (QSAR) models for the well-known amines based on NNu values derived from DFT calculations. Utilizing explainable Shapley Additive Explanation plots, we were able to determine the five critical substructures that significantly impact the NNu values of amine. The aforementioned conclusion can be applied to produce and cultivate 4920 novel hypothetical amines with high NNu values. The QSAR models were employed to predict the NNu values of 259 well-known and 4920 hypothetical amines, resulting in the identification of five novel hypothetical amines with exceptional NNu values (>4.55 eV). The enhanced NNu values of these novel amines were validated by DFT calculations. One novel hypothetical amine, H1, exhibits an unprecedentedly high NNu value of 5.36 eV, surpassing the maximum value (5.35 eV) observed in well-established amines. Our research strategy efficiently accelerates the discovery of the high nucleophilicity of amines using ML predictions, as well as the DFT calculations.
Collapse
Affiliation(s)
- Xu Li
- Laboratory of Electrochemical Energy Storage and Energy Conversion of Hainan Province, School of Chemistry and Chemical Engineering, Hainan Normal University, Haikou 571158, China
- School of Chemical Engineering and Light Industry, Guangdong University of Technology, Guangzhou 510006, Guangdong, China
| | - Haoliang Zhong
- School of Chemical Engineering and Light Industry, Guangdong University of Technology, Guangzhou 510006, Guangdong, China
| | - Haoyu Yang
- College of Information and Communication Engineering, Hainan University, Haikou 570228, China
| | - Lin Li
- State Key Laboratory of Inorganic Synthesis and Preparative Chemistry, Jilin University, Qianjin Street 2699, Changchun 130012, China
| | - Qingji Wang
- College of Information and Communication Engineering, Hainan University, Haikou 570228, China
| |
Collapse
|
17
|
Su Y, Wang X, Ye Y, Xie Y, Xu Y, Jiang Y, Wang C. Automation and machine learning augmented by large language models in a catalysis study. Chem Sci 2024; 15:12200-12233. [PMID: 39118602 PMCID: PMC11304797 DOI: 10.1039/d3sc07012c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 06/21/2024] [Indexed: 08/10/2024] Open
Abstract
Recent advancements in artificial intelligence and automation are transforming catalyst discovery and design from traditional trial-and-error manual mode into intelligent, high-throughput digital methodologies. This transformation is driven by four key components, including high-throughput information extraction, automated robotic experimentation, real-time feedback for iterative optimization, and interpretable machine learning for generating new knowledge. These innovations have given rise to the development of self-driving labs and significantly accelerated materials research. Over the past two years, the emergence of large language models (LLMs) has added a new dimension to this field, providing unprecedented flexibility in information integration, decision-making, and interacting with human researchers. This review explores how LLMs are reshaping catalyst design, heralding a revolutionary change in the fields.
Collapse
Affiliation(s)
- Yuming Su
- iChem, State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Xiamen University Xiamen 361005 P. R. China
- Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM) Xiamen 361005 P. R. China
| | - Xue Wang
- iChem, State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Xiamen University Xiamen 361005 P. R. China
| | - Yuanxiang Ye
- Institute of Artificial Intelligence, Xiamen University Xiamen 361005 P. R. China
| | - Yibo Xie
- Institute of Artificial Intelligence, Xiamen University Xiamen 361005 P. R. China
| | - Yujing Xu
- iChem, State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Xiamen University Xiamen 361005 P. R. China
| | - Yibin Jiang
- Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM) Xiamen 361005 P. R. China
| | - Cheng Wang
- iChem, State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Xiamen University Xiamen 361005 P. R. China
- Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM) Xiamen 361005 P. R. China
| |
Collapse
|
18
|
Singh S, Hernández-Lobato JM. Deep Kernel learning for reaction outcome prediction and optimization. Commun Chem 2024; 7:136. [PMID: 38877182 PMCID: PMC11178803 DOI: 10.1038/s42004-024-01219-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2024] [Accepted: 06/05/2024] [Indexed: 06/16/2024] Open
Abstract
Recent years have seen a rapid growth in the application of various machine learning methods for reaction outcome prediction. Deep learning models have gained popularity due to their ability to learn representations directly from the molecular structure. Gaussian processes (GPs), on the other hand, provide reliable uncertainty estimates but are unable to learn representations from the data. We combine the feature learning ability of neural networks (NNs) with uncertainty quantification of GPs in a deep kernel learning (DKL) framework to predict the reaction outcome. The DKL model is observed to obtain very good predictive performance across different input representations. It significantly outperforms standard GPs and provides comparable performance to graph neural networks, but with uncertainty estimation. Additionally, the uncertainty estimates on predictions provided by the DKL model facilitated its incorporation as a surrogate model for Bayesian optimization (BO). The proposed method, therefore, has a great potential towards accelerating reaction discovery by integrating accurate predictive models that provide reliable uncertainty estimates with BO.
Collapse
Affiliation(s)
- Sukriti Singh
- Department of Engineering, University of Cambridge, Cambridge, UK.
| | | |
Collapse
|
19
|
Das M, Ghosh A, Sunoj RB. Advances in machine learning with chemical language models in molecular property and reaction outcome predictions. J Comput Chem 2024; 45:1160-1176. [PMID: 38299229 DOI: 10.1002/jcc.27315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 01/06/2024] [Accepted: 01/09/2024] [Indexed: 02/02/2024]
Abstract
Molecular properties and reactions form the foundation of chemical space. Over the years, innumerable molecules have been synthesized, a smaller fraction of them found immediate applications, while a larger proportion served as a testimony to creative and empirical nature of the domain of chemical science. With increasing emphasis on sustainable practices, it is desirable that a target set of molecules are synthesized preferably through a fewer empirical attempts instead of a larger library, to realize an active candidate. In this front, predictive endeavors using machine learning (ML) models built on available data acquire high timely significance. Prediction of molecular property and reaction outcome remain one of the burgeoning applications of ML in chemical science. Among several methods of encoding molecular samples for ML models, the ones that employ language like representations are gaining steady popularity. Such representations would additionally help adopt well-developed natural language processing (NLP) models for chemical applications. Given this advantageous background, herein we describe several successful chemical applications of NLP focusing on molecular property and reaction outcome predictions. From relatively simpler recurrent neural networks (RNNs) to complex models like transformers, different network architecture have been leveraged for tasks such as de novo drug design, catalyst generation, forward and retro-synthesis predictions. The chemical language model (CLM) provides promising avenues toward a broad range of applications in a time and cost-effective manner. While we showcase an optimistic outlook of CLMs, attention is also placed on the persisting challenges in reaction domain, which would optimistically be addressed by advanced algorithms tailored to chemical language and with increased availability of high-quality datasets.
Collapse
Affiliation(s)
- Manajit Das
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Ankit Ghosh
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Raghavan B Sunoj
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
- Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Mumbai, India
| |
Collapse
|
20
|
Bi H, Jiang J, Chen J, Kuang X, Zhang J. Machine Learning Prediction of Quantum Yields and Wavelengths of Aggregation-Induced Emission Molecules. MATERIALS (BASEL, SWITZERLAND) 2024; 17:1664. [PMID: 38612177 PMCID: PMC11012915 DOI: 10.3390/ma17071664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Revised: 03/27/2024] [Accepted: 04/02/2024] [Indexed: 04/14/2024]
Abstract
The aggregation-induced emission (AIE) effect exhibits a significant influence on the development of luminescent materials and has made remarkable progress over the past decades. The advancement of high-performance AIE materials requires fast and accurate predictions of their photophysical properties, which is impeded by the inherent limitations of quantum chemical calculations. In this work, we present an accurate machine learning approach for the fast predictions of quantum yields and wavelengths to screen out AIE molecules. A database of about 563 organic luminescent molecules with quantum yields and wavelengths in the monomeric/aggregated states was established. Individual/combined molecular fingerprints were selected and compared elaborately to attain appropriate molecular descriptors. Different machine learning algorithms combined with favorable molecular fingerprints were further screened to achieve more accurate prediction models. The simulation results indicate that combined molecular fingerprints yield more accurate predictions in the aggregated states, and random forest and gradient boosting regression algorithms show the best predictions in quantum yields and wavelengths, respectively. Given the successful applications of machine learning in quantum yields and wavelengths, it is reasonable to anticipate that machine learning can serve as a complementary strategy to traditional experimental/theoretical methods in the investigation of aggregation-induced luminescent molecules to facilitate the discovery of luminescent materials.
Collapse
Affiliation(s)
| | | | | | | | - Jinxiao Zhang
- College of Chemistry and Bioengineering, Guilin University of Technology, Guilin 541006, China; (H.B.)
| |
Collapse
|
21
|
Harnik Y, Milo A. A focus on molecular representation learning for the prediction of chemical properties. Chem Sci 2024; 15:5052-5055. [PMID: 38577350 PMCID: PMC10988574 DOI: 10.1039/d4sc90043j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/06/2024] Open
Abstract
Molecular representation learning (MRL) is a specialized field in which deep-learning models condense essential molecular information into a vectorized form. Whereas recent research has predominantly emphasized drug discovery and bioactivity applications, MRL holds significant potential for diverse chemical properties beyond these contexts. The recently published study by King-Smith introduces a novel application of molecular representation training and compellingly demonstrates its value in predicting molecular properties (E. King-Smith, Chem. Sci., 2024, https://doi.org/10.1039/D3SC04928K). In this focus article, we will briefly delve into MRL in chemistry and the significance of King-Smith's work within the dynamic landscape of this evolving field.
Collapse
Affiliation(s)
- Yonatan Harnik
- Department of Chemistry, Ben-Gurion University of the Negev Beer Sheva 84105 Israel
| | - Anat Milo
- Department of Chemistry, Ben-Gurion University of the Negev Beer Sheva 84105 Israel
| |
Collapse
|
22
|
Karsakov GV, Shirobokov VP, Kulakova A, Milichko VA. Prediction of Metal-Organic Frameworks with Phase Transition via Machine Learning. J Phys Chem Lett 2024; 15:3089-3095. [PMID: 38470071 DOI: 10.1021/acs.jpclett.3c03639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/13/2024]
Abstract
Metal-organic frameworks (MOFs) possess a virtually unlimited number of potential structures. Although the latter enables an efficient route to control the structure-related functional properties of MOFs, it still complicates the prediction and searching for an optimal structure for specific application. Next to prediction of the MOFs for gas sorption/separation and catalysis via machine learning (ML), we report on ML to find MOFs demonstrating a phase transition (PT). On the basis of an available QMOF database (7463 frameworks), we create and train the autoencoder followed by training the classifier of MOFs from a unique database with experimentally confirmed PT. This makes it possible to identify MOFs with a high potential for PT and evaluate the most likely stimulus for it (guest molecules or temperature/pressure). The formed list of available MOFs for PT allows us to discuss their structural features and opens an opportunity to search for phase change MOFs for diverse physical/chemical application.
Collapse
Affiliation(s)
- Grigory V Karsakov
- School of Physics and Engineering, ITMO University, St. Petersburg 197101, Russia
| | | | - Alena Kulakova
- School of Physics and Engineering, ITMO University, St. Petersburg 197101, Russia
| | - Valentin A Milichko
- School of Physics and Engineering, ITMO University, St. Petersburg 197101, Russia
- Institut Jean Lamour, Université de Lorraine, Centre National de la Recherche Scientifique (CNRS), F-54000 Nancy, France
| |
Collapse
|
23
|
Ahmed M, Wang C, Zhao Y, Sathish CI, Lei Z, Qiao L, Sun C, Wang S, Kennedy JV, Vinu A, Yi J. Bridging Together Theoretical and Experimental Perspectives in Single-Atom Alloys for Electrochemical Ammonia Production. SMALL (WEINHEIM AN DER BERGSTRASSE, GERMANY) 2024:e2308084. [PMID: 38243883 DOI: 10.1002/smll.202308084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 10/26/2023] [Indexed: 01/22/2024]
Abstract
Ammonia is an essential commodity in the food and chemical industry. Despite the energy-intensive nature, the Haber-Bosch process is the only player in ammonia production at large scales. Developing other strategies is highly desirable, as sustainable and decentralized ammonia production is crucial. Electrochemical ammonia production by directly reducing nitrogen and nitrogen-based moieties powered by renewable energy sources holds great potential. However, low ammonia production and selectivity rates hamper its utilization as a large-scale ammonia production process. Creating effective and selective catalysts for the electrochemical generation of ammonia is critical for long-term nitrogen fixation. Single-atom alloys (SAAs) have become a new class of materials with distinctive features that may be able to solve some of the problems with conventional heterogeneous catalysts. The design and optimization of SAAs for electrochemical ammonia generation have recently been significantly advanced. This comprehensive review discusses these advancements from theoretical and experimental research perspectives, offering a fundamental understanding of the development of SAAs for ammonia production.
Collapse
Affiliation(s)
- MuhammadIbrar Ahmed
- Global Innovative Center of Advanced Nanomaterials, School of Engineering, College of Engineering, Science, and Environment, University of Newcastle, Callaghan, NSW, 2308, Australia
| | - Cheng Wang
- CSIRO Energy Centre, 10 Murray Dwyer Circuit, Mayfield West, NSW, 2304, Australia
| | - Yong Zhao
- CSIRO Energy Centre, 10 Murray Dwyer Circuit, Mayfield West, NSW, 2304, Australia
| | - C I Sathish
- Global Innovative Center of Advanced Nanomaterials, School of Engineering, College of Engineering, Science, and Environment, University of Newcastle, Callaghan, NSW, 2308, Australia
| | - Zhihao Lei
- Global Innovative Center of Advanced Nanomaterials, School of Engineering, College of Engineering, Science, and Environment, University of Newcastle, Callaghan, NSW, 2308, Australia
| | - Liang Qiao
- University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - Chenghua Sun
- Centre for Translational Atomaterials, Faculty of Science, Engineering and Technology, Swinburne University of Technology, Hawthorn, Victoria, 3122, Australia
| | - Shaobin Wang
- School of Chemical Engineering and Advanced Materials, The University of Adelaide, Adelaide, SA, 5005, Australia
| | - John V Kennedy
- National Isotope Centre, GNS Science, P.O. Box 31312, Lower Hutt, 5010, New Zealand
| | - Ajayan Vinu
- Global Innovative Center of Advanced Nanomaterials, School of Engineering, College of Engineering, Science, and Environment, University of Newcastle, Callaghan, NSW, 2308, Australia
| | - Jiabao Yi
- Global Innovative Center of Advanced Nanomaterials, School of Engineering, College of Engineering, Science, and Environment, University of Newcastle, Callaghan, NSW, 2308, Australia
| |
Collapse
|
24
|
Lu H, Kang X, Yu H, Zhang W, Luo Y. Using a single complex to predict the reaction energy profile: a case study of Pd/Ni-catalyzed ethylene polymerization. Dalton Trans 2023; 52:14790-14796. [PMID: 37807861 DOI: 10.1039/d3dt02745g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
Mechanism-driven catalyst screening could be greatly accelerated by quantitative prediction models of the reaction energy profile. Here, we propose a novel method for molecular representation, taking palladium- and nickel-catalyzed ethylene polymerization as model reactions. The geometric parameters (GPfra) and electron occupancies (EOfra) from the non-ligand fragment of the η3-complex were extracted as the molecular descriptors, followed by constructing the reaction energy profile prediction models on the basis of various regression algorithms. The models showed great accuracy with respect to both theoretical and experimental data. More importantly, the models are convenient for training and utilization. On one hand, all the features were easily captured from the single η3-complex. On the other hand, further investigation also demonstrated that the models could be constructed with a small training sample size. We believe that our featurization method could possibly be generalized to more organometallic reactions and paves the way to efficient catalyst design.
Collapse
Affiliation(s)
- Han Lu
- State Key Laboratory of Fine Chemicals, School of Chemical Engineering, Dalian University of Technology, Dalian 116024, China.
| | - Xiaohui Kang
- College of Pharmacy, Dalian Medical University, Dalian 116044, China
| | - Hang Yu
- Liaoning Key Laboratory of Clean Energy, Shenyang Aerospace University, Shenyang 110136, China
| | - Wenzhen Zhang
- State Key Laboratory of Fine Chemicals, School of Chemical Engineering, Dalian University of Technology, Dalian 116024, China.
| | - Yi Luo
- State Key Laboratory of Fine Chemicals, School of Chemical Engineering, Dalian University of Technology, Dalian 116024, China.
- PetroChina Petrochemical Research Institute, Beijing 102206, China
| |
Collapse
|
25
|
Shilpa S, Kashyap G, Sunoj RB. Recent Applications of Machine Learning in Molecular Property and Chemical Reaction Outcome Predictions. J Phys Chem A 2023; 127:8253-8271. [PMID: 37769193 DOI: 10.1021/acs.jpca.3c04779] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/30/2023]
Abstract
Burgeoning developments in machine learning (ML) and its rapidly growing adaptations in chemistry are noteworthy. Motivated by the successful deployments of ML in the realm of molecular property prediction (MPP) and chemical reaction prediction (CRP), herein we highlight some of its most recent applications in predictive chemistry. We present a nonmathematical and concise overview of the progression of ML implementations, ranging from an ensemble-based random forest model to advanced graph neural network algorithms. Similarly, the prospects of various feature engineering and feature learning approaches that work in conjunction with ML models are described. Highly accurate predictions reported in MPP tasks (e.g., lipophilicity, solubility, distribution coefficient), using methods such as D-MPNN, MolCLR, SMILES-BERT, and MolBERT, offer promising avenues in molecular design and drug discovery. Whereas MPP pertains to a given molecule, ML applications in chemical reactions present a different level of challenge, primarily arising from the simultaneous involvement of multiple molecules and their diverse roles in a reaction setting. The reported RMSEs in MPP tasks range from 0.287 to 2.20, while those for yield predictions are well over 4.9 in the lower end, reaching thresholds of >10.0 in several examples. Our Review concludes with a set of persisting challenges in dealing with reaction data sets and an overall optimistic outlook on benefits of ML-driven workflows for various MPP as well as CRP tasks.
Collapse
Affiliation(s)
- Shilpa Shilpa
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Gargee Kashyap
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Raghavan B Sunoj
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
- Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| |
Collapse
|