1
|
Zhang Y, Vlachos DG, Liu D, Fang H. Rapid Adaptation of Chemical Named Entity Recognition Using Few-Shot Learning and LLM Distillation. J Chem Inf Model 2025; 65:4334-4345. [PMID: 40310732 DOI: 10.1021/acs.jcim.5c00248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2025]
Abstract
Named entity recognition (NER) has been widely used in chemical text mining for the automatic identification and extraction of chemical entities. However, existing chemical NER systems primarily focus on scenarios with abundant training data, requiring significant human effort on annotations. This poses challenges for applications in the chemical field, such as catalysis, where many advancements have traditionally relied on trial-and-error investigations and incremental adjustment of variables. This hinders catalysis science and technology progress in addressing emerging energy and environmental crises. In this work, we propose a few-shot NER model that can quickly adapt to extract new types of chemical entities by using only a limited number of annotated examples. Our model employs a metric-learning approach to transfer entity similarity knowledge from high-resource chemical domains (with abundant annotations) to enable effective entity recognition in low-resource specialized domains (limited annotation). We validate the effectiveness of our model on a few-shot chemical NER benchmark built based on six existing chemical NER data sets. Experiments show that the proposed few-shot NER model can achieve reasonable performance with only 5 examples per entity type and shows consistent improvement as the number of examples increases. Furthermore, we demonstrate how the proposed model can be trained with large language model (LLM) annotated data, opening a new pathway for rapid adaptation of NER systems. Our approach leverages the knowledge broadness of large language models for chemistry while distilling this knowledge into a lightweight model suitable for efficient and in-house use.
Collapse
Affiliation(s)
- Yue Zhang
- Center for Plastics Innovation, University of Delaware, Newark, Delaware 19716, United States
- Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19716, United States
| | - Dionisios G Vlachos
- Center for Plastics Innovation, University of Delaware, Newark, Delaware 19716, United States
- Department of Chemical and Biomolecular Engineering, University of Delaware, Newark, Delaware 19711, United States
| | - Dongxia Liu
- Center for Plastics Innovation, University of Delaware, Newark, Delaware 19716, United States
- Department of Chemical and Biomolecular Engineering, University of Delaware, Newark, Delaware 19711, United States
| | - Hui Fang
- Center for Plastics Innovation, University of Delaware, Newark, Delaware 19716, United States
- Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19716, United States
| |
Collapse
|
2
|
Zaza L, Ranković B, Schwaller P, Buonsanti R. A Holistic Data-Driven Approach to Synthesis Predictions of Colloidal Nanocrystal Shapes. J Am Chem Soc 2025; 147:6116-6125. [PMID: 39916674 PMCID: PMC11848920 DOI: 10.1021/jacs.4c17283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2024] [Revised: 01/30/2025] [Accepted: 02/03/2025] [Indexed: 02/20/2025]
Abstract
The ability to precisely design colloidal nanocrystals (NCs) has far-reaching implications in optoelectronics, catalysis, biomedicine, and beyond. Achieving such control is generally based on a trial-and-error approach. Data-driven synthesis holds promise to advance both discovery and mechanistic knowledge. Herein, we contribute to advancing the current state of the art in the chemical synthesis of colloidal NCs by proposing a machine-learning toolbox that operates in a low-data regime, yet comprehensive of the most typical parameters relevant for colloidal NC synthesis. The developed toolbox predicts the NC shape given the reaction conditions and proposes reaction conditions given a target NC shape using Cu NCs as the model system. By classifying NC shapes on a continuous energy scale, we synthesize an unreported shape, which is the Cu rhombic dodecahedron. This holistic approach integrates data-driven and computational tools with materials chemistry. Such development is promising to greatly accelerate materials discovery and mechanistic understanding, thus advancing the field of tailored materials with atomic-scale precision tunability.
Collapse
Affiliation(s)
- Ludovic Zaza
- Laboratory
of Nanochemistry for Energy (LNCE), Department of Chemical Sciences
and Engineering, École Polytechnique
Fédérale de Lausanne, Sion CH-1950, Switzerland
| | - Bojana Ranković
- Laboratory
of Artificial Chemical Intelligence (LIAC), Department of Chemical
Sciences and Engineering, École Polytechnique
Fédérale de Lausanne, Lausanne CH-1015, Switzerland
| | - Philippe Schwaller
- Laboratory
of Artificial Chemical Intelligence (LIAC), Department of Chemical
Sciences and Engineering, École Polytechnique
Fédérale de Lausanne, Lausanne CH-1015, Switzerland
| | - Raffaella Buonsanti
- Laboratory
of Nanochemistry for Energy (LNCE), Department of Chemical Sciences
and Engineering, École Polytechnique
Fédérale de Lausanne, Sion CH-1950, Switzerland
| |
Collapse
|
3
|
Schilling-Wilhelmi M, Ríos-García M, Shabih S, Gil MV, Miret S, Koch CT, Márquez JA, Jablonka KM. From text to insight: large language models for chemical data extraction. Chem Soc Rev 2025; 54:1125-1150. [PMID: 39703015 DOI: 10.1039/d4cs00913d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2024]
Abstract
The vast majority of chemical knowledge exists in unstructured natural language, yet structured data is crucial for innovative and systematic materials design. Traditionally, the field has relied on manual curation and partial automation for data extraction for specific use cases. The advent of large language models (LLMs) represents a significant shift, potentially enabling non-experts to extract structured, actionable data from unstructured text efficiently. While applying LLMs to chemical and materials science data extraction presents unique challenges, domain knowledge offers opportunities to guide and validate LLM outputs. This tutorial review provides a comprehensive overview of LLM-based structured data extraction in chemistry, synthesizing current knowledge and outlining future directions. We address the lack of standardized guidelines and present frameworks for leveraging the synergy between LLMs and chemical expertise. This work serves as a foundational resource for researchers aiming to harness LLMs for data-driven chemical research. The insights presented here could significantly enhance how researchers across chemical disciplines access and utilize scientific information, potentially accelerating the development of novel compounds and materials for critical societal needs.
Collapse
Affiliation(s)
- Mara Schilling-Wilhelmi
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
| | - Martiño Ríos-García
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
- Institute of Carbon Science and Technology (INCAR), CSIC, Francisco Pintado Fe 26, 33011 Oviedo, Spain
| | - Sherjeel Shabih
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - María Victoria Gil
- Institute of Carbon Science and Technology (INCAR), CSIC, Francisco Pintado Fe 26, 33011 Oviedo, Spain
| | | | - Christoph T Koch
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - José A Márquez
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Kevin Maik Jablonka
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
- Center for Energy and Environmental Chemistry Jena (CEEC Jena), Friedrich Schiller University Jena, Philosophenweg 7a, 07743 Jena, Germany
- Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Lessingstrasse 12-14, 07743 Jena, Germany
| |
Collapse
|
4
|
Dangayach R, Jeong N, Demirel E, Uzal N, Fung V, Chen Y. Machine Learning-Aided Inverse Design and Discovery of Novel Polymeric Materials for Membrane Separation. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2025; 59:993-1012. [PMID: 39680111 PMCID: PMC11755723 DOI: 10.1021/acs.est.4c08298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/10/2024] [Revised: 12/03/2024] [Accepted: 12/04/2024] [Indexed: 12/17/2024]
Abstract
Polymeric membranes have been widely used for liquid and gas separation in various industrial applications over the past few decades because of their exceptional versatility and high tunability. Traditional trial-and-error methods for material synthesis are inadequate to meet the growing demands for high-performance membranes. Machine learning (ML) has demonstrated huge potential to accelerate design and discovery of membrane materials. In this review, we cover strengths and weaknesses of the traditional methods, followed by a discussion on the emergence of ML for developing advanced polymeric membranes. We describe methodologies for data collection, data preparation, the commonly used ML models, and the explainable artificial intelligence (XAI) tools implemented in membrane research. Furthermore, we explain the experimental and computational validation steps to verify the results provided by these ML models. Subsequently, we showcase successful case studies of polymeric membranes and emphasize inverse design methodology within a ML-driven structured framework. Finally, we conclude by highlighting the recent progress, challenges, and future research directions to advance ML research for next generation polymeric membranes. With this review, we aim to provide a comprehensive guideline to researchers, scientists, and engineers assisting in the implementation of ML to membrane research and to accelerate the membrane design and material discovery process.
Collapse
Affiliation(s)
- Raghav Dangayach
- School
of Civil & Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Nohyeong Jeong
- School
of Civil & Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Elif Demirel
- School
of Civil & Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Nigmet Uzal
- School
of Civil & Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
- Department
of Civil Engineering, Abdullah Gul University, 38039 Kayseri, Turkey
| | - Victor Fung
- School
of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Yongsheng Chen
- School
of Civil & Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| |
Collapse
|
5
|
Van Herck J, Gil MV, Jablonka KM, Abrudan A, Anker AS, Asgari M, Blaiszik B, Buffo A, Choudhury L, Corminboeuf C, Daglar H, Elahi AM, Foster IT, Garcia S, Garvin M, Godin G, Good LL, Gu J, Xiao Hu N, Jin X, Junkers T, Keskin S, Knowles TPJ, Laplaza R, Lessona M, Majumdar S, Mashhadimoslem H, McIntosh RD, Moosavi SM, Mouriño B, Nerli F, Pevida C, Poudineh N, Rajabi-Kochi M, Saar KL, Hooriabad Saboor F, Sagharichiha M, Schmidt KJ, Shi J, Simone E, Svatunek D, Taddei M, Tetko I, Tolnai D, Vahdatifar S, Whitmer J, Wieland DCF, Willumeit-Römer R, Züttel A, Smit B. Assessment of fine-tuned large language models for real-world chemistry and material science applications. Chem Sci 2025; 16:670-684. [PMID: 39664810 PMCID: PMC11629507 DOI: 10.1039/d4sc04401k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Accepted: 11/12/2024] [Indexed: 12/13/2024] Open
Abstract
The current generation of large language models (LLMs) has limited chemical knowledge. Recently, it has been shown that these LLMs can learn and predict chemical properties through fine-tuning. Using natural language to train machine learning models opens doors to a wider chemical audience, as field-specific featurization techniques can be omitted. In this work, we explore the potential and limitations of this approach. We studied the performance of fine-tuning three open-source LLMs (GPT-J-6B, Llama-3.1-8B, and Mistral-7B) for a range of different chemical questions. We benchmark their performances against "traditional" machine learning models and find that, in most cases, the fine-tuning approach is superior for a simple classification problem. Depending on the size of the dataset and the type of questions, we also successfully address more sophisticated problems. The most important conclusions of this work are that, for all datasets considered, their conversion into an LLM fine-tuning training set is straightforward and that fine-tuning with even relatively small datasets leads to predictive models. These results suggest that the systematic use of LLMs to guide experiments and simulations will be a powerful technique in any research study, significantly reducing unnecessary experiments or computations.
Collapse
Affiliation(s)
- Joren Van Herck
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | - María Victoria Gil
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
- Instituto de Ciencia y TecnologÍa del Carbono (INCAR), CSIC Francisco Pintado Fe 26 33011 Oviedo Spain
| | - Kevin Maik Jablonka
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
- Laboratory of Organic and Tecnolog'ıa Chemistry (IOMC), Friedrich Schiller University Jena Humboldtstrasse 10 07743 Jena Germany
- Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena) Lessingstrasse 12-14 07743 Jena Germany
| | - Alex Abrudan
- Yusuf Hamied Department of Chemistry, University of Cambridge Cambridge CB2 1EW UK
| | - Andy S Anker
- Department of Energy Conversion and Storage, Technical University of Denmark DK-2800 Kgs. Lyngby Denmark
- Department of Chemistry, University of Oxford Oxford OX1 3TA UK
| | - Mehrdad Asgari
- Department of Chemical Engineering & Biotechnology, University of Cambridge Philippa Fawcett Drive Cambridge CB3 0AS UK
| | - Ben Blaiszik
- Department of Computer Science, University of Chicago Chicago IL 60637 USA
- Data Science and Learning Division, Argonne National Laboratory Lemont IL 60439 USA
| | - Antonio Buffo
- Department of Applied Science and Technology (DISAT), Politecnico di Torino 10129 Turino Italy
| | - Leander Choudhury
- Laboratory of Catalysis and Organic Synthesis (LCSO), Institute of Chemical Sciences and Engineering (ISIC), École Polytechnique Fédérale de Lausanne (EPFL) CH-1015 Lausanne Switzerland
| | - Clemence Corminboeuf
- Laboratory for Computational Molecular Design (LCMD), Institute of Chemical Sciences and Engineering (ISIC), École Polytechnique Fédérale de Lausanne (EPFL) CH-1015 Lausanne Switzerland
| | - Hilal Daglar
- Department of Chemical and Biological Engineering, Koç University Rumelifeneri Yolu, Sariyer 34450 Istanbul Turkey
| | - Amir Mohammad Elahi
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | - Ian T Foster
- Department of Computer Science, University of Chicago Chicago IL 60637 USA
- Data Science and Learning Division, Argonne National Laboratory Lemont IL 60439 USA
| | - Susana Garcia
- The Research Centre for Carbon Solutions (RCCS), School of Engineering and Physical Sciences, Heriot-Watt University Edinburgh EH14 4AS UK
| | - Matthew Garvin
- The Research Centre for Carbon Solutions (RCCS), School of Engineering and Physical Sciences, Heriot-Watt University Edinburgh EH14 4AS UK
| | | | - Lydia L Good
- Yusuf Hamied Department of Chemistry, University of Cambridge Cambridge CB2 1EW UK
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health Bethesda Maryland 20892 USA
| | - Jianan Gu
- Institute of Metallic Biomaterials, Helmholtz Zentrum Hereon Geesthacht Germany
| | - Noémie Xiao Hu
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | - Xin Jin
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | - Tanja Junkers
- Polymer Reaction Design Group, School of Chemistry, Monash University Clayton VIC 3800 Australia
| | - Seda Keskin
- Department of Chemical and Biological Engineering, Koç University Rumelifeneri Yolu, Sariyer 34450 Istanbul Turkey
| | - Tuomas P J Knowles
- Yusuf Hamied Department of Chemistry, University of Cambridge Cambridge CB2 1EW UK
- Cavendish Laboratory, Department of Physics, University of Cambridge Cambridge CB3 0HE UK
| | - Ruben Laplaza
- Laboratory for Computational Molecular Design (LCMD), Institute of Chemical Sciences and Engineering (ISIC), École Polytechnique Fédérale de Lausanne (EPFL) CH-1015 Lausanne Switzerland
| | - Michele Lessona
- Department of Applied Science and Technology (DISAT), Politecnico di Torino 10129 Turino Italy
| | - Sauradeep Majumdar
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | | | - Ruaraidh D McIntosh
- Institute of Chemical Sciences, School of Engineering and Physical Sciences, Heriot-Watt University Edinburgh EH14 4AS UK
| | - Seyed Mohamad Moosavi
- Chemical Engineering & Applied Chemistry, University of Toronto Toronto Ontario M5S 3E5 Canada
| | - Beatriz Mouriño
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | - Francesca Nerli
- Dipartimento di Chimica e Chimica Industriale, Unità di Ricerca INSTM, Università di Pisa Via Giuseppe Moruzzi 13 56124 Pisa Italy
| | - Covadonga Pevida
- Instituto de Ciencia y TecnologÍa del Carbono (INCAR), CSIC Francisco Pintado Fe 26 33011 Oviedo Spain
| | - Neda Poudineh
- The Research Centre for Carbon Solutions (RCCS), School of Engineering and Physical Sciences, Heriot-Watt University Edinburgh EH14 4AS UK
| | - Mahyar Rajabi-Kochi
- Chemical Engineering & Applied Chemistry, University of Toronto Toronto Ontario M5S 3E5 Canada
| | - Kadi L Saar
- Yusuf Hamied Department of Chemistry, University of Cambridge Cambridge CB2 1EW UK
| | | | - Morteza Sagharichiha
- Department of Chemical Engineering, College of Engineering, University of Tehran Tehran Iran
| | - K J Schmidt
- Department of Computer Science, University of Chicago Chicago IL 60637 USA
| | - Jiale Shi
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA 02139 USA
- Department of Chemical and Biomolecular Engineering, University of Notre Dame Notre Dame Indiana 46556 USA
| | - Elena Simone
- Department of Applied Science and Technology (DISAT), Politecnico di Torino 10129 Turino Italy
| | - Dennis Svatunek
- Institute of Applied Synthetic Chemistry, TU Wien Getreidemarkt 9 1060 Vienna Austria
| | - Marco Taddei
- Dipartimento di Chimica e Chimica Industriale, Unità di Ricerca INSTM, Università di Pisa Via Giuseppe Moruzzi 13 56124 Pisa Italy
| | - Igor Tetko
- BIGCHEM GmbH Valerystraße 49 85716 Unterschleißheim Germany
- Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH) Ingolstädter Landstraße 1 85764 Neuherberg Germany
| | - Domonkos Tolnai
- Institute of Metallic Biomaterials, Helmholtz Zentrum Hereon Geesthacht Germany
| | - Sahar Vahdatifar
- Department of Chemical Engineering, College of Engineering, University of Tehran Tehran Iran
| | - Jonathan Whitmer
- Department of Chemical and Biomolecular Engineering, University of Notre Dame Notre Dame Indiana 46556 USA
- Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame Indiana 46556 USA
| | - D C Florian Wieland
- Institute of Metallic Biomaterials, Helmholtz Zentrum Hereon Geesthacht Germany
| | | | - Andreas Züttel
- Laboratory of Materials for Renewable Energy (LMER), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | - Berend Smit
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| |
Collapse
|
6
|
Baskin I, Ein-Eli Y. Chemoinformatics for corrosion science: Data-driven modeling of corrosion inhibition by organic molecules. Mol Inform 2024; 43:e202400082. [PMID: 39404187 DOI: 10.1002/minf.202400082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 07/16/2024] [Accepted: 07/17/2024] [Indexed: 11/14/2024]
Abstract
This paper reviews the application of machine learning to the inhibition of corrosion by organic molecules. The methodologies considered include quantitative structure-property relationships (QSPR) and related data-driven approaches. The characteristic features of their key components are considered as applied to corrosion inhibition, including datasets, response properties, molecular descriptors, machine learning methods, and structure-property models. It is shown that the most important factors determining their choice and application features are: (1) the small or very small size of datasets, (2) the mechanism of corrosion inhibition associated with the adsorption of inhibitor molecules on the metal surface, and (3) multifactorial conditioning and noisiness of response property. On this basis, the application of machine learning to the inhibition of corrosion of materials based on iron, aluminum, and magnesium is considered. The main trends in the development of QSPR and related data-driven modeling of corrosion inhibition are discussed, the shortcomings and common errors are considered, and the prospects for their further development are outlined.
Collapse
Affiliation(s)
- Igor Baskin
- Department of Materials Science and Engineering, Technion-Israel Institute of Technology, Haifa, 3200003, Israel
| | - Yair Ein-Eli
- Department of Materials Science and Engineering, Technion-Israel Institute of Technology, Haifa, 3200003, Israel
- Grand Technion Energy Program (GTEP), Technion-Israel Institute of Technology, Haifa, 3200003, Israel
| |
Collapse
|
7
|
Wang X, Huang L, Xu S, Lu K. How Does a Generative Large Language Model Perform on Domain-Specific Information Extraction?─A Comparison between GPT-4 and a Rule-Based Method on Band Gap Extraction. J Chem Inf Model 2024; 64:7895-7904. [PMID: 39375999 DOI: 10.1021/acs.jcim.4c00882] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/09/2024]
Abstract
The advent of generative Large Language Models (LLMs) has greatly impacted the field of Natural Language Processing. However, it is inconclusive how generative LLMs perform on domain-specific information extraction tasks. This study compares the performance of GPT-4 and a rule-based information extraction method based on ChemDataExtractor on band gap information extraction, a task that has important implications for the materials science domain. No training data is required for either method, which is desirable because there is a lack of training data in the materials science domain compared with a variety of material information that is of interest. Manual evaluation on 415 randomly selected articles showed that the GPT-4 model achieved a higher level of accuracy in extracting materials' band gap information than the rule-based method (Correctness 87.95% vs 51.08%, Partial correctness 11.33% vs 36.87%, incorrectness 0.72% vs 12.05%). Further analysis of the errors reveals the strengths and weaknesses of the GPT-4 model compared to the rule-based method. The GPT-4 model shows stronger performance in interdependency resolution and complicated material name recognition, while it also has weaknesses in hallucination, identifying band gap values, and identifying band gap types. Revised prompt based on the error analysis leads to improved accuracy for GPT-4. To the best of our knowledge, this study is the first to compare the GPT-4 model and ChemDataExtractor for the band gap extraction task. This study provides evidence to support using generative LLMs for domain-specific information extraction tasks.
Collapse
Affiliation(s)
- Xin Wang
- School of Library and Information Studies, The University of Oklahoma, 401 West Brooks, Norman, Oklahoma 73019, United States
| | - Liangliang Huang
- School of Sustainable Chemical, Biological and Materials Engineering, The University of Oklahoma, 100 E. Boyd St., Norman, Oklahoma 73019, United States
| | - Shuozhi Xu
- School of Aerospace and Mechanical Engineering, The University of Oklahoma, 865 Asp Ave., Norman, Oklahoma 73019, United States
| | - Kun Lu
- School of Library and Information Studies, The University of Oklahoma, 401 West Brooks, Norman, Oklahoma 73019, United States
| |
Collapse
|
8
|
Leong SX, Pablo-García S, Zhang Z, Aspuru-Guzik A. Automated electrosynthesis reaction mining with multimodal large language models (MLLMs). Chem Sci 2024; 15:d4sc04630g. [PMID: 39397816 PMCID: PMC11462585 DOI: 10.1039/d4sc04630g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Accepted: 09/13/2024] [Indexed: 10/15/2024] Open
Abstract
Leveraging the chemical data available in legacy formats such as publications and patents is a significant challenge for the community. Automated reaction mining offers a promising solution to unleash this knowledge into a learnable digital form and therefore help expedite materials and reaction discovery. However, existing reaction mining toolkits are limited to single input modalities (text or images) and cannot effectively integrate heterogeneous data that is scattered across text, tables, and figures. In this work, we go beyond single input modalities and explore multimodal large language models (MLLMs) for the analysis of diverse data inputs for automated electrosynthesis reaction mining. We compiled a test dataset of 65 articles (MERMES-T24 set) and employed it to benchmark five prominent MLLMs against two critical tasks: (i) reaction diagram parsing and (ii) resolving cross-modality data interdependencies. The frontrunner MLLM achieved ≥96% accuracy in both tasks, with the strategic integration of single-shot visual prompts and image pre-processing techniques. We integrate this capability into a toolkit named MERMES (multimodal reaction mining pipeline for electrosynthesis). Our toolkit functions as an end-to-end MLLM-powered pipeline that integrates article retrieval, information extraction and multimodal analysis for streamlining and automating knowledge extraction. This work lays the groundwork for the increased utilization of MLLMs to accelerate the digitization of chemistry knowledge for data-driven research.
Collapse
Affiliation(s)
- Shi Xuan Leong
- Department of Chemistry, University of Toronto, Lash Miller Chemical Laboratories 80 St. George Street ON M5S 3H6 Toronto Canada
- Division of Chemistry and Biological Chemistry, School of Chemistry, Chemical Engineering and Biotechnology, Nanyang Technological University 21 Nanyang Link Singapore 637371
| | - Sergio Pablo-García
- Department of Chemistry, University of Toronto, Lash Miller Chemical Laboratories 80 St. George Street ON M5S 3H6 Toronto Canada
- Department of Computer Science, University of Toronto Sandford Fleming Building, 10 King's College Road ON M5S 3G4 Toronto Canada
- Vector Institute for Artificial Intelligence 661 University Ave. Suite 710 ON M5G 1M1 Toronto Canada
| | - Zijian Zhang
- Department of Computer Science, University of Toronto Sandford Fleming Building, 10 King's College Road ON M5S 3G4 Toronto Canada
- Vector Institute for Artificial Intelligence 661 University Ave. Suite 710 ON M5G 1M1 Toronto Canada
| | - Alán Aspuru-Guzik
- Department of Chemistry, University of Toronto, Lash Miller Chemical Laboratories 80 St. George Street ON M5S 3H6 Toronto Canada
- Department of Computer Science, University of Toronto Sandford Fleming Building, 10 King's College Road ON M5S 3G4 Toronto Canada
- Vector Institute for Artificial Intelligence 661 University Ave. Suite 710 ON M5G 1M1 Toronto Canada
- Acceleration Consortium 80 St. George St. M5S 3H6 Toronto Canada
- Department of Materials Science & Engineering, University of Toronto 184 College St. M5S 3E4 Toronto Canada
- Department of Chemical Engineering & Applied Chemistry, University of Toronto 200 College St. ON M5S 3E5 Toronto Canada
- Lebovic Fellow, Canadian Institute for Advanced Research (CIFAR) 661 University Ave. M5G 1M1 Toronto Canada
| |
Collapse
|
9
|
Huang Z, Li X, Li A, Yang Y, He L, Zhang Z, Wu S, Wang Y, Cai S, He Y, Liu X. MPNTEXT: An Interactive Platform for Automatically Extracting Metal-Polyphenol Networks and Their Applications from Scientific Literature. J Chem Inf Model 2024. [PMID: 39258795 DOI: 10.1021/acs.jcim.4c01093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]
Abstract
In recent years, metal-polyphenol networks (MPNs) have gained significant attention due to their unique properties and broad applications across various fields. However, the burgeoning volume of MPN literature necessitates the automation of chemical information extraction from the extensive corpus of unstructured data, including scientific publications. To address this challenge, we proposed a platform named MPNTEXT, which utilized natural language processing techniques and machine learning algorithms to efficiently identify and extract pertinent information, thereby assisting users in comprehending complex MPNs and their textual descriptions of applications. Users can enter keywords, such as "Fe", "drug delivery", or "tannic acid", to retrieve relevant information, which is then presented in a structured format. This study aims to provide a user-friendly tool for collecting and retrieving MPN data and promotes data-driven material design. The platform offers researchers a more convenient and efficient way to design versatile MPNs and explore their applications.
Collapse
Affiliation(s)
- Zihui Huang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Xinyi Li
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Andi Li
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Yuhang Yang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Liqiang He
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Zhiwen Zhang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Siwei Wu
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Yang Wang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Shuting Cai
- School of Integrated Circuits, Guangdong University of Technology, Guangzhou 510006, China
| | - Yan He
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Xujie Liu
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| |
Collapse
|
10
|
Alshehri AS, Horstmann KA, You F. Versatile Deep Learning Pipeline for Transferable Chemical Data Extraction. J Chem Inf Model 2024; 64:5888-5899. [PMID: 39009039 DOI: 10.1021/acs.jcim.4c00816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
Chemical information disseminated in scientific documents offers an untapped potential for deep learning-assisted insights and breakthroughs. Automated extraction efforts have shifted from resource-intensive manual extraction toward applying machine learning methods to streamline chemical data extraction. While current extraction models and pipelines have ushered in notable efficiency improvements, they often exhibit modest performance, compromising the accuracy of predictive models trained on extracted data. Further, current chemical pipelines lack both transferability─where a model trained on one task can be adapted to another relevant task with limited examples─and extensibility, which enables seamless adaptability for new extraction tasks. Addressing these gaps, we present ChemREL, a versatile chemical data extraction pipeline emphasizing performance, transferability, and extensibility. ChemREL utilizes a custom, diverse data set of chemical documents, labeled through an active learning strategy to extract two properties: normal melting point and lethal dose 50 (LD50). The normal melting point is selected for its prevalence in diverse contexts and wider literature, serving as the foundation for pipeline training. In contrast, LD50 evaluates the pipeline's transferability to an unrelated property, underscoring variance in its biological nature, toxicological context, and units, among other differences. With pretraining and fine-tuning, our pipeline outperforms existing methods and GPT-4, achieving F1-scores of 96.1% for entity identification and 97.0% for relation mapping, culminating in an overall F1-score of 95.4%. More importantly, ChemREL displays high transferability, effectively transitioning from melting point extraction to LD50 extraction with 10 randomly selected training documents. Released as an open-source package, ChemREL aims to broaden access to chemical data extraction, enabling the construction of expansive relational data sets that propel discovery.
Collapse
Affiliation(s)
- Abdulelah S Alshehri
- Robert Frederick Smith School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, New York 14853, United States
- Department of Chemical Engineering, College of Engineering, King Saud University, Riyadh 11421, Saudi Arabia
| | - Kai A Horstmann
- Department of Computer Science, Cornell University, Ithaca, New York 14853, United States
| | - Fengqi You
- Robert Frederick Smith School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, New York 14853, United States
| |
Collapse
|
11
|
Wang X, Zhang W, Zhang W. Dielectric Ceramics Database Automatically Constructed by Data Mining in the Literature. J Chem Inf Model 2024; 64:5931-5943. [PMID: 39042485 DOI: 10.1021/acs.jcim.4c00282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/25/2024]
Abstract
Vast published dielectric ceramics literature is a natural database for big-data analysis, discovering structure-property relationships, and property prediction. We constructed a data-mining pipeline based on natural language processing (NLP) to extract property information from about 12,900 published dielectric ceramics articles and normalized more than 20 properties. The micro-F1 scores for sentence classification, named entities recognition, relation extraction (related), and relation extraction (same), are 91.6, 82.4, 91.4, and 88.3%, respectively. We demonstrated the distribution of some essential properties according to the publication years to reveal the tendency. In order to test the reliability of the data extraction, we trained an XGBoost model to predict the dielectric constant and used the SHAP module to interpret the contribution of each feature in order to identify some of the factors that determine the dielectric properties. The result shows that including Q × f in the model can increase the dielectric constant prediction accuracy. Our work can give some hints to experimentalists on their way to improve the performances of cutting-edge materials.
Collapse
Affiliation(s)
- Xiaochao Wang
- School of Integrated Circuits Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, P.R. China
| | - Wanli Zhang
- School of Integrated Circuits Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, P.R. China
| | - Wenxu Zhang
- School of Integrated Circuits Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, P.R. China
| |
Collapse
|
12
|
Tannir S, Pan Y, Josephs N, Cunningham C, Hendrick NR, Beckett A, McNeely J, Beeler A, Jeffries-El M, Kolaczyk ED. Predicting Emission Wavelengths in Benzobisoxazole-Based OLEDs with Gradient Boosted Ensemble Models. J Phys Chem A 2024; 128:6116-6123. [PMID: 39008894 DOI: 10.1021/acs.jpca.4c00077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
We demonstrate the use of gradient-boosted ensemble models that accurately predict emission wavelengths in benzobis[1,2-d:4,5-d']oxazole (BBO) based fluorescent emitters. We have curated a database of 50 molecules from previously published data by the Jeffries-EL group using density functional theory (DFT) computed ground and excited state features. We consider two machine learning (ML) models based on (i) whole cruciform molecules and (ii) their constituent fragment molecules. Both ML models provide accurate predictions with root-mean-square errors between 30 and 36 nm, competitive with state-of-the-art deep learning models trained on orders of magnitude more molecules, and this accuracy holds even when tested on four new BBO emitters unseen by the models. We also provide an interpretable feature importance analysis and discuss the relevant relationships between DFT and changes in predicted emission wavelength.
Collapse
Affiliation(s)
- Shambhavi Tannir
- Department of Chemistry, Boston University, Boston, Massachusetts 02215, United States
| | - Yuning Pan
- Department of Mathematics and Statistics, Boston University, Boston, Massachusetts 02215, United States
| | - Nathaniel Josephs
- Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, United States
| | | | - Nathan R Hendrick
- Department of Chemistry, Boston University, Boston, Massachusetts 02215, United States
| | - Annie Beckett
- Department of Chemistry, Boston University, Boston, Massachusetts 02215, United States
| | - James McNeely
- Department of Chemistry, Boston University, Boston, Massachusetts 02215, United States
| | - Aaron Beeler
- Department of Chemistry, Boston University, Boston, Massachusetts 02215, United States
| | - Malika Jeffries-El
- Department of Chemistry, Boston University, Boston, Massachusetts 02215, United States
- Division of Material Science and Engineering, Boston University, Boston, Massachusetts 02215, United States
| | - Eric D Kolaczyk
- Department of Mathematics and Statistics, Boston University, Boston, Massachusetts 02215, United States
- Department of Mathematics and Statistics, McGill University, Montreal, QC H3A 0G4, Canada
| |
Collapse
|
13
|
Gou Y, Zhang Y, Zhu J, Shu Y. A document-level information extraction pipeline for layered cathode materials for sodium-ion batteries. Sci Data 2024; 11:372. [PMID: 38605057 PMCID: PMC11009284 DOI: 10.1038/s41597-024-03196-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 03/28/2024] [Indexed: 04/13/2024] Open
Abstract
Natural language processing techniques enable extraction of valuable information from large amounts of published literature for the application of data science and technology, i.e. machine learning in the field of materials science. Nevertheless, the automated extraction of data from full-text documents remains a complex task. We propose a document-level natural language processing pipeline for literature extraction of comprehensive information on layered cathode materials for sodium-ion batteries. The pipeline enhances entity recognition with contextual supplementary information while capturing the article structure. Finally, a heuristic multi-level relationship extraction algorithm is employed in relation extraction to extract experimental parameters and complex performance relationships respectively. We successfully extracted a comprehensive dataset containing 5265 records from 1747 documents, encompassing essential information such as chemical composition, synthesis parameters, and electrochemical properties. By implementing our pipeline, we have made significant progress in overcoming the challenges associated with data scarcity in battery informatics. The extracted datasets provide a valuable resource for further research and development in the field of layered cathode materials.
Collapse
Affiliation(s)
- Yuxiao Gou
- School of Materials Science and Engineering, Sun Yat-sen University, Guangdong, China
| | - Yiping Zhang
- School of Materials Science and Engineering, Sun Yat-sen University, Guangdong, China
| | - Jian Zhu
- School of Materials Science and Engineering, Sun Yat-sen University, Guangdong, China
| | - Yidan Shu
- School of Materials Science and Engineering, Sun Yat-sen University, Guangdong, China.
- The Key Laboratory of Low-carbon Chemistry & Energy Conservation of Guangdong Province, Guangdong, China.
| |
Collapse
|
14
|
Chen L, Wang B, Zhang W, Zheng S, Chen Z, Zhang M, Dong C, Pan F, Li S. Crystal Structure Assignment for Unknown Compounds from X-ray Diffraction Patterns with Deep Learning. J Am Chem Soc 2024; 146:8098-8109. [PMID: 38477574 DOI: 10.1021/jacs.3c11852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/14/2024]
Abstract
Determining the structures of previously unseen compounds from experimental characterizations is a crucial part of materials science. It requires a step of searching for the structure type that conforms to the lattice of the unknown compound, which enables the pattern matching process for characterization data, such as X-ray diffraction (XRD) patterns. However, this procedure typically places a high demand on domain expertise, thus creating an obstacle for computer-driven automation. Here, we address this challenge by leveraging a deep-learning model composed of a union of convolutional residual neural networks. The accuracy of the model is demonstrated on a dataset of over 60,000 different compounds for 100 structure types, and additional categories can be integrated without the need to retrain the existing networks. We also unravel the operation of the deep-learning black box and highlight the way in which the resemblance between the unknown compound and a structure type is quantified based on both local and global characteristics in XRD patterns. This computational tool opens new avenues for automating structure analysis on materials unearthed in high-throughput experimentation.
Collapse
Affiliation(s)
- Litao Chen
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen 518055, People's Republic of China
| | - Bingxu Wang
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen 518055, People's Republic of China
| | - Wentao Zhang
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen 518055, People's Republic of China
| | - Shisheng Zheng
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen 518055, People's Republic of China
| | - Zhefeng Chen
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen 518055, People's Republic of China
| | - Mingzheng Zhang
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen 518055, People's Republic of China
| | - Cheng Dong
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen 518055, People's Republic of China
| | - Feng Pan
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen 518055, People's Republic of China
| | - Shunning Li
- School of Advanced Materials, Peking University, Shenzhen Graduate School, Shenzhen 518055, People's Republic of China
| |
Collapse
|
15
|
Polak MP, Morgan D. Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nat Commun 2024; 15:1569. [PMID: 38383556 PMCID: PMC10882009 DOI: 10.1038/s41467-024-45914-8] [Citation(s) in RCA: 27] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 02/05/2024] [Indexed: 02/23/2024] Open
Abstract
There has been a growing effort to replace manual extraction of data from research papers with automated data extraction based on natural language processing, language models, and recently, large language models (LLMs). Although these methods enable efficient extraction of data from large sets of research papers, they require a significant amount of up-front effort, expertise, and coding. In this work, we propose the ChatExtract method that can fully automate very accurate data extraction with minimal initial effort and background, using an advanced conversational LLM. ChatExtract consists of a set of engineered prompts applied to a conversational LLM that both identify sentences with data, extract that data, and assure the data's correctness through a series of follow-up questions. These follow-up questions largely overcome known issues with LLMs providing factually inaccurate responses. ChatExtract can be applied with any conversational LLMs and yields very high quality data extraction. In tests on materials data, we find precision and recall both close to 90% from the best conversational LLMs, like GPT-4. We demonstrate that the exceptional performance is enabled by the information retention in a conversational model combined with purposeful redundancy and introducing uncertainty through follow-up prompts. These results suggest that approaches similar to ChatExtract, due to their simplicity, transferability, and accuracy are likely to become powerful tools for data extraction in the near future. Finally, databases for critical cooling rates of metallic glasses and yield strengths of high entropy alloys are developed using ChatExtract.
Collapse
Affiliation(s)
- Maciej P Polak
- Department of Materials Science and Engineering, University of Wisconsin-Madison, Madison, WI, 53706-1595, USA.
| | - Dane Morgan
- Department of Materials Science and Engineering, University of Wisconsin-Madison, Madison, WI, 53706-1595, USA.
| |
Collapse
|
16
|
Knoll P, Ouyang B, Steinbock O. Patterns Lead the Way to Far-from-Equilibrium Materials. ACS PHYSICAL CHEMISTRY AU 2024; 4:19-30. [PMID: 38283788 PMCID: PMC10811769 DOI: 10.1021/acsphyschemau.3c00050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 10/14/2023] [Accepted: 10/19/2023] [Indexed: 01/30/2024]
Abstract
The universe is a complex fabric of repeating patterns that unfold their beauty in system-specific diversity. The periodic table, crystallography, and the genetic code are classic examples that illustrate how even a small number of rules generate a vast range of shapes and structures. Today, we are on the brink of an AI-driven revolution that will reveal an unprecedented number of novel patterns, many of which will escape human intuition and expertise. We suggest that in the second half of the 21st century, the challenge for Physical Chemistry will be to guide and interpret these advances in the broader context of physical sciences and materials-related engineering. If we succeed in this role, Physical Chemistry will be able to extend to new horizons. In this article, we will discuss examples that strike us as particularly promising, specifically the discovery of high-entropy and far-from-equilibrium materials as well as applications to origins-of-life research and the search for life on other planets.
Collapse
Affiliation(s)
- Pamela Knoll
- School
of Physics and Astronomy, Institute for Condensed Matter and Complex
Systems, University of Edinburgh, Edinburgh EH9 3FD, U.K.
| | - Bin Ouyang
- Department
of Chemistry and Biochemistry, Florida State
University, Tallahassee, Florida 32306-4390, United States
| | - Oliver Steinbock
- Department
of Chemistry and Biochemistry, Florida State
University, Tallahassee, Florida 32306-4390, United States
| |
Collapse
|
17
|
Cruse K, Baibakova V, Abdelsamie M, Hong K, Bartel CJ, Trewartha A, Jain A, Sutter-Fella CM, Ceder G. Text Mining the Literature to Inform Experiments and Rationalize Impurity Phase Formation for BiFeO 3. CHEMISTRY OF MATERIALS : A PUBLICATION OF THE AMERICAN CHEMICAL SOCIETY 2024; 36:772-785. [PMID: 38282687 PMCID: PMC10809418 DOI: 10.1021/acs.chemmater.3c02203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 12/08/2023] [Accepted: 12/08/2023] [Indexed: 01/30/2024]
Abstract
We used data-driven methods to understand the formation of impurity phases in BiFeO3 thin-film synthesis through the sol-gel technique. Using a high-quality dataset of 331 synthesis procedures and outcomes extracted manually from 177 scientific articles, we trained decision tree models that reinforce important experimental heuristics for the avoidance of phase impurities but ultimately show limited predictive capability. We find that several important synthesis features, identified by our model, are often not reported in the literature. To test our ability to correctly impute missing synthesis parameters, we attempted to reproduce nine syntheses from the literature with varying degrees of "missingness". We demonstrate how a text-mined dataset can be made useful by informing new controlled experiments and forming a better understanding for impurity phase formation in this complex oxide system.
Collapse
Affiliation(s)
- Kevin Cruse
- Department
of Materials Science & Engineering, University of California, Berkeley, California 94720, United States
- Materials
Sciences Division, Lawrence Berkeley National
Laboratory, Berkeley, California 94720, United States
| | - Viktoriia Baibakova
- Energy
Technologies Area, Lawrence Berkeley National
Laboratory, Berkeley, California 94720, United States
| | - Maged Abdelsamie
- Material
Science and Engineering Department, King
Fahd University of Petroleum and Minerals (KFUPM), Dhahran 31261, Saudi Arabia
- Interdisciplinary
Research Center for Intelligent Manufacturing and Robotics, KFUPM, Dhahran 31261, Saudi Arabia
| | - Kootak Hong
- Chemical
Sciences Division, Lawrence Berkeley National
Laboratory, Berkeley, California 94720, United States
- Department
of Materials Science and Engineering, Chonnam
National University, Gwangju 61186, Republic
of Korea
| | - Christopher J. Bartel
- Department
of Materials Science & Engineering, University of California, Berkeley, California 94720, United States
- Materials
Sciences Division, Lawrence Berkeley National
Laboratory, Berkeley, California 94720, United States
- Department
of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, Minnesota 55455, United States
| | - Amalie Trewartha
- Department
of Materials Science & Engineering, University of California, Berkeley, California 94720, United States
- Materials
Sciences Division, Lawrence Berkeley National
Laboratory, Berkeley, California 94720, United States
- Energy
and Materials, Toyota Research Institute, Los Altos, California 94022, United States
| | - Anubhav Jain
- Energy
Technologies Area, Lawrence Berkeley National
Laboratory, Berkeley, California 94720, United States
| | - Carolin M. Sutter-Fella
- Molecular
Foundry Division, Lawrence Berkeley National
Laboratory, Berkeley, California 94720, United States
| | - Gerbrand Ceder
- Department
of Materials Science & Engineering, University of California, Berkeley, California 94720, United States
- Materials
Sciences Division, Lawrence Berkeley National
Laboratory, Berkeley, California 94720, United States
| |
Collapse
|
18
|
Choi J, Lee B. Quantitative Topic Analysis of Materials Science Literature Using Natural Language Processing. ACS APPLIED MATERIALS & INTERFACES 2024; 16:1957-1968. [PMID: 38059688 DOI: 10.1021/acsami.3c12301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/08/2023]
Abstract
Materials science research has garnered extensive attention from industry, society, policy, and academia. However, understanding the research landscape and extracting strategic insights are challenging due to the increasing diversity and volume of publications. This study proposes a natural language processing-based protocol for extracting text-encoded topics from a large volume of scientific literature, uncovering research interests of scientific communities, as well as convergence trends. We report a topic map, representing the materials science research landscape with text-mined 257 topics regarding biocompatible materials, structural materials, electrochemistry, or photonics. We analyze the topic map in terms of national research interests in materials science, revealing competitive positions and strategies of active nations. For example, it is found that the increasing trend of research interest in machine learning topic was captured in the United States earlier than other nations. Similarly, our journal-level analyses serve as reference information for journal recommendations and trend guidance, showing that the main topics and research interests of materials science journals slightly changed over time. Moreover, we build the topic association network which can highlight the status and future potential of interdisciplinary research, revealing research fields with high centrality in the network such as machine learning-enabled composite modeling, energy policy, or wearable electronics. This study offers insightful results on current and near-future materials science research landscapes, facilitating the understanding of stakeholders, amidst the fast-evolving and diverse knowledge of materials science.
Collapse
Affiliation(s)
- Jaewoong Choi
- Computational Science Research Center, Korea Institute of Science and Technology, Seoul 02792, Republic of Korea
| | - Byungju Lee
- Computational Science Research Center, Korea Institute of Science and Technology, Seoul 02792, Republic of Korea
| |
Collapse
|
19
|
Bi X, Lin L, Chen Z, Ye J. Artificial Intelligence for Surface-Enhanced Raman Spectroscopy. SMALL METHODS 2024; 8:e2301243. [PMID: 37888799 DOI: 10.1002/smtd.202301243] [Citation(s) in RCA: 31] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/11/2023] [Indexed: 10/28/2023]
Abstract
Surface-enhanced Raman spectroscopy (SERS), well acknowledged as a fingerprinting and sensitive analytical technique, has exerted high applicational value in a broad range of fields including biomedicine, environmental protection, food safety among the others. In the endless pursuit of ever-sensitive, robust, and comprehensive sensing and imaging, advancements keep emerging in the whole pipeline of SERS, from the design of SERS substrates and reporter molecules, synthetic route planning, instrument refinement, to data preprocessing and analysis methods. Artificial intelligence (AI), which is created to imitate and eventually exceed human behaviors, has exhibited its power in learning high-level representations and recognizing complicated patterns with exceptional automaticity. Therefore, facing up with the intertwining influential factors and explosive data size, AI has been increasingly leveraged in all the above-mentioned aspects in SERS, presenting elite efficiency in accelerating systematic optimization and deepening understanding about the fundamental physics and spectral data, which far transcends human labors and conventional computations. In this review, the recent progresses in SERS are summarized through the integration of AI, and new insights of the challenges and perspectives are provided in aim to better gear SERS toward the fast track.
Collapse
Affiliation(s)
- Xinyuan Bi
- State Key Laboratory of Systems Medicine for Cancer, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030, P. R. China
| | - Li Lin
- State Key Laboratory of Systems Medicine for Cancer, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030, P. R. China
| | - Zhou Chen
- State Key Laboratory of Systems Medicine for Cancer, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030, P. R. China
| | - Jian Ye
- State Key Laboratory of Systems Medicine for Cancer, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030, P. R. China
- Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, 200127, P. R. China
- Shanghai Key Laboratory of Gynecologic Oncology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, P. R. China
| |
Collapse
|
20
|
Zhang X, Zhou Z, Ming C, Sun YY. GPT-Assisted Learning of Structure-Property Relationships by Graph Neural Networks: Application to Rare-Earth-Doped Phosphors. J Phys Chem Lett 2023; 14:11342-11349. [PMID: 38064589 DOI: 10.1021/acs.jpclett.3c02848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2023]
Abstract
Two challenges facing machine learning tasks in materials science are data set construction and descriptor design. Graph neural networks circumvent the need for empirical descriptors by encoding geometric information in graphs. Large language models have shown promise for database construction via text extraction. Here, we apply OpenAI's Generative Pre-trained Transformer 4 (GPT-4) and the Crystal Graph Convolutional Neural Network (CGCNN) to the problem of discovering rare-earth-doped phosphors for solid-state lighting. We used GPT-4 to datamine the chemical formulas and emission wavelengths of 264 Eu2+-doped phosphors from 274 articles. A CGCNN model was trained on the acquired data set, achieving a test R2 of 0.77. Using this model, we predicted the emission wavelengths of over 40 000 inorganic materials. We also used transfer learning to fine-tune a bandgap-predicting CGCNN model for emission wavelength prediction. The workflow requires minimal human supervision and is generalizable to other fields.
Collapse
Affiliation(s)
- Xiang Zhang
- State Key Laboratory of High Performance Ceramics and Superfine Microstructure, Shanghai Institute of Ceramics, Chinese Academy of Sciences, Shanghai 201899, People's Republic of China
| | - Zichun Zhou
- State Key Laboratory of High Performance Ceramics and Superfine Microstructure, Shanghai Institute of Ceramics, Chinese Academy of Sciences, Shanghai 201899, People's Republic of China
| | - Chen Ming
- State Key Laboratory of High Performance Ceramics and Superfine Microstructure, Shanghai Institute of Ceramics, Chinese Academy of Sciences, Shanghai 201899, People's Republic of China
| | - Yi-Yang Sun
- State Key Laboratory of High Performance Ceramics and Superfine Microstructure, Shanghai Institute of Ceramics, Chinese Academy of Sciences, Shanghai 201899, People's Republic of China
| |
Collapse
|
21
|
Suvarna M, Vaucher AC, Mitchell S, Laino T, Pérez-Ramírez J. Language models and protocol standardization guidelines for accelerating synthesis planning in heterogeneous catalysis. Nat Commun 2023; 14:7964. [PMID: 38042926 PMCID: PMC10693572 DOI: 10.1038/s41467-023-43836-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Accepted: 11/22/2023] [Indexed: 12/04/2023] Open
Abstract
Synthesis protocol exploration is paramount in catalyst discovery, yet keeping pace with rapid literature advances is increasingly time intensive. Automated synthesis protocol analysis is attractive for swiftly identifying opportunities and informing predictive models, however such applications in heterogeneous catalysis remain limited. In this proof-of-concept, we introduce a transformer model for this task, exemplified using single-atom heterogeneous catalysts (SACs), a rapidly expanding catalyst family. Our model adeptly converts SAC protocols into action sequences, and we use this output to facilitate statistical inference of their synthesis trends and applications, potentially expediting literature review and analysis. We demonstrate the model's adaptability across distinct heterogeneous catalyst families, underscoring its versatility. Finally, our study highlights a critical issue: the lack of standardization in reporting protocols hampers machine-reading capabilities. Embracing digital advances in catalysis demands a shift in data reporting norms, and to this end, we offer guidelines for writing protocols, significantly improving machine-readability. We release our model as an open-source web application, inviting a fresh approach to accelerate heterogeneous catalysis synthesis planning.
Collapse
Affiliation(s)
- Manu Suvarna
- Institute for Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 1, 8093, Zurich, Switzerland
| | | | - Sharon Mitchell
- Institute for Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 1, 8093, Zurich, Switzerland
| | - Teodoro Laino
- IBM Research Europe, Säumerstrasse 4, 8803, Rüschlikon, Switzerland.
| | - Javier Pérez-Ramírez
- Institute for Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 1, 8093, Zurich, Switzerland.
| |
Collapse
|
22
|
Li S, Zhang Y, Fang Z, Meng K, Tian R, He H, Sun S. Extracting the Synthetic Route of Pd-Based Catalysts in Methanol Steam Reforming from the Scientific Literature. J Chem Inf Model 2023; 63:6249-6260. [PMID: 37807535 DOI: 10.1021/acs.jcim.3c01442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
The structured material synthesis route is crucial for chemists in performing experiments and modern applications such as machine learning material design. With the exponential growth of the chemical literature in recent years, manual extraction from the published literature is time-consuming and labor-intensive. This study focuses on developing an automated method for extracting Pd-based catalyst synthesis routes from the chemical literature. First, a paragraph classification model based on regular expressions is employed to identify paragraphs that contain material synthesis processes. The identified paragraphs are verified using machine learning techniques. Second, natural language processing techniques are applied to automatically parse the material synthesis routes from the identified paragraphs, generate regularized flowcharts, and output structured data. Lastly, we utilized the structured data of the synthesis routes to train machine learning models and predict the performance of the materials. The extracted material entities include the product, preparation method, precursor, support, loading, synthesis operation, and operation condition. This method avoids extensive manual data annotation and improves the scientific literature information acquisition efficiency. The accuracy of the 11 material entities exceeds 80%, and the accuracy of the method, support, precursor, drying time, and reduction time exceeds 90%.
Collapse
Affiliation(s)
- Shuyuan Li
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Yunjiang Zhang
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Zhaolin Fang
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Kong Meng
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Rui Tian
- Beijing Engineering Research Center for IoT Software and Systems, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
| | - Hong He
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Shaorui Sun
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| |
Collapse
|
23
|
Wang J, He T, Yang X, Cai Z, Wang Y, Lacivita V, Kim H, Ouyang B, Ceder G. Design principles for NASICON super-ionic conductors. Nat Commun 2023; 14:5210. [PMID: 37626068 PMCID: PMC10457403 DOI: 10.1038/s41467-023-40669-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Accepted: 08/03/2023] [Indexed: 08/27/2023] Open
Abstract
Na Super Ionic Conductor (NASICON) materials are an important class of solid-state electrolytes owing to their high ionic conductivity and superior chemical and electrochemical stability. In this paper, we combine first-principles calculations, experimental synthesis and testing, and natural language-driven text-mined historical data on NASICON ionic conductivity to achieve clear insights into how chemical composition influences the Na-ion conductivity. These insights, together with a high-throughput first-principles analysis of the compositional space over which NASICONs are expected to be stable, lead to the successful synthesis and electrochemical investigation of several new NASICONs solid-state conductors. Among these, a high ionic conductivity of 1.2 mS cm-1 could be achieved at 25 °C. We find that the ionic conductivity increases with average metal size up to a certain value and that the substitution of PO4 polyanions by SiO4 also enhances the ionic conductivity. While optimal ionic conductivity is found near a Na content of 3 per formula unit, the exact optimum depends on other compositional variables. Surprisingly, the Na content enhances the ionic conductivity mostly through its effect on the activation barrier, rather than through the carrier concentration. These deconvoluted design criteria may provide guidelines for the design of optimized NASICON conductors.
Collapse
Affiliation(s)
- Jingyang Wang
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
- Department of Materials Science and Engineering, University of California, Berkeley, CA, 94720, USA
- School of Sustainable Energy and Resources, School of Materials Science and Intelligent Engineering, Nanjing University, Suzhou, China
| | - Tanjin He
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
- Department of Materials Science and Engineering, University of California, Berkeley, CA, 94720, USA
| | - Xiaochen Yang
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
- Department of Materials Science and Engineering, University of California, Berkeley, CA, 94720, USA
| | - Zijian Cai
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
- Department of Materials Science and Engineering, University of California, Berkeley, CA, 94720, USA
| | - Yan Wang
- Advanced Materials Lab, Samsung Advanced Institute of Technology and Samsung Semiconductor, Inc, Cambridge, MA, 02138, USA
| | - Valentina Lacivita
- Advanced Materials Lab, Samsung Advanced Institute of Technology and Samsung Semiconductor, Inc, Cambridge, MA, 02138, USA
| | - Haegyeom Kim
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Bin Ouyang
- Department of Chemistry and Biochemistry, Florida State University, Tallahassee, FL, 32306, USA.
| | - Gerbrand Ceder
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.
- Department of Materials Science and Engineering, University of California, Berkeley, CA, 94720, USA.
| |
Collapse
|
24
|
Lambor SM, Kasiraju S, Vlachos DG. CKineticsDB─An Extensible and FAIR Data Management Framework and Datahub for Multiscale Modeling in Heterogeneous Catalysis. J Chem Inf Model 2023; 63:4342-4354. [PMID: 37436913 DOI: 10.1021/acs.jcim.3c00123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/14/2023]
Abstract
A great advantage of computational research is its reproducibility and reusability. However, an enormous amount of computational research data in heterogeneous catalysis is barricaded due to logistical limitations. Sufficient provenance and characterization of data and computational environment, with uniform organization and easy accessibility, can allow the development of software tools for integration across the multiscale modeling workflow. Here, we develop the Chemical Kinetics Database, CKineticsDB, a state-of-the-art datahub for multiscale modeling, designed to be compliant with the FAIR guiding principles for scientific data management. CKineticsDB utilizes a MongoDB back-end for extensibility and adaptation to varying data formats, with a referencing-based data model to reduce redundancy in storage. We have developed a Python software program for data processing operations and with built-in features to extract data for common applications. CKineticsDB evaluates the incoming data for quality and uniformity, retains curated information from simulations, enables accurate regeneration of publication results, optimizes storage, and allows the selective retrieval of files based on domain-relevant catalyst and simulation parameters. CKineticsDB provides data from multiple scales of theory (ab initio calculations, thermochemistry, and microkinetic models) to accelerate the development of new reaction pathways, kinetic analysis of reaction mechanisms, and catalysis discovery, along with several data-driven applications.
Collapse
Affiliation(s)
- Siddhant M Lambor
- RAPID Manufacturing Institute, Delaware Energy Institute, University of Delaware, Newark, Delaware 19716, United States
| | - Sashank Kasiraju
- RAPID Manufacturing Institute, Delaware Energy Institute, University of Delaware, Newark, Delaware 19716, United States
| | - Dionisios G Vlachos
- RAPID Manufacturing Institute, Delaware Energy Institute, University of Delaware, Newark, Delaware 19716, United States
- Department of Chemical and Biomolecular Engineering and Catalysis Center for Energy Innovation (CCEI), University of Delaware, Newark, Delaware 19716, United States
| |
Collapse
|
25
|
Liu Y, Yang Z, Zou X, Ma S, Liu D, Avdeev M, Shi S. Data quantity governance for machine learning in materials science. Natl Sci Rev 2023; 10:nwad125. [PMID: 37323811 PMCID: PMC10265966 DOI: 10.1093/nsr/nwad125] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 04/14/2023] [Accepted: 04/26/2023] [Indexed: 06/17/2023] Open
Abstract
Data-driven machine learning (ML) is widely employed in the analysis of materials structure-activity relationships, performance optimization and materials design due to its superior ability to reveal latent data patterns and make accurate prediction. However, because of the laborious process of materials data acquisition, ML models encounter the issue of the mismatch between a high dimension of feature space and a small sample size (for traditional ML models) or the mismatch between model parameters and sample size (for deep-learning models), usually resulting in terrible performance. Here, we review the efforts for tackling this issue via feature reduction, sample augmentation and specific ML approaches, and show that the balance between the number of samples and features or model parameters should attract great attention during data quantity governance. Following this, we propose a synergistic data quantity governance flow with the incorporation of materials domain knowledge. After summarizing the approaches to incorporating materials domain knowledge into the process of ML, we provide examples of incorporating domain knowledge into governance schemes to demonstrate the advantages of the approach and applications. The work paves the way for obtaining the required high-quality data to accelerate materials design and discovery based on ML.
Collapse
Affiliation(s)
- Yue Liu
- School of Computer Engineering and Science, Shanghai University, Shanghai200444, China
- Shanghai Engineering Research Center of Intelligent Computing System, Shanghai200444, China
| | - Zhengwei Yang
- School of Computer Engineering and Science, Shanghai University, Shanghai200444, China
| | - Xinxin Zou
- School of Computer Engineering and Science, Shanghai University, Shanghai200444, China
| | - Shuchang Ma
- School of Computer Engineering and Science, Shanghai University, Shanghai200444, China
| | - Dahui Liu
- School of Computer Engineering and Science, Shanghai University, Shanghai200444, China
| | - Maxim Avdeev
- Australian Nuclear Science and Technology Organisation, Sydney 2232, Australia
- School of Chemistry, The University of Sydney, Sydney 2006, Australia
| | - Siqi Shi
- State Key Laboratory of Advanced Special Steel, School of Materials Science and Engineering, Shanghai University, Shanghai200444, China
- Materials Genome Institute, Shanghai University, Shanghai200444, China
| |
Collapse
|
26
|
Zhang Z, Xu Z. Fatigue database of additively manufactured alloys. Sci Data 2023; 10:249. [PMID: 37127747 PMCID: PMC10151339 DOI: 10.1038/s41597-023-02150-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 04/12/2023] [Indexed: 05/03/2023] Open
Abstract
Fatigue is a process of mechanical degradation that is usually assessed based on empirical rules and experimental data obtained from standardized tests. Fatigue data of engineering materials are commonly reported in S-N (the stress-life relation), ε-N (the strain-life relation), and da/dN-ΔK (the relation between the fatigue crack growth rate and the stress intensity factor range) data. Fatigue and static mechanical properties of additively manufactured (AM) alloys, as well as the types of materials, parameters of AM, processing, and testing are collected from thousands of scientific articles till the end of 2022 using natural language processing, machine learning, and computer vision techniques. The results show that the performance of AM alloys could reach that of conventional alloys although data dispersion and system deviation are present. The database (FatigueData-AM2022) is formatted in compact structures, hosted in an open repository, and analyzed to show their patterns and statistics. The quality of data collected from the literature is measured by defining rating scores for datasets reported in individual studies and through the fill rates of data entries across all the datasets. The database also serves as a high-quality training set for data processing using machine learning models. The procedures of data extraction and analysis are outlined and the tools are publicly released. A unified language of fatigue data is suggested to regulate data reporting for the fatigue performance of materials to facilitate data sharing and the development of open science.
Collapse
Affiliation(s)
- Zian Zhang
- Tsinghua University, Applied Mechanics Laboratory and Department of Engineering Mechanics, Beijing, 100084, China
| | - Zhiping Xu
- Tsinghua University, Applied Mechanics Laboratory and Department of Engineering Mechanics, Beijing, 100084, China.
| |
Collapse
|
27
|
Huang S, Cole JM. BatteryBERT: A Pretrained Language Model for Battery Database Enhancement. J Chem Inf Model 2022; 62:6365-6377. [PMID: 35533012 PMCID: PMC9795558 DOI: 10.1021/acs.jcim.2c00035] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
A great number of scientific papers are published every year in the field of battery research, which forms a huge textual data source. However, it is difficult to explore and retrieve useful information efficiently from these large unstructured sets of text. The Bidirectional Encoder Representations from Transformers (BERT) model, trained on a large data set in an unsupervised way, provides a route to process the scientific text automatically with minimal human effort. To this end, we realized six battery-related BERT models, namely, BatteryBERT, BatteryOnlyBERT, and BatterySciBERT, each of which consists of both cased and uncased models. They have been trained specifically on a corpus of battery research papers. The pretrained BatteryBERT models were then fine-tuned on downstream tasks, including battery paper classification and extractive question-answering for battery device component classification that distinguishes anode, cathode, and electrolyte materials. Our BatteryBERT models were found to outperform the original BERT models on the specific battery tasks. The fine-tuned BatteryBERT was then used to perform battery database enhancement. We also provide a website application for its interactive use and visualization.
Collapse
Affiliation(s)
- Shu Huang
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J.J. Thomson Avenue, Cambridge CB3 0HE, U.K.
| | - Jacqueline M. Cole
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J.J. Thomson Avenue, Cambridge CB3 0HE, U.K.,ISIS
Neutron and Muon Source, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.,
| |
Collapse
|
28
|
Shen S, Liu J, Lin L, Huang Y, Zhang L, Liu C, Feng Y, Wang D. SsciBERT: a pre-trained language model for social science texts. Scientometrics 2022. [DOI: 10.1007/s11192-022-04602-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
29
|
Abstract
Computational modeling is increasingly used to assist in the discovery of supramolecular materials. Supramolecular materials are typically primarily built from organic components that are self-assembled through noncovalent bonding and have potential applications, including in selective binding, sorption, molecular separations, catalysis, optoelectronics, sensing, and as molecular machines. In this review, the key areas where computational prediction can assist in the discovery of supramolecular materials, including in structure prediction, property prediction, and the prediction of how to synthesize a hypothetical material are discussed, before exploring the potential impact of artificial intelligence techniques on the field. Throughout, the importance of close integration with experimental materials discovery programs will be highlighted. A series of case studies from the author's work across some different supramolecular material classes will be discussed, before finishing with a discussion of the outlook for the field.
Collapse
Affiliation(s)
- Kim E. Jelfs
- Department of Chemistry, Molecular Sciences Research HubImperial College LondonLondonUK
| |
Collapse
|
30
|
DATa: Domain Adaptation-aided deep Table detection using visual–lexical representations. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
31
|
Ghosh S, Lu K. Band gap information extraction from materials science literature – a pilot study. ASLIB J INFORM MANAG 2022. [DOI: 10.1108/ajim-03-2022-0141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeThe purpose of this paper is to present a preliminary work on extracting band gap information of materials from academic papers. With increasing demand for renewable energy, band gap information will help material scientists design and implement novel photovoltaic (PV) cells.Design/methodology/approachThe authors collected 1.44 million titles and abstracts of scholarly articles related to materials science, and then filtered the collection to 11,939 articles that potentially contain relevant information about materials and their band gap values. ChemDataExtractor was extended to extract information about PV materials and their band gap information. Evaluation was performed on randomly sampled information records of 415 papers.FindingsThe findings of this study show that the current system is able to correctly extract information for 51.32% articles, with partially correct extraction for 36.62% articles and incorrect for 12.04%. The authors have also identified the errors belonging to three main categories pertaining to chemical entity identification, band gap information and interdependency resolution. Future work will focus on addressing these errors to improve the performance of the system.Originality/valueThe authors did not find any literature to date on band gap information extraction from academic text using automated methods. This work is unique and original. Band gap information is of importance to materials scientists in applications such as solar cells, light emitting diodes and laser diodes.
Collapse
|
32
|
Huo H, Bartel CJ, He T, Trewartha A, Dunn A, Ouyang B, Jain A, Ceder G. Machine-Learning Rationalization and Prediction of Solid-State Synthesis Conditions. CHEMISTRY OF MATERIALS : A PUBLICATION OF THE AMERICAN CHEMICAL SOCIETY 2022; 34:7323-7336. [PMID: 36032555 PMCID: PMC9407029 DOI: 10.1021/acs.chemmater.2c01293] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Revised: 07/19/2022] [Indexed: 06/02/2023]
Abstract
There currently exist no quantitative methods to determine the appropriate conditions for solid-state synthesis. This not only hinders the experimental realization of novel materials but also complicates the interpretation and understanding of solid-state reaction mechanisms. Here, we demonstrate a machine-learning approach that predicts synthesis conditions using large solid-state synthesis data sets text-mined from scientific journal articles. Using feature importance ranking analysis, we discovered that optimal heating temperatures have strong correlations with the stability of precursor materials quantified using melting points and formation energies (ΔG f , ΔH f ). In contrast, features derived from the thermodynamics of synthesis-related reactions did not directly correlate to the chosen heating temperatures. This correlation between optimal solid-state heating temperature and precursor stability extends Tamman's rule from intermetallics to oxide systems, suggesting the importance of reaction kinetics in determining synthesis conditions. Heating times are shown to be strongly correlated with the chosen experimental procedures and instrument setups, which may be indicative of human bias in the data set. Using these predictive features, we constructed machine-learning models with good performance and general applicability to predict the conditions required to synthesize diverse chemical systems.
Collapse
Affiliation(s)
- Haoyan Huo
- Department
of Materials Science and Engineering, University
of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, California 94720, United States
- Materials
Sciences Division, Lawrence Berkeley National
Laboratory, 1 Cyclotron Road, Berkeley, California 94720, United States
| | - Christopher J. Bartel
- Department
of Materials Science and Engineering, University
of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, California 94720, United States
- Materials
Sciences Division, Lawrence Berkeley National
Laboratory, 1 Cyclotron Road, Berkeley, California 94720, United States
| | - Tanjin He
- Department
of Materials Science and Engineering, University
of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, California 94720, United States
- Materials
Sciences Division, Lawrence Berkeley National
Laboratory, 1 Cyclotron Road, Berkeley, California 94720, United States
| | - Amalie Trewartha
- Materials
Sciences Division, Lawrence Berkeley National
Laboratory, 1 Cyclotron Road, Berkeley, California 94720, United States
| | - Alexander Dunn
- Department
of Materials Science and Engineering, University
of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, California 94720, United States
- Energy
Technologies Area, Lawrence Berkeley National
Laboratory, 1 Cyclotron Road, Berkeley, California 94720, United States
| | - Bin Ouyang
- Department
of Materials Science and Engineering, University
of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, California 94720, United States
- Materials
Sciences Division, Lawrence Berkeley National
Laboratory, 1 Cyclotron Road, Berkeley, California 94720, United States
| | - Anubhav Jain
- Energy
Technologies Area, Lawrence Berkeley National
Laboratory, 1 Cyclotron Road, Berkeley, California 94720, United States
| | - Gerbrand Ceder
- Department
of Materials Science and Engineering, University
of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, California 94720, United States
- Materials
Sciences Division, Lawrence Berkeley National
Laboratory, 1 Cyclotron Road, Berkeley, California 94720, United States
| |
Collapse
|
33
|
Zhang Y, Wang C, Soukaseum M, Vlachos DG, Fang H. Unleashing the Power of Knowledge Extraction from Scientific Literature in Catalysis. J Chem Inf Model 2022; 62:3316-3330. [PMID: 35772028 DOI: 10.1021/acs.jcim.2c00359] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Valuable knowledge of catalysis is often hidden in a large amount of scientific literature. There is an urgent need to extract useful knowledge to facilitate scientific discovery. This work takes the first step toward the goal in the field of catalysis. Specifically, we construct the first information extraction benchmark data set that covers the field of catalysis and also develop a general extraction framework that can accurately extract catalysis-related entities from scientific literature with 90% extraction accuracy. We further demonstrate the feasibility of leveraging the extracted knowledge to help users better access relevant information in catalysis through an entity-aware search engine and a correlation analysis system.
Collapse
Affiliation(s)
- Yue Zhang
- Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19711, United States.,Center for Plastics Innovation, University of Delaware, Newark, Delaware 19711, United States
| | - Cong Wang
- Center for Plastics Innovation, University of Delaware, Newark, Delaware 19711, United States.,Department of Chemical and Biomolecular Engineering, University of Delaware, Newark, Delaware 19711, United States
| | - Mya Soukaseum
- Center for Plastics Innovation, University of Delaware, Newark, Delaware 19711, United States.,Department of Chemical and Biological Engineering, Drexel University, Philadelphia, Pennsylvania 19104, United States
| | - Dionisios G Vlachos
- Center for Plastics Innovation, University of Delaware, Newark, Delaware 19711, United States.,Department of Chemical and Biomolecular Engineering, University of Delaware, Newark, Delaware 19711, United States
| | - Hui Fang
- Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19711, United States.,Center for Plastics Innovation, University of Delaware, Newark, Delaware 19711, United States
| |
Collapse
|
34
|
Nandy A, Duan C, Kulik HJ. Audacity of huge: overcoming challenges of data scarcity and data quality for machine learning in computational materials discovery. Curr Opin Chem Eng 2022. [DOI: 10.1016/j.coche.2021.100778] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
|
35
|
Text-mined dataset of gold nanoparticle synthesis procedures, morphologies, and size entities. Sci Data 2022; 9:234. [PMID: 35618761 PMCID: PMC9135747 DOI: 10.1038/s41597-022-01321-6] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Accepted: 04/08/2022] [Indexed: 12/13/2022] Open
Abstract
Gold nanoparticles are highly desired for a range of technological applications due to their tunable properties, which are dictated by the size and shape of the constituent particles. Many heuristic methods for controlling the morphological characteristics of gold nanoparticles are well known. However, the underlying mechanisms controlling their size and shape remain poorly understood, partly due to the immense range of possible combinations of synthesis parameters. Data-driven methods can offer insight to help guide understanding of these underlying mechanisms, so long as sufficient synthesis data are available. To facilitate data mining in this direction, we have constructed and made publicly available a dataset of codified gold nanoparticle synthesis protocols and outcomes extracted directly from the nanoparticle materials science literature using natural language processing and text-mining techniques. This dataset contains 5,154 data records, each representing a single gold nanoparticle synthesis article, filtered from a database of 4,973,165 publications. Each record contains codified synthesis protocols and extracted morphological information from a total of 7,608 experimental and 12,519 characterization paragraphs. Measurement(s) | gold nanoparticle morphology • gold nanoparticle size • gold nanoparticle synthesis data | Technology Type(s) | natural language processing |
Collapse
|
36
|
Dataset of solution-based inorganic materials synthesis procedures extracted from the scientific literature. Sci Data 2022; 9:231. [PMID: 35614129 PMCID: PMC9132903 DOI: 10.1038/s41597-022-01317-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Accepted: 04/05/2022] [Indexed: 11/10/2022] Open
Abstract
The development of a materials synthesis route is usually based on heuristics and experience. A possible new approach would be to apply data-driven approaches to learn the patterns of synthesis from past experience and use them to predict the syntheses of novel materials. However, this route is impeded by the lack of a large-scale database of synthesis formulations. In this work, we applied advanced machine learning and natural language processing techniques to construct a dataset of 35,675 solution-based synthesis procedures extracted from the scientific literature. Each procedure contains essential synthesis information including the precursors and target materials, their quantities, and the synthesis actions and corresponding attributes. Every procedure is also augmented with the reaction formula. Through this work, we are making freely available the first large dataset of solution-based inorganic materials synthesis procedures. Measurement(s) | solution-based inorganic synthesis data | Technology Type(s) | natural language processing |
Collapse
|
37
|
Trewartha A, Walker N, Huo H, Lee S, Cruse K, Dagdelen J, Dunn A, Persson KA, Ceder G, Jain A. Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. PATTERNS (NEW YORK, N.Y.) 2022; 3:100488. [PMID: 35465225 PMCID: PMC9024010 DOI: 10.1016/j.patter.2022.100488] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2021] [Revised: 01/21/2022] [Accepted: 03/15/2022] [Indexed: 11/03/2022]
Abstract
A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to an increase in publications. This problem may be addressed by using named entity recognition (NER) to extract structured summary-level data from unstructured materials science text. We compare the performance of four NER models on three materials science datasets. The four models include a bidirectional long short-term memory (BiLSTM) and three transformer models (BERT, SciBERT, and MatBERT) with increasing degrees of domain-specific materials science pre-training. MatBERT improves over the other two BERTBASE-based models by 1%∼12%, implying that domain-specific pre-training provides measurable advantages. Despite relative architectural simplicity, the BiLSTM model consistently outperforms BERT, perhaps due to its domain-specific pre-trained word embeddings. Furthermore, MatBERT and SciBERT models outperform the original BERT model to a greater extent in the small data limit. MatBERT’s higher-quality predictions should accelerate the extraction of structured data from materials science literature. Efficient extraction of information from materials science literature is needed Domain-specific materials science pre-training improves results Even simpler domain-specific models can outperform more complex general models
A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to a massive increase in publications. Four different language models are trained to automatically collect important information from materials science articles. We compare a simple model (BiLSTM) with materials science knowledge to three variants of a more complex model: one with general knowledge (BERT), one with general scientific knowledge (SciBERT), and one with materials science knowledge (MatBERT). We find that MatBERT performs the best overall. This implies that language models with greater extents of materials science knowledge will perform better on materials science-related tasks. The simpler model even consistently outperforms BERT. Furthermore, the performance gaps grow when the models are given fewer examples of information extraction to learn from. MatBERT’s higher-quality results should accelerate the collection of information from materials science literature.
Collapse
Affiliation(s)
- Amalie Trewartha
- Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - Nicholas Walker
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - Haoyan Huo
- Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Sanghoon Lee
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Kevin Cruse
- Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - John Dagdelen
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Alexander Dunn
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Kristin A Persson
- Molecular Foundry, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Gerbrand Ceder
- Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Anubhav Jain
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| |
Collapse
|
38
|
Mannodi-Kanakkithodi A, Xiang X, Jacoby L, Biegaj R, Dunham ST, Gamelin DR, Chan MKY. Universal machine learning framework for defect predictions in zinc blende semiconductors. PATTERNS (NEW YORK, N.Y.) 2022; 3:100450. [PMID: 35510195 PMCID: PMC9058924 DOI: 10.1016/j.patter.2022.100450] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Revised: 12/06/2021] [Accepted: 01/20/2022] [Indexed: 11/27/2022]
Abstract
We develop a framework powered by machine learning (ML) and high-throughput density functional theory (DFT) computations for the prediction and screening of functional impurities in groups IV, III–V, and II–VI zinc blende semiconductors. Elements spanning the length and breadth of the periodic table are considered as impurity atoms at the cation, anion, or interstitial sites in supercells of 34 candidate semiconductors, leading to a chemical space of approximately 12,000 points, 10% of which are used to generate a DFT dataset of charge dependent defect formation energies. Descriptors based on tabulated elemental properties, defect coordination environment, and relevant semiconductor properties are used to train ML regression models for the DFT computed neutral state formation energies and charge transition levels of impurities. Optimized kernel ridge, Gaussian process, random forest, and neural network regression models are applied to screen impurities with lower formation energy than dominant native defects in all compounds. Large computational dataset of defect properties in semiconductors is developed Regression algorithms are used to train predictive models for defect properties Best models are used for high-throughput prediction and screening Lists of low energy “dominating” impurities are generated
Our article introduces a universal predictive framework for point defect formation energies and charge transition levels in a wide chemical space of zinc blende semiconductors and possible impurity atoms selected from across the periodic table. This framework was developed by leveraging high-throughput quantum mechanical simulations benchmarked using some experimental data from the literature, as well as machine learning (ML)-based regressions techniques that map unique materials descriptors to computed defect properties and yield optimized and generalizable models. The power and utility of these models is revealed through quick predictions for thousands of new defects and screening of low-energy impurities, which may tune the equilibrium conductivity in the semiconductor. This work presents, to our knowledge, the largest density functional theory (DFT) dataset of defect properties in semiconductors and the largest DFT+ML-based screening of point defects in semiconductors to date.
Collapse
Affiliation(s)
- Arun Mannodi-Kanakkithodi
- Center for Nanoscale Materials, Argonne National Laboratory, Argonne, IL 60439, USA.,School of Materials Engineering, Purdue University, West Lafayette, IN 47907, USA
| | - Xiaofeng Xiang
- Molecular Engineering & Sciences Institute, University of Washington, Seattle, WA 98195, USA
| | - Laura Jacoby
- Department of Chemistry, University of Washington, Seattle, WA 98195, USA
| | - Robert Biegaj
- Materials Science & Engineering, University of Washington, Seattle, WA 98195, USA
| | - Scott T Dunham
- Department of Electrical and Computer Engineering, University of Washington, Seattle, WA 98195, USA
| | - Daniel R Gamelin
- Department of Chemistry, University of Washington, Seattle, WA 98195, USA
| | - Maria K Y Chan
- Center for Nanoscale Materials, Argonne National Laboratory, Argonne, IL 60439, USA
| |
Collapse
|
39
|
Duan C, Nandy A, Kulik HJ. Machine Learning for the Discovery, Design, and Engineering of Materials. Annu Rev Chem Biomol Eng 2022; 13:405-429. [PMID: 35320698 DOI: 10.1146/annurev-chembioeng-092320-120230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Machine learning (ML) has become a part of the fabric of high-throughput screening and computational discovery of materials. Despite its increasingly central role, challenges remain in fully realizing the promise of ML. This is especially true for the practical acceleration of the engineering of robust materials and the development of design strategies that surpass trial and error or high-throughput screening alone. Depending on the quantity being predicted and the experimental data available, ML can either outperform physics-based modes, be used to accelerate such models, or be integrated with them to improve their performance. We cover recent advances in algorithms and in their application that are starting to make inroads toward (a) the discovery of new materials through large-scale enumerative screening, (b) the design of materials through identification of rules and principles that govern materials properties, and (c) the engineering of practical materials by satisfying multiple objectives. We conclude with opportunities for further advancement to realize ML as a widespread tool for practical computational materials design. Expected final online publication date for the Annual Review of Chemical and Biomolecular Engineering, Volume 13 is October 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Collapse
Affiliation(s)
- Chenru Duan
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; , , .,Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Aditya Nandy
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; , , .,Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Heather J Kulik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; , ,
| |
Collapse
|
40
|
Smart Materials Prediction: Applying Machine Learning to Lithium Solid-State Electrolyte. MATERIALS 2022; 15:ma15031157. [PMID: 35161101 PMCID: PMC8840428 DOI: 10.3390/ma15031157] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/23/2022] [Accepted: 01/31/2022] [Indexed: 11/24/2022]
Abstract
Traditionally, the discovery of new materials has often depended on scholars’ computational and experimental experience. The traditional trial-and-error methods require many resources and computing time. Due to new materials’ properties becoming more complex, it is difficult to predict and identify new materials only by general knowledge and experience. Material prediction tools based on machine learning (ML) have been successfully applied to various materials fields; they are beneficial for modeling and accelerating the prediction process for materials that cannot be accurately predicted. However, the obstacles of disciplinary span led to many scholars in materials not having complete knowledge of data-driven materials science methods. This paper provides an overview of the general process of ML applied to materials prediction and uses solid-state electrolytes (SSE) as an example. Recent approaches and specific applications to ML in the materials field and the requirements for building ML models for predicting lithium SSE are reviewed. Finally, some current obstacles to applying ML in materials prediction and prospects are described with the expectation that more materials scholars will be aware of the application of ML in materials prediction.
Collapse
|
41
|
Pei Z, Rozman KA, Doğan ÖN, Wen Y, Gao N, Holm EA, Hawk JA, Alman DE, Gao MC. Machine-Learning Microstructure for Inverse Material Design. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2021; 8:e2101207. [PMID: 34716677 PMCID: PMC8655171 DOI: 10.1002/advs.202101207] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 09/10/2021] [Indexed: 05/06/2023]
Abstract
Metallurgy and material design have thousands of years' history and have played a critical role in the civilization process of humankind. The traditional trial-and-error method has been unprecedentedly challenged in the modern era when the number of components and phases in novel alloys keeps increasing, with high-entropy alloys as the representative. New opportunities emerge for alloy design in the artificial intelligence era. Here, a successful machine-learning (ML) method has been developed to identify the microstructure images with eye-challenging morphology for a number of martensitic and ferritic steels. Assisted by it, a new neural-network method is proposed for the inverse design of alloys with 20 components, which can accelerate the design process based on microstructure. The method is also readily applied to other material systems given sufficient microstructure images. This work lays the foundation for inverse alloy design based on microstructure images with extremely similar features.
Collapse
Affiliation(s)
- Zongrui Pei
- Materials Engineering and Manufacturing DirectorateNational Energy Technology Laboratory1450 Queen Ave SWAlbanyOR97321USA
- ORISE100 ORAU WayOak RidgeTN37830USA
| | - Kyle A. Rozman
- Materials Engineering and Manufacturing DirectorateNational Energy Technology Laboratory1450 Queen Ave SWAlbanyOR97321USA
- LRST1450 Queen Ave SWAlbanyOR97321USA
| | - Ömer N. Doğan
- Materials Engineering and Manufacturing DirectorateNational Energy Technology Laboratory1450 Queen Ave SWAlbanyOR97321USA
| | - Youhai Wen
- Materials Engineering and Manufacturing DirectorateNational Energy Technology Laboratory1450 Queen Ave SWAlbanyOR97321USA
| | - Nan Gao
- Department of Materials Science and EngineeringCarnegie Mellon UniversityPittsburghPA15213USA
| | - Elizabeth A. Holm
- Department of Materials Science and EngineeringCarnegie Mellon UniversityPittsburghPA15213USA
| | - Jeffrey A. Hawk
- Materials Engineering and Manufacturing DirectorateNational Energy Technology Laboratory1450 Queen Ave SWAlbanyOR97321USA
| | - David E. Alman
- Materials Engineering and Manufacturing DirectorateNational Energy Technology Laboratory1450 Queen Ave SWAlbanyOR97321USA
| | - Michael C. Gao
- Materials Engineering and Manufacturing DirectorateNational Energy Technology Laboratory1450 Queen Ave SWAlbanyOR97321USA
| |
Collapse
|
42
|
Nandy A, Duan C, Taylor MG, Liu F, Steeves AH, Kulik HJ. Computational Discovery of Transition-metal Complexes: From High-throughput Screening to Machine Learning. Chem Rev 2021; 121:9927-10000. [PMID: 34260198 DOI: 10.1021/acs.chemrev.1c00347] [Citation(s) in RCA: 101] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Transition-metal complexes are attractive targets for the design of catalysts and functional materials. The behavior of the metal-organic bond, while very tunable for achieving target properties, is challenging to predict and necessitates searching a wide and complex space to identify needles in haystacks for target applications. This review will focus on the techniques that make high-throughput search of transition-metal chemical space feasible for the discovery of complexes with desirable properties. The review will cover the development, promise, and limitations of "traditional" computational chemistry (i.e., force field, semiempirical, and density functional theory methods) as it pertains to data generation for inorganic molecular discovery. The review will also discuss the opportunities and limitations in leveraging experimental data sources. We will focus on how advances in statistical modeling, artificial intelligence, multiobjective optimization, and automation accelerate discovery of lead compounds and design rules. The overall objective of this review is to showcase how bringing together advances from diverse areas of computational chemistry and computer science have enabled the rapid uncovering of structure-property relationships in transition-metal chemistry. We aim to highlight how unique considerations in motifs of metal-organic bonding (e.g., variable spin and oxidation state, and bonding strength/nature) set them and their discovery apart from more commonly considered organic molecules. We will also highlight how uncertainty and relative data scarcity in transition-metal chemistry motivate specific developments in machine learning representations, model training, and in computational chemistry. Finally, we will conclude with an outlook of areas of opportunity for the accelerated discovery of transition-metal complexes.
Collapse
Affiliation(s)
- Aditya Nandy
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.,Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Chenru Duan
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.,Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Michael G Taylor
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Fang Liu
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Adam H Steeves
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Heather J Kulik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
43
|
Rincón-López J, Almanza-Arjona YC, Riascos AP, Rojas-Aguirre Y. When Cyclodextrins Met Data Science: Unveiling Their Pharmaceutical Applications through Network Science and Text-Mining. Pharmaceutics 2021; 13:1297. [PMID: 34452258 PMCID: PMC8399453 DOI: 10.3390/pharmaceutics13081297] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Revised: 08/14/2021] [Accepted: 08/16/2021] [Indexed: 12/21/2022] Open
Abstract
We present a data-driven approach to unveil the pharmaceutical technologies of cyclodextrins (CDs) by analyzing a dataset of CD pharmaceutical patents. First, we implemented network science techniques to represent CD patents as a single structure and provide a framework for unsupervised detection of keywords in the patent dataset. Guided by those keywords, we further mined the dataset to examine the patenting trends according to CD-based dosage forms. CD patents formed complex networks, evidencing the supremacy of CDs for solubility enhancement and how this has triggered cutting-edge applications based on or beyond the solubility improvement. The networks exposed the significance of CDs to formulate aqueous solutions, tablets, and powders. Additionally, they highlighted the role of CDs in formulations of anti-inflammatory drugs, cancer therapies, and antiviral strategies. Text-mining showed that the trends in CDs for aqueous solutions, tablets, and powders are going upward. Gels seem to be promising, while patches and fibers are emerging. Cyclodextrins' potential in suspensions and emulsions is yet to be recognized and can become an opportunity area. This is the first unsupervised/supervised data-mining approach aimed at depicting a landscape of CDs to identify trending and emerging technologies and uncover opportunity areas in CD pharmaceutical research.
Collapse
Affiliation(s)
- Juliana Rincón-López
- Instituto de Investigaciones en Materiales, Universidad Nacional Autónoma de México, Ciudad Universitaria, Mexico City 04510, Mexico;
| | - Yara C. Almanza-Arjona
- Instituto de Ciencias Aplicadas y Tecnología, Universidad Nacional Autónoma de México, Ciudad Universitaria, Mexico City 04510, Mexico;
| | - Alejandro P. Riascos
- Instituto de Física, Universidad Nacional Autónoma de México, Ciudad Universitaria, Mexico City 04510, Mexico
| | - Yareli Rojas-Aguirre
- Instituto de Investigaciones en Materiales, Universidad Nacional Autónoma de México, Ciudad Universitaria, Mexico City 04510, Mexico;
| |
Collapse
|
44
|
Szymanski NJ, Zeng Y, Huo H, Bartel CJ, Kim H, Ceder G. Toward autonomous design and synthesis of novel inorganic materials. MATERIALS HORIZONS 2021; 8:2169-2198. [PMID: 34846423 DOI: 10.1039/d1mh00495f] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Autonomous experimentation driven by artificial intelligence (AI) provides an exciting opportunity to revolutionize inorganic materials discovery and development. Herein, we review recent progress in the design of self-driving laboratories, including robotics to automate materials synthesis and characterization, in conjunction with AI to interpret experimental outcomes and propose new experimental procedures. We focus on efforts to automate inorganic synthesis through solution-based routes, solid-state reactions, and thin film deposition. In each case, connections are made to relevant work in organic chemistry, where automation is more common. Characterization techniques are primarily discussed in the context of phase identification, as this task is critical to understand what products have formed during synthesis. The application of deep learning to analyze multivariate characterization data and perform phase identification is examined. To achieve "closed-loop" materials synthesis and design, we further provide a detailed overview of optimization algorithms that use active learning to rationally guide experimental iterations. Finally, we highlight several key opportunities and challenges for the future development of self-driving inorganic materials synthesis platforms.
Collapse
Affiliation(s)
- Nathan J Szymanski
- Department of Materials Science & Engineering, UC Berkeley, Berkeley, CA 94720, USA.
| | | | | | | | | | | |
Collapse
|
45
|
IP Analytics and Machine Learning Applied to Create Process Visualization Graphs for Chemical Utility Patents. Processes (Basel) 2021. [DOI: 10.3390/pr9081342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Researchers must read and understand a large volume of technical papers, including patent documents, to fully grasp the state-of-the-art technological progress in a given domain. Chemical research is particularly challenging with the fast growth of newly registered utility patents (also known as intellectual property or IP) that provide detailed descriptions of the processes used to create a new chemical or a new process to manufacture a known chemical. The researcher must be able to understand the latest patents and literature in order to develop new chemicals and processes that do not infringe on existing claims and processes. This research uses text mining, integrated machine learning, and knowledge visualization techniques to effectively and accurately support the extraction and graphical presentation of chemical processes disclosed in patent documents. The computer framework trains a machine learning model called ALBERT for automatic paragraph text classification. ALBERT separates chemical and non-chemical descriptive paragraphs from a patent for effective chemical term extraction. The ChemDataExtractor is used to classify chemical terms, such as inputs, units, and reactions from the chemical paragraphs. A computer-supported graph-based knowledge representation interface is developed to plot the extracted chemical terms and their chemical process links as a network of nodes with connecting arcs. The computer-supported chemical knowledge visualization approach helps researchers to quickly understand the innovative and unique chemical or processes of any chemical patent of interest.
Collapse
|