1
|
Ho SS, Mills RE. Domain-specific embeddings uncover latent genetics knowledge. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.17.643817. [PMID: 40166296 PMCID: PMC11957060 DOI: 10.1101/2025.03.17.643817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
The inundating rate of scientific publishing means every researcher will miss new discoveries from overwhelming saturation. To address this limitation, we employ natural language processing to overcome human limitations in reading, curation, and knowledge synthesis, with domain-specific applications to genetics and genomics. We construct a corpus of 3.5 million normalized genetics and genomics abstracts and implement both semantic and network-based embedding models. Our methods not only capture broad biological concepts and relationships but also predict complex phenomena such as gene expression. Through a rigorous temporal validation framework, we demonstrate that our embeddings successfully predict gene-disease associations, cancer driver genes, and experimentally-verified protein interactions years before their formal documentation in literature. Additionally, our embeddings successfully predict experimentally verified gene-gene interactions absent from the literature. These findings demonstrate that substantial undiscovered knowledge exists within the collective scientific literature and that computational approaches can accelerate biological discovery by identifying hidden connections across the fragmented landscape of scientific publishing.
Collapse
Affiliation(s)
- S. S. Ho
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - R. E. Mills
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
2
|
Ge W, De Silva R, Fan Y, Sisson SA, Stenzel MH. Machine Learning in Polymer Research. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2025; 37:e2413695. [PMID: 39924835 PMCID: PMC11923530 DOI: 10.1002/adma.202413695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Revised: 12/21/2024] [Indexed: 02/11/2025]
Abstract
Machine learning is increasingly being applied in polymer chemistry to link chemical structures to macroscopic properties of polymers and to identify chemical patterns in the polymer structures that help improve specific properties. To facilitate this, a chemical dataset needs to be translated into machine readable descriptors. However, limited and inadequately curated datasets, broad molecular weight distributions, and irregular polymer configurations pose significant challenges. Most off the shelf mathematical models often need refinement for specific applications. Addressing these challenges demand a close collaboration between chemists and mathematicians as chemists must formulate research questions in mathematical terms while mathematicians are required to refine models for specific applications. This review unites both disciplines to address dataset curation hurdles and highlight advances in polymer synthesis and modeling that enhance data availability. It then surveys ML approaches used to predict solid-state properties, solution behavior, composite performance, and emerging applications such as drug delivery and the polymer-biology interface. A perspective of the field is concluded and the importance of FAIR (findability, accessibility, interoperability, and reusability) data and the integration of polymer theory and data are discussed, and the thoughts on the machine-human interface are shared.
Collapse
Affiliation(s)
- Wei Ge
- School of Chemistry, University of New South Wales, Sydney, 2052, Australia
- School of Mathematics and Statistics and UNSW Data Science Hub, University of New South Wales, Sydney, 2052, Australia
| | - Ramindu De Silva
- School of Chemistry, University of New South Wales, Sydney, 2052, Australia
- School of Mathematics and Statistics and UNSW Data Science Hub, University of New South Wales, Sydney, 2052, Australia
- Data61, CSIRO, Sydney, NSW, 2015, Australia
| | - Yanan Fan
- School of Mathematics and Statistics and UNSW Data Science Hub, University of New South Wales, Sydney, 2052, Australia
- Data61, CSIRO, Sydney, NSW, 2015, Australia
| | - Scott A Sisson
- School of Mathematics and Statistics and UNSW Data Science Hub, University of New South Wales, Sydney, 2052, Australia
| | - Martina H Stenzel
- School of Chemistry, University of New South Wales, Sydney, 2052, Australia
| |
Collapse
|
3
|
Dangayach R, Jeong N, Demirel E, Uzal N, Fung V, Chen Y. Machine Learning-Aided Inverse Design and Discovery of Novel Polymeric Materials for Membrane Separation. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2025; 59:993-1012. [PMID: 39680111 PMCID: PMC11755723 DOI: 10.1021/acs.est.4c08298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/10/2024] [Revised: 12/03/2024] [Accepted: 12/04/2024] [Indexed: 12/17/2024]
Abstract
Polymeric membranes have been widely used for liquid and gas separation in various industrial applications over the past few decades because of their exceptional versatility and high tunability. Traditional trial-and-error methods for material synthesis are inadequate to meet the growing demands for high-performance membranes. Machine learning (ML) has demonstrated huge potential to accelerate design and discovery of membrane materials. In this review, we cover strengths and weaknesses of the traditional methods, followed by a discussion on the emergence of ML for developing advanced polymeric membranes. We describe methodologies for data collection, data preparation, the commonly used ML models, and the explainable artificial intelligence (XAI) tools implemented in membrane research. Furthermore, we explain the experimental and computational validation steps to verify the results provided by these ML models. Subsequently, we showcase successful case studies of polymeric membranes and emphasize inverse design methodology within a ML-driven structured framework. Finally, we conclude by highlighting the recent progress, challenges, and future research directions to advance ML research for next generation polymeric membranes. With this review, we aim to provide a comprehensive guideline to researchers, scientists, and engineers assisting in the implementation of ML to membrane research and to accelerate the membrane design and material discovery process.
Collapse
Affiliation(s)
- Raghav Dangayach
- School
of Civil & Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Nohyeong Jeong
- School
of Civil & Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Elif Demirel
- School
of Civil & Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Nigmet Uzal
- School
of Civil & Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
- Department
of Civil Engineering, Abdullah Gul University, 38039 Kayseri, Turkey
| | - Victor Fung
- School
of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Yongsheng Chen
- School
of Civil & Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| |
Collapse
|
4
|
Lalonde JN, Pilania G, Marrone BL. Materials designed to degrade: structure, properties, processing, and performance relationships in polyhydroxyalkanoate biopolymers. Polym Chem 2025; 16:235-265. [PMID: 39464417 PMCID: PMC11498330 DOI: 10.1039/d4py00623b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Accepted: 10/05/2024] [Indexed: 10/29/2024]
Abstract
Conventional plastics pose significant environmental and health risks across their life cycle, driving intense interest in sustainable alternatives. Among these, polyhydroxyalkanoates (PHAs) stand out for their biocompatibility, degradation characteristics, and diverse applications. Yet, challenges like production cost, scalability, and limited chemical variety hinder their widespread adoption, impacting material selection and design. This review examines PHA research through the lens of the classical materials tetrahedron, exploring property-structure-processing-performance (PSPP) relationships. By analyzing recent literature and addressing current limitations, we gain valuable insights into PHA development. Despite challenges, we remain optimistic about the role of PHAs in transitioning towards a circular plastic economy, emphasizing the need for further research to unlock their full potential.
Collapse
Affiliation(s)
- Jessica N Lalonde
- Department of Mechanical Engineering and Materials Science, Duke University Durham NC 27708 USA
- Bioscience Division, Los Alamos National Laboratory Los Alamos NM 87545 USA
| | | | - Babetta L Marrone
- Bioscience Division, Los Alamos National Laboratory Los Alamos NM 87545 USA
| |
Collapse
|
5
|
Huang Y, Zhang L, Deng H, Mao J. NJmat: Data-Driven Machine Learning Interface to Accelerate Material Design. J Chem Inf Model 2024; 64:6477-6491. [PMID: 39133673 DOI: 10.1021/acs.jcim.4c00493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
Machine learning techniques have significantly transformed the way materials scientists conduct research. However, the widespread deployment of machine learning software in daily experimental and simulation research for materials and chemical design has been limited. This is partly due to the substantial time investment and learning curve associated with mastering the necessary codes and computational environments. In this paper, we introduce a user-friendly, data-driven machine learning interface featuring multiple "button-clicking" functionalities to streamline the design of materials and chemicals. This interface automates the processes of transforming materials and molecules, performing feature selection, constructing machine learning models, making virtual predictions, and visualizing results. Such automation accelerates materials prediction and analysis in the inverse design process, aligning with the time criteria outlined by the Materials Genome Initiative. With simple button clicks, researchers can build machine learning models and predict new materials once they have gathered experimental or simulation data. Beyond the ease of use, NJmat offers three additional features for data-driven materials design: (1) automatic feature generation for both inorganic materials (from chemical formulas) and organic molecules (from SMILES), (2) automatic generation of Shapley plots, and (3) automatic construction of "white-box" genetic models and decision trees to provide scientific insights. We present case studies on surface design for halide perovskite materials encompassing both inorganic and organic species. These case studies illustrate general machine learning models for virtual predictions as well as the automatic featurization and Shapley/genetic model construction capabilities. We anticipate that this software tool will expedite materials and molecular design within the scope of the Materials Genome Initiative, particularly benefiting experimentalists.
Collapse
Affiliation(s)
- Yiru Huang
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Lei Zhang
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Hangyuan Deng
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Junfei Mao
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| |
Collapse
|
6
|
Bonatti AF, Chiarello F, Vozzi G, De Maria C. AI-Based Knowledge Extraction from the Bioprinting Literature for Identifying Technology Trends. 3D PRINTING AND ADDITIVE MANUFACTURING 2024; 11:1495-1509. [PMID: 39360130 PMCID: PMC11443122 DOI: 10.1089/3dp.2022.0316] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/04/2024]
Abstract
Bioprinting is a rapidly evolving field, as represented by the exponential growth of articles and reviews published each year on the topic. As the number of publications increases, there is a need for an automatic tool that can help researchers do more comprehensive literature analysis, standardize the nomenclature, and so accelerate the development of novel manufacturing techniques and materials for the field. In this context, we propose an automatic keyword annotation model, based on Natural Language Processing (NLP) techniques, that can be used to find insights in the bioprinting scientific literature. The approach is based on two main data sources, the abstracts and related author keywords, which are used to train a composite model based on (i) an embeddings part (using the FastText algorithm), which generates word vectors for an input keyword, and (ii) a classifier part (using the Support Vector Machine algorithm), to label the keyword based on its word vector into a manufacturing technique, employed material, or application of the bioprinted product. The composite model was trained and optimized based on a two-stage optimization procedure to yield the best classification performance. The annotated author keywords were then reprojected on the abstract collection to both generate a lexicon of the bioprinting field and extract relevant information, like technology trends and the relationship between manufacturing-material-application. The proposed approach can serve as a basis for more complex NLP-related analysis toward the automated analysis of the bioprinting literature.
Collapse
Affiliation(s)
- Amedeo Franco Bonatti
- Department of Information Engineering and Research Center “Enrico Piaggio,”, Systems, Territory and Construction Engineering, University of Pisa, Pisa, Italy
| | - Filippo Chiarello
- Department of Energy, Systems, Territory and Construction Engineering, University of Pisa, Pisa, Italy
| | - Giovanni Vozzi
- Department of Information Engineering and Research Center “Enrico Piaggio,”, Systems, Territory and Construction Engineering, University of Pisa, Pisa, Italy
| | - Carmelo De Maria
- Department of Information Engineering and Research Center “Enrico Piaggio,”, Systems, Territory and Construction Engineering, University of Pisa, Pisa, Italy
| |
Collapse
|
7
|
Ting JM, Tamayo-Mendoza T, Petersen SR, Van Reet J, Ahmed UA, Snell NJ, Fisher JD, Stern M, Oviedo F. Frontiers in nonviral delivery of small molecule and genetic drugs, driven by polymer chemistry and machine learning for materials informatics. Chem Commun (Camb) 2023; 59:14197-14209. [PMID: 37955165 DOI: 10.1039/d3cc04705a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2023]
Abstract
Materials informatics (MI) has immense potential to accelerate the pace of innovation and new product development in biotechnology. Close collaborations between skilled physical and life scientists with data scientists are being established in pursuit of leveraging MI tools in automation and artificial intelligence (AI) to predict material properties in vitro and in vivo. However, the scarcity of large, standardized, and labeled materials data for connecting structure-function relationships represents one of the largest hurdles to overcome. In this Highlight, focus is brought to emerging developments in polymer-based therapeutic delivery platforms, where teams generate large experimental datasets around specific therapeutics and successfully establish a design-to-deployment cycle of specialized nanocarriers. Three select collaborations demonstrate how custom-built polymers protect and deliver small molecules, nucleic acids, and proteins, representing ideal use-cases for machine learning to understand how molecular-level interactions impact drug stabilization and release. We conclude with our perspectives on how MI innovations in automation efficiencies and digitalization of data-coupled with fundamental insight and creativity from the polymer science community-can accelerate translation of more gene therapies into lifesaving medicines.
Collapse
|
8
|
Martin TB, Audus DJ. Emerging Trends in Machine Learning: A Polymer Perspective. ACS POLYMERS AU 2023; 3:239-258. [PMID: 37334191 PMCID: PMC10273415 DOI: 10.1021/acspolymersau.2c00053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Revised: 12/20/2022] [Accepted: 12/21/2022] [Indexed: 01/19/2023]
Abstract
In the last five years, there has been tremendous growth in machine learning and artificial intelligence as applied to polymer science. Here, we highlight the unique challenges presented by polymers and how the field is addressing them. We focus on emerging trends with an emphasis on topics that have received less attention in the review literature. Finally, we provide an outlook for the field, outline important growth areas in machine learning and artificial intelligence for polymer science and discuss important advances from the greater material science community.
Collapse
Affiliation(s)
- Tyler B. Martin
- National Institute of Standards
and Technology, Gaithersburg, Maryland20899, United States
| | - Debra J. Audus
- National Institute of Standards
and Technology, Gaithersburg, Maryland20899, United States
| |
Collapse
|
9
|
Shetty P, Rajan AC, Kuenneth C, Gupta S, Panchumarti LP, Holm L, Zhang C, Ramprasad R. A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing. NPJ COMPUTATIONAL MATERIALS 2023; 9:52. [PMID: 37033291 PMCID: PMC10073792 DOI: 10.1038/s41524-023-01003-w] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Accepted: 03/16/2023] [Indexed: 06/19/2023]
Abstract
The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from literature. We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available at polymerscholar.org which can be used to locate material property data recorded in abstracts. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with extracted material property information.
Collapse
Affiliation(s)
- Pranav Shetty
- School of Computational Science & Engineering, Atlanta, GA USA
| | - Arunkumar Chitteth Rajan
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| | - Chris Kuenneth
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| | - Sonakshi Gupta
- Department of Metallurgy Engineering and Materials Science, Indian Institute of Technology, Indore, Madhya Pradesh India
| | - Lakshmi Prerana Panchumarti
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| | - Lauren Holm
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| | - Chao Zhang
- School of Computational Science & Engineering, Atlanta, GA USA
| | - Rampi Ramprasad
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| |
Collapse
|
10
|
Steinmann SN, Wang Q, Seh ZW. How machine learning can accelerate electrocatalysis discovery and optimization. MATERIALS HORIZONS 2023; 10:393-406. [PMID: 36541226 DOI: 10.1039/d2mh01279k] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Advances in machine learning (ML) provide the means to bypass bottlenecks in the discovery of new electrocatalysts using traditional approaches. In this review, we highlight the currently achieved work in ML-accelerated discovery and optimization of electrocatalysts via a tight collaboration between computational models and experiments. First, the applicability of available methods for constructing machine-learned potentials (MLPs), which provide accurate energies and forces for atomistic simulations, are discussed. Meanwhile, the current challenges for MLPs in the context of electrocatalysis are highlighted. Then, we review the recent progress in predicting catalytic activities using surrogate models, including microkinetic simulations and more global proxies thereof. Several typical applications of using ML to rationalize thermodynamic proxies and predict the adsorption and activation energies are also discussed. Next, recent developments of ML-assisted experiments for catalyst characterization, synthesis optimization and reaction condition optimization are illustrated. In particular, the applications in ML-enhanced spectra analysis and the use of ML to interpret experimental kinetic data are highlighted. Additionally, we also show how robotics are applied to high-throughput synthesis, characterization and testing of electrocatalysts to accelerate the materials exploration process and how this equipment can be assembled into self-driven laboratories.
Collapse
Affiliation(s)
| | - Qing Wang
- Univ Lyon, ENS de Lyon, CNRS, Laboratoire de Chimie UMR 5182, Lyon, France.
| | - Zhi Wei Seh
- Institute of Materials Research and Engineering, Agency for Science, Technology and Research (A*STAR), 2 Fusionopolis Way, Innovis, 138634, Singapore.
| |
Collapse
|
11
|
Huang S, Cole JM. BatteryBERT: A Pretrained Language Model for Battery Database Enhancement. J Chem Inf Model 2022; 62:6365-6377. [PMID: 35533012 PMCID: PMC9795558 DOI: 10.1021/acs.jcim.2c00035] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
A great number of scientific papers are published every year in the field of battery research, which forms a huge textual data source. However, it is difficult to explore and retrieve useful information efficiently from these large unstructured sets of text. The Bidirectional Encoder Representations from Transformers (BERT) model, trained on a large data set in an unsupervised way, provides a route to process the scientific text automatically with minimal human effort. To this end, we realized six battery-related BERT models, namely, BatteryBERT, BatteryOnlyBERT, and BatterySciBERT, each of which consists of both cased and uncased models. They have been trained specifically on a corpus of battery research papers. The pretrained BatteryBERT models were then fine-tuned on downstream tasks, including battery paper classification and extractive question-answering for battery device component classification that distinguishes anode, cathode, and electrolyte materials. Our BatteryBERT models were found to outperform the original BERT models on the specific battery tasks. The fine-tuned BatteryBERT was then used to perform battery database enhancement. We also provide a website application for its interactive use and visualization.
Collapse
Affiliation(s)
- Shu Huang
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J.J. Thomson Avenue, Cambridge CB3 0HE, U.K.
| | - Jacqueline M. Cole
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J.J. Thomson Avenue, Cambridge CB3 0HE, U.K.,ISIS
Neutron and Muon Source, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.,
| |
Collapse
|
12
|
Abstract
The application of machine learning to the materials domain has traditionally struggled with two major challenges: a lack of large, curated data sets and the need to understand the physics behind the machine-learning prediction. The former problem is particularly acute in the polymers domain. Here we aim to simultaneously tackle these challenges through the incorporation of scientific knowledge, thus, providing improved predictions for smaller data sets, both under interpolation and extrapolation, and a degree of explainability. We focus on imperfect theories, as they are often readily available and easier to interpret. Using a system of a polymer in different solvent qualities, we explore numerous methods for incorporating theory into machine learning using different machine-learning models, including Gaussian process regression. Ultimately, we find that encoding the functional form of the theory performs best followed by an encoding of the numeric values of the theory.
Collapse
Affiliation(s)
- Debra J Audus
- Materials Science and Engineering Division, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, United States
| | - Austin McDannald
- Materials Measurement Science Division, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, United States
| | - Brian DeCost
- Materials Measurement Science Division, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, United States
| |
Collapse
|
13
|
Kumar R. Materiomically Designed Polymeric Vehicles for Nucleic Acids: Quo Vadis? ACS APPLIED BIO MATERIALS 2022; 5:2507-2535. [PMID: 35642794 DOI: 10.1021/acsabm.2c00346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Despite rapid advances in molecular biology, particularly in site-specific genome editing technologies, such as CRISPR/Cas9 and base editing, financial and logistical challenges hinder a broad population from accessing and benefiting from gene therapy. To improve the affordability and scalability of gene therapy, we need to deploy chemically defined, economical, and scalable materials, such as synthetic polymers. For polymers to deliver nucleic acids efficaciously to targeted cells, they must optimally combine design attributes, such as architecture, length, composition, spatial distribution of monomers, basicity, hydrophilic-hydrophobic phase balance, or protonation degree. Designing polymeric vectors for specific nucleic acid payloads is a multivariate optimization problem wherein even minuscule deviations from the optimum are poorly tolerated. To explore the multivariate polymer design space rapidly, efficiently, and fruitfully, we must integrate parallelized polymer synthesis, high-throughput biological screening, and statistical modeling. Although materiomics approaches promise to streamline polymeric vector development, several methodological ambiguities must be resolved. For instance, establishing a flexible polymer ontology that accommodates recent synthetic advances, enforcing uniform polymer characterization and data reporting standards, and implementing multiplexed in vitro and in vivo screening studies require considerable planning, coordination, and effort. This contribution will acquaint readers with the challenges associated with materiomics approaches to polymeric gene delivery and offers guidelines for overcoming these challenges. Here, we summarize recent developments in combinatorial polymer synthesis, high-throughput screening of polymeric vectors, omics-based approaches to polymer design, barcoding schemes for pooled in vitro and in vivo screening, and identify materiomics-inspired research directions that will realize the long-unfulfilled clinical potential of polymeric carriers in gene therapy.
Collapse
Affiliation(s)
- Ramya Kumar
- Department of Chemical & Biological Engineering, Colorado School of Mines, 1613 Illinois St, Golden, Colorado 80401, United States
| |
Collapse
|
14
|
Teruya E, Takeuchi T, Morita H, Hayashi T, Ono K. ARTS: autonomous research topic selection system using word embeddings and network analysis. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2022. [DOI: 10.1088/2632-2153/ac61eb] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Abstract
The materials science research process has become increasingly autonomous due to the remarkable progress in artificial intelligence. However, autonomous research topic selection (ARTS) has not yet been fully explored due to the difficulty of estimating its promise and the lack of previous research. This paper introduces an ARTS system that autonomously selects potential research topics that are likely to reveal new scientific facts yet have not been the subject of much previous research by analyzing vast numbers of articles. Potential research topics are selected by analyzing the difference between two research concept networks constructed from research information in articles: one that represents the promise of research topics and is constructed from word embeddings, and one that represents known facts and past research activities and is constructed from statistical information on the appearance patterns of research concepts. The ARTS system is also equipped with functions to search and visualize information about selected research topics to assist in the final determination of a research topic by a scientist. We developed the ARTS system using approximately 100 00 articles published in the Computational Materials Science journal. The results of our evaluation demonstrated that research topics studied after 2016 could be generated autonomously from an analysis of the articles published before 2015. This suggests that potential research topics can be effectively selected by using the ARTS system.
Collapse
|
15
|
Shetty P, Ramprasad R. Machine-Guided Polymer Knowledge Extraction Using Natural Language Processing: The Example of Named Entity Normalization. J Chem Inf Model 2021; 61:5377-5385. [PMID: 34752101 DOI: 10.1021/acs.jcim.1c00554] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
A rich body of literature has emerged in recent years that discusses the extraction of structured information from materials science text through named entity recognition models. Relatively little work has been done to address the "normalization" of extracted entities, that is, recognizing that two or more seemingly different entities actually refer to the same entity in reality. In this work, we address the normalization of polymer named entities, polymers being a class of materials that often have a variety of common names for the same material in addition to the IUPAC name. We have trained supervised clustering models using Word2Vec and fastText word embeddings reported in previous work so that named entities referring to the same polymer are categorized within the same cluster in the word embedding space. We report the use of parameterized cosine distance functions to cluster and normalize textually derived entities, achieving an F1 score of 0.85. Furthermore, a labeled data set of polymer names was utilized to train our model and to infer the true total number of unique polymers that are actively reported in the literature. For ∼15,500 polymer named entities extracted from our corpus of 0.5 million papers, we detected 6734 unique clusters (i.e., unique polymers), 632 of which were manually curated to train the normalization model. This work will serve as a critical ingredient in a natural language processing-based pipeline for the automatic and efficient extraction of knowledge from the polymer literature.
Collapse
Affiliation(s)
- Pranav Shetty
- School of Computational Science & Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, Georgia 30332, United States
| | - Rampi Ramprasad
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, Georgia 30332, United States
| |
Collapse
|
16
|
IP Analytics and Machine Learning Applied to Create Process Visualization Graphs for Chemical Utility Patents. Processes (Basel) 2021. [DOI: 10.3390/pr9081342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Researchers must read and understand a large volume of technical papers, including patent documents, to fully grasp the state-of-the-art technological progress in a given domain. Chemical research is particularly challenging with the fast growth of newly registered utility patents (also known as intellectual property or IP) that provide detailed descriptions of the processes used to create a new chemical or a new process to manufacture a known chemical. The researcher must be able to understand the latest patents and literature in order to develop new chemicals and processes that do not infringe on existing claims and processes. This research uses text mining, integrated machine learning, and knowledge visualization techniques to effectively and accurately support the extraction and graphical presentation of chemical processes disclosed in patent documents. The computer framework trains a machine learning model called ALBERT for automatic paragraph text classification. ALBERT separates chemical and non-chemical descriptive paragraphs from a patent for effective chemical term extraction. The ChemDataExtractor is used to classify chemical terms, such as inputs, units, and reactions from the chemical paragraphs. A computer-supported graph-based knowledge representation interface is developed to plot the extracted chemical terms and their chemical process links as a network of nodes with connecting arcs. The computer-supported chemical knowledge visualization approach helps researchers to quickly understand the innovative and unique chemical or processes of any chemical patent of interest.
Collapse
|
17
|
Kuenneth C, Schertzer W, Ramprasad R. Copolymer Informatics with Multitask Deep Neural Networks. Macromolecules 2021. [DOI: 10.1021/acs.macromol.1c00728] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Affiliation(s)
- Christopher Kuenneth
- School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - William Schertzer
- School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Rampi Ramprasad
- School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| |
Collapse
|
18
|
Siraj A, Lim DY, Tayara H, Chong KT. UbiComb: A Hybrid Deep Learning Model for Predicting Plant-Specific Protein Ubiquitylation Sites. Genes (Basel) 2021; 12:genes12050717. [PMID: 34064731 PMCID: PMC8151217 DOI: 10.3390/genes12050717] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Revised: 05/06/2021] [Accepted: 05/07/2021] [Indexed: 12/11/2022] Open
Abstract
Protein ubiquitylation is an essential post-translational modification process that performs a critical role in a wide range of biological functions, even a degenerative role in certain diseases, and is consequently used as a promising target for the treatment of various diseases. Owing to the significant role of protein ubiquitylation, these sites can be identified by enzymatic approaches, mass spectrometry analysis, and combinations of multidimensional liquid chromatography and tandem mass spectrometry. However, these large-scale experimental screening techniques are time consuming, expensive, and laborious. To overcome the drawbacks of experimental methods, machine learning and deep learning-based predictors were considered for prediction in a timely and cost-effective manner. In the literature, several computational predictors have been published across species; however, predictors are species-specific because of the unclear patterns in different species. In this study, we proposed a novel approach for predicting plant ubiquitylation sites using a hybrid deep learning model by utilizing convolutional neural network and long short-term memory. The proposed method uses the actual protein sequence and physicochemical properties as inputs to the model and provides more robust predictions. The proposed predictor achieved the best result with accuracy values of 80% and 81% and F-scores of 79% and 82% on the 10-fold cross-validation and an independent dataset, respectively. Moreover, we also compared the testing of the independent dataset with popular ubiquitylation predictors; the results demonstrate that our model significantly outperforms the other methods in prediction classification results.
Collapse
Affiliation(s)
- Arslan Siraj
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.S.); (D.Y.L.)
| | - Dae Yeong Lim
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.S.); (D.Y.L.)
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Korea
- Correspondence: (H.T.); (K.T.C.)
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.S.); (D.Y.L.)
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
- Correspondence: (H.T.); (K.T.C.)
| |
Collapse
|