1
|
Shetty P, Ramprasad R. Automated knowledge extraction from polymer literature using natural language processing. iScience 2020; 24:101922. [PMID: 33458607 PMCID: PMC7797509 DOI: 10.1016/j.isci.2020.101922] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 11/12/2020] [Accepted: 12/05/2020] [Indexed: 12/21/2022] Open
Abstract
Materials science literature has grown exponentially in recent years making it difficult for individuals to master all of this information. This constrains the formulation of new hypotheses that scientists can come up with. In this work, we explore whether materials science knowledge can be automatically inferred from textual information contained in journal papers. Using a data set of 0.5 million polymer papers, we show, using natural language processing methods that vector representations trained for every word in our corpus can indeed capture this knowledge in a completely unsupervised manner. We perform time-based studies through which we track popularity of various polymers for different applications and predict new polymers for novel applications based solely on the domain knowledge contained in our data set. Using co-relations detected automatically from literature in this manner thus, opens up a new paradigm for materials discovery. Word embeddings trained on a corpus of polymer papers Common polymers and their corresponding applications analyzed over time Polymer domain knowledge encoded in word vectors Novel polymers for certain applications predicted and validated using word embeddings
Collapse
Affiliation(s)
- Pranav Shetty
- School of Computational Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, GA 30332, USA
| | - Rampi Ramprasad
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, GA 30332, USA
| |
Collapse
|
2
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 129] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
3
|
Difficulties and Challenges Associated with Literature Searches in Operating Room Management, Complete with Recommendations. Anesth Analg 2013; 117:1460-79. [DOI: 10.1213/ane.0b013e3182a6d33b] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
4
|
Hahn U, Cohen KB, Garten Y, Shah NH. Mining the pharmacogenomics literature--a survey of the state of the art. Brief Bioinform 2012; 13:460-94. [PMID: 22833496 PMCID: PMC3404399 DOI: 10.1093/bib/bbs018] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2011] [Accepted: 03/23/2012] [Indexed: 01/05/2023] Open
Abstract
This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.
Collapse
Affiliation(s)
- Udo Hahn
- Jena University Language and Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Jena, Germany.
| | | | | | | |
Collapse
|
5
|
Nadkarni PM, Parikh CR. An eUtils toolset and its use for creating a pipeline to link genomics and proteomics analyses to domain-specific biomedical literature. J Clin Bioinforma 2012; 2:9. [PMID: 22507626 PMCID: PMC3422171 DOI: 10.1186/2043-9113-2-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2012] [Accepted: 04/16/2012] [Indexed: 12/03/2022] Open
Abstract
Background Numerous biomedical software applications access databases maintained by the US National Center for Biotechnology Information (NCBI). To ease software automation, NCBI provides a powerful but complex Web-service-based programming interface, eUtils. This paper describes a toolset that simplifies eUtils use through a graphical front-end that can be used by non-programmers to construct data-extraction pipelines. The front-end relies on a code library that provides high-level wrappers around eUtils functions, and which is distributed as open-source, allowing customization and enhancement by individuals with programming skills. Methods We initially created an application that queried eUtils to retrieve nephrology-specific biomedical literature citations for a user-definable set of genes. We later augmented the application code to create a general-purpose library that accesses eUtils capability as individual functions that could be combined into user-defined pipelines. Results The toolset’s use is illustrated with an application that serves as a front-end to the library and can be used by non-programmers to construct user-defined pipelines. The operation of the library is illustrated for the literature-surveillance application, which serves as a case-study. An overview of the library is also provided. Conclusions The library simplifies use of the eUtils service by operating at a higher level, and also transparently addresses robustness issues that would need to be individually implemented otherwise, such as error recovery and prevention of overloading of the eUtils service.
Collapse
Affiliation(s)
- Prakash M Nadkarni
- Department of Medicine, Program of Applied Translational Research, Yale University School of Medicine, New Haven, CT, USA.
| | | |
Collapse
|