Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Corbett P, Copestake A. Cascaded classifiers for confidence-based chemical named entity recognition. BMC Bioinformatics 2008;9 Suppl 11:S4. [PMID: 19025690 PMCID: PMC2586753 DOI: 10.1186/1471-2105-9-s11-s4] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open

For:	Corbett P, Copestake A. Cascaded classifiers for confidence-based chemical named entity recognition. BMC Bioinformatics 2008;9 Suppl 11:S4. [PMID: 19025690 PMCID: PMC2586753 DOI: 10.1186/1471-2105-9-s11-s4] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open

Number

Cited by Other Article(s)

Lai PT, Coudert E, Aimo L, Axelsen K, Breuza L, de Castro E, Feuermann M, Morgat A, Pourcel L, Pedruzzi I, Poux S, Redaschi N, Rivoire C, Sveshnikova A, Wei CH, Leaman R, Luo L, Lu Z, Bridge A. EnzChemRED, a rich enzyme chemistry relation extraction dataset. Sci Data 2024;11:982. [PMID: 39251610 PMCID: PMC11384730 DOI: 10.1038/s41597-024-03835-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Accepted: 08/23/2024] [Indexed: 09/11/2024] Open

Affiliation(s)

Po-Ting Lai National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
Elisabeth Coudert Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Lucila Aimo Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Kristian Axelsen Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Lionel Breuza Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Edouard de Castro Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Marc Feuermann Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Anne Morgat Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Lucille Pourcel Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Ivo Pedruzzi Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Sylvain Poux Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Nicole Redaschi Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Catherine Rivoire Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Anastasia Sveshnikova Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Chih-Hsuan Wei National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
Robert Leaman National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
Ling Luo School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China
Zhiyong Lu National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA.
Alan Bridge Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland.

Collapse

Thompson P, Ananiadou S, Basinas I, Brinchmann BC, Cramer C, Galea KS, Ge C, Georgiadis P, Kirkeleit J, Kuijpers E, Nguyen N, Nuñez R, Schlünssen V, Stokholm ZA, Taher EA, Tinnerberg H, Van Tongeren M, Xie Q. Supporting the working life exposome: Annotating occupational exposure for enhanced literature search. PLoS One 2024;19:e0307844. [PMID: 39146349 PMCID: PMC11326626 DOI: 10.1371/journal.pone.0307844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Accepted: 07/12/2024] [Indexed: 08/17/2024] Open

Abstract

An individual's likelihood of developing non-communicable diseases is often influenced by the types, intensities and duration of exposures at work. Job exposure matrices provide exposure estimates associated with different occupations. However, due to their time-consuming expert curation process, job exposure matrices currently cover only a subset of possible workplace exposures and may not be regularly updated. Scientific literature articles describing exposure studies provide important supporting evidence for developing and updating job exposure matrices, since they report on exposures in a variety of occupational scenarios. However, the constant growth of scientific literature is increasing the challenges of efficiently identifying relevant articles and important content within them. Natural language processing methods emulate the human process of reading and understanding texts, but in a fraction of the time. Such methods can increase the efficiency of both finding relevant documents and pinpointing specific information within them, which could streamline the process of developing and updating job exposure matrices. Named entity recognition is a fundamental natural language processing method for language understanding, which automatically identifies mentions of domain-specific concepts (named entities) in documents, e.g., exposures, occupations and job tasks. State-of-the-art machine learning models typically use evidence from an annotated corpus, i.e., a set of documents in which named entities are manually marked up (annotated) by experts, to learn how to detect named entities automatically in new documents. We have developed a novel annotated corpus of scientific articles to support machine learning based named entity recognition relevant to occupational substance exposures. Through incremental refinements to the annotation process, we demonstrate that expert annotators can attain high levels of agreement, and that the corpus can be used to train high-performance named entity recognition models. The corpus thus constitutes an important foundation for the wider development of natural language processing tools to support the study of occupational exposures.

Collapse

Affiliation(s)

Paul Thompson Department of Computer Science, National Centre for Text Mining, University of Manchester, Manchester, United Kingdom
Sophia Ananiadou Department of Computer Science, National Centre for Text Mining, University of Manchester, Manchester, United Kingdom
Ioannis Basinas Centre for Occupational and Environmental Health, School of Health Sciences, University of Manchester, Manchester, United Kingdom
Bendik C Brinchmann Federation of Norwegian Industries, Oslo, Norway Department of Occupational Medicine and Epidemiology, National Institute of Occupational Health, Oslo, Norway
Christine Cramer Department of Public Health, Research Unit for Environment, Occupation and Health, Danish Ramazzini Centre, Aarhus University, Aarhus, Denmark Department of Occupational Medicine, Danish Ramazzini Centre, Aarhus University Hospital, Aarhus, Denmark
Karen S Galea Institute of Occupational Medicine, Edinburgh, United Kingdom
Calvin Ge Netherlands Organisation for Applied Scientific Research, Utrecht, Netherlands
Panagiotis Georgiadis Department of Computer Science, National Centre for Text Mining, University of Manchester, Manchester, United Kingdom
Jorunn Kirkeleit Federation of Norwegian Industries, Oslo, Norway Department of Global Public Health and Primary Care, University of Bergen, Bergen, Norway
Eelco Kuijpers Netherlands Organisation for Applied Scientific Research, Utrecht, Netherlands
Nhung Nguyen Department of Computer Science, National Centre for Text Mining, University of Manchester, Manchester, United Kingdom
Roberto Nuñez Occupational Health Group, Institute for Risk Assessment Sciences, Utrecht University, Utrecht, Netherlands
Vivi Schlünssen Department of Public Health, Research Unit for Environment, Occupation and Health, Danish Ramazzini Centre, Aarhus University, Aarhus, Denmark
Zara Ann Stokholm Department of Occupational Medicine, Danish Ramazzini Centre, Aarhus University Hospital, Aarhus, Denmark
Evana Amir Taher Center for Occupational and Environmental Medicine, Stockholm, Sweden
Håkan Tinnerberg School of Public Health and Community Medicine, University of Gothenburg, Gothenburg, Sweden Institute of Environmental Medicine, Karolinska Institutet, Stockholm, Sweden
Martie Van Tongeren Centre for Occupational and Environmental Health, School of Health Sciences, University of Manchester, Manchester, United Kingdom
Qianqian Xie Department of Computer Science, National Centre for Text Mining, University of Manchester, Manchester, United Kingdom

Collapse

A Narrative Literature Review of Natural Language Processing Applied to the Occupational Exposome. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022;19:ijerph19148544. [PMID: 35886395 PMCID: PMC9316260 DOI: 10.3390/ijerph19148544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Revised: 07/07/2022] [Accepted: 07/11/2022] [Indexed: 02/05/2023]

Trewartha A, Walker N, Huo H, Lee S, Cruse K, Dagdelen J, Dunn A, Persson KA, Ceder G, Jain A. Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. PATTERNS (NEW YORK, N.Y.) 2022;3:100488. [PMID: 35465225 PMCID: PMC9024010 DOI: 10.1016/j.patter.2022.100488] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2021] [Revised: 01/21/2022] [Accepted: 03/15/2022] [Indexed: 11/03/2022]

Abstract

A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to an increase in publications. This problem may be addressed by using named entity recognition (NER) to extract structured summary-level data from unstructured materials science text. We compare the performance of four NER models on three materials science datasets. The four models include a bidirectional long short-term memory (BiLSTM) and three transformer models (BERT, SciBERT, and MatBERT) with increasing degrees of domain-specific materials science pre-training. MatBERT improves over the other two BERT_BASE-based models by 1%∼12%, implying that domain-specific pre-training provides measurable advantages. Despite relative architectural simplicity, the BiLSTM model consistently outperforms BERT, perhaps due to its domain-specific pre-trained word embeddings. Furthermore, MatBERT and SciBERT models outperform the original BERT model to a greater extent in the small data limit. MatBERT’s higher-quality predictions should accelerate the extraction of structured data from materials science literature.

•

Efficient extraction of information from materials science literature is needed

•

Domain-specific materials science pre-training improves results

•

Even simpler domain-specific models can outperform more complex general models

A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to a massive increase in publications. Four different language models are trained to automatically collect important information from materials science articles. We compare a simple model (BiLSTM) with materials science knowledge to three variants of a more complex model: one with general knowledge (BERT), one with general scientific knowledge (SciBERT), and one with materials science knowledge (MatBERT). We find that MatBERT performs the best overall. This implies that language models with greater extents of materials science knowledge will perform better on materials science-related tasks. The simpler model even consistently outperforms BERT. Furthermore, the performance gaps grow when the models are given fewer examples of information extraction to learn from. MatBERT’s higher-quality results should accelerate the collection of information from materials science literature.

Collapse

Affiliation(s)

Amalie Trewartha Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
Nicholas Walker Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
Haoyan Huo Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
Sanghoon Lee Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
Kevin Cruse Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
John Dagdelen Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
Alexander Dunn Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
Kristin A Persson Molecular Foundry, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
Gerbrand Ceder Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
Anubhav Jain Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA

Collapse

Kononova O, He T, Huo H, Trewartha A, Olivetti EA, Ceder G. Opportunities and challenges of text mining in aterials research. iScience 2021;24:102155. [PMID: 33665573 PMCID: PMC7905448 DOI: 10.1016/j.isci.2021.102155] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open

Dai HJ, Su CH, Wu CS. Adverse drug event and medication extraction in electronic health records via a cascading architecture with different sequence labeling models and word embeddings. J Am Med Inform Assoc 2021;27:47-55. [PMID: 31334805 DOI: 10.1093/jamia/ocz120] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Revised: 05/11/2019] [Accepted: 06/14/2019] [Indexed: 11/12/2022] Open

Corbett P, Boyle J. Chemlistem: chemical named entity recognition using recurrent neural networks. J Cheminform 2018;10:59. [PMID: 30523437 PMCID: PMC6755713 DOI: 10.1186/s13321-018-0313-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Accepted: 11/30/2018] [Indexed: 11/30/2022] Open

Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017;117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]

Asiaee AH, Minning T, Doshi P, Tarleton RL. A framework for ontology-based question answering with application to parasite immunology. J Biomed Semantics 2015;6:31. [PMID: 26185615 PMCID: PMC4504081 DOI: 10.1186/s13326-015-0029-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2013] [Accepted: 06/19/2015] [Indexed: 11/15/2022] Open

Abstract

Background

Large quantities of biomedical data are being produced at a rapid pace for a variety of organisms. With ontologies proliferating, data is increasingly being stored using the RDF data model and queried using RDF based querying languages. While existing systems facilitate the querying in various ways, the scientist must map the question in his or her mind to the interface used by the systems. The field of natural language processing has long investigated the challenges of designing natural language based retrieval systems. Recent efforts seek to bring the ability to pose natural language questions to RDF data querying systems while leveraging the associated ontologies. These analyze the input question and extract triples (subject, relationship, object), if possible, mapping them to RDF triples in the data. However, in the biomedical context, relationships between entities are not always explicit in the question and these are often complex involving many intermediate concepts.

Results

We present a new framework, OntoNLQA, for querying RDF data annotated using ontologies which allows posing questions in natural language. OntoNLQA offers five steps in order to answer natural language questions. In comparison to previous systems, OntoNLQA differs in how some of the methods are realized. In particular, it introduces a novel approach for discovering the sophisticated semantic associations that may exist between the key terms of a natural language question, in order to build an intuitive query and retrieve precise answers. We apply this framework to the context of parasite immunology data, leading to a system called AskCuebee that allows parasitologists to pose genomic, proteomic and pathway questions in natural language related to the parasite, Trypanosoma cruzi. We separately evaluate the accuracy of each component of OntoNLQA as implemented in AskCuebee and the accuracy of the whole system. AskCuebee answers 68 % of the questions in a corpus of 125 questions, and 60 % of the questions in a new previously unseen corpus. If we allow simple corrections by the scientists, this proportion increases to 92 %.

Conclusions

We introduce a novel framework for question answering and apply it to parasite immunology data. Evaluations of translating the questions to RDF triple queries by combining machine learning, lexical similarity matching with ontology classes, properties and instances for specificity, and discovering associations between them demonstrate that the approach performs well and improves on previous systems. Subsequently, OntoNLQA offers a viable framework for building question answering systems in other biomedical domains.

Electronic supplementary material

The online version of this article (doi:10.1186/s13326-015-0029-x) contains supplementary material, which is available to authorized users.

Collapse

Pyysalo S, Ohta T, Rak R, Rowley A, Chun HW, Jung SJ, Choi SP, Tsujii J, Ananiadou S. Overview of the Cancer Genetics and Pathway Curation tasks of BioNLP Shared Task 2013. BMC Bioinformatics 2015;16 Suppl 10:S2. [PMID: 26202570 PMCID: PMC4511510 DOI: 10.1186/1471-2105-16-s10-s2] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open

Hsu YY, Kao HY. Curatable Named-Entity Recognition Using Semantic Relations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015;12:785-792. [PMID: 26357317 DOI: 10.1109/tcbb.2014.2366770] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]

Application of text mining in the biomedical domain. Methods 2015;74:97-106. [PMID: 25641519 DOI: 10.1016/j.ymeth.2015.01.015] [Citation(s) in RCA: 76] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2014] [Revised: 01/21/2015] [Accepted: 01/23/2015] [Indexed: 12/12/2022] Open

Dai HJ, Lai PT, Chang YC, Tsai RTH. Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization. J Cheminform 2015;7:S14. [PMID: 25810771 PMCID: PMC4331690 DOI: 10.1186/1758-2946-7-s1-s14] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open

Abstract

Background

The functions of chemical compounds and drugs that affect biological processes and their particular effect on the onset and treatment of diseases have attracted increasing interest with the advancement of research in the life sciences. To extract knowledge from the extensive literatures on such compounds and drugs, the organizers of BioCreative IV administered the CHEMical Compound and Drug Named Entity Recognition (CHEMDNER) task to establish a standard dataset for evaluating state-of-the-art chemical entity recognition methods.

Methods

This study introduces the approach of our CHEMDNER system. Instead of emphasizing the development of novel feature sets for machine learning, this study investigates the effect of various tag schemes on the recognition of the names of chemicals and drugs by using conditional random fields. Experiments were conducted using combinations of different tokenization strategies and tag schemes to investigate the effects of tag set selection and tokenization method on the CHEMDNER task.

Results

This study presents the performance of CHEMDNER of three more representative tag schemes-IOBE, IOBES, and IOB₁₂E-when applied to a widely utilized IOB tag set and combined with the coarse-/fine-grained tokenization methods. The experimental results thus reveal that the fine-grained tokenization strategy performance best in terms of precision, recall and F-scores when the IOBES tag set was utilized. The IOBES model with fine-grained tokenization yielded the best-F-scores in the six chemical entity categories other than the "Multiple" entity category. Nonetheless, no significant improvement was observed when a more representative tag schemes was used with the coarse or fine-grained tokenization rules. The best F-scores that were achieved using the developed system on the test dataset of the CHEMDNER task were 0.833 and 0.815 for the chemical documents indexing and the chemical entity mention recognition tasks, respectively.

Conclusions

The results herein highlight the importance of tag set selection and the use of different tokenization strategies. Fine-grained tokenization combined with the tag set IOBES most effectively recognizes chemical and drug names. To the best of the authors' knowledge, this investigation is the first comprehensive investigation use of various tag set schemes combined with different tokenization strategies for the recognition of chemical entities.

Collapse

Batista-Navarro R, Rak R, Ananiadou S. Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics. J Cheminform 2015;7:S6. [PMID: 25810777 PMCID: PMC4331696 DOI: 10.1186/1758-2946-7-s1-s6] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open

Abstract

Background

The development of robust methods for chemical named entity recognition, a challenging natural language processing task, was previously hindered by the lack of publicly available, large-scale, gold standard corpora. The recent public release of a large chemical entity-annotated corpus as a resource for the CHEMDNER track of the Fourth BioCreative Challenge Evaluation (BioCreative IV) workshop greatly alleviated this problem and allowed us to develop a conditional random fields-based chemical entity recogniser. In order to optimise its performance, we introduced customisations in various aspects of our solution. These include the selection of specialised pre-processing analytics, the incorporation of chemistry knowledge-rich features in the training and application of the statistical model, and the addition of post-processing rules.

Results

Our evaluation shows that optimal performance is obtained when our customisations are integrated into the chemical entity recogniser. When its performance is compared with that of state-of-the-art methods, under comparable experimental settings, our solution achieves competitive advantage. We also show that our recogniser that uses a model trained on the CHEMDNER corpus is suitable for recognising names in a wide range of corpora, consistently outperforming two popular chemical NER tools.

Conclusion

The contributions resulting from this work are two-fold. Firstly, we present the details of a chemical entity recognition methodology that has demonstrated performance at a competitive, if not superior, level as that of state-of-the-art methods. Secondly, the developed suite of solutions has been made publicly available as a configurable workflow in the interoperable text mining workbench Argo. This allows interested users to conveniently apply and evaluate our solutions in the context of other chemical text mining tasks.

Collapse

Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado D, Lu Z, Leaman R, Lu Y, Ji D, Lowe DM, Sayle RA, Batista-Navarro RT, Rak R, Huber T, Rocktäschel T, Matos S, Campos D, Tang B, Xu H, Munkhdalai T, Ryu KH, Ramanan SV, Nathan S, Žitnik S, Bajec M, Weber L, Irmer M, Akhondi SA, Kors JA, Xu S, An X, Sikdar UK, Ekbal A, Yoshioka M, Dieb TM, Choi M, Verspoor K, Khabsa M, Giles CL, Liu H, Ravikumar KE, Lamurias A, Couto FM, Dai HJ, Tsai RTH, Ata C, Can T, Usié A, Alves R, Segura-Bedmar I, Martínez P, Oyarzabal J, Valencia A. The CHEMDNER corpus of chemicals and drugs and its annotation principles. J Cheminform 2015;7:S2. [PMID: 25810773 PMCID: PMC4331692 DOI: 10.1186/1758-2946-7-s1-s2] [Citation(s) in RCA: 112] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open

Abstract

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/.

Collapse

Affiliation(s)

Martin Krallinger Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Madrid, Spain
Obdulia Rabal Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra, Pamplona, Spain
Florian Leitner Computational Intelligence Group, Department of Artificial Intelligence, Universidad Politecnica de Madrid, Madrid, Spain
Miguel Vazquez Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Madrid, Spain
David Salgado Faculte de Medecine La Timone, Marseille, Marseille, France
Zhiyong Lu National Center for Biotechnology Information (NCBI), National Institutes of Health, Bethesda, USA
Robert Leaman National Center for Biotechnology Information (NCBI), National Institutes of Health, Bethesda, USA
Yanan Lu Natural Language Processing Lab, Wuhan University, Wuhan, Hubei, PR China
Donghong Ji Natural Language Processing Lab, Wuhan University, Wuhan, Hubei, PR China
Daniel M Lowe NextMove Software Ltd, Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge, UK
Roger A Sayle NextMove Software Ltd, Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge, UK
Riza Theresa Batista-Navarro National Centre for Text Mining, Manchester Institute of Biotechnology, Manchester, UK
Rafal Rak National Centre for Text Mining, Manchester Institute of Biotechnology, Manchester, UK
Torsten Huber Humboldt-Universität zu Berlin, Knowledge Management in Bioinformatics, Berlin, Germany
Tim Rocktäschel Department of Computer Science, University College London, London, UK
Sérgio Matos IEETA/DETI, University of Aveiro, Campus Universitario de Santiago, Aveiro, Portugal
David Campos IEETA/DETI, University of Aveiro, Campus Universitario de Santiago, Aveiro, Portugal
Buzhou Tang Department of Computer Science, Harbin Institute of Technology, Shenzhen Graduate School Shenzhen, GuangDong, PR China
Hua Xu School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, USA
Tsendsuren Munkhdalai Database/Bioinformatics Laboratory, School of Electrical and Computer Engineering, Chungbuk National University, Cheongju, South Korea
Keun Ho Ryu Database/Bioinformatics Laboratory, School of Electrical and Computer Engineering, Chungbuk National University, Cheongju, South Korea
SV Ramanan RelAgent Pvt Ltd, IIT Madras Research Park, Taramani, Chennai, India
Senthil Nathan RelAgent Pvt Ltd, IIT Madras Research Park, Taramani, Chennai, India
Slavko Žitnik Faculty of computer and information science, University of Ljubljana, Ljubljana, Slovenia
Marko Bajec Faculty of computer and information science, University of Ljubljana, Ljubljana, Slovenia
Lutz Weber OntoChem GmbH, Halle, Germany
Matthias Irmer OntoChem GmbH, Halle, Germany
Saber A Akhondi Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
Jan A Kors Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
Shuo Xu Information Technology Supporting Center, Institute of Scientific and Technical Information of China, Beijing, PR China
Xin An School of Economics and Management, Beijing Forestry University, Beijing, PR China
Utpal Kumar Sikdar Department of Computer Science and Engineering Indian institute of Technology, Patna, Bihar, India
Asif Ekbal Department of Computer Science and Engineering Indian institute of Technology, Patna, Bihar, India
Masaharu Yoshioka Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan
Thaer M Dieb Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan
Miji Choi Department of Computing and Information Systems, University of Melbourne, Melbourne, Australia
Karin Verspoor Department of Computing and Information Systems, University of Melbourne, Melbourne, Australia National ICT Australia Victoria Research Laboratory, West Melbourne, Australia
Madian Khabsa Computer Science and Engineering, The Pennsylvania State University, Pennsylvania, USA
C Lee Giles Computer Science and Engineering, The Pennsylvania State University, Pennsylvania, USA Information Sciences and Technology, The Pennsylvania State University, Pennsylvania, USA
Hongfang Liu Department of Health Sciences Research, Mayo College of Medicine, Rochester, USA
Komandur Elayavilli Ravikumar Department of Health Sciences Research, Mayo College of Medicine, Rochester, USA
Andre Lamurias LaSIGE, Department of Informatics, Faculty of Sciences, University of Lisbon, Lisbon, Portugal
Francisco M Couto LaSIGE, Department of Informatics, Faculty of Sciences, University of Lisbon, Lisbon, Portugal
Hong-Jie Dai Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
Richard Tzong-Han Tsai Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
Caglar Ata Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
Tolga Can Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
Anabel Usié Departament Ciències Mèdiques Bàsiques, Universitat de Lleida, Lleida, Spain Departament d'Informatica i Enginyeria Industrial, Univesitat de Lleida, Lleida, Spain
Rui Alves Departament Ciències Mèdiques Bàsiques, Universitat de Lleida, Lleida, Spain
Isabel Segura-Bedmar Computer Science Department, Universidad Carlos III de Madrid, Madrid, Spain
Paloma Martínez Computer Science Department, Universidad Carlos III de Madrid, Madrid, Spain
Julen Oyarzabal Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra, Pamplona, Spain
Alfonso Valencia Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Madrid, Spain

Collapse

Campos D, Matos S, Oliveira JL. A document processing pipeline for annotating chemical entities in scientific documents. J Cheminform 2015;7:S7. [PMID: 25810778 PMCID: PMC4331697 DOI: 10.1186/1758-2946-7-s1-s7] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open

Munkhdalai T, Li M, Batsuren K, Park HA, Choi NH, Ryu KH. Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations. J Cheminform 2015;7:S9. [PMID: 25810780 PMCID: PMC4331699 DOI: 10.1186/1758-2946-7-s1-s9] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open

Abstract

Background

Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature.

We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named entity recognition model and to leverage system performance. The proposed method includes Natural Language Processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. Other than the free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-text data.

Results

We extended BANNER, a biomedical NER system, with the proposed method. This yields an integrated system that can be applied to chemical and drug NER or biomedical NER. We call our branch of the BANNER system BANNER-CHEMDNER, which is scalable over millions of documents, processing about 530 documents per minute, is configurable via XML, and can be plugged into other systems by using the BANNER Unstructured Information Management Architecture (UIMA) interface.

BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarkable performance in both chemical and biomedical NER. BANNER-CHEMDNER system is available at: https://bitbucket.org/tsendeemts/banner-chemdner.

Collapse

Rak R, Batista-Navarro RT, Rowley A, Carter J, Ananiadou S. Text-mining-assisted biocuration workflows in Argo. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014;2014:bau070. [PMID: 25037308 PMCID: PMC4103424 DOI: 10.1093/database/bau070] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]

Abstract

Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into curation pipelines to support them. We investigate the suitability of Argo, a workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components. Apart from syntactic and semantic analytics, the ever-growing library of components includes several data readers and consumers that support well-established as well as emerging data interchange formats such as XMI, RDF and BioC, which facilitate the interoperability of Argo with other platforms or resources. To validate the suitability of Argo for curation activities, we participated in the BioCreative IV challenge whose purpose was to evaluate Web-based systems addressing user-defined biocuration tasks. Argo proved to have the edge over other systems in terms of flexibility of defining biocuration tasks. As expected, the versatility of the workbench inevitably lengthened the time the curators spent on learning the system before taking on the task, which may have affected the usability of Argo. The participation in the challenge gave us an opportunity to gather valuable feedback and identify areas of improvement, some of which have already been introduced. Database URL: http://argo.nactem.ac.uk.

Collapse

He L, Yang Z, Lin H, Li Y. Drug name recognition in biomedical texts: a machine-learning-based method. Drug Discov Today 2014;19:610-7. [DOI: 10.1016/j.drudis.2013.10.006] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2013] [Revised: 09/01/2013] [Accepted: 10/08/2013] [Indexed: 10/26/2022]

Eltyeb S, Salim N. Chemical named entities recognition: a review on approaches and applications. J Cheminform 2014;6:17. [PMID: 24834132 PMCID: PMC4022577 DOI: 10.1186/1758-2946-6-17] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2013] [Accepted: 03/25/2014] [Indexed: 12/03/2022] Open

Davis AP, Wiegers TC, Johnson RJ, Lay JM, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, Murphy CG, Mattingly CJ. Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database. PLoS One 2013;8:e58201. [PMID: 23613709 PMCID: PMC3629079 DOI: 10.1371/journal.pone.0058201] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2012] [Accepted: 01/31/2013] [Indexed: 11/30/2022] Open

Tharatipyakul A, Numnark S, Wichadakul D, Ingsriswang S. ChemEx: information extraction system for chemical data curation. BMC Bioinformatics 2012;13 Suppl 17:S9. [PMID: 23282330 PMCID: PMC3521388 DOI: 10.1186/1471-2105-13-s17-s9] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open

Wiegers TC, Davis AP, Mattingly CJ. Collaborative biocuration--text-mining development task for document prioritization for curation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2012. [PMID: 23180769 PMCID: PMC3504477 DOI: 10.1093/database/bas037] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]

Hahn U, Cohen KB, Garten Y, Shah NH. Mining the pharmacogenomics literature--a survey of the state of the art. Brief Bioinform 2012;13:460-94. [PMID: 22833496 PMCID: PMC3404399 DOI: 10.1093/bib/bbs018] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2011] [Accepted: 03/23/2012] [Indexed: 01/05/2023] Open

Rocktäschel T, Weidlich M, Leser U. ChemSpot: a hybrid system for chemical named entity recognition. ACTA ACUST UNITED AC 2012;28:1633-40. [PMID: 22500000 DOI: 10.1093/bioinformatics/bts183] [Citation(s) in RCA: 172] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Grego T, Pesquita C, Bastos HP, Couto FM. Chemical Entity Recognition and Resolution to ChEBI. ISRN BIOINFORMATICS 2012;2012:619427. [PMID: 25937941 PMCID: PMC4393067 DOI: 10.5402/2012/619427] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Accepted: 11/23/2011] [Indexed: 11/23/2022]

Jessop DM, Adams SE, Willighagen EL, Hawizy L, Murray-Rust P. OSCAR4: a flexible architecture for chemical text-mining. J Cheminform 2011;3:41. [PMID: 21999457 PMCID: PMC3205045 DOI: 10.1186/1758-2946-3-41] [Citation(s) in RCA: 89] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2011] [Accepted: 10/14/2011] [Indexed: 11/10/2022] Open

Mining chemical information from open patents. J Cheminform 2011;3:40. [PMID: 21999425 PMCID: PMC3205044 DOI: 10.1186/1758-2946-3-40] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2011] [Accepted: 10/14/2011] [Indexed: 11/24/2022] Open

Vazquez M, Krallinger M, Leitner F, Valencia A. Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications. Mol Inform 2011;30:506-19. [PMID: 27467152 DOI: 10.1002/minf.201100005] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2011] [Accepted: 06/07/2011] [Indexed: 11/10/2022]

Using workflows to explore and optimise named entity recognition for chemistry. PLoS One 2011;6:e20181. [PMID: 21633495 PMCID: PMC3102085 DOI: 10.1371/journal.pone.0020181] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2010] [Accepted: 04/27/2011] [Indexed: 11/30/2022] Open

Hawizy L, Jessop DM, Adams N, Murray-Rust P. ChemicalTagger: A tool for semantic text-mining in chemistry. J Cheminform 2011;3:17. [PMID: 21575201 PMCID: PMC3117806 DOI: 10.1186/1758-2946-3-17] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2010] [Accepted: 05/16/2011] [Indexed: 11/10/2022] Open

Sun B, Mitra P, Lee Giles C, Mueller KT. Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital Documents. ACM T INFORM SYST 2011. [DOI: 10.1145/1961209.1961215] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]

A systematic review of named entity recognition in biomedical texts. JOURNAL OF THE BRAZILIAN COMPUTER SOCIETY 2011. [DOI: 10.1007/s13173-011-0031-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]

Nobata C, Dobson PD, Iqbal SA, Mendes P, Tsujii J, Kell DB, Ananiadou S. Mining metabolites: extracting the yeast metabolome from the literature. Metabolomics 2011;7:94-101. [PMID: 21687783 PMCID: PMC3111869 DOI: 10.1007/s11306-010-0251-6] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/18/2010] [Accepted: 10/12/2010] [Indexed: 12/01/2022]

Hettne KM, Williams AJ, van Mulligen EM, Kleinjans J, Tkachenko V, Kors JA. Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining. J Cheminform 2010;2:3. [PMID: 20331846 PMCID: PMC2848622 DOI: 10.1186/1758-2946-2-3] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2009] [Accepted: 03/23/2010] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships.

RESULTS

We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation.

CONCLUSIONS

We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist.

Collapse

Downing J, Harvey MJ, Morgan PB, Murray-Rust P, Rzepa HS, Stewart DC, Tonge AP, Townsend JA. SPECTRa-T: Machine-Based Data Extraction and Semantic Searching of Chemistry e-Theses. J Chem Inf Model 2010;50:251-61. [DOI: 10.1021/ci9003688] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Affiliation(s)

Jim Downing Unilever Centre for Molecular Informatics, Department of Chemistry, Lensfield Rd., Cambridge CB2 1EW, U.K., Cambridge University Library, West Rd., Cambridge CB3 9DR, U.K., and Department of Chemistry and High Performance Computing Unit, ICT, Imperial College London, Exhibition Rd., London SW7 2AZ, U.K
Matt J. Harvey Unilever Centre for Molecular Informatics, Department of Chemistry, Lensfield Rd., Cambridge CB2 1EW, U.K., Cambridge University Library, West Rd., Cambridge CB3 9DR, U.K., and Department of Chemistry and High Performance Computing Unit, ICT, Imperial College London, Exhibition Rd., London SW7 2AZ, U.K
Peter B. Morgan Unilever Centre for Molecular Informatics, Department of Chemistry, Lensfield Rd., Cambridge CB2 1EW, U.K., Cambridge University Library, West Rd., Cambridge CB3 9DR, U.K., and Department of Chemistry and High Performance Computing Unit, ICT, Imperial College London, Exhibition Rd., London SW7 2AZ, U.K
Peter Murray-Rust Unilever Centre for Molecular Informatics, Department of Chemistry, Lensfield Rd., Cambridge CB2 1EW, U.K., Cambridge University Library, West Rd., Cambridge CB3 9DR, U.K., and Department of Chemistry and High Performance Computing Unit, ICT, Imperial College London, Exhibition Rd., London SW7 2AZ, U.K
Henry S. Rzepa Unilever Centre for Molecular Informatics, Department of Chemistry, Lensfield Rd., Cambridge CB2 1EW, U.K., Cambridge University Library, West Rd., Cambridge CB3 9DR, U.K., and Department of Chemistry and High Performance Computing Unit, ICT, Imperial College London, Exhibition Rd., London SW7 2AZ, U.K
Diana C. Stewart Unilever Centre for Molecular Informatics, Department of Chemistry, Lensfield Rd., Cambridge CB2 1EW, U.K., Cambridge University Library, West Rd., Cambridge CB3 9DR, U.K., and Department of Chemistry and High Performance Computing Unit, ICT, Imperial College London, Exhibition Rd., London SW7 2AZ, U.K
Alan P. Tonge Unilever Centre for Molecular Informatics, Department of Chemistry, Lensfield Rd., Cambridge CB2 1EW, U.K., Cambridge University Library, West Rd., Cambridge CB3 9DR, U.K., and Department of Chemistry and High Performance Computing Unit, ICT, Imperial College London, Exhibition Rd., London SW7 2AZ, U.K
Joe A. Townsend Unilever Centre for Molecular Informatics, Department of Chemistry, Lensfield Rd., Cambridge CB2 1EW, U.K., Cambridge University Library, West Rd., Cambridge CB3 9DR, U.K., and Department of Chemistry and High Performance Computing Unit, ICT, Imperial College London, Exhibition Rd., London SW7 2AZ, U.K

Collapse

Wiegers TC, Davis AP, Cohen KB, Hirschman L, Mattingly CJ. Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD). BMC Bioinformatics 2009;10:326. [PMID: 19814812 PMCID: PMC2768719 DOI: 10.1186/1471-2105-10-326] [Citation(s) in RCA: 91] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2009] [Accepted: 10/08/2009] [Indexed: 11/11/2022] Open

A dictionary to identify small molecules and drugs in free text. Bioinformatics 2009;25:2983-91. [DOI: 10.1093/bioinformatics/btp535] [Citation(s) in RCA: 102] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open

Grego T, Pęzik P, Couto FM, Rebholz-Schuhmann D. Identification of Chemical Entities in Patent Documents. DISTRIBUTED COMPUTING, ARTIFICIAL INTELLIGENCE, BIOINFORMATICS, SOFT COMPUTING, AND AMBIENT ASSISTED LIVING 2009. [DOI: 10.1007/978-3-642-02481-8_144] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]

Demner-Fushman D, Ananiadou S, Cohen KB, Pestian J, Tsujii J, Webber B. Themes in biomedical natural language processing: BioNLP08. BMC Bioinformatics 2008;9 Suppl 11:S1. [PMID: 19025685 PMCID: PMC2586759 DOI: 10.1186/1471-2105-9-s11-s1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open