1
|
Ivanisenko TV, Demenkov PS, Kolchanov NA, Ivanisenko VA. The New Version of the ANDDigest Tool with Improved AI-Based Short Names Recognition. Int J Mol Sci 2022; 23:ijms232314934. [PMID: 36499269 PMCID: PMC9738852 DOI: 10.3390/ijms232314934] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 11/19/2022] [Accepted: 11/22/2022] [Indexed: 12/05/2022] Open
Abstract
The body of scientific literature continues to grow annually. Over 1.5 million abstracts of biomedical publications were added to the PubMed database in 2021. Therefore, developing cognitive systems that provide a specialized search for information in scientific publications based on subject area ontology and modern artificial intelligence methods is urgently needed. We previously developed a web-based information retrieval system, ANDDigest, designed to search and analyze information in the PubMed database using a customized domain ontology. This paper presents an improved ANDDigest version that uses fine-tuned PubMedBERT classifiers to enhance the quality of short name recognition for molecular-genetics entities in PubMed abstracts on eight biological object types: cell components, diseases, side effects, genes, proteins, pathways, drugs, and metabolites. This approach increased average short name recognition accuracy by 13%.
Collapse
Affiliation(s)
- Timofey V. Ivanisenko
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Correspondence:
| | - Pavel S. Demenkov
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
| | - Nikolay A. Kolchanov
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Faculty of Natural Sciences, Novosibirsk State University, St. Pirogova 1, Novosibirsk 630090, Russia
| | - Vladimir A. Ivanisenko
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Faculty of Natural Sciences, Novosibirsk State University, St. Pirogova 1, Novosibirsk 630090, Russia
| |
Collapse
|
2
|
Turki H, Hadj Taieb MA, Ben Aouicha M. Enhancing filter-based parenthetic abbreviation extraction methods. J Am Med Inform Assoc 2021; 28:668-669. [PMID: 33355359 DOI: 10.1093/jamia/ocaa314] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Accepted: 11/24/2020] [Indexed: 11/12/2022] Open
Abstract
This letter discusses the limitations of the use of filters to enhance the accuracy of the extraction of parenthetic abbreviations from scholarly publications and proposes the usage of the parentheses level count algorithm to efficiently extract entities between parentheses from raw texts as well as of machine learning-based supervised classification techniques for the identification of biomedical abbreviations to significantly reduce the removal of acronyms including disallowed punctuations.
Collapse
Affiliation(s)
- Houcemeddine Turki
- Faculty of Medicine of Sfax, University of Sfax, Tunisia.,Data Engineering and Semantics Research Unit, Faculty of Sciences of Sfax, University of Sfax, Tunisia
| | - Mohamed Ali Hadj Taieb
- Data Engineering and Semantics Research Unit, Faculty of Sciences of Sfax, University of Sfax, Tunisia
| | - Mohamed Ben Aouicha
- Data Engineering and Semantics Research Unit, Faculty of Sciences of Sfax, University of Sfax, Tunisia
| |
Collapse
|
3
|
Shardlow M, Ju M, Li M, O'Reilly C, Iavarone E, McNaught J, Ananiadou S. A Text Mining Pipeline Using Active and Deep Learning Aimed at Curating Information in Computational Neuroscience. Neuroinformatics 2020; 17:391-406. [PMID: 30443819 PMCID: PMC6594987 DOI: 10.1007/s12021-018-9404-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
The curation of neuroscience entities is crucial to ongoing efforts in neuroinformatics and computational neuroscience, such as those being deployed in the context of continuing large-scale brain modelling projects. However, manually sifting through thousands of articles for new information about modelled entities is a painstaking and low-reward task. Text mining can be used to help a curator extract relevant information from this literature in a systematic way. We propose the application of text mining methods for the neuroscience literature. Specifically, two computational neuroscientists annotated a corpus of entities pertinent to neuroscience using active learning techniques to enable swift, targeted annotation. We then trained machine learning models to recognise the entities that have been identified. The entities covered are Neuron Types, Brain Regions, Experimental Values, Units, Ion Currents, Channels, and Conductances and Model organisms. We tested a traditional rule-based approach, a conditional random field and a model using deep learning named entity recognition, finding that the deep learning model was superior. Our final results show that we can detect a range of named entities of interest to the neuroscientist with a macro average precision, recall and F1 score of 0.866, 0.817 and 0.837 respectively. The contributions of this work are as follows: 1) We provide a set of Named Entity Recognition (NER) tools that are capable of detecting neuroscience entities with performance above or similar to prior work. 2) We propose a methodology for training NER tools for neuroscience that requires very little training data to get strong performance. This can be adapted for any sub-domain within neuroscience. 3) We provide a small corpus with annotations for multiple entity types, as well as annotation guidelines to help others reproduce our experiments.
Collapse
Affiliation(s)
- Matthew Shardlow
- The National Centre for Text Mining, School of Computer Science, University of Manchester, Road, M13 9PL, Oxford, UK
| | - Meizhi Ju
- The National Centre for Text Mining, School of Computer Science, University of Manchester, Road, M13 9PL, Oxford, UK
| | - Maolin Li
- The National Centre for Text Mining, School of Computer Science, University of Manchester, Road, M13 9PL, Oxford, UK
| | - Christian O'Reilly
- Blue Brain Project, EPFL, Campus Biotech, Ch. des Mines 9, CH-1202, Geneva, Switzerland
| | - Elisabetta Iavarone
- Blue Brain Project, EPFL, Campus Biotech, Ch. des Mines 9, CH-1202, Geneva, Switzerland
| | - John McNaught
- The National Centre for Text Mining, School of Computer Science, University of Manchester, Road, M13 9PL, Oxford, UK
| | - Sophia Ananiadou
- The National Centre for Text Mining, School of Computer Science, University of Manchester, Road, M13 9PL, Oxford, UK.
| |
Collapse
|
4
|
Steppi A, Gyori BM, Bachman JA. Adeft: Acromine-based Disambiguation of Entities from Text with applications to the biomedical literature. JOURNAL OF OPEN SOURCE SOFTWARE 2020; 5:1708. [PMID: 32337477 PMCID: PMC7182313 DOI: 10.21105/joss.01708] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Affiliation(s)
- Albert Steppi
- Laboratory of Systems Pharmacology, Harvard Medical School
| | | | - John A Bachman
- Laboratory of Systems Pharmacology, Harvard Medical School
| |
Collapse
|
5
|
Bachman JA, Gyori BM, Sorger PK. FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining. BMC Bioinformatics 2018; 19:248. [PMID: 29954318 PMCID: PMC6022344 DOI: 10.1186/s12859-018-2211-5] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2017] [Accepted: 05/17/2018] [Indexed: 11/29/2022] Open
Abstract
Background For automated reading of scientific publications to extract useful information about molecular mechanisms it is critical that genes, proteins and other entities be correctly associated with uniform identifiers, a process known as named entity linking or “grounding.” Correct grounding is essential for resolving relationships among mined information, curated interaction databases, and biological datasets. The accuracy of this process is largely dependent on the availability of machine-readable resources associating synonyms and abbreviations commonly found in biomedical literature with uniform identifiers. Results In a task involving automated reading of ∼215,000 articles using the REACH event extraction software we found that grounding was disproportionately inaccurate for multi-protein families (e.g., “AKT”) and complexes with multiple subunits (e.g.“NF- κB”). To address this problem we constructed FamPlex, a manually curated resource defining protein families and complexes as they are commonly encountered in biomedical text. In FamPlex the gene-level constituents of families and complexes are defined in a flexible format allowing for multi-level, hierarchical membership. To create FamPlex, text strings corresponding to entities were identified empirically from literature and linked manually to uniform identifiers; these identifiers were also mapped to equivalent entries in multiple related databases. FamPlex also includes curated prefix and suffix patterns that improve named entity recognition and event extraction. Evaluation of REACH extractions on a test corpus of ∼54,000 articles showed that FamPlex significantly increased grounding accuracy for families and complexes (from 15 to 71%). The hierarchical organization of entities in FamPlex also made it possible to integrate otherwise unconnected mechanistic information across families, subfamilies, and individual proteins. Applications of FamPlex to the TRIPS/DRUM reading system and the Biocreative VI Bioentity Normalization Task dataset demonstrated the utility of FamPlex in other settings. Conclusion FamPlex is an effective resource for improving named entity recognition, grounding, and relationship resolution in automated reading of biomedical text. The content in FamPlex is available in both tabular and Open Biomedical Ontology formats at https://github.com/sorgerlab/famplex
under the Creative Commons CC0 license and has been integrated into the TRIPS/DRUM and REACH reading systems.
Collapse
Affiliation(s)
- John A Bachman
- Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Ave, Boston, MA, 02115, USA
| | - Benjamin M Gyori
- Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Ave, Boston, MA, 02115, USA
| | - Peter K Sorger
- Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Ave, Boston, MA, 02115, USA.
| |
Collapse
|
6
|
Dai L, Fang R, Li H, Hou X, Sheng B, Wu Q, Jia W. Clinical Report Guided Retinal Microaneurysm Detection With Multi-Sieving Deep Learning. IEEE TRANSACTIONS ON MEDICAL IMAGING 2018; 37:1149-1161. [PMID: 29727278 DOI: 10.1109/tmi.2018.2794988] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Timely detection and treatment of microaneurysms is a critical step to prevent the development of vision-threatening eye diseases such as diabetic retinopathy. However, detecting microaneurysms in fundus images is a highly challenging task due to the low image contrast, misleading cues of other red lesions, and the large variation of imaging conditions. Existing methods tend to fail in face of the large intra-class variation and small inter-class variations for microaneurysm detection in fundus images. Recently, hybrid text/image mining computer-aided diagnosis systems have emerged to offer a promise of bridging the semantic gap between images and diagnostic information. In this paper, we focus on developing an interleaved deep mining technique to cope intelligently with the unbalanced microaneurysm detection problem. Specifically, we present a clinical report guided multi-sieving convolutional neural network, which leverages a small amount of supervised information in clinical reports to identify the potential microaneurysm regions via the image-to-text mapping in the feature space. These potential microaneurysm regions are then interleaved with fundus image information for multi-sieving deep mining in a highly unbalanced classification problem. Critically, the clinical reports are employed to bridge the semantic gap between low-level image features and high-level diagnostic information. We build an efficient microaneurysm detection framework based on the hybrid text/image interleaving and validate its performance on challenging clinical data sets acquired from diabetic retinopathy patients. Extensive evaluations are carried out in terms of fundus detection and classification. Experimental results show that our framework achieves 99.7% precision and 87.8% recall, comparing favorably with the state-of-the-art algorithms. Integration of expert domain knowledge and image information demonstrates the feasibility of reducing the difficulty of training classifiers under extremely unbalanced data distributions.
Collapse
|
7
|
Dai HJ, Singh O, Jonnagaddala J, Su ECY. NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw111. [PMID: 27465130 PMCID: PMC4962763 DOI: 10.1093/database/baw111] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/03/2015] [Accepted: 07/05/2016] [Indexed: 11/13/2022]
Abstract
In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery.Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus.
Collapse
Affiliation(s)
- Hong-Jie Dai
- Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan Interdisciplinary Program of Green and Information Technology, National Taitung University, Taitung, Taiwan
| | - Onkar Singh
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
| | - Jitendra Jonnagaddala
- School of Public Health and Community Medicine, the University of New South Wales, Sydney, Australia Prince of Wales Clinical School, the University of New South Wales, Australia
| | - Emily Chia-Yu Su
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
| |
Collapse
|
8
|
Xie B, Ding Q, Wu D. Text Mining on Big and Complex Biomedical Literature. BIG DATA ANALYTICS IN BIOINFORMATICS AND HEALTHCARE 2015. [DOI: 10.4018/978-1-4666-6611-5.ch002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Driven by the rapidly advancing techniques and increasing interests in biology and medicine, about 2,000 to 4,000 references are added daily to MEDLINE, the US national biomedical bibliographic database. Even for a specific research topic, extracting useful and comprehensive information out of the huge literature data pool is challenging. Text mining techniques become extremely useful when dealing with the abundant biomedical information and they have been applied to various areas in the realm of biomedical research. Instead of providing a brief overview of all text mining techniques and every major biomedical text mining application, this chapter explores in-depth the microRNA profiling area and related text mining tools. As an illustrative example, one rule-based text mining system developed by the authors is discussed in detail. This chapter also includes the discussion of the challenges and potential research areas in biomedical text mining.
Collapse
|
9
|
Kim S, Yoon J. Link-topic model for biomedical abbreviation disambiguation. J Biomed Inform 2014; 53:367-80. [PMID: 25554684 DOI: 10.1016/j.jbi.2014.12.013] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2014] [Revised: 12/19/2014] [Accepted: 12/20/2014] [Indexed: 10/24/2022]
Abstract
INTRODUCTION The ambiguity of biomedical abbreviations is one of the challenges in biomedical text mining systems. In particular, the handling of term variants and abbreviations without nearby definitions is a critical issue. In this study, we adopt the concepts of topic of document and word link to disambiguate biomedical abbreviations. METHODS We newly suggest the link topic model inspired by the latent Dirichlet allocation model, in which each document is perceived as a random mixture of topics, where each topic is characterized by a distribution over words. Thus, the most probable expansions with respect to abbreviations of a given abstract are determined by word-topic, document-topic, and word-link distributions estimated from a document collection through the link topic model. The model allows two distinct modes of word generation to incorporate semantic dependencies among words, particularly long form words of abbreviations and their sentential co-occurring words; a word can be generated either dependently on the long form of the abbreviation or independently. The semantic dependency between two words is defined as a link and a new random parameter for the link is assigned to each word as well as a topic parameter. Because the link status indicates whether the word constitutes a link with a given specific long form, it has the effect of determining whether a word forms a unigram or a skipping/consecutive bigram with respect to the long form. Furthermore, we place a constraint on the model so that a word has the same topic as a specific long form if it is generated in reference to the long form. Consequently, documents are generated from the two hidden parameters, i.e. topic and link, and the most probable expansion of a specific abbreviation is estimated from the parameters. RESULTS Our model relaxes the bag-of-words assumption of the standard topic model in which the word order is neglected, and it captures a richer structure of text than does the standard topic model by considering unigrams and semantically associated bigrams simultaneously. The addition of semantic links improves the disambiguation accuracy without removing irrelevant contextual words and reduces the parameter space of massive skipping or consecutive bigrams. The link topic model achieves 98.42% disambiguation accuracy on 73,505 MEDLINE abstracts with respect to 21 three letter abbreviations and their 139 distinct long forms.
Collapse
Affiliation(s)
- Seonho Kim
- Department of Computer Science, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul, Republic of Korea.
| | - Juntae Yoon
- Daumsoft Inc., Hannam-dong 635-1, Yongsan-gu, Seoul, Republic of Korea.
| |
Collapse
|
10
|
Tripathy SJ, Savitskaya J, Burton SD, Urban NN, Gerkin RC. NeuroElectro: a window to the world's neuron electrophysiology data. Front Neuroinform 2014; 8:40. [PMID: 24808858 PMCID: PMC4010726 DOI: 10.3389/fninf.2014.00040] [Citation(s) in RCA: 65] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2013] [Accepted: 03/27/2014] [Indexed: 11/25/2022] Open
Abstract
The behavior of neural circuits is determined largely by the electrophysiological properties of the neurons they contain. Understanding the relationships of these properties requires the ability to first identify and catalog each property. However, information about such properties is largely locked away in decades of closed-access journal articles with heterogeneous conventions for reporting results, making it difficult to utilize the underlying data. We solve this problem through the NeuroElectro project: a Python library, RESTful API, and web application (at http://neuroelectro.org) for the extraction, visualization, and summarization of published data on neurons' electrophysiological properties. Information is organized both by neuron type (using neuron definitions provided by NeuroLex) and by electrophysiological property (using a newly developed ontology). We describe the techniques and challenges associated with the automated extraction of tabular electrophysiological data and methodological metadata from journal articles. We further discuss strategies for how to best combine, normalize and organize data across these heterogeneous sources. NeuroElectro is a valuable resource for experimental physiologists attempting to supplement their own data, for computational modelers looking to constrain their model parameters, and for theoreticians searching for undiscovered relationships among neurons and their properties.
Collapse
Affiliation(s)
- Shreejoy J. Tripathy
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburgh, PA, USA
- Center for the Neural Basis of Cognition, Carnegie Mellon UniversityPittsburgh, PA, USA
| | - Judith Savitskaya
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburgh, PA, USA
| | - Shawn D. Burton
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburgh, PA, USA
- Center for the Neural Basis of Cognition, Carnegie Mellon UniversityPittsburgh, PA, USA
| | - Nathaniel N. Urban
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburgh, PA, USA
- Center for the Neural Basis of Cognition, Carnegie Mellon UniversityPittsburgh, PA, USA
| | - Richard C. Gerkin
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburgh, PA, USA
- Center for the Neural Basis of Cognition, Carnegie Mellon UniversityPittsburgh, PA, USA
| |
Collapse
|
11
|
Piedra D, Ferrer A, Gea J. Text mining and medicine: usefulness in respiratory diseases. Arch Bronconeumol 2014; 50:113-9. [PMID: 24507559 DOI: 10.1016/j.arbres.2013.04.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2013] [Revised: 04/12/2013] [Accepted: 04/18/2013] [Indexed: 12/24/2022]
Abstract
It is increasingly common to have medical information in electronic format. This includes scientific articles as well as clinical management reviews, and even records from health institutions with patient data. However, traditional instruments, both individual and institutional, are of little use for selecting the most appropriate information in each case, either in the clinical or research field. So-called text or data «mining» enables this huge amount of information to be managed, extracting it from various sources using processing systems (filtration and curation), integrating it and permitting the generation of new knowledge. This review aims to provide an overview of text and data mining, and of the potential usefulness of this bioinformatic technique in the exercise of care in respiratory medicine and in research in the same field.
Collapse
Affiliation(s)
- David Piedra
- Instituto de Investigación del Hospital del Mar (IMIM), Barcelona, España.
| | - Antoni Ferrer
- Instituto de Investigación del Hospital del Mar (IMIM), Barcelona, España; Servicio de Neumología, Hospital del Mar, Barcelona, España; Facultat de Ciències de la Salut i de la Vida, Universitat Pompeu Fabra, Barcelona, España; CIBERES, ISC III, Bunyola, Mallorca, España
| | - Joaquim Gea
- Instituto de Investigación del Hospital del Mar (IMIM), Barcelona, España; Servicio de Neumología, Hospital del Mar, Barcelona, España; Facultat de Ciències de la Salut i de la Vida, Universitat Pompeu Fabra, Barcelona, España; CIBERES, ISC III, Bunyola, Mallorca, España
| |
Collapse
|
12
|
Miwa M, Ohta T, Rak R, Rowley A, Kell DB, Pyysalo S, Ananiadou S. A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text. Bioinformatics 2013; 29:i44-52. [PMID: 23813008 PMCID: PMC3694679 DOI: 10.1093/bioinformatics/btt227] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Motivation: To create, verify and maintain pathway models, curators must discover and assess knowledge distributed over the vast body of biological literature. Methods supporting these tasks must understand both the pathway model representations and the natural language in the literature. These methods should identify and order documents by relevance to any given pathway reaction. No existing system has addressed all aspects of this challenge. Method: We present novel methods for associating pathway model reactions with relevant publications. Our approach extracts the reactions directly from the models and then turns them into queries for three text mining-based MEDLINE literature search systems. These queries are executed, and the resulting documents are combined and ranked according to their relevance to the reactions of interest. We manually annotate document-reaction pairs with the relevance of the document to the reaction and use this annotation to study several ranking methods, using various heuristic and machine-learning approaches. Results: Our evaluation shows that the annotated document-reaction pairs can be used to create a rule-based document ranking system, and that machine learning can be used to rank documents by their relevance to pathway reactions. We find that a Support Vector Machine-based system outperforms several baselines and matches the performance of the rule-based system. The success of the query extraction and ranking methods are used to update our existing pathway search system, PathText. Availability: An online demonstration of PathText 2 and the annotated corpus are available for research purposes at http://www.nactem.ac.uk/pathtext2/. Contact:makoto.miwa@manchester.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Makoto Miwa
- The National Centre for Text Mining and School of Computer Science and School of Chemistry and the Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, UK.
| | | | | | | | | | | | | |
Collapse
|
13
|
Biomedical text mining and its applications in cancer research. J Biomed Inform 2013; 46:200-11. [DOI: 10.1016/j.jbi.2012.10.007] [Citation(s) in RCA: 159] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2012] [Revised: 10/30/2012] [Accepted: 10/30/2012] [Indexed: 11/21/2022]
|
14
|
Ananiadou S, Ohta T, Rutter MK. Text Mining Supporting Search for Knowledge Discovery in Diabetes. CURRENT CARDIOVASCULAR RISK REPORTS 2012. [DOI: 10.1007/s12170-012-0288-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
15
|
Shinohara EY, Aramaki E, Imai T, Miura Y, Tonoike M, Ohkuma T, Masuichi H, Ohe K. An easily implemented method for abbreviation expansion for the medical domain in Japanese text. A preliminary study. Methods Inf Med 2012; 52:51-61. [PMID: 23223786 DOI: 10.3414/me12-01-0040] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2012] [Accepted: 10/28/2012] [Indexed: 11/09/2022]
Abstract
BACKGROUND One of the barriers for the effective use of computerized health-care related text is the ambiguity of abbreviations. To date, the task of disambiguating abbreviations has been treated as a classification task based on surrounding words. Application of this framework for languages that have no word boundaries requires pre-processing to segment a sentence into separate word sequences. While the segmentation processing is often a source of problem, it is unknown whether word information is really requisite for abbreviation expansion. OBJECTIVES The present study examined and compared abbreviation expansion methods with and without the incorporation of word information as a preliminary study. METHODS We implemented two abbreviation expansion methods: 1) a morpheme-based method that relied on word information and therefore required pre-processing, and 2) a character-based method that relied on simple character information. We compared the expansion accuracies for these two methods using eight medical abbreviations. Experimental data were automatically built as a pseudo-annotated corpus using the Internet. RESULTS As a result of the experiment, accuracies for the character-based method were from 0.890 to 0.942 while accuracies for the morpheme-based method were from 0.796 to 0.932. The character-based method significantly outperformed the morpheme-based method for three of the eight abbreviations (p < 0.05). For the remaining five abbreviations, no significant differences were found between the two methods. CONCLUSIONS Character information may be a good alternative in terms of simplicity to morphological information for abbreviation expansion in English medical abbreviations appeared in Japanese texts on the Internet.
Collapse
Affiliation(s)
- E Y Shinohara
- Department of Planning, Information and Management, The University of Tokyo Hospital, Tokyo, Japan.
| | | | | | | | | | | | | | | |
Collapse
|
16
|
Akella LM, Norton CN, Miller H. NetiNeti: discovery of scientific names from text using machine learning methods. BMC Bioinformatics 2012; 13:211. [PMID: 22913485 PMCID: PMC3542245 DOI: 10.1186/1471-2105-13-211] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2010] [Accepted: 08/06/2012] [Indexed: 12/12/2022] Open
Abstract
Background A scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information. Results We present NetiNeti (Name Extraction from Textual Information-Name Extraction for Taxonomic Indexing), a machine learning based approach for recognition of scientific names including the discovery of new species names from text that will also handle misspellings, OCR errors and other variations in names. The system generates candidate names using rules for scientific names and applies probabilistic machine learning methods to classify names based on structural features of candidate names and features derived from their contexts. NetiNeti can also disambiguate scientific names from other names using the contextual information. We evaluated NetiNeti on legacy biodiversity texts and biomedical literature (MEDLINE). NetiNeti performs better (precision = 98.9% and recall = 70.5%) compared to a popular dictionary based approach (precision = 97.5% and recall = 54.3%) on a 600-page biodiversity book that was manually marked by an annotator. On a small set of PubMed Central’s full text articles annotated with scientific names, the precision and recall values are 98.5% and 96.2% respectively. NetiNeti found more than 190,000 unique binomial and trinomial names in more than 1,880,000 PubMed records when used on the full MEDLINE database. NetiNeti also successfully identifies almost all of the new species names mentioned within web pages. Conclusions We present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed at
http://namefinding.ubio.org.
Collapse
|
17
|
Yamaguchi A, Yamamoto Y, Kim JD, Takagi T, Yonezawa A. Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering. BMC Genomics 2012; 13 Suppl 3:S8. [PMID: 22759617 PMCID: PMC3394426 DOI: 10.1186/1471-2164-13-s3-s8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
Abstract
Background Term clustering, by measuring the string similarities between terms, is known within the natural language processing community to be an effective method for improving the quality of texts and dictionaries. However, we have observed that chemical names are difficult to cluster using string similarity measures. In order to clearly demonstrate this difficulty, we compared the string similarities determined using the edit distance, the Monge-Elkan score, SoftTFIDF, and the bigram Dice coefficient for chemical names with those for non-chemical names. Results Our experimental results revealed the following: (1) The edit distance had the best performance in the matching of full forms, whereas Cohen et al. reported that SoftTFIDF with the Jaro-Winkler distance would yield the best measure for matching pairs of terms for their experiments. (2) For each of the string similarity measures above, the best threshold for term matching differs for chemical names and for non-chemical names; the difference is especially large for the edit distance. (3) Although the matching results obtained for chemical names using the edit distance, Monge-Elkan scores, or the bigram Dice coefficients are better than the result obtained for non-chemical names, the results were contrary when using SoftTFIDF. (4) A suitable weight for chemical names varies substantially from one for non-chemical names. In particular, a weight vector that has been optimized for non-chemical names is not suitable for chemical names. (5) The matching results using the edit distances improve further by dividing a set of full forms into two subsets, according to whether a full form is a chemical name or not. These results show that our hypothesis is acceptable, and that we can significantly improve the performance of abbreviation-full form clustering by computing chemical names and non-chemical names separately. Conclusions In conclusion, the discriminative application of string similarity methods to chemical and non-chemical names may be a simple yet effective way to improve the performance of term clustering.
Collapse
|
18
|
Applications of natural language processing in biodiversity science. Adv Bioinformatics 2012; 2012:391574. [PMID: 22685456 PMCID: PMC3364545 DOI: 10.1155/2012/391574] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2011] [Accepted: 02/15/2012] [Indexed: 12/11/2022] Open
Abstract
Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science.
A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science.
Collapse
|
19
|
Sasaki Y, Wang X, Ananiadou S. EXTRACTING SECONDARY BIO-EVENT ARGUMENTS WITH EXTRACTION CONSTRAINTS. Comput Intell 2011. [DOI: 10.1111/j.1467-8640.2011.00406.x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
20
|
Yamamoto Y, Yamaguchi A, Bono H, Takagi T. Allie: a database and a search service of abbreviations and long forms. Database (Oxford) 2011; 2011:bar013. [PMID: 21498548 PMCID: PMC3077826 DOI: 10.1093/database/bar013] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2010] [Revised: 03/25/2011] [Accepted: 03/28/2011] [Indexed: 11/17/2022]
Abstract
Many abbreviations are used in the literature especially in the life sciences, and polysemous abbreviations appear frequently, making it difficult to read and understand scientific papers that are outside of a reader's expertise. Thus, we have developed Allie, a database and a search service of abbreviations and their long forms (a.k.a. full forms or definitions). Allie searches for abbreviations and their corresponding long forms in a database that we have generated based on all titles and abstracts in MEDLINE. When a user query matches an abbreviation, Allie returns all potential long forms of the query along with their bibliographic data (i.e. title and publication year). In addition, for each candidate, co-occurring abbreviations and a research field in which it frequently appears in the MEDLINE data are displayed. This function helps users learn about the context in which an abbreviation appears. To deal with synonymous long forms, we use a dictionary called GENA that contains domain-specific terms such as gene, protein or disease names along with their synonymic information. Conceptually identical domain-specific terms are regarded as one term, and then conceptually identical abbreviation-long form pairs are grouped taking into account their appearance in MEDLINE. To keep up with new abbreviations that are continuously introduced, Allie has an automatic update system. In addition, the database of abbreviations and their long forms with their corresponding PubMed IDs is constructed and updated weekly. Database URL: The Allie service is available at http://allie.dbcls.jp/.
Collapse
|
21
|
Applications of text mining within systematic reviews. Res Synth Methods 2011; 2:1-14. [DOI: 10.1002/jrsm.27] [Citation(s) in RCA: 120] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2010] [Revised: 01/24/2011] [Accepted: 01/28/2011] [Indexed: 11/07/2022]
|
22
|
Ananiadou S, Sullivan D, Black W, Levow GA, Gillespie JJ, Mao C, Pyysalo S, Kolluru B, Tsujii J, Sobral B. Named entity recognition for bacterial Type IV secretion systems. PLoS One 2011; 6:e14780. [PMID: 21468321 PMCID: PMC3066171 DOI: 10.1371/journal.pone.0014780] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2010] [Accepted: 02/16/2011] [Indexed: 11/18/2022] Open
Abstract
Research on specialized biological systems is often hampered by a lack of consistent terminology, especially across species. In bacterial Type IV secretion systems genes within one set of orthologs may have over a dozen different names. Classifying research publications based on biological processes, cellular components, molecular functions, and microorganism species should improve the precision and recall of literature searches allowing researchers to keep up with the exponentially growing literature, through resources such as the Pathosystems Resource Integration Center (PATRIC, patricbrc.org). We developed named entity recognition (NER) tools for four entities related to Type IV secretion systems: 1) bacteria names, 2) biological processes, 3) molecular functions, and 4) cellular components. These four entities are important to pathogenesis and virulence research but have received less attention than other entities, e.g., genes and proteins. Based on an annotated corpus, large domain terminological resources, and machine learning techniques, we developed recognizers for these entities. High accuracy rates (>80%) are achieved for bacteria, biological processes, and molecular function. Contrastive experiments highlighted the effectiveness of alternate recognition strategies; results of term extraction on contrasting document sets demonstrated the utility of these classes for identifying T4SS-related documents.
Collapse
Affiliation(s)
- Sophia Ananiadou
- School of Computer Science, University of Manchester, Manchester, United Kingdom
- National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, Manchester, United Kingdom
| | - Dan Sullivan
- Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, Virginia, United States of America
| | - William Black
- National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, Manchester, United Kingdom
| | - Gina-Anne Levow
- School of Computer Science, University of Manchester, Manchester, United Kingdom
| | - Joseph J. Gillespie
- Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, Virginia, United States of America
| | - Chunhong Mao
- Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, Virginia, United States of America
| | - Sampo Pyysalo
- Department of Computer Science, School of Information Science and Technology, University of Tokyo, Tokyo, Japan
| | - BalaKrishna Kolluru
- School of Computer Science, University of Manchester, Manchester, United Kingdom
- National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, Manchester, United Kingdom
| | - Junichi Tsujii
- School of Computer Science, University of Manchester, Manchester, United Kingdom
- National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, Manchester, United Kingdom
- Department of Computer Science, School of Information Science and Technology, University of Tokyo, Tokyo, Japan
| | - Bruno Sobral
- Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, Virginia, United States of America
| |
Collapse
|
23
|
Kemper B, Matsuzaki T, Matsuoka Y, Tsuruoka Y, Kitano H, Ananiadou S, Tsujii J. PathText: a text mining integrator for biological pathway visualizations. Bioinformatics 2010; 26:i374-81. [PMID: 20529930 PMCID: PMC2881405 DOI: 10.1093/bioinformatics/btq221] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Motivation: Metabolic and signaling pathways are an increasingly important part of organizing knowledge in systems biology. They serve to integrate collective interpretations of facts scattered throughout literature. Biologists construct a pathway by reading a large number of articles and interpreting them as a consistent network, but most of the models constructed currently lack direct links to those articles. Biologists who want to check the original articles have to spend substantial amounts of time to collect relevant articles and identify the sections relevant to the pathway. Furthermore, with the scientific literature expanding by several thousand papers per week, keeping a model relevant requires a continuous curation effort. In this article, we present a system designed to integrate a pathway visualizer, text mining systems and annotation tools into a seamless environment. This will enable biologists to freely move between parts of a pathway and relevant sections of articles, as well as identify relevant papers from large text bases. The system, PathText, is developed by Systems Biology Institute, Okinawa Institute of Science and Technology, National Centre for Text Mining (University of Manchester) and the University of Tokyo, and is being used by groups of biologists from these locations. Contact:brian@monrovian.com.
Collapse
Affiliation(s)
- Brian Kemper
- Department of Computer Science, University of Tokyo, Tokyo, Japan.
| | | | | | | | | | | | | |
Collapse
|
24
|
Harmston N, Filsell W, Stumpf MPH. What the papers say: text mining for genomics and systems biology. Hum Genomics 2010; 5:17-29. [PMID: 21106487 PMCID: PMC3500154 DOI: 10.1186/1479-7364-5-1-17] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2010] [Accepted: 08/06/2010] [Indexed: 12/11/2022] Open
Abstract
Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining - the automated extraction of information from (electronically) published sources - could potentially fulfil an important role - but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward.
Collapse
Affiliation(s)
- Nathan Harmston
- Division of Molecular Biosciences, Centre for Bioinformatics, Imperial College London, 303, Wolfson Building, South Kensington Campus, London, SW7 2AZ, UK
| | - Wendy Filsell
- Unilever R&D, Colworth Science Park, Sharnbrook, Bedford MK44 1 LQ, UK
| | - Michael PH Stumpf
- Division of Molecular Biosciences, Centre for Bioinformatics, Imperial College London, 303, Wolfson Building, South Kensington Campus, London, SW7 2AZ, UK
| |
Collapse
|
25
|
Ananiadou S, Pyysalo S, Tsujii J, Kell DB. Event extraction for systems biology by text mining the literature. Trends Biotechnol 2010; 28:381-90. [PMID: 20570001 DOI: 10.1016/j.tibtech.2010.04.005] [Citation(s) in RCA: 140] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2010] [Revised: 04/20/2010] [Accepted: 04/26/2010] [Indexed: 01/08/2023]
Abstract
Systems biology recognizes in particular the importance of interactions between biological components and the consequences of these interactions. Such interactions and their downstream effects are known as events. To computationally mine the literature for such events, text mining methods that can detect, extract and annotate them are required. This review summarizes the methods that are currently available, with a specific focus on protein-protein interactions and pathway or network reconstruction. The approaches described will be of considerable value in associating particular pathways and their components with higher-order physiological properties, including disease states.
Collapse
|
26
|
Okazaki N, Ananiadou S, Tsujii J. Building a high-quality sense inventory for improved abbreviation disambiguation. Bioinformatics 2010; 26:1246-53. [PMID: 20360059 PMCID: PMC2859134 DOI: 10.1093/bioinformatics/btq129] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Motivation: The ultimate goal of abbreviation management is to disambiguate every occurrence of an abbreviation into its expanded form (concept or sense). To collect expanded forms for abbreviations, previous studies have recognized abbreviations and their expanded forms in parenthetical expressions of bio-medical texts. However, expanded forms extracted by abbreviation recognition are mixtures of concepts/senses and their term variations. Consequently, a list of expanded forms should be structured into a sense inventory, which provides possible concepts or senses for abbreviation disambiguation. Results: A sense inventory is a key to robust management of abbreviations. Therefore, we present a supervised approach for clustering expanded forms. The experimental result reports 0.915 F1 score in clustering expanded forms. We then investigate the possibility of conflicts of protein and gene names with abbreviations. Finally, an experiment of abbreviation disambiguation on the sense inventory yielded 0.984 accuracy and 0.986 F1 score using the dataset obtained from MEDLINE abstracts. Availability: The sense inventory and disambiguator of abbreviations are accessible at http://www.nactem.ac.uk/software/acromine/ and http://www.nactem.ac.uk/software/acromine_disambiguation/ Contact:okazaki@chokkan.org
Collapse
Affiliation(s)
- Naoaki Okazaki
- Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan.
| | | | | |
Collapse
|
27
|
Fink JL, Fernicola P, Chandran R, Parastatidis S, Wade A, Naim O, Quinn GB, Bourne PE. Word add-in for ontology recognition: semantic enrichment of scientific literature. BMC Bioinformatics 2010; 11:103. [PMID: 20181245 PMCID: PMC2837026 DOI: 10.1186/1471-2105-11-103] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2009] [Accepted: 02/24/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the current era of scientific research, efficient communication of information is paramount. As such, the nature of scholarly and scientific communication is changing; cyberinfrastructure is now absolutely necessary and new media are allowing information and knowledge to be more interactive and immediate. One approach to making knowledge more accessible is the addition of machine-readable semantic data to scholarly articles. RESULTS The Word add-in presented here will assist authors in this effort by automatically recognizing and highlighting words or phrases that are likely information-rich, allowing authors to associate semantic data with those words or phrases, and to embed that data in the document as XML. The add-in and source code are publicly available at http://www.codeplex.com/UCSDBioLit. CONCLUSIONS The Word add-in for ontology term recognition makes it possible for an author to add semantic data to a document as it is being written and it encodes these data using XML tags that are effectively a standard in life sciences literature. Allowing authors to mark-up their own work will help increase the amount and quality of machine-readable literature metadata.
Collapse
Affiliation(s)
- J Lynn Fink
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, CA, 92093-0444 USA
| | - Pablo Fernicola
- External Research, MS 99/4618, Microsoft Corporation, 1 Microsoft Way, Redmond, WA, 98052 USA
| | - Rahul Chandran
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, CA, 92093-0444 USA
| | - Savas Parastatidis
- External Research, MS 99/4618, Microsoft Corporation, 1 Microsoft Way, Redmond, WA, 98052 USA
| | - Alex Wade
- External Research, MS 99/4618, Microsoft Corporation, 1 Microsoft Way, Redmond, WA, 98052 USA
| | - Oscar Naim
- External Research, MS 99/4618, Microsoft Corporation, 1 Microsoft Way, Redmond, WA, 98052 USA
| | - Gregory B Quinn
- San Diego Supercomputer Center, 10100 Hopkins Dr., San Diego, CA, 92093-0743 USA
| | - Philip E Bourne
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, CA, 92093-0444 USA
| |
Collapse
|
28
|
Gerner M, Nenadic G, Bergman CM. LINNAEUS: a species name identification system for biomedical literature. BMC Bioinformatics 2010; 11:85. [PMID: 20149233 PMCID: PMC2836304 DOI: 10.1186/1471-2105-11-85] [Citation(s) in RCA: 153] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2009] [Accepted: 02/11/2010] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles. RESULTS In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers. CONCLUSIONS LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/.
Collapse
Affiliation(s)
- Martin Gerner
- Faculty of Life Sciences, University of Manchester, Manchester, M13 9PT, UK
| | - Goran Nenadic
- School of Computer Science, University of Manchester, Manchester, M13 9PL, UK
| | - Casey M Bergman
- Faculty of Life Sciences, University of Manchester, Manchester, M13 9PT, UK
| |
Collapse
|
29
|
Krallinger M, Leitner F, Valencia A. Analysis of biological processes and diseases using text mining approaches. Methods Mol Biol 2010; 593:341-382. [PMID: 19957157 DOI: 10.1007/978-1-60327-194-3_16] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
A number of biomedical text mining systems have been developed to extract biologically relevant information directly from the literature, complementing bioinformatics methods in the analysis of experimentally generated data. We provide a short overview of the general characteristics of natural language data, existing biomedical literature databases, and lexical resources relevant in the context of biomedical text mining. A selected number of practically useful systems are introduced together with the type of user queries supported and the results they generate. The extraction of biological relationships, such as protein-protein interactions as well as metabolic and signaling pathways using information extraction systems, will be discussed through example cases of cancer-relevant proteins. Basic strategies for detecting associations of genes to diseases together with literature mining of mutations, SNPs, and epigenetic information (methylation) are described. We provide an overview of disease-centric and gene-centric literature mining methods for linking genes to phenotypic and genotypic aspects. Moreover, we discuss recent efforts for finding biomarkers through text mining and for gene list analysis and prioritization. Some relevant issues for implementing a customized biomedical text mining system will be pointed out. To demonstrate the usefulness of literature mining for the molecular oncology domain, we implemented two cancer-related applications. The first tool consists of a literature mining system for retrieving human mutations together with supporting articles. Specific gene mutations are linked to a set of predefined cancer types. The second application consists of a text categorization system supporting breast cancer-specific literature search and document-based breast cancer gene ranking. Future trends in text mining emphasize the importance of community efforts such as the BioCreative challenge for the development and integration of multiple systems into a common platform provided by the BioCreative Metaserver.
Collapse
|
30
|
Attwood TK, Kell DB, McDermott P, Marsh J, Pettifer SR, Thorne D. Calling International Rescue: knowledge lost in literature and data landslide! Biochem J 2009; 424:317-33. [PMID: 19929850 PMCID: PMC2805925 DOI: 10.1042/bj20091474] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2009] [Accepted: 09/29/2009] [Indexed: 11/17/2022]
Abstract
We live in interesting times. Portents of impending catastrophe pervade the literature, calling us to action in the face of unmanageable volumes of scientific data. But it isn't so much data generation per se, but the systematic burial of the knowledge embodied in those data that poses the problem: there is so much information available that we simply no longer know what we know, and finding what we want is hard - too hard. The knowledge we seek is often fragmentary and disconnected, spread thinly across thousands of databases and millions of articles in thousands of journals. The intellectual energy required to search this array of data-archives, and the time and money this wastes, has led several researchers to challenge the methods by which we traditionally commit newly acquired facts and knowledge to the scientific record. We present some of these initiatives here - a whirlwind tour of recent projects to transform scholarly publishing paradigms, culminating in Utopia and the Semantic Biochemical Journal experiment. With their promises to provide new ways of interacting with the literature, and new and more powerful tools to access and extract the knowledge sequestered within it, we ask what advances they make and what obstacles to progress still exist? We explore these questions, and, as you read on, we invite you to engage in an experiment with us, a real-time test of a new technology to rescue data from the dormant pages of published documents. We ask you, please, to read the instructions carefully. The time has come: you may turn over your papers...
Collapse
Key Words
- dynamic document content
- interactive pdf
- linking documents with research data
- manuscript mark-up
- mark-up standards
- semantic publishing
- bj, biochemical journal
- cohse, conceptual open hypermedia services environment
- doi, digital object identifier
- go, gene ontology
- gpcr, g protein-coupled receptor
- html, hypertext mark-up language
- iupac, international union of pure and applied chemistry
- ntd, neglected tropical diseases
- obo, open biomedical ontologies
- pdb, protein data bank
- pdf, portable document format
- plos, public library of science
- pmc, pubmed central
- ptm, post-translational modification
- rsc, royal society of chemistry
- sda, structured digital abstract
- stm, scientific, technical and medical
- ud, utopia documents
- xml, extensible mark-up language
- xmp, extensible metadata platform
Collapse
Affiliation(s)
- Teresa K Attwood
- School of Chemistry, The University of Manchester, Oxford Road, Manchester M13 9PL, UK.
| | | | | | | | | | | |
Collapse
|
31
|
Wakoh M, Nishimoto N, Uesugi M, Terashita T, Ogasawara K. [Developing and evaluating an auto-retrieval algorithm for abbreviations in academic articles]. Nihon Hoshasen Gijutsu Gakkai Zasshi 2009; 65:1025-1031. [PMID: 19721310 DOI: 10.6009/jjrt.65.1025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
The purpose of this study was to develop an automated acquisition support for the semantics of abbreviations and evaluate its performance with respect to academic articles. Our goal was to support the maintenance of an institution-specific semantics inventory for abbreviations on a continuous basis. We retrieved articles from MEDLINE with the keyword "Liver [MeSH]," and 100 abstracts were randomly selected. Abbreviations and their full forms were retrieved using original Java software based on the following rules. (1) Searching the parentheses in the abstracts, the words inside the parentheses were retrieved as "INNER" and the words in front of the parentheses were retrieved as "OUTER." (2) Matching rules, such as whether the first characters of INNER and OUTER were the same. (3) If the words satisfied the conditions stated at (2), INNER was saved as the abbreviation and OUTER as the full form. Performance was manually evaluated by two graduate students and a radiologist. Of the 165 pairs of abbreviations and full forms that were obtained, 145 (87.9%) constituted correct matches.
Collapse
Affiliation(s)
- Michika Wakoh
- Department of Health Sciences, School of Medicine, Hokkaido University
| | | | | | | | | |
Collapse
|
32
|
Xu Y, Wang Z, Lei Y, Zhao Y, Xue Y. MBA: a literature mining system for extracting biomedical abbreviations. BMC Bioinformatics 2009; 10:14. [PMID: 19134199 PMCID: PMC2639376 DOI: 10.1186/1471-2105-10-14] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2008] [Accepted: 01/09/2009] [Indexed: 12/05/2022] Open
Abstract
BACKGROUND The exploding growth of the biomedical literature presents many challenges for biological researchers. One such challenge is from the use of a great deal of abbreviations. Extracting abbreviations and their definitions accurately is very helpful to biologists and also facilitates biomedical text analysis. Existing approaches fall into four broad categories: rule based, machine learning based, text alignment based and statistically based. State of the art methods either focus exclusively on acronym-type abbreviations, or could not recognize rare abbreviations. We propose a systematic method to extract abbreviations effectively. At first a scoring method is used to classify the abbreviations into acronym-type and non-acronym-type abbreviations, and then their corresponding definitions are identified by two different methods: text alignment algorithm for the former, statistical method for the latter. RESULTS A literature mining system MBA was constructed to extract both acronym-type and non-acronym-type abbreviations. An abbreviation-tagged literature corpus, called Medstract gold standard corpus, was used to evaluate the system. MBA achieved a recall of 88% at the precision of 91% on the Medstract gold-standard EVALUATION Corpus. CONCLUSION We present a new literature mining system MBA for extracting biomedical abbreviations. Our evaluation demonstrates that the MBA system performs better than the others. It can identify the definition of not only acronym-type abbreviations including a little irregular acronym-type abbreviations (e.g., ), but also non-acronym-type abbreviations (e.g., ).
Collapse
Affiliation(s)
- Yun Xu
- Department of Computer Science and Technology, University of Science and Technology of China Hefei, Anhui 230027, PR China
- Anhui Province-MOST Co-Key Laboratory of High Performance Computing and Its Application Hefei, Anhui 230027, PR China
| | - ZhiHao Wang
- Department of Computer Science and Technology, University of Science and Technology of China Hefei, Anhui 230027, PR China
- Anhui Province-MOST Co-Key Laboratory of High Performance Computing and Its Application Hefei, Anhui 230027, PR China
| | - YiMing Lei
- Department of Computer Science and Technology, University of Science and Technology of China Hefei, Anhui 230027, PR China
- Anhui Province-MOST Co-Key Laboratory of High Performance Computing and Its Application Hefei, Anhui 230027, PR China
| | - YuZhong Zhao
- Department of Computer Science and Technology, University of Science and Technology of China Hefei, Anhui 230027, PR China
- Anhui Province-MOST Co-Key Laboratory of High Performance Computing and Its Application Hefei, Anhui 230027, PR China
| | - Yu Xue
- School of Life Science, University of Science and Technology of China Hefei, Anhui 230027, PR China
| |
Collapse
|
33
|
Krallinger M, Valencia A, Hirschman L. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol 2008; 9 Suppl 2:S8. [PMID: 18834499 PMCID: PMC2559992 DOI: 10.1186/gb-2008-9-s2-s8] [Citation(s) in RCA: 145] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Efficient access to information contained in online scientific literature collections is essential for life science research, playing a crucial role from the initial stage of experiment planning to the final interpretation and communication of the results. The biological literature also constitutes the main information source for manual literature curation used by expert-curated databases. Following the increasing popularity of web-based applications for analyzing biological data, new text-mining and information extraction strategies are being implemented. These systems exploit existing regularities in natural language to extract biologically relevant information from electronic texts automatically. The aim of the BioCreative challenge is to promote the development of such tools and to provide insight into their performance. This review presents a general introduction to the main characteristics and applications of currently available text-mining systems for life sciences in terms of the following: the type of biological information demands being addressed; the level of information granularity of both user queries and results; and the features and methods commonly exploited by these applications. The current trend in biomedical text mining points toward an increasing diversification in terms of application types and techniques, together with integration of domain-specific resources such as ontologies. Additional descriptions of some of the systems discussed here are available on the internet .
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Biology and BioComputing Programme, Spanish Nacional Cancer Research Centre (CNIO), Madrid, Spain.
| | | | | |
Collapse
|
34
|
Winnenburg R, Wachter T, Plake C, Doms A, Schroeder M. Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief Bioinform 2008; 9:466-78. [DOI: 10.1093/bib/bbn043] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
35
|
Dobson PD, Kell DB. Carrier-mediated cellular uptake of pharmaceutical drugs: an exception or the rule? Nat Rev Drug Discov 2008; 7:205-20. [PMID: 18309312 DOI: 10.1038/nrd2438] [Citation(s) in RCA: 325] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
It is generally thought that many drug molecules are transported across biological membranes via passive diffusion at a rate related to their lipophilicity. However, the types of biophysical forces involved in the interaction of drugs with lipid membranes are no different from those involved in their interaction with proteins, and so arguments based on lipophilicity could also be applied to drug uptake by membrane transporters or carriers. In this article, we discuss the evidence supporting the idea that rather than being an exception, carrier-mediated and active uptake of drugs may be more common than is usually assumed - including a summary of specific cases in which drugs are known to be taken up into cells via defined carriers - and consider the implications for drug discovery and development.
Collapse
Affiliation(s)
- Paul D Dobson
- School of Chemistry and Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester M1 7DN, UK
| | | |
Collapse
|
36
|
Kim JD, Ohta T, Tsujii J. Corpus annotation for mining biomedical events from literature. BMC Bioinformatics 2008; 9:10. [PMID: 18182099 PMCID: PMC2267702 DOI: 10.1186/1471-2105-9-10] [Citation(s) in RCA: 163] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2007] [Accepted: 01/08/2008] [Indexed: 11/24/2022] Open
Abstract
Background Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. Results We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. Conclusion The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.
Collapse
Affiliation(s)
- Jin-Dong Kim
- Department of Computer Science, School of Information Science and Technology, University of Tokyo, Tokyo, Japan.
| | | | | |
Collapse
|
37
|
Torii M, Hu ZZ, Song M, Wu CH, Liu H. A comparison study on algorithms of detecting long forms for short forms in biomedical text. BMC Bioinformatics 2007; 8 Suppl 9:S5. [PMID: 18047706 PMCID: PMC2217663 DOI: 10.1186/1471-2105-8-s9-s5] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs) and their corresponding definitions as long forms (LFs). The study was designed to answer the following questions; i) how well a system performs in detecting LFs from novel text, ii) what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii) how to combine results from various SF knowledge bases. METHOD We evaluated the following three publicly available detection systems in detecting LFs for SFs: i) a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii) a machine learning system by Chang et al., and iii) a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i) the UMLS (the Unified Medical Language System), and ii) the BioThesaurus (a thesaurus of names for all UniProt protein records). We also implemented a web interface that provides a virtual integration of various SF knowledge bases. RESULTS We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases. AVAILABILITY The web site is http://gauss.dbb.georgetown.edu/liblab/SFThesaurus.
Collapse
Affiliation(s)
- Manabu Torii
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, 4000 Resevoir Rd, NW, Washington, DC 20057, USA.
| | | | | | | | | |
Collapse
|
38
|
Crangle CE, Cherry JM, Hong EL, Zbyslaw A. Mining experimental evidence of molecular function claims from the literature. Bioinformatics 2007; 23:3232-40. [PMID: 17942445 DOI: 10.1093/bioinformatics/btm495] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The rate at which gene-related findings appear in the scientific literature makes it difficult if not impossible for biomedical scientists to keep fully informed and up to date. The importance of these findings argues for the development of automated methods that can find, extract and summarize this information. This article reports on methods for determining the molecular function claims that are being made in a scientific article, specifically those that are backed by experimental evidence. RESULTS The most significant result is that for molecular function claims based on direct assays, our methods achieved recall of 70.7% and precision of 65.7%. Furthermore, our methods correctly identified in the text 44.6% of the specific molecular function claims backed up by direct assays, but with a precision of only 0.92%, a disappointing outcome that led to an examination of the different kinds of errors. These results were based on an analysis of 1823 articles from the literature of Saccharomyces cerevisiae (budding yeast). AVAILABILITY The annotation files for S.cerevisiae are available from ftp://genome-ftp.stanford.edu/pub/yeast/data_download/literature_curation/gene_association.sgd.gz. The draft protocol vocabulary is available by request from the first author.
Collapse
|