101
|
Wang X, Tsujii J, Ananiadou S. Disambiguating the species of biomedical named entities using natural language parsers. Bioinformatics 2010; 26:661-7. [PMID: 20053840 PMCID: PMC2828111 DOI: 10.1093/bioinformatics/btq002] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Motivation: Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive study on resolving the ambiguity in mentions of biomedical named entities with respect to model organisms and presents an array of approaches, with focus on methods utilizing natural language parsers. Results: We build a corpus for organism disambiguation where every occurrence of protein/gene entity is manually tagged with a species ID, and evaluate a number of methods on it. Promising results are obtained by training a machine learning model on syntactic parse trees, which is then used to decide whether an entity belongs to the model organism denoted by a neighbouring species-indicating word (e.g. yeast). The parser-based approaches are also compared with a supervised classification method and results indicate that the former are a more favorable choice when domain portability is of concern. The best overall performance is obtained by combining the strengths of syntactic features and supervised classification. Availability: The corpus and demo are available at http://www.nactem.ac.uk/deca_details/start.cgi, and the software is freely available as U-Compare components (Kano et al., 2009): NaCTeM Species Word Detector and NaCTeM Species Disambiguator. U-Compare is available at http://-compare.org/ Contact:xinglong.wang@manchester.ac.uk
Collapse
Affiliation(s)
- Xinglong Wang
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK.
| | | | | |
Collapse
|
102
|
|
103
|
Islamaj Dogan R, Murray GC, Névéol A, Lu Z. Understanding PubMed user search behavior through log analysis. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2009; 2009:bap018. [PMID: 20157491 PMCID: PMC2797455 DOI: 10.1093/database/bap018] [Citation(s) in RCA: 131] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/20/2009] [Revised: 10/05/2009] [Accepted: 10/06/2009] [Indexed: 11/20/2022]
Abstract
This article reports on a detailed investigation of PubMed users’ needs and behavior as a step toward improving biomedical information retrieval. PubMed is providing free service to researchers with access to more than 19 million citations for biomedical articles from MEDLINE and life science journals. It is accessed by millions of users each day. Efficient search tools are crucial for biomedical researchers to keep abreast of the biomedical literature relating to their own research. This study provides insight into PubMed users’ needs and their behavior. This investigation was conducted through the analysis of one month of log data, consisting of more than 23 million user sessions and more than 58 million user queries. Multiple aspects of users’ interactions with PubMed are characterized in detail with evidence from these logs. Despite having many features in common with general Web searches, biomedical information searches have unique characteristics that are made evident in this study. PubMed users are more persistent in seeking information and they reformulate queries often. The three most frequent types of search are search by author name, search by gene/protein, and search by disease. Use of abbreviation in queries is very frequent. Factors such as result set size influence users’ decisions. Analysis of characteristics such as these plays a critical role in identifying users’ information needs and their search habits. In turn, such an analysis also provides useful insight for improving biomedical information retrieval. Database URL:http://www.ncbi.nlm.nih.gov/PubMed
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, US National Library of Medicine, Bethesda, MD 20894, USA
| | | | | | | |
Collapse
|
104
|
Duchrow T, Shtatland T, Guettler D, Pivovarov M, Kramer S, Weissleder R. Enhancing navigation in biomedical databases by community voting and database-driven text classification. BMC Bioinformatics 2009; 10:317. [PMID: 19799796 PMCID: PMC2768718 DOI: 10.1186/1471-2105-10-317] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2009] [Accepted: 10/03/2009] [Indexed: 11/29/2022] Open
Abstract
Background The breadth of biological databases and their information content continues to increase exponentially. Unfortunately, our ability to query such sources is still often suboptimal. Here, we introduce and apply community voting, database-driven text classification, and visual aids as a means to incorporate distributed expert knowledge, to automatically classify database entries and to efficiently retrieve them. Results Using a previously developed peptide database as an example, we compared several machine learning algorithms in their ability to classify abstracts of published literature results into categories relevant to peptide research, such as related or not related to cancer, angiogenesis, molecular imaging, etc. Ensembles of bagged decision trees met the requirements of our application best. No other algorithm consistently performed better in comparative testing. Moreover, we show that the algorithm produces meaningful class probability estimates, which can be used to visualize the confidence of automatic classification during the retrieval process. To allow viewing long lists of search results enriched by automatic classifications, we added a dynamic heat map to the web interface. We take advantage of community knowledge by enabling users to cast votes in Web 2.0 style in order to correct automated classification errors, which triggers reclassification of all entries. We used a novel framework in which the database "drives" the entire vote aggregation and reclassification process to increase speed while conserving computational resources and keeping the method scalable. In our experiments, we simulate community voting by adding various levels of noise to nearly perfectly labelled instances, and show that, under such conditions, classification can be improved significantly. Conclusion Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases. The system can be accessed at .
Collapse
Affiliation(s)
- Timo Duchrow
- Center for Molecular Imaging Research, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA, USA.
| | | | | | | | | | | |
Collapse
|
105
|
Chung GYC. Towards identifying intervention arms in randomized controlled trials: Extracting coordinating constructions. J Biomed Inform 2009; 42:790-800. [DOI: 10.1016/j.jbi.2008.12.011] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2008] [Revised: 11/10/2008] [Accepted: 12/19/2008] [Indexed: 11/25/2022]
|
106
|
Garten Y, Tatonetti NP, Altman RB. Improving the prediction of pharmacogenes using text-derived drug-gene relationships. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2009:305-14. [PMID: 19908383 DOI: 10.1142/9789814295291_0033] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
A critical goal of pharmacogenomics research is to identify genes that can explain variation in drug response. We have previously reported a method that creates a genome-scale ranking of genes likely to interact with a drug. The algorithm uses information about drug structure and indications of use to rank the genes. Although the algorithm has good performance, its performance depends on a curated set of drug-gene relationships that is expensive to create and difficult to maintain. In this work, we assess the utility of text mining in extracting a network of drug-gene relationships automatically. This provides a valuable aggregate source of knowledge, subsequently used as input into the algorithm that ranks potential pharmacogenes. Using a drug-gene network created from sentence-level co-occurrence in the full text of scientific articles, we compared the performance to that of a network created by manual curation of those articles. Under a wide range of conditions, we show that a knowledge base derived from text-mining the literature performs as well as, and sometimes better than, a high-quality, manually curated knowledge base. We conclude that we can use relationships mined automatically from the literature as a knowledgebase for pharmacogenomics relationships. Additionally, when relationships are missed by text mining, our system can accurately extrapolate new relationships with 77.4% precision.
Collapse
Affiliation(s)
- Yael Garten
- Stanford Biomedical Informatics Training Program, Stanford University, Stanford, CA 94305, USA
| | | | | |
Collapse
|
107
|
Cano C, Monaghan T, Blanco A, Wall DP, Peshkin L. Collaborative text-annotation resource for disease-centered relation extraction from biomedical text. J Biomed Inform 2009; 42:967-77. [PMID: 19232400 PMCID: PMC2757509 DOI: 10.1016/j.jbi.2009.02.001] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2008] [Revised: 12/04/2008] [Accepted: 02/04/2009] [Indexed: 11/30/2022]
Abstract
Agglomerating results from studies of individual biological components has shown the potential to produce biomedical discovery and the promise of therapeutic development. Such knowledge integration could be tremendously facilitated by automated text mining for relation extraction in the biomedical literature. Relation extraction systems cannot be developed without substantial datasets annotated with ground truth for benchmarking and training. The creation of such datasets is hampered by the absence of a resource for launching a distributed annotation effort, as well as by the lack of a standardized annotation schema. We have developed an annotation schema and an annotation tool which can be widely adopted so that the resulting annotated corpora from a multitude of disease studies could be assembled into a unified benchmark dataset. The contribution of this paper is threefold. First, we provide an overview of available benchmark corpora and derive a simple annotation schema for specific binary relation extraction problems such as protein-protein and gene-disease relation extraction. Second, we present BioNotate: an open source annotation resource for the distributed creation of a large corpus. Third, we present and make available the results of a pilot annotation effort of the autism disease network.
Collapse
Affiliation(s)
- C Cano
- Department of Computer Science and Artificial Intelligence, University of Granada, 18071 Granada, Spain.
| | | | | | | | | |
Collapse
|
108
|
Korhonen A, Silins I, Sun L, Stenius U. The first step in the development of Text Mining technology for Cancer Risk Assessment: identifying and organizing scientific evidence in risk assessment literature. BMC Bioinformatics 2009; 10:303. [PMID: 19772619 PMCID: PMC2759963 DOI: 10.1186/1471-2105-10-303] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2009] [Accepted: 09/22/2009] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND One of the most neglected areas of biomedical Text Mining (TM) is the development of systems based on carefully assessed user needs. We have recently investigated the user needs of an important task yet to be tackled by TM -- Cancer Risk Assessment (CRA). Here we take the first step towards the development of TM technology for the task: identifying and organizing the scientific evidence required for CRA in a taxonomy which is capable of supporting extensive data gathering from biomedical literature. RESULTS The taxonomy is based on expert annotation of 1297 abstracts downloaded from relevant PubMed journals. It classifies 1742 unique keywords found in the corpus to 48 classes which specify core evidence required for CRA. We report promising results with inter-annotator agreement tests and automatic classification of PubMed abstracts to taxonomy classes. A simple user test is also reported in a near real-world CRA scenario which demonstrates along with other evaluation that the resources we have built are well-defined, accurate, and applicable in practice. CONCLUSION We present our annotation guidelines and a tool which we have designed for expert annotation of PubMed abstracts. A corpus annotated for keywords and document relevance is also presented, along with the taxonomy which organizes the keywords into classes defining core evidence for CRA. As demonstrated by the evaluation, the materials we have constructed provide a good basis for classification of CRA literature along multiple dimensions. They can support current manual CRA as well as facilitate the development of an approach based on TM. We discuss extending the taxonomy further via manual and machine learning approaches and the subsequent steps required to develop TM technology for the needs of CRA.
Collapse
Affiliation(s)
- Anna Korhonen
- Computer Laboratory, University of Cambridge, 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK
| | - Ilona Silins
- Institute of Environmental Medicine, Karolinska Institutet, S-17177, Stockholm, Sweden
| | - Lin Sun
- Computer Laboratory, University of Cambridge, 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK
| | - Ulla Stenius
- Institute of Environmental Medicine, Karolinska Institutet, S-17177, Stockholm, Sweden
| |
Collapse
|
109
|
Aljaber B, Stokes N, Bailey J, Pei J. Document clustering of scientific texts using citation contexts. INFORM RETRIEVAL J 2009. [DOI: 10.1007/s10791-009-9108-x] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
110
|
Smalheiser NR, Torvik VI, Zhou W. Arrowsmith two-node search interface: a tutorial on finding meaningful links between two disparate sets of articles in MEDLINE. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2009; 94:190-7. [PMID: 19185946 PMCID: PMC2693227 DOI: 10.1016/j.cmpb.2008.12.006] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/07/2008] [Revised: 10/17/2008] [Accepted: 12/12/2008] [Indexed: 05/27/2023]
Abstract
The Arrowsmith two-node search is a strategy that is designed to assist biomedical investigators in formulating and assessing scientific hypotheses. More generally, it allows users to identify biologically meaningful links between any two sets of articles A and C in PubMed, even when these share no articles or authors in common and represent disparate topics or disciplines. The key idea is to relate the two sets of articles via title words and phrases (B-terms) that they share. We have created a free, public web-based version of the two-node search tool (http://arrowsmith.psych.uic.edu), have described its development and implementation, and have presented analyses of individual two-node searches. In this paper, we provide an updated tutorial intended for end-users, that covers the use of the tool for a variety of potential scientific use case scenarios. For example, one can assess a recent experimental, clinical or epidemiologic finding that connects two disparate fields of inquiry--identifying likely mechanisms to explain the finding, and choosing promising follow-up lines of investigation. Alternatively, one can assess whether the existing scientific literature lends indirect support to a hypothesis posed by the user that has not yet been investigated. One can also employ two-node searches to search for novel hypotheses. Arrowsmith provides a service that cannot be carried out feasibly via standard PubMed searches or by other available text mining tools.
Collapse
Affiliation(s)
- Neil R Smalheiser
- Department of Psychiatry and Psychiatric Institute, MC912, University of Illinois at Chicago, 1601W. Taylor Street, Chicago, IL 60612, USA.
| | | | | |
Collapse
|
111
|
Leach SM, Tipney H, Feng W, Baumgartner WA, Kasliwal P, Schuyler RP, Williams T, Spritz RA, Hunter L. Biomedical discovery acceleration, with applications to craniofacial development. PLoS Comput Biol 2009; 5:e1000215. [PMID: 19325874 PMCID: PMC2653649 DOI: 10.1371/journal.pcbi.1000215] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2008] [Accepted: 02/12/2009] [Indexed: 01/17/2023] Open
Abstract
The profusion of high-throughput instruments and the explosion of new results in the scientific literature, particularly in molecular biomedicine, is both a blessing and a curse to the bench researcher. Even knowledgeable and experienced scientists can benefit from computational tools that help navigate this vast and rapidly evolving terrain. In this paper, we describe a novel computational approach to this challenge, a knowledge-based system that combines reading, reasoning, and reporting methods to facilitate analysis of experimental data. Reading methods extract information from external resources, either by parsing structured data or using biomedical language processing to extract information from unstructured data, and track knowledge provenance. Reasoning methods enrich the knowledge that results from reading by, for example, noting two genes that are annotated to the same ontology term or database entry. Reasoning is also used to combine all sources into a knowledge network that represents the integration of all sorts of relationships between a pair of genes, and to calculate a combined reliability score. Reporting methods combine the knowledge network with a congruent network constructed from experimental data and visualize the combined network in a tool that facilitates the knowledge-based analysis of that data. An implementation of this approach, called the Hanalyzer, is demonstrated on a large-scale gene expression array dataset relevant to craniofacial development. The use of the tool was critical in the creation of hypotheses regarding the roles of four genes never previously characterized as involved in craniofacial development; each of these hypotheses was validated by further experimental work.
Collapse
Affiliation(s)
- Sonia M. Leach
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Hannah Tipney
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Weiguo Feng
- Department of Craniofacial Biology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - William A. Baumgartner
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Priyanka Kasliwal
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Ronald P. Schuyler
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Trevor Williams
- Department of Craniofacial Biology, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Richard A. Spritz
- Human Medical Genetics Program, University of Colorado at Denver, Denver, Colorado, United States of America
| | - Lawrence Hunter
- Center for Computational Pharmacology, University of Colorado at Denver, Denver, Colorado, United States of America
- * E-mail:
| |
Collapse
|
112
|
Lin J. Is searching full text more effective than searching abstracts? BMC Bioinformatics 2009; 10:46. [PMID: 19192280 PMCID: PMC2695361 DOI: 10.1186/1471-2105-10-46] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2008] [Accepted: 02/03/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND With the growing availability of full-text articles online, scientists and other consumers of the life sciences literature now have the ability to go beyond searching bibliographic records (title, abstract, metadata) to directly access full-text content. Motivated by this emerging trend, I posed the following question: is searching full text more effective than searching abstracts? This question is answered by comparing text retrieval algorithms on MEDLINE abstracts, full-text articles, and spans (paragraphs) within full-text articles using data from the TREC 2007 genomics track evaluation. Two retrieval models are examined: bm25 and the ranking algorithm implemented in the open-source Lucene search engine. RESULTS Experiments show that treating an entire article as an indexing unit does not consistently yield higher effectiveness compared to abstract-only search. However, retrieval based on spans, or paragraphs-sized segments of full-text articles, consistently outperforms abstract-only search. Results suggest that highest overall effectiveness may be achieved by combining evidence from spans and full articles. CONCLUSION Users searching full text are more likely to find relevant articles than searching only abstracts. This finding affirms the value of full text collections for text retrieval and provides a starting point for future work in exploring algorithms that take advantage of rapidly-growing digital archives. Experimental results also highlight the need to develop distributed text retrieval algorithms, since full-text articles are significantly longer than abstracts and may require the computational resources of multiple machines in a cluster. The MapReduce programming model provides a convenient framework for organizing such computations.
Collapse
Affiliation(s)
- Jimmy Lin
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland, USA.
| |
Collapse
|
113
|
Tüchler T, Velez G, Graf A, Kreil DP. BibGlimpse: the case for a light-weight reprint manager in distributed literature research. BMC Bioinformatics 2008; 9:406. [PMID: 18828894 PMCID: PMC2571996 DOI: 10.1186/1471-2105-9-406] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2008] [Accepted: 10/01/2008] [Indexed: 12/02/2022] Open
Abstract
Background While text-mining and distributed annotation systems both aim at capturing knowledge and presenting it in a standardized form, there have been few attempts to investigate potential synergies between these two fields. For instance, distributed annotation would be very well suited for providing topic focussed, expert knowledge enriched text corpora. A key limitation for this approach is the availability of literature annotation systems that can be routinely used by groups of collaborating researchers on a day to day basis, not distracting from the main focus of their work. Results For this purpose, we have designed BibGlimpse. Features like drop-to-file, SVM based automated retrieval of PubMed bibliography for PDF reprints, and annotation support make BibGlimpse an efficient, light-weight reprint manager that facilitates distributed literature research for work groups. Building on an established open search engine, full-text search and structured queries are supported, while at the same time making shared collections of annotated reprints accessible to literature classification and text-mining tools. Conclusion BibGlimpse offers scientists a tool that enhances their own literature management. Moreover, it may be used to create content enriched, annotated text corpora for research in text-mining.
Collapse
Affiliation(s)
- Thomas Tüchler
- Chair of Bioinformatics, Boku University, Vienna, Austria.
| | | | | | | |
Collapse
|
114
|
Hull D, Pettifer SR, Kell DB. Defrosting the digital library: bibliographic tools for the next generation web. PLoS Comput Biol 2008; 4:e1000204. [PMID: 18974831 PMCID: PMC2568856 DOI: 10.1371/journal.pcbi.1000204] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Many scientists now manage the bulk of their bibliographic information electronically, thereby organizing their publications and citation material from digital libraries. However, a library has been described as "thought in cold storage," and unfortunately many digital libraries can be cold, impersonal, isolated, and inaccessible places. In this Review, we discuss the current chilly state of digital libraries for the computational biologist, including PubMed, IEEE Xplore, the ACM digital library, ISI Web of Knowledge, Scopus, Citeseer, arXiv, DBLP, and Google Scholar. We illustrate the current process of using these libraries with a typical workflow, and highlight problems with managing data and metadata using URIs. We then examine a range of new applications such as Zotero, Mendeley, Mekentosj Papers, MyNCBI, CiteULike, Connotea, and HubMed that exploit the Web to make these digital libraries more personal, sociable, integrated, and accessible places. We conclude with how these applications may begin to help achieve a digital defrost, and discuss some of the issues that will help or hinder this in terms of making libraries on the Web warmer places in the future, becoming resources that are considerably more useful to both humans and machines.
Collapse
Affiliation(s)
- Duncan Hull
- School of Chemistry, The University of Manchester, Manchester, UK.
| | | | | |
Collapse
|
115
|
Müller HM, Rangarajan A, Teal TK, Sternberg PW. Textpresso for neuroscience: searching the full text of thousands of neuroscience research papers. Neuroinformatics 2008; 6:195-204. [PMID: 18949581 PMCID: PMC2666735 DOI: 10.1007/s12021-008-9031-0] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2008] [Accepted: 09/22/2008] [Indexed: 10/21/2022]
Abstract
Textpresso is a text-mining system for scientific literature. Its two major features are access to the full text of research papers and the development and use of categories of biological concepts as well as categories that describe or relate objects. A search engine enables the user to search for one or a combination of these categories and/or keywords within an entire literature. Here we describe Textpresso for Neuroscience, part of the core Neuroscience Information Framework (NIF). The Textpresso site currently consists of 67,500 full text papers and 131,300 abstracts. We show that using categories in literature can make a pure keyword query more refined and meaningful. We also show how semantic queries can be formulated with categories only. We explain the build and content of the database and describe the main features of the web pages and the advanced search options. We also give detailed illustrations of the web service developed to provide programmatic access to Textpresso. This web service is used by the NIF interface to access Textpresso. The standalone website of Textpresso for Neuroscience can be accessed at http://www.textpresso.org/neuroscience/.
Collapse
Affiliation(s)
- Hans-Michael Müller
- Division of Biology and Howard Hughes Medical Institute, California Institute of Technology, Pasadena, CA, USA, e-mail:
| | - Arun Rangarajan
- Division of Biology and Howard Hughes Medical Institute, California Institute of Technology, Pasadena, CA, USA, e-mail:
| | - Tracy K. Teal
- Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, USA
| | - Paul W. Sternberg
- Division of Biology and Howard Hughes Medical Institute, California Institute of Technology, Pasadena, CA, USA, e-mail:
| |
Collapse
|
116
|
Krallinger M, Valencia A, Hirschman L. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol 2008; 9 Suppl 2:S8. [PMID: 18834499 PMCID: PMC2559992 DOI: 10.1186/gb-2008-9-s2-s8] [Citation(s) in RCA: 105] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Efficient access to information contained in online scientific literature collections is essential for life science research, playing a crucial role from the initial stage of experiment planning to the final interpretation and communication of the results. The biological literature also constitutes the main information source for manual literature curation used by expert-curated databases. Following the increasing popularity of web-based applications for analyzing biological data, new text-mining and information extraction strategies are being implemented. These systems exploit existing regularities in natural language to extract biologically relevant information from electronic texts automatically. The aim of the BioCreative challenge is to promote the development of such tools and to provide insight into their performance. This review presents a general introduction to the main characteristics and applications of currently available text-mining systems for life sciences in terms of the following: the type of biological information demands being addressed; the level of information granularity of both user queries and results; and the features and methods commonly exploited by these applications. The current trend in biomedical text mining points toward an increasing diversification in terms of application types and techniques, together with integration of domain-specific resources such as ontologies. Additional descriptions of some of the systems discussed here are available on the internet .
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Biology and BioComputing Programme, Spanish Nacional Cancer Research Centre (CNIO), Madrid, Spain.
| | | | | |
Collapse
|
117
|
Agarwal P, Searls DB. Literature mining in support of drug discovery. Brief Bioinform 2008; 9:479-92. [DOI: 10.1093/bib/bbn035] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
118
|
Gabow AP, Leach SM, Baumgartner WA, Hunter LE, Goldberg DS. Improving protein function prediction methods with integrated literature data. BMC Bioinformatics 2008; 9:198. [PMID: 18412966 PMCID: PMC2375131 DOI: 10.1186/1471-2105-9-198] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2007] [Accepted: 04/15/2008] [Indexed: 11/22/2022] Open
Abstract
Background Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity. Results We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial. Conclusion Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit.
Collapse
Affiliation(s)
- Aaron P Gabow
- Department of Pharmacology, University of Colorado at Denver and Health Sciences Center, MS 8303, RC-1 South, 12801 East 17th Avenue, L18-6101, PO Box 6511, Aurora, CO 80045, USA.
| | | | | | | | | |
Collapse
|
119
|
Caporaso JG, Baumgartner WA, Randolph DA, Cohen KB, Hunter L. Rapid pattern development for concept recognition systems: application to point mutations. J Bioinform Comput Biol 2008; 5:1233-59. [PMID: 18172927 DOI: 10.1142/s0219720007003144] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2007] [Revised: 07/30/2007] [Accepted: 08/23/2007] [Indexed: 11/18/2022]
Abstract
The primary biomedical literature is being generated at an unprecedented rate, and researchers cannot keep abreast of new developments in their fields. Biomedical natural language processing is being developed to address this issue, but building reliable systems often requires many expert-hours. We present an approach for automatically developing collections of regular expressions to drive high-performance concept recognition systems with minimal human interaction. We applied our approach to develop MutationFinder, a system for automatically extracting mentions of point mutations from the text. MutationFinder achieves performance equivalent to or better than manually developed mutation recognition systems, but the generation of its 759 patterns has required only 5.5 expert-hours. We also discuss the development and evaluation of our recently published high-quality, human-annotated gold standard corpus, which contains 1,515 complete point mutation mentions annotated in 813 abstracts. Both MutationFinder and the complete corpus are publicly available at (http://mutationfinder.sourceforge.net/).
Collapse
Affiliation(s)
- J Gregory Caporaso
- Department of Biochemistry and Molecular Genetics, Center for Computational Pharmacology, University of Colorado Health Sciences Center, Aurora, CO, USA.
| | | | | | | | | |
Collapse
|
120
|
Hunter L, Lu Z, Firby J, Baumgartner WA, Johnson HL, Ogren PV, Cohen KB. OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinformatics 2008; 9:78. [PMID: 18237434 PMCID: PMC2275248 DOI: 10.1186/1471-2105-9-78] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2007] [Accepted: 01/31/2008] [Indexed: 12/03/2022] Open
Abstract
Background Information extraction (IE) efforts are widely acknowledged to be important in harnessing the rapid advance of biomedical knowledge, particularly in areas where important factual information is published in a diverse literature. Here we report on the design, implementation and several evaluations of OpenDMAP, an ontology-driven, integrated concept analysis system. It significantly advances the state of the art in information extraction by leveraging knowledge in ontological resources, integrating diverse text processing applications, and using an expanded pattern language that allows the mixing of syntactic and semantic elements and variable ordering. Results OpenDMAP information extraction systems were produced for extracting protein transport assertions (transport), protein-protein interaction assertions (interaction) and assertions that a gene is expressed in a cell type (expression). Evaluations were performed on each system, resulting in F-scores ranging from .26 – .72 (precision .39 – .85, recall .16 – .85). Additionally, each of these systems was run over all abstracts in MEDLINE, producing a total of 72,460 transport instances, 265,795 interaction instances and 176,153 expression instances. Conclusion OpenDMAP advances the performance standards for extracting protein-protein interaction predications from the full texts of biomedical research articles. Furthermore, this level of performance appears to generalize to other information extraction tasks, including extracting information about predicates of more than two arguments. The output of the information extraction system is always constructed from elements of an ontology, ensuring that the knowledge representation is grounded with respect to a carefully constructed model of reality. The results of these efforts can be used to increase the efficiency of manual curation efforts and to provide additional features in systems that integrate multiple sources for information extraction. The open source OpenDMAP code library is freely available at
Collapse
Affiliation(s)
- Lawrence Hunter
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, CO 80045, USA.
| | | | | | | | | | | | | |
Collapse
|
121
|
Affiliation(s)
- K Bretonnel Cohen
- University of Colorado School of Medicine, Center for Computational Pharmacology, UCHSC at Fitzsimons, Department of Pharmacology, Aurora, Colorado, United States of America.
| | | |
Collapse
|
122
|
Caporaso JG, Deshpande N, Fink JL, Bourne PE, Cohen KB, Hunter L. Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2008:640-51. [PMID: 18229722 PMCID: PMC2517250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Biomedical text mining and other automated techniques are beginning to achieve performance which suggests that they could be applied to aid database curators. However, few studies have evaluated how these systems might work in practice. In this article we focus on the problem of annotating mutations in Protein Data Bank (PDB) entries, and evaluate the relationship between performance of two automated techniques, a text-mining-based approach (MutationFinder) and an alignment-based approach, in intrinsic versus extrinsic evaluations. We find that high performance on gold standard data (an intrinsic evaluation) does not necessarily translate to high performance for database annotation (an extrinsic evaluation). We show that this is in part a result of lack of access to the full text of journal articles, which appears to be critical for comprehensive database annotation by text mining. Additionally, we evaluate the accuracy and completeness of manually annotated mutation data in the PDB, and find that it is far from perfect. We conclude that currently the most cost-effective and reliable approach for database annotation might incorporate manual and automatic annotation methods.
Collapse
Affiliation(s)
- J Gregory Caporaso
- Center for Computational Pharmacology, University of Colorado Health Sciences Center, Aurora, CO, USA
| | | | | | | | | | | |
Collapse
|
123
|
Zhou D, He Y. Extracting interactions between proteins from the literature. J Biomed Inform 2007; 41:393-407. [PMID: 18207462 DOI: 10.1016/j.jbi.2007.11.008] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2007] [Revised: 11/21/2007] [Accepted: 11/28/2007] [Indexed: 11/29/2022]
Abstract
During the last decade, biomedicine has witnessed a tremendous development. Large amounts of experimental and computational biomedical data have been generated along with new discoveries, which are accompanied by an exponential increase in the number of biomedical publications describing these discoveries. In the meantime, there has been a great interest with scientific communities in text mining tools to find knowledge such as protein-protein interactions, which is most relevant and useful for specific analysis tasks. This paper provides a outline of the various information extraction methods in biomedical domain, especially for discovery of protein-protein interactions. It surveys methodologies involved in plain texts analyzing and processing, categorizes current work in biomedical information extraction, and provides examples of these methods. Challenges in the field are also presented and possible solutions are discussed.
Collapse
Affiliation(s)
- Deyu Zhou
- Informatics Research Centre, The University of Reading, Reading, RG6 6BX, UK.
| | | |
Collapse
|
124
|
Torii M, Hu ZZ, Song M, Wu CH, Liu H. A comparison study on algorithms of detecting long forms for short forms in biomedical text. BMC Bioinformatics 2007; 8 Suppl 9:S5. [PMID: 18047706 PMCID: PMC2217663 DOI: 10.1186/1471-2105-8-s9-s5] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs) and their corresponding definitions as long forms (LFs). The study was designed to answer the following questions; i) how well a system performs in detecting LFs from novel text, ii) what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii) how to combine results from various SF knowledge bases. METHOD We evaluated the following three publicly available detection systems in detecting LFs for SFs: i) a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii) a machine learning system by Chang et al., and iii) a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i) the UMLS (the Unified Medical Language System), and ii) the BioThesaurus (a thesaurus of names for all UniProt protein records). We also implemented a web interface that provides a virtual integration of various SF knowledge bases. RESULTS We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases. AVAILABILITY The web site is http://gauss.dbb.georgetown.edu/liblab/SFThesaurus.
Collapse
Affiliation(s)
- Manabu Torii
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, 4000 Resevoir Rd, NW, Washington, DC 20057, USA.
| | | | | | | | | |
Collapse
|
125
|
Abstract
It is now almost 15 years since the publication of the first paper on text mining in the genomics domain, and decades since the first paper on text mining in the medical domain. Enormous progress has been made in the areas of information retrieval, evaluation methodologies and resource construction. Some problems, such as abbreviation-handling, can essentially be considered solved problems, and others, such as identification of gene mentions in text, seem likely to be solved soon. However, a number of problems at the frontiers of biomedical text mining continue to present interesting challenges and opportunities for great improvements and interesting research. In this article we review the current state of the art in biomedical text mining or 'BioNLP' in general, focusing primarily on papers published within the past year.
Collapse
|
126
|
Torvik VI, Smalheiser NR. A quantitative model for linking two disparate sets of articles in MEDLINE. ACTA ACUST UNITED AC 2007; 23:1658-65. [PMID: 17463015 DOI: 10.1093/bioinformatics/btm161] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
BACKGROUND Identifying information that implicitly links two disparate sets of articles is a fundamental and intuitive data mining strategy that can help investigators address real scientific questions. The Arrowsmith two-node search finds title words and phrases (so-called B-terms) that are shared across two sets of articles within MEDLINE and displays them in a manner that facilitates human assessment. A serious stumbling-block has been the lack of a quantitative model for predicting which of the hundreds if not thousands of B-terms computed for a given search are most likely to be relevant to the investigator. METHODOLOGY/PRINCIPAL FINDINGS Using a public two-node search interface, field testers devised a set of two-node searches under real life conditions and a certain number of B-terms were marked relevant. These were employed as 'gold standards;' each B-term was characterized according to eight complementary features that were strongly correlated with relevance. A logistic regression model was developed that permits one to estimate the probability of relevance for each B-term, to rank B-terms according to their likely relevance, and to estimate the overall number of relevant B-terms inherent in a given two-node search. CONCLUSIONS/SIGNIFICANCE The model greatly simplifies and streamlines the process of carrying out a two-node search, and may be applicable to a number of other literature-based discovery applications, including the so-called one-node search and related gene-centric strategies that incorporate implicit links to predict how genes may be related to each other and to human diseases. This should encourage much wider exploration of text mining for implicit information among the general scientific community. AVAILABILITY Two-node searches can be carried out freely at http://arrowsmith.psych.uic.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Vetle I Torvik
- Department of Psychiatry and Psychiatric Institute (MC912), University of Illinois-Chicago, Chicago, IL 60612, USA
| | | |
Collapse
|
127
|
Schulz S, Hahn U. Towards the ontological foundations of symbolic biological theories. Artif Intell Med 2007; 39:237-50. [PMID: 17321118 DOI: 10.1016/j.artmed.2006.12.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2004] [Revised: 12/05/2006] [Accepted: 12/06/2006] [Indexed: 10/23/2022]
Abstract
OBJECTIVE Support for the symbolic representation of the physical structure of living organisms by an ontologically solid and logically sound foundation as a basis for formal reasoning. METHODS A set of canonical relations and attributes necessary for empirically adequate descriptions of biological entities is proposed. RESULTS It is shown how a broad range of biological organisms and their parts can be represented by cascading theories which are ordered by the dimensions of granularity, development, species, and canonicity. CONCLUSION The proposed representation of biological objects is non-redundant and compatible with inter- and intra-species similarities, developmental stages and pathological deviations.
Collapse
Affiliation(s)
- Stefan Schulz
- Department of Medical Informatics, Freiburg University Hospital, Stefan-Meier-Strasse 26, 79104 Freiburg, Germany.
| | | |
Collapse
|
128
|
Ananiadou S, Fluck J. Proceedings of the Second International Symposium for Semantic Mining in Biomedicine. BMC Bioinformatics 2006. [PMCID: PMC1764445 DOI: 10.1186/1471-2105-7-s3-s1] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
|
129
|
Ananiadou S, Kell DB, Tsujii JI. Text mining and its potential applications in systems biology. Trends Biotechnol 2006; 24:571-9. [PMID: 17045684 DOI: 10.1016/j.tibtech.2006.10.002] [Citation(s) in RCA: 164] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2006] [Revised: 08/21/2006] [Accepted: 10/02/2006] [Indexed: 11/15/2022]
Abstract
With biomedical literature increasing at a rate of several thousand papers per week, it is impossible to keep abreast of all developments; therefore, automated means to manage the information overload are required. Text mining techniques, which involve the processes of information retrieval, information extraction and data mining, provide a means of solving this. By adding meaning to text, these techniques produce a more structured analysis of textual knowledge than simple word searches, and can provide powerful tools for the production and analysis of systems biology models.
Collapse
Affiliation(s)
- Sophia Ananiadou
- School of Computer Science, National Centre for Text Mining, The Manchester Interdisciplinary Biocentre, The University of Manchester, 131 Princess Street, Manchester M1 7ND, UK.
| | | | | |
Collapse
|