1
|
Sousa DF, Couto FM. K-RET: knowledgeable biomedical relation extraction system. BIOINFORMATICS (OXFORD, ENGLAND) 2023; 39:7108769. [PMID: 37018156 PMCID: PMC10112952 DOI: 10.1093/bioinformatics/btad174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 02/25/2023] [Accepted: 03/29/2023] [Indexed: 04/20/2023]
Abstract
MOTIVATION Relation extraction (RE) is a crucial process to deal with the amount of text published daily, e.g. to find missing associations in a database. RE is a text mining task for which the state-of-the-art approaches use bidirectional encoders, namely, BERT. However, state-of-the-art performance may be limited by the lack of efficient external knowledge injection approaches, with a larger impact in the biomedical area given the widespread usage and high quality of biomedical ontologies. This knowledge can propel these systems forward by aiding them in predicting more explainable biomedical associations. With this in mind, we developed K-RET, a novel, knowledgeable biomedical RE system that, for the first time, injects knowledge by handling different types of associations, multiple sources and where to apply it, and multi-token entities. RESULTS We tested K-RET on three independent and open-access corpora (DDI, BC5CDR, and PGR) using four biomedical ontologies handling different entities. K-RET improved state-of-the-art results by 2.68% on average, with the DDI Corpus yielding the most significant boost in performance, from 79.30% to 87.19% in F-measure, representing a P-value of 2.91×10-12. AVAILABILITY AND IMPLEMENTATION https://github.com/lasigeBioTM/K-RET.
Collapse
Affiliation(s)
- Diana F Sousa
- Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| | - Francisco M Couto
- Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| |
Collapse
|
2
|
Sosa DN, Altman RB. Contexts and contradictions: a roadmap for computational drug repurposing with knowledge inference. Brief Bioinform 2022; 23:bbac268. [PMID: 35817308 PMCID: PMC9294417 DOI: 10.1093/bib/bbac268] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 05/25/2022] [Accepted: 06/07/2022] [Indexed: 11/30/2022] Open
Abstract
The cost of drug development continues to rise and may be prohibitive in cases of unmet clinical need, particularly for rare diseases. Artificial intelligence-based methods are promising in their potential to discover new treatment options. The task of drug repurposing hypothesis generation is well-posed as a link prediction problem in a knowledge graph (KG) of interacting of drugs, proteins, genes and disease phenotypes. KGs derived from biomedical literature are semantically rich and up-to-date representations of scientific knowledge. Inference methods on scientific KGs can be confounded by unspecified contexts and contradictions. Extracting context enables incorporation of relevant pharmacokinetic and pharmacodynamic detail, such as tissue specificity of interactions. Contradictions in biomedical KGs may arise when contexts are omitted or due to contradicting research claims. In this review, we describe challenges to creating literature-scale representations of pharmacological knowledge and survey current approaches toward incorporating context and resolving contradictions.
Collapse
Affiliation(s)
- Daniel N Sosa
- Department of Biomedical Data Science, Stanford University, 443 Via Ortega, 94305, California, USA
| | - Russ B Altman
- Department of Biological Engineering; Department of Genetics; Department of Biomedical Data Science, Stanford University, 443 Via Ortega, 94305, California, USA
| |
Collapse
|
3
|
Rosemblat G, Fiszman M, Shin D, Kilicoglu H. Towards a characterization of apparent contradictions in the biomedical literature using context analysis. J Biomed Inform 2019; 98:103275. [PMID: 31473364 DOI: 10.1016/j.jbi.2019.103275] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Revised: 08/26/2019] [Accepted: 08/28/2019] [Indexed: 11/19/2022]
Abstract
BACKGROUND With the substantial growth in the biomedical research literature, a larger number of claims are published daily, some of which seemingly disagree with or contradict prior claims on the same topics. Resolving such contradictions is critical to advancing our understanding of human disease and developing effective treatments. Automated text analysis techniques can facilitate such analysis by extracting claims from the literature, flagging those that are potentially contradictory, and identifying any study characteristics that may explain such contradictions. METHODS Using SemMedDB, our own PubMed-scale repository of semantic predications (subject-relation-object triples), we identified apparent contradictions in the biomedical research literature and developed a categorization of contextual characteristics that explain such contradictions. Clinically relevant semantic predications relating to 20 diseases and involving opposing predicate pairs (e.g., an intervention treats or causes a disease) were retrieved from SemMedDB. After addressing inference, uncertainty, generic concepts, and NLP errors through automatic and manual filtering steps, a set of apparent contradictions were identified and characterized. RESULTS We retrieved 117,676 predication instances from 62,360 PubMed abstracts (Jan 1980-Dec 2016). From these instances, automatic filtering steps generated 2236 candidate contradictory pairs. Through manual analysis, we determined that 58 of these pairs (2.6%) were apparent contradictions. We identified five main categories of contextual characteristics that explain these contradictions: (a) internal to the patient, (b) external to the patient, (c) endogenous/exogenous, (d) known controversy, and (e) contradictions in literature. Categories (a) and (b) were subcategorized further (e.g., species, dosage) and accounted for the bulk of the contradictory information. CONCLUSIONS Semantic predications, by accounting for lexical variability, and SemMedDB, owing to its literature scale, can support identification and elucidation of potentially contradictory claims across the biomedical domain. Further filtering and classification steps are needed to distinguish among them the true contradictory claims. The ability to detect contradictions automatically can facilitate important biomedical knowledge management tasks, such as tracking and verifying scientific claims, summarizing research on a given topic, identifying knowledge gaps, and assessing evidence for systematic reviews, with potential benefits to the scientific community. Future work will focus on automating these steps for fully automatic recognition of contradictions from the biomedical research literature.
Collapse
Affiliation(s)
- Graciela Rosemblat
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| | - Marcelo Fiszman
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| | - Dongwook Shin
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| | - Halil Kilicoglu
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| |
Collapse
|
4
|
Ahmed Z, Zeeshan S, Dandekar T. Mining biomedical images towards valuable information retrieval in biomedical and life sciences. Database (Oxford) 2016; 2016:baw118. [PMID: 27538578 PMCID: PMC4990152 DOI: 10.1093/database/baw118] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2015] [Revised: 06/07/2016] [Accepted: 07/19/2016] [Indexed: 12/22/2022]
Abstract
Biomedical images are helpful sources for the scientists and practitioners in drawing significant hypotheses, exemplifying approaches and describing experimental results in published biomedical literature. In last decades, there has been an enormous increase in the amount of heterogeneous biomedical image production and publication, which results in a need for bioimaging platforms for feature extraction and analysis of text and content in biomedical images to take advantage in implementing effective information retrieval systems. In this review, we summarize technologies related to data mining of figures. We describe and compare the potential of different approaches in terms of their developmental aspects, used methodologies, produced results, achieved accuracies and limitations. Our comparative conclusions include current challenges for bioimaging software with selective image mining, embedded text extraction and processing of complex natural language queries.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Saman Zeeshan
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Thomas Dandekar
- Department of Bioinformatics, Biocenter, University of Wuerzburg, Wuerzburg, Germany EMBL, Computational Biology and Structures Program, Heidelberg, Germany
| |
Collapse
|
5
|
Wagner M, Vicinus B, Muthra ST, Richards TA, Linder R, Frick VO, Groh A, Rubie C, Weichert F. Text mining, a race against time? An attempt to quantify possible variations in text corpora of medical publications throughout the years. Comput Biol Med 2016; 73:173-85. [PMID: 27208610 DOI: 10.1016/j.compbiomed.2016.03.016] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2015] [Revised: 03/19/2016] [Accepted: 03/21/2016] [Indexed: 11/29/2022]
Abstract
BACKGROUND The continuous growth of medical sciences literature indicates the need for automated text analysis. Scientific writing which is neither unitary, transcending social situation nor defined by a timeless idea is subject to constant change as it develops in response to evolving knowledge, aims at different goals, and embodies different assumptions about nature and communication. The objective of this study was to evaluate whether publication dates should be considered when performing text mining. METHODS A search of PUBMED for combined references to chemokine identifiers and particular cancer related terms was conducted to detect changes over the past 36 years. Text analyses were performed using freeware available from the World Wide Web. TOEFL Scores of territories hosting institutional affiliations as well as various readability indices were investigated. Further assessment was conducted using Principal Component Analysis. Laboratory examination was performed to evaluate the quality of attempts to extract content from the examined linguistic features. RESULTS The PUBMED search yielded a total of 14,420 abstracts (3,190,219 words). The range of findings in laboratory experimentation were coherent with the variability of the results described in the analyzed body of literature. Increased concurrence of chemokine identifiers together with cancer related terms was found at the abstract and sentence level, whereas complexity of sentences remained fairly stable. CONCLUSIONS The findings of the present study indicate that concurrent references to chemokines and cancer increased over time whereas text complexity remained stable.
Collapse
Affiliation(s)
- Mathias Wagner
- Department of Pathology, University of Saarland, Homburg Saar Campus, Homburg Saar, Germany
| | - Benjamin Vicinus
- Department of General, Visceral, Vascular and Pediatric Surgery, University of Saarland, Homburg Saar Campus, Homburg Saar, Germany; Institute of Virology, University of Saarland, Homburg Saar Campus, Homburg Saar, Germany
| | - Sherieda T Muthra
- Lombardi Comprehensive Cancer Center, Georgetown University, 37th & O St NW, Washington, DC 20057, United States of America.
| | - Tereza A Richards
- The Medical Library, University of the West Indies, Mona, Kingston, Jamaica
| | - Roland Linder
- Institute of Medical Informatics, University of Luebeck, Luebeck, Germany
| | - Vilma Oliveira Frick
- Department of General, Visceral, Vascular and Pediatric Surgery, University of Saarland, Homburg Saar Campus, Homburg Saar, Germany
| | - Andreas Groh
- Department of Mathematics, University of Saarland, Saarbrücken Campus, Saarbrücken, Germany
| | - Claudia Rubie
- Department of General, Visceral, Vascular and Pediatric Surgery, University of Saarland, Homburg Saar Campus, Homburg Saar, Germany
| | - Frank Weichert
- Department of Computer Science VII, Technical University of Dortmund, Dortmund, Germany
| |
Collapse
|
6
|
A Relation Extraction Framework for Biomedical Text Using Hybrid Feature Set. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:910423. [PMID: 26347797 PMCID: PMC4546954 DOI: 10.1155/2015/910423] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/05/2015] [Revised: 06/17/2015] [Accepted: 06/29/2015] [Indexed: 11/27/2022]
Abstract
The information extraction from unstructured text segments is a complex task. Although manual information extraction often produces the best results, it is harder to manage biomedical data extraction manually because of the exponential increase in data size. Thus, there is a need for automatic tools and techniques for information extraction in biomedical text mining. Relation extraction is a significant area under biomedical information extraction that has gained much importance in the last two decades. A lot of work has been done on biomedical relation extraction focusing on rule-based and machine learning techniques. In the last decade, the focus has changed to hybrid approaches showing better results. This research presents a hybrid feature set for classification of relations between biomedical entities. The main contribution of this research is done in the semantic feature set where verb phrases are ranked using Unified Medical Language System (UMLS) and a ranking algorithm. Support Vector Machine and Naïve Bayes, the two effective machine learning techniques, are used to classify these relations. Our approach has been validated on the standard biomedical text corpus obtained from MEDLINE 2001. Conclusively, it can be articulated that our framework outperforms all state-of-the-art approaches used for relation extraction on the same corpus.
Collapse
|
7
|
Rebholz-Schuhmann D, Oellrich A, Hoehndorf R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 2012; 13:829-39. [DOI: 10.1038/nrg3337] [Citation(s) in RCA: 170] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
8
|
Agarwal S, Yu H, Kohane I. BioNØT: a searchable database of biomedical negated sentences. BMC Bioinformatics 2011; 12:420. [PMID: 22032181 PMCID: PMC3225379 DOI: 10.1186/1471-2105-12-420] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2011] [Accepted: 10/27/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Negated biomedical events are often ignored by text-mining applications; however, such events carry scientific significance. We report on the development of BioNØT, a database of negated sentences that can be used to extract such negated events. DESCRIPTION Currently BioNØT incorporates ≈32 million negated sentences, extracted from over 336 million biomedical sentences from three resources: ≈2 million full-text biomedical articles in Elsevier and the PubMed Central, as well as ≈20 million abstracts in PubMed. We evaluated BioNØT on three important genetic disorders: autism, Alzheimer's disease and Parkinson's disease, and found that BioNØT is able to capture negated events that may be ignored by experts. CONCLUSIONS The BioNØT database can be a useful resource for biomedical researchers. BioNØT is freely available at http://bionot.askhermes.org/. In future work, we will develop semantic web related technologies to enrich BioNØT.
Collapse
Affiliation(s)
- Shashank Agarwal
- Medical Informatics, College of Engineering and Applied Sciences, University of Wisconsin-Milwaukee, 3200 N, Cramer St, Milwaukee, WI 53201-0784, USA
| | | | | |
Collapse
|
9
|
Faro A, Giordano D, Spampinato C. Combining literature text mining with microarray data: advances for system biology modeling. Brief Bioinform 2011; 13:61-82. [PMID: 21677032 DOI: 10.1093/bib/bbr018] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
A huge amount of important biomedical information is hidden in the bulk of research articles in biomedical fields. At the same time, the publication of databases of biological information and of experimental datasets generated by high-throughput methods is in great expansion, and a wealth of annotated gene databases, chemical, genomic (including microarray datasets), clinical and other types of data repositories are now available on the Web. Thus a current challenge of bioinformatics is to develop targeted methods and tools that integrate scientific literature, biological databases and experimental data for reducing the time of database curation and for accessing evidence, either in the literature or in the datasets, useful for the analysis at hand. Under this scenario, this article reviews the knowledge discovery systems that fuse information from the literature, gathered by text mining, with microarray data for enriching the lists of down and upregulated genes with elements for biological understanding and for generating and validating new biological hypothesis. Finally, an easy to use and freely accessible tool, GeneWizard, that exploits text mining and microarray data fusion for supporting researchers in discovering gene-disease relationships is described.
Collapse
Affiliation(s)
- Alberto Faro
- Department of Informatics and Telecommunication Engineering-University of Catania, Catania, Italy
| | | | | |
Collapse
|
10
|
Prasad R, McRoy S, Frid N, Joshi A, Yu H. The biomedical discourse relation bank. BMC Bioinformatics 2011; 12:188. [PMID: 21605399 PMCID: PMC3130691 DOI: 10.1186/1471-2105-12-188] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2010] [Accepted: 05/23/2011] [Indexed: 12/17/2022] Open
Abstract
Background Identification of discourse relations, such as causal and contrastive relations, between situations mentioned in text is an important task for biomedical text-mining. A biomedical text corpus annotated with discourse relations would be very useful for developing and evaluating methods for biomedical discourse processing. However, little effort has been made to develop such an annotated resource. Results We have developed the Biomedical Discourse Relation Bank (BioDRB), in which we have annotated explicit and implicit discourse relations in 24 open-access full-text biomedical articles from the GENIA corpus. Guidelines for the annotation were adapted from the Penn Discourse TreeBank (PDTB), which has discourse relations annotated over open-domain news articles. We introduced new conventions and modifications to the sense classification. We report reliable inter-annotator agreement of over 80% for all sub-tasks. Experiments for identifying the sense of explicit discourse connectives show the connective itself as a highly reliable indicator for coarse sense classification (accuracy 90.9% and F1 score 0.89). These results are comparable to results obtained with the same classifier on the PDTB data. With more refined sense classification, there is degradation in performance (accuracy 69.2% and F1 score 0.28), mainly due to sparsity in the data. The size of the corpus was found to be sufficient for identifying the sense of explicit connectives, with classifier performance stabilizing at about 1900 training instances. Finally, the classifier performs poorly when trained on PDTB and tested on BioDRB (accuracy 54.5% and F1 score 0.57). Conclusion Our work shows that discourse relations can be reliably annotated in biomedical text. Coarse sense disambiguation of explicit connectives can be done with high reliability by using just the connective as a feature, but more refined sense classification requires either richer features or more annotated data. The poor performance of a classifier trained in the open domain and tested in the biomedical domain suggests significant differences in the semantic usage of connectives across these domains, and provides robust evidence for a biomedical sublanguage for discourse and the need to develop a specialized biomedical discourse annotated corpus. The results of our cross-domain experiments are consistent with related work on identifying connectives in BioDRB.
Collapse
Affiliation(s)
- Rashmi Prasad
- Institute for Research in Cognitive Science, University of Pennsylvania, 3401 Walnut Street, Philadelphia, PA 19104, USA
| | | | | | | | | |
Collapse
|
11
|
Yang H, Swaminathan R, Sharma A, Ketkar V, D‘Silva J. Mining Biomedical Text towards Building a Quantitative Food-Disease-Gene Network. LEARNING STRUCTURE AND SCHEMAS FROM DOCUMENTS 2011. [DOI: 10.1007/978-3-642-22913-8_10] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|
12
|
Cano C, Monaghan T, Blanco A, Wall DP, Peshkin L. Collaborative text-annotation resource for disease-centered relation extraction from biomedical text. J Biomed Inform 2009; 42:967-77. [PMID: 19232400 PMCID: PMC2757509 DOI: 10.1016/j.jbi.2009.02.001] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2008] [Revised: 12/04/2008] [Accepted: 02/04/2009] [Indexed: 11/30/2022]
Abstract
Agglomerating results from studies of individual biological components has shown the potential to produce biomedical discovery and the promise of therapeutic development. Such knowledge integration could be tremendously facilitated by automated text mining for relation extraction in the biomedical literature. Relation extraction systems cannot be developed without substantial datasets annotated with ground truth for benchmarking and training. The creation of such datasets is hampered by the absence of a resource for launching a distributed annotation effort, as well as by the lack of a standardized annotation schema. We have developed an annotation schema and an annotation tool which can be widely adopted so that the resulting annotated corpora from a multitude of disease studies could be assembled into a unified benchmark dataset. The contribution of this paper is threefold. First, we provide an overview of available benchmark corpora and derive a simple annotation schema for specific binary relation extraction problems such as protein-protein and gene-disease relation extraction. Second, we present BioNotate: an open source annotation resource for the distributed creation of a large corpus. Third, we present and make available the results of a pilot annotation effort of the autism disease network.
Collapse
Affiliation(s)
- C Cano
- Department of Computer Science and Artificial Intelligence, University of Granada, 18071 Granada, Spain.
| | | | | | | | | |
Collapse
|
13
|
Abstract
It is now almost 15 years since the publication of the first paper on text mining in the genomics domain, and decades since the first paper on text mining in the medical domain. Enormous progress has been made in the areas of information retrieval, evaluation methodologies and resource construction. Some problems, such as abbreviation-handling, can essentially be considered solved problems, and others, such as identification of gene mentions in text, seem likely to be solved soon. However, a number of problems at the frontiers of biomedical text mining continue to present interesting challenges and opportunities for great improvements and interesting research. In this article we review the current state of the art in biomedical text mining or 'BioNLP' in general, focusing primarily on papers published within the past year.
Collapse
|
14
|
Sanchez-Graillet O, Poesio M. Negation of protein-protein interactions: analysis and extraction. ACTA ACUST UNITED AC 2007; 23:i424-32. [PMID: 17646327 DOI: 10.1093/bioinformatics/btm184] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Negative information about protein-protein interactions--from uncertainty about the occurrence of an interaction to knowledge that it did not occur--is often of great use to biologists and could lead to important discoveries. Yet, to our knowledge, no proposals focusing on extracting such information have been proposed in the text mining literature. RESULTS In this work, we present an analysis of the types of negative information that is reported, and a heuristic-based system using a full dependency parser to extract such information. We performed a preliminary evaluation study that shows encouraging results of our system. Finally, we have obtained an initial corpus of negative protein-protein interactions as basis for the construction of larger ones. AVAILABILITY The corpus is available by request from the authors.
Collapse
|