1
|
Aftab W, Apostolou Z, Bouazoune K, Straub T. Optimizing biomedical information retrieval with a keyword frequency-driven prompt enhancement strategy. BMC Bioinformatics 2024; 25:281. [PMID: 39192204 DOI: 10.1186/s12859-024-05902-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Accepted: 08/15/2024] [Indexed: 08/29/2024] Open
Abstract
BACKGROUND Mining the vast pool of biomedical literature to extract accurate responses and relevant references is challenging due to the domain's interdisciplinary nature, specialized jargon, and continuous evolution. Early natural language processing (NLP) approaches often led to incorrect answers as they failed to comprehend the nuances of natural language. However, transformer models have significantly advanced the field by enabling the creation of large language models (LLMs), enhancing question-answering (QA) tasks. Despite these advances, current LLM-based solutions for specialized domains like biology and biomedicine still struggle to generate up-to-date responses while avoiding "hallucination" or generating plausible but factually incorrect responses. RESULTS Our work focuses on enhancing prompts using a retrieval-augmented architecture to guide LLMs in generating meaningful responses for biomedical QA tasks. We evaluated two approaches: one relying on text embedding and vector similarity in a high-dimensional space, and our proposed method, which uses explicit signals in user queries to extract meaningful contexts. For robust evaluation, we tested these methods on 50 specific and challenging questions from diverse biomedical topics, comparing their performance against a baseline model, BM25. Retrieval performance of our method was significantly better than others, achieving a median Precision@10 of 0.95, which indicates the fraction of the top 10 retrieved chunks that are relevant. We used GPT-4, OpenAI's most advanced LLM to maximize the answer quality and manually accessed LLM-generated responses. Our method achieved a median answer quality score of 2.5, surpassing both the baseline model and the text embedding-based approach. We developed a QA bot, WeiseEule ( https://github.com/wasimaftab/WeiseEule-LocalHost ), which utilizes these methods for comparative analysis and also offers advanced features for review writing and identifying relevant articles for citation. CONCLUSIONS Our findings highlight the importance of prompt enhancement methods that utilize explicit signals in user queries over traditional text embedding-based approaches to improve LLM-generated responses for specialized queries in specialized domains such as biology and biomedicine. By providing users complete control over the information fed into the LLM, our approach addresses some of the major drawbacks of existing web-based chatbots and LLM-based QA systems, including hallucinations and the generation of irrelevant or outdated responses.
Collapse
Affiliation(s)
- Wasim Aftab
- Core Facility Bioinformatics, Biomedical Center, LMU Munich, Grosshaderner Str. 9, 82152, Martinsried, Germany.
| | - Zivkos Apostolou
- Molecular Biology Division, Biomedical Center, LMU Munich, Grosshaderner Str. 9, 82152, Martinsried, Germany
| | - Karim Bouazoune
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, 16802, USA
| | - Tobias Straub
- Core Facility Bioinformatics, Biomedical Center, LMU Munich, Grosshaderner Str. 9, 82152, Martinsried, Germany.
| |
Collapse
|
2
|
Kell G, Roberts A, Umansky S, Qian L, Ferrari D, Soboczenski F, Wallace BC, Patel N, Marshall IJ. Question answering systems for health professionals at the point of care-a systematic review. J Am Med Inform Assoc 2024; 31:1009-1024. [PMID: 38366879 PMCID: PMC10990539 DOI: 10.1093/jamia/ocae015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Revised: 01/11/2024] [Accepted: 01/15/2024] [Indexed: 02/18/2024] Open
Abstract
OBJECTIVES Question answering (QA) systems have the potential to improve the quality of clinical care by providing health professionals with the latest and most relevant evidence. However, QA systems have not been widely adopted. This systematic review aims to characterize current medical QA systems, assess their suitability for healthcare, and identify areas of improvement. MATERIALS AND METHODS We searched PubMed, IEEE Xplore, ACM Digital Library, ACL Anthology, and forward and backward citations on February 7, 2023. We included peer-reviewed journal and conference papers describing the design and evaluation of biomedical QA systems. Two reviewers screened titles, abstracts, and full-text articles. We conducted a narrative synthesis and risk of bias assessment for each study. We assessed the utility of biomedical QA systems. RESULTS We included 79 studies and identified themes, including question realism, answer reliability, answer utility, clinical specialism, systems, usability, and evaluation methods. Clinicians' questions used to train and evaluate QA systems were restricted to certain sources, types and complexity levels. No system communicated confidence levels in the answers or sources. Many studies suffered from high risks of bias and applicability concerns. Only 8 studies completely satisfied any criterion for clinical utility, and only 7 reported user evaluations. Most systems were built with limited input from clinicians. DISCUSSION While machine learning methods have led to increased accuracy, most studies imperfectly reflected real-world healthcare information needs. Key research priorities include developing more realistic healthcare QA datasets and considering the reliability of answer sources, rather than merely focusing on accuracy.
Collapse
Affiliation(s)
- Gregory Kell
- Department of Population Health Sciences, King’s College London, London, Greater London, SE1 1UL, United Kingdom
| | - Angus Roberts
- Department of Biostatistics and Health Informatics, King’s College London, London, Greater London, SE5 8AB, United Kingdom
| | - Serge Umansky
- Metadvice Ltd, London, Greater London, SW1Y 5JG, United Kingdom
| | - Linglong Qian
- Department of Biostatistics and Health Informatics, King’s College London, London, Greater London, SE5 8AB, United Kingdom
| | - Davide Ferrari
- Department of Population Health Sciences, King’s College London, London, Greater London, SE1 1UL, United Kingdom
| | - Frank Soboczenski
- Department of Population Health Sciences, King’s College London, London, Greater London, SE1 1UL, United Kingdom
| | - Byron C Wallace
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, United States
| | - Nikhil Patel
- Department of Population Health Sciences, King’s College London, London, Greater London, SE1 1UL, United Kingdom
| | - Iain J Marshall
- Department of Population Health Sciences, King’s College London, London, Greater London, SE1 1UL, United Kingdom
| |
Collapse
|
3
|
Abstract
Smart healthcare has achieved significant progress in recent years. Emerging artificial intelligence (AI) technologies enable various smart applications across various healthcare scenarios. As an essential technology powered by AI, natural language processing (NLP) plays a key role in smart healthcare due to its capability of analysing and understanding human language. In this work, we review existing studies that concern NLP for smart healthcare from the perspectives of technique and application. We first elaborate on different NLP approaches and the NLP pipeline for smart healthcare from the technical point of view. Then, in the context of smart healthcare applications employing NLP techniques, we introduce representative smart healthcare scenarios, including clinical practice, hospital management, personal care, public health, and drug development. We further discuss two specific medical issues, i.e., the coronavirus disease 2019 (COVID-19) pandemic and mental health, in which NLP-driven smart healthcare plays an important role. Finally, we discuss the limitations of current works and identify the directions for future works.
Collapse
|
4
|
Arabzadeh N, Bagheri E. A self-supervised language model selection strategy for biomedical question answering. J Biomed Inform 2023; 146:104486. [PMID: 37722445 DOI: 10.1016/j.jbi.2023.104486] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 08/14/2023] [Accepted: 09/01/2023] [Indexed: 09/20/2023]
Abstract
Large neural-based Pre-trained Language Models (PLM) have recently gained much attention due to their noteworthy performance in many downstream Information Retrieval (IR) and Natural Language Processing (NLP) tasks. PLMs can be categorized as either general-purpose, which are trained on resources such as large-scale Web corpora, and domain-specific which are trained on in-domain or mixed-domain corpora. While domain-specific PLMs have shown promising performance on domain-specific tasks, they are significantly more computationally expensive compared to general-purpose PLMs as they have to be either retrained or trained from scratch. The objective of our work in this paper is to explore whether it would be possible to leverage general-purpose PLMs to show competitive performance to domain-specific PLMs without the need for expensive retraining of the PLMs for domain-specific tasks. By focusing specifically on the recent BioASQ Biomedical Question Answering task, we show how different general-purpose PLMs show synergistic behaviour in terms of performance, which can lead to overall notable performance improvement when used in tandem with each other. More concretely, given a set of general-purpose PLMs, we propose a self-supervised method for training a classifier that systematically selects the PLM that is most likely to answer the question correctly on a per-input basis. We show that through such a selection strategy, the performance of general-purpose PLMs can become competitive with domain-specific PLMs while remaining computationally light since there is no need to retrain the large language model itself. We run experiments on the BioASQ dataset, which is a large-scale biomedical question-answering benchmark. We show that utilizing our proposed selection strategy can show statistically significant performance improvements on general-purpose language models with an average of 16.7% when using only lighter models such as DistilBERT and DistilRoBERTa, as well as 14.2% improvement when using relatively larger models such as BERT and RoBERTa and so, their performance become competitive with domain-specific large language models such as PubMedBERT.
Collapse
|
5
|
Sarker A, Yang YC, Al-Garadi MA, Abbas A. A Light-Weight Text Summarization System for Fast Access to Medical Evidence. Front Digit Health 2021; 2:585559. [PMID: 34713057 PMCID: PMC8521877 DOI: 10.3389/fdgth.2020.585559] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 11/13/2020] [Indexed: 11/13/2022] Open
Abstract
As the volume of published medical research continues to grow rapidly, staying up-to-date with the best-available research evidence regarding specific topics is becoming an increasingly challenging problem for medical experts and researchers. The current COVID19 pandemic is a good example of a topic on which research evidence is rapidly evolving. Automatic query-focused text summarization approaches may help researchers to swiftly review research evidence by presenting salient and query-relevant information from newly-published articles in a condensed manner. Typical medical text summarization approaches require domain knowledge, and the performances of such systems rely on resource-heavy medical domain-specific knowledge sources and pre-processing methods (e.g., text classification) for deriving semantic information. Consequently, these systems are often difficult to speedily customize, extend, or deploy in low-resource settings, and they are often operationally slow. In this paper, we propose a fast and simple extractive summarization approach that can be easily deployed and run, and may thus aid medical experts and researchers obtain fast access to the latest research evidence. At runtime, our system utilizes similarity measurements derived from pre-trained medical domain-specific word embeddings in addition to simple features, rather than computationally-expensive pre-processing and resource-heavy knowledge bases. Automatic evaluation using ROUGE-a summary evaluation tool-on a public dataset for evidence-based medicine shows that our system's performance, despite the simple implementation, is statistically comparable with the state-of-the-art. Extrinsic manual evaluation based on recently-released COVID19 articles demonstrates that the summarizer performance is close to human agreement, which is generally low, for extractive summarization.
Collapse
Affiliation(s)
- Abeed Sarker
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, United States.,Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, United States
| | - Yuan-Chi Yang
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, United States
| | - Mohammed Ali Al-Garadi
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, United States
| | - Aamir Abbas
- Heinz College of Information Systems and Public Policy, Carnegie Mellon University, Pittsburgh, PA, United States
| |
Collapse
|
6
|
Queralt-Rosinach N, Stupp GS, Li TS, Mayers M, Hoatlin ME, Might M, Good BM, Su AI. Structured reviews for data and knowledge-driven research. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2020:5818923. [PMID: 32283553 PMCID: PMC7153956 DOI: 10.1093/database/baaa015] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 01/21/2020] [Accepted: 02/07/2020] [Indexed: 12/25/2022]
Abstract
Hypothesis generation is a critical step in research and a cornerstone in the rare disease field. Research is most efficient when those hypotheses are based on the entirety of knowledge known to date. Systematic review articles are commonly used in biomedicine to summarize existing knowledge and contextualize experimental data. But the information contained within review articles is typically only expressed as free-text, which is difficult to use computationally. Researchers struggle to navigate, collect and remix prior knowledge as it is scattered in several silos without seamless integration and access. This lack of a structured information framework hinders research by both experimental and computational scientists. To better organize knowledge and data, we built a structured review article that is specifically focused on NGLY1 Deficiency, an ultra-rare genetic disease first reported in 2012. We represented this structured review as a knowledge graph and then stored this knowledge graph in a Neo4j database to simplify dissemination, querying and visualization of the network. Relative to free-text, this structured review better promotes the principles of findability, accessibility, interoperability and reusability (FAIR). In collaboration with domain experts in NGLY1 Deficiency, we demonstrate how this resource can improve the efficiency and comprehensiveness of hypothesis generation. We also developed a read–write interface that allows domain experts to contribute FAIR structured knowledge to this community resource. In contrast to traditional free-text review articles, this structured review exists as a living knowledge graph that is curated by humans and accessible to computational analyses. Finally, we have generalized this workflow into modular and repurposable components that can be applied to other domain areas. This NGLY1 Deficiency-focused network is publicly available at http://ngly1graph.org/. Availability and implementation Database URL: http://ngly1graph.org/. Network data files are at: https://github.com/SuLab/ngly1-graph and source code at: https://github.com/SuLab/bioknowledge-reviewer. Contact asu@scripps.edu
Collapse
Affiliation(s)
- Núria Queralt-Rosinach
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Gregory S Stupp
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Tong Shu Li
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Michael Mayers
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Maureen E Hoatlin
- Department of Biochemistry and Molecular Biology, Oregon Health and Science University, 3181 SW Sam Jackson Parkway, Portland, OR 97239, USA
| | - Matthew Might
- Department of Medicine, Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, 510 20th St S, Birmingham, AL 35210, USA
| | - Benjamin M Good
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Andrew I Su
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| |
Collapse
|
7
|
Calijorne Soares MA, Parreiras FS. A literature review on question answering techniques, paradigms and systems. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2020. [DOI: 10.1016/j.jksuci.2018.08.005] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
8
|
Kilicoglu H, Rosemblat G, Fiszman M, Shin D. Broad-coverage biomedical relation extraction with SemRep. BMC Bioinformatics 2020; 21:188. [PMID: 32410573 PMCID: PMC7222583 DOI: 10.1186/s12859-020-3517-7] [Citation(s) in RCA: 52] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Accepted: 04/29/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep's performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F 1 score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F 1 score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F 1 score. The recall and the F 1 score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
- University of Illinois at Urbana-Champaign, School of Information Sciences, 501 E Daniel Street, Champaign, 61820 IL USA
| | - Graciela Rosemblat
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| | | | - Dongwook Shin
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| |
Collapse
|
9
|
Soni S, Gudala M, Wang DZ, Roberts K. Using FHIR to Construct a Corpus of Clinical Questions Annotated with Logical Forms and Answers. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2020; 2019:1207-1215. [PMID: 32308918 PMCID: PMC7153115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This paper describes a novel technique for annotating logical forms and answers for clinical questions by utilizing Fast Healthcare Interoperability Resources (FHIR). Such annotations are widely used in building the semantic parsing models (which aim at understanding the precise meaning of natural language questions by converting them to machine-understandable logical forms). These systems focus on reducing the time it takes for a user to get to information present in electronic health records (EHRs). Directly annotating questions with logical forms is a challenging task and involves a time-consuming step of concept normalization annotation. We aim to automate this step using the normalized codes present in a FHIR resource. Using the proposed approach, two annotators curated an annotated dataset of 1000 questions in less than 1 week. To assess the quality of these annotations, we trained a semantic parsing model which achieved an accuracy of 94.2% on this corpus.
Collapse
Affiliation(s)
- Sarvesh Soni
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX
| | - Meghana Gudala
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX
| | - Daisy Zhe Wang
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX
| |
Collapse
|
10
|
Sarrouti M, Ouatik El Alaoui S. SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions. Artif Intell Med 2019; 102:101767. [PMID: 31980104 DOI: 10.1016/j.artmed.2019.101767] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Revised: 11/19/2019] [Accepted: 11/19/2019] [Indexed: 12/11/2022]
Abstract
BACKGROUND AND OBJECTIVE Question answering (QA), the identification of short accurate answers to users questions written in natural language expressions, is a longstanding issue widely studied over the last decades in the open-domain. However, it still remains a real challenge in the biomedical domain as the most of the existing systems support a limited amount of question and answer types as well as still require further efforts in order to improve their performance in terms of precision for the supported questions. Here, we present a semantic biomedical QA system named SemBioNLQA which has the ability to handle the kinds of yes/no, factoid, list, and summary natural language questions. METHODS This paper describes the system architecture and an evaluation of the developed end-to-end biomedical QA system named SemBioNLQA, which consists of question classification, document retrieval, passage retrieval and answer extraction modules. It takes natural language questions as input, and outputs both short precise answers and summaries as results. The SemBioNLQA system, dealing with four types of questions, is based on (1) handcrafted lexico-syntactic patterns and a machine learning algorithm for question classification, (2) PubMed search engine and UMLS similarity for document retrieval, (3) the BM25 model, stemmed words and UMLS concepts for passage retrieval, and (4) UMLS metathesaurus, BioPortal synonyms, sentiment analysis and term frequency metric for answer extraction. RESULTS AND CONCLUSION Compared with the current state-of-the-art biomedical QA systems, SemBioNLQA, a fully automated system, has the potential to deal with a large amount of question and answer types. SemBioNLQA retrieves quickly users' information needs by returning exact answers (e.g., "yes", "no", a biomedical entity name, etc.) and ideal answers (i.e., paragraph-sized summaries of relevant information) for yes/no, factoid and list questions, whereas it provides only the ideal answers for summary questions. Moreover, experimental evaluations performed on biomedical questions and answers provided by the BioASQ challenge especially in 2015, 2016 and 2017 (as part of our participation), show that SemBioNLQA achieves good performances compared with the most current state-of-the-art systems and allows a practical and competitive alternative to help information seekers find exact and ideal answers to their biomedical questions. The SemBioNLQA source code is publicly available at https://github.com/sarrouti/sembionlqa.
Collapse
Affiliation(s)
- Mourad Sarrouti
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, U.S. National Institutes of Health, Bethesda, MD.
| | - Said Ouatik El Alaoui
- National School of Applied Sciences, Ibn Tofail University, Kenitra, Morocco; Laboratory of Informatics and Modeling, FSDM, Sidi Mohammed Ben Abdellah University, Fez, Morocco
| |
Collapse
|
11
|
Abstract
The Semantic Web allows knowledge discovery on graph-based data sets and facilitates answering complex queries that are extremely difficult to achieve using traditional database approaches. Intuitively, the Semantic Web query language (SPARQL) has a ‘property path’ feature that enables knowledge discovery in a knowledgebase using its reasoning engine. In this article, we utilise the property path of SPARQL and the other Semantic Web technologies to answer sophisticated queries posed over a disease data set. To this aim, we transform data from a disease web portal to a graph-based data set by designing an ontology, present a template to define the queries and provide a set of conjunctive queries on the data set. We illustrate how the reasoning engine of ‘property path’ feature of SPARQL can retrieve the results from the designed knowledgebase. The results of this study were verified by two domain experts as well as authors’ manual exploration on the disease web portal.
Collapse
|
12
|
Kilicoglu H. Biomedical text mining for research rigor and integrity: tasks, challenges, directions. Brief Bioinform 2018; 19:1400-1414. [PMID: 28633401 PMCID: PMC6291799 DOI: 10.1093/bib/bbx057] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Revised: 04/10/2017] [Indexed: 01/01/2023] Open
Abstract
An estimated quarter of a trillion US dollars is invested in the biomedical research enterprise annually. There is growing alarm that a significant portion of this investment is wasted because of problems in reproducibility of research findings and in the rigor and integrity of research conduct and reporting. Recent years have seen a flurry of activities focusing on standardization and guideline development to enhance the reproducibility and rigor of biomedical research. Research activity is primarily communicated via textual artifacts, ranging from grant applications to journal publications. These artifacts can be both the source and the manifestation of practices leading to research waste. For example, an article may describe a poorly designed experiment, or the authors may reach conclusions not supported by the evidence presented. In this article, we pose the question of whether biomedical text mining techniques can assist the stakeholders in the biomedical research enterprise in doing their part toward enhancing research integrity and rigor. In particular, we identify four key areas in which text mining techniques can make a significant contribution: plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload and accurate citation/enhanced bibliometrics. We review the existing methods and tools for specific tasks, if they exist, or discuss relevant research that can provide guidance for future work. With the exponential increase in biomedical research output and the ability of text mining approaches to perform automatic tasks at large scale, we propose that such approaches can support tools that promote responsible research practices, providing significant benefits for the biomedical research enterprise.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, US National Library of Medicine
| |
Collapse
|
13
|
Abstract
BACKGROUND Health question-answering (QA) systems have become a typical application scenario of Artificial Intelligent (AI). An annotated question corpus is prerequisite for training machines to understand health information needs of users. Thus, we aimed to develop an annotated classification corpus of Chinese health questions (Qcorp) and make it openly accessible. METHODS We developed a two-layered classification schema and corresponding annotation rules on basis of our previous work. Using the schema, we annotated 5000 questions that were randomly selected from 5 Chinese health websites within 6 broad sections. 8 annotators participated in the annotation task, and the inter-annotator agreement was evaluated to ensure the corpus quality. Furthermore, the distribution and relationship of the annotated tags were measured by descriptive statistics and social network map. RESULTS The questions were annotated using 7101 tags that covers 29 topic categories in the two-layered schema. In our released corpus, the distribution of questions on the top-layered categories was treatment of 64.22%, diagnosis of 37.14%, epidemiology of 14.96%, healthy lifestyle of 10.38%, and health provider choice of 4.54% respectively. Both the annotated health questions and annotation schema were openly accessible on the Qcorp website. Users can download the annotated Chinese questions in CSV, XML, and HTML format. CONCLUSIONS We developed a Chinese health question corpus including 5000 manually annotated questions. It is openly accessible and would contribute to the intelligent health QA system development.
Collapse
Affiliation(s)
- Haihong Guo
- Institute of Medical Information / Medical Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Xu Na
- Institute of Medical Information / Medical Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Jiao Li
- Institute of Medical Information / Medical Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China.
| |
Collapse
|
14
|
Kilicoglu H, Ben Abacha A, Mrabet Y, Shooshan SE, Rodriguez L, Masterton K, Demner-Fushman D. Semantic annotation of consumer health questions. BMC Bioinformatics 2018; 19:34. [PMID: 29409442 PMCID: PMC5802048 DOI: 10.1186/s12859-018-2045-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 01/24/2018] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Consumers increasingly use online resources for their health information needs. While current search engines can address these needs to some extent, they generally do not take into account that most health information needs are complex and can only fully be expressed in natural language. Consumer health question answering (QA) systems aim to fill this gap. A major challenge in developing consumer health QA systems is extracting relevant semantic content from the natural language questions (question understanding). To develop effective question understanding tools, question corpora semantically annotated for relevant question elements are needed. In this paper, we present a two-part consumer health question corpus annotated with several semantic categories: named entities, question triggers/types, question frames, and question topic. The first part (CHQA-email) consists of relatively long email requests received by the U.S. National Library of Medicine (NLM) customer service, while the second part (CHQA-web) consists of shorter questions posed to MedlinePlus search engine as queries. Each question has been annotated by two annotators. The annotation methodology is largely the same between the two parts of the corpus; however, we also explain and justify the differences between them. Additionally, we provide information about corpus characteristics, inter-annotator agreement, and our attempts to measure annotation confidence in the absence of adjudication of annotations. RESULTS The resulting corpus consists of 2614 questions (CHQA-email: 1740, CHQA-web: 874). Problems are the most frequent named entities, while treatment and general information questions are the most common question types. Inter-annotator agreement was generally modest: question types and topics yielded highest agreement, while the agreement for more complex frame annotations was lower. Agreement in CHQA-web was consistently higher than that in CHQA-email. Pairwise inter-annotator agreement proved most useful in estimating annotation confidence. CONCLUSIONS To our knowledge, our corpus is the first focusing on annotation of uncurated consumer health questions. It is currently used to develop machine learning-based methods for question understanding. We make the corpus publicly available to stimulate further research on consumer health QA.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| | - Asma Ben Abacha
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| | - Yassine Mrabet
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| | - Sonya E. Shooshan
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| | - Laritza Rodriguez
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| | - Kate Masterton
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| | - Dina Demner-Fushman
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| |
Collapse
|
15
|
El Alaoui SO, Sarrouti M. A Machine Learning-based Method for Question Type Classification in Biomedical Question Answering. Methods Inf Med 2018; 56:209-216. [DOI: 10.3414/me16-01-0116] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Accepted: 01/11/2017] [Indexed: 11/09/2022]
Abstract
SummaryBackground and Objective: Biomedical question type classification is one of the important components of an automatic biomedical question answering system. The performance of the latter depends directly on the performance of its biomedical question type classification system, which consists of assigning a category to each question in order to determine the appropriate answer extraction algorithm. This study aims to automatically classify biomedical questions into one of the four categories: (1) yes/no, (2) factoid, (3) list, and (4) summary.Methods: In this paper, we propose a biomedical question type classification method based on machine learning approaches to automatically assign a category to a biomedical question. First, we extract features from biomedical questions using the proposed handcrafted lexico-syntactic patterns. Then, we feed these features for machine- learning algorithms. Finally, the class label is predicted using the trained classifiers.Results: Experimental evaluations performed on large standard annotated datasets of biomedical questions, provided by the BioASQ challenge, demonstrated that our method exhibits significant improved performance when compared to four baseline systems. The proposed method achieves a roughly 10-point increase over the best baseline in terms of accuracy. Moreover, the obtained results show that using handcrafted lexico-syntactic patterns as features’ provider of support vector machine (SVM) lead to the highest accuracy of 89.40%.Conclusion: The proposed method can automatically classify BioASQ questions into one of the four categories: yes/no, factoid, list, and summary. Furthermore, the results demonstrated that our method produced the best classification performance compared to four baseline systems.
Collapse
|
16
|
Elsworth B, Dawe K, Vincent EE, Langdon R, Lynch BM, Martin RM, Relton C, Higgins JPT, Gaunt TR. MELODI: Mining Enriched Literature Objects to Derive Intermediates. Int J Epidemiol 2018; 47:4803214. [PMID: 29342271 PMCID: PMC5913624 DOI: 10.1093/ije/dyx251] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 11/02/2017] [Accepted: 01/03/2018] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND The scientific literature contains a wealth of information from different fields on potential disease mechanisms. However, identifying and prioritizing mechanisms for further analytical evaluation presents enormous challenges in terms of the quantity and diversity of published research. The application of data mining approaches to the literature offers the potential to identify and prioritize mechanisms for more focused and detailed analysis. METHODS Here we present MELODI, a literature mining platform that can identify mechanistic pathways between any two biomedical concepts. RESULTS Two case studies demonstrate the potential uses of MELODI and how it can generate hypotheses for further investigation. First, an analysis of ETS-related gene ERG and prostate cancer derives the intermediate transcription factor SP1, recently confirmed to be physically interacting with ERG. Second, examining the relationship between a new potential risk factor for pancreatic cancer identifies possible mechanistic insights which can be studied in vitro. CONCLUSIONS We have demonstrated the possible applications of MELODI, including two case studies. MELODI has been implemented as a Python/Django web application, and is freely available to use at [www.melodi.biocompute.org.uk].
Collapse
Affiliation(s)
- Benjamin Elsworth
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Karen Dawe
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Emma E Vincent
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Ryan Langdon
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Brigid M Lynch
- Cancer Epidemiology and Intelligence Division, Cancer Council Victoria, Melbourne, VIC, Australia
- Centre for Epidemiology and Biostatistics, University of Melbourne, Melbourne, VIC, Australia
- Physical Activity Laboratory, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia
| | - Richard M Martin
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Caroline Relton
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | | | - Tom R Gaunt
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| |
Collapse
|
17
|
Kim S, Park D, Choi Y, Lee K, Kim B, Jeon M, Kim J, Tan AC, Kang J. A Pilot Study of Biomedical Text Comprehension using an Attention-Based Deep Neural Reader: Design and Experimental Analysis. JMIR Med Inform 2018; 6:e2. [PMID: 29305341 PMCID: PMC5783222 DOI: 10.2196/medinform.8751] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Revised: 10/25/2017] [Accepted: 11/16/2017] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND With the development of artificial intelligence (AI) technology centered on deep-learning, the computer has evolved to a point where it can read a given text and answer a question based on the context of the text. Such a specific task is known as the task of machine comprehension. Existing machine comprehension tasks mostly use datasets of general texts, such as news articles or elementary school-level storybooks. However, no attempt has been made to determine whether an up-to-date deep learning-based machine comprehension model can also process scientific literature containing expert-level knowledge, especially in the biomedical domain. OBJECTIVE This study aims to investigate whether a machine comprehension model can process biomedical articles as well as general texts. Since there is no dataset for the biomedical literature comprehension task, our work includes generating a large-scale question answering dataset using PubMed and manually evaluating the generated dataset. METHODS We present an attention-based deep neural model tailored to the biomedical domain. To further enhance the performance of our model, we used a pretrained word vector and biomedical entity type embedding. We also developed an ensemble method of combining the results of several independent models to reduce the variance of the answers from the models. RESULTS The experimental results showed that our proposed deep neural network model outperformed the baseline model by more than 7% on the new dataset. We also evaluated human performance on the new dataset. The human evaluation result showed that our deep neural model outperformed humans in comprehension by 22% on average. CONCLUSIONS In this work, we introduced a new task of machine comprehension in the biomedical domain using a deep neural model. Since there was no large-scale dataset for training deep neural models in the biomedical domain, we created the new cloze-style datasets Biomedical Knowledge Comprehension Title (BMKC_T) and Biomedical Knowledge Comprehension Last Sentence (BMKC_LS) (together referred to as BioMedical Knowledge Comprehension) using the PubMed corpus. The experimental results showed that the performance of our model is much higher than that of humans. We observed that our model performed consistently better regardless of the degree of difficulty of a text, whereas humans have difficulty when performing biomedical literature comprehension tasks that require expert level knowledge.
Collapse
Affiliation(s)
- Seongsoon Kim
- Department of Computer Science and Engineering, College of Informatics, Korea University, Seoul, Republic Of Korea
| | - Donghyeon Park
- Department of Computer Science and Engineering, College of Informatics, Korea University, Seoul, Republic Of Korea
| | - Yonghwa Choi
- Department of Computer Science and Engineering, College of Informatics, Korea University, Seoul, Republic Of Korea
| | - Kyubum Lee
- Department of Computer Science and Engineering, College of Informatics, Korea University, Seoul, Republic Of Korea
| | - Byounggun Kim
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, Republic Of Korea
| | - Minji Jeon
- Department of Computer Science and Engineering, College of Informatics, Korea University, Seoul, Republic Of Korea
| | - Jihye Kim
- Division of Medical Oncology, Department of Medicine, Translational Bioinformatics and Cancer Systems Biology Laboratory, University of Colorado Anschutz Medical Campus, Aurora, CO, United States
| | - Aik Choon Tan
- Division of Medical Oncology, Department of Medicine, Translational Bioinformatics and Cancer Systems Biology Laboratory, University of Colorado Anschutz Medical Campus, Aurora, CO, United States
| | - Jaewoo Kang
- Department of Computer Science and Engineering, College of Informatics, Korea University, Seoul, Republic Of Korea
| |
Collapse
|
18
|
Sarrouti M, El Alaoui SO. A Yes/No Answer Generator Based on Sentiment-Word Scores in Biomedical Question Answering. INTERNATIONAL JOURNAL OF HEALTHCARE INFORMATION SYSTEMS AND INFORMATICS 2017. [DOI: 10.4018/ijhisi.2017070104] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Background and Objective: Yes/no question answering (QA) in open-domain is a longstanding challenge widely studied over the last decades. However, it still requires further efforts in the biomedical domain. Yes/no QA aims at answering yes/no questions, which are seeking for a clear “yes” or “no” answer. In this paper, we present a novel yes/no answer generator based on sentiment-word scores in biomedical QA. Methods: In the proposed method, we first use the Stanford CoreNLP for tokenization and part-of-speech tagging all relevant passages to a given yes/no question. We then assign a sentiment score based on SentiWordNet to each word of the passages. Finally, the decision on either the answers “yes” or “no” is based on the obtained sentiment-passages score: “yes” for a positive final sentiment-passages score and “no” for a negative one. Results: Experimental evaluations performed on BioASQ collections show that the proposed method is more effective as compared with the current state-of-the-art method, and significantly outperforms it by an average of 15.68% in terms of accuracy.
Collapse
Affiliation(s)
- Mourad Sarrouti
- Laboratory of Computer Science and Modeling, FSDM, Sidi Mohammed Ben Abdellah University, Fez, Morocco
| | - Said Ouatik El Alaoui
- Laboratory of Computer Science and Modeling, FSDM, Sidi Mohammed Ben Abdellah University, Fez, Morocco
| |
Collapse
|
19
|
A passage retrieval method based on probabilistic information retrieval model and UMLS concepts in biomedical question answering. J Biomed Inform 2017; 68:96-103. [DOI: 10.1016/j.jbi.2017.03.001] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2016] [Revised: 03/03/2017] [Accepted: 03/05/2017] [Indexed: 11/17/2022]
|
20
|
Roberts K, Rodriguez L, Shooshan SE, Demner-Fushman D. Resource Classification for Medical Questions. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:1040-1049. [PMID: 28269901 PMCID: PMC5333297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
We present an approach for manually and automatically classifying the resource type of medical questions. Three types of resources are considered: patient-specific, general knowledge, and research. Using this approach, an automatic question answering system could select the best type of resource from which to consider answers. We first describe our methodology for manually annotating resource type on four different question corpora totaling over 5,000 questions. We then describe our approach for automatically identifying the appropriate type of resource. A supervised machine learning approach is used with lexical, syntactic, semantic, and topic-based feature types. This approach is able to achieve accuracies in the range of 80.9% to 92.8% across four datasets. Finally, we discuss the difficulties encountered in both manual and automatic classification of this challenging task.
Collapse
Affiliation(s)
- Kirk Roberts
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX
| | - Laritza Rodriguez
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD
| | - Sonya E Shooshan
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD
| | - Dina Demner-Fushman
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD
| |
Collapse
|
21
|
Hristovski D, Kastrin A, Dinevski D, Burgun A, Žiberna L, Rindflesch TC. Using Literature-Based Discovery to Explain Adverse Drug Effects. J Med Syst 2016; 40:185. [PMID: 27318993 DOI: 10.1007/s10916-016-0544-z] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2015] [Accepted: 06/09/2016] [Indexed: 01/29/2023]
Abstract
We report on our research in using literature-based discovery (LBD) to provide pharmacological and/or pharmacogenomic explanations for reported adverse drug effects. The goal of LBD is to generate novel and potentially useful hypotheses by analyzing the scientific literature and optionally some additional resources. Our assumption is that drugs have effects on some genes or proteins and that these genes or proteins are associated with the observed adverse effects. Therefore, by using LBD we try to find genes or proteins that link the drugs with the reported adverse effects. These genes or proteins can be used to provide insight into the processes causing the adverse effects. Initial results show that our method has the potential to assist in explaining reported adverse drug effects.
Collapse
Affiliation(s)
- Dimitar Hristovski
- Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia.
| | - Andrej Kastrin
- Faculty of Information Studies, Novo mesto, Ljubljana, Slovenia
| | - Dejan Dinevski
- Faculty of Medicine, University of Maribor, Maribor, Slovenia
| | - Anita Burgun
- INSERM UMRS 1138 Eq 22, Paris Descartes University, Georges Pompidou European Hospital, APHP, Paris, France
| | - Lovro Žiberna
- Institute of Pharmacology and Experimental Toxicology, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | | |
Collapse
|
22
|
Fathiamini S, Johnson AM, Zeng J, Araya A, Holla V, Bailey AM, Litzenburger BC, Sanchez NS, Khotskaya Y, Xu H, Meric-Bernstam F, Bernstam EV, Cohen T. Automated identification of molecular effects of drugs (AIMED). J Am Med Inform Assoc 2016; 23:758-65. [PMID: 27107438 DOI: 10.1093/jamia/ocw030] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2016] [Accepted: 02/09/2016] [Indexed: 11/13/2022] Open
Abstract
INTRODUCTION Genomic profiling information is frequently available to oncologists, enabling targeted cancer therapy. Because clinically relevant information is rapidly emerging in the literature and elsewhere, there is a need for informatics technologies to support targeted therapies. To this end, we have developed a system for Automated Identification of Molecular Effects of Drugs, to help biomedical scientists curate this literature to facilitate decision support. OBJECTIVES To create an automated system to identify assertions in the literature concerning drugs targeting genes with therapeutic implications and characterize the challenges inherent in automating this process in rapidly evolving domains. METHODS We used subject-predicate-object triples (semantic predications) and co-occurrence relations generated by applying the SemRep Natural Language Processing system to MEDLINE abstracts and ClinicalTrials.gov descriptions. We applied customized semantic queries to find drugs targeting genes of interest. The results were manually reviewed by a team of experts. RESULTS Compared to a manually curated set of relationships, recall, precision, and F2 were 0.39, 0.21, and 0.33, respectively, which represents a 3- to 4-fold improvement over a publically available set of predications (SemMedDB) alone. Upon review of ostensibly false positive results, 26% were considered relevant additions to the reference set, and an additional 61% were considered to be relevant for review. Adding co-occurrence data improved results for drugs in early development, but not their better-established counterparts. CONCLUSIONS Precision medicine poses unique challenges for biomedical informatics systems that help domain experts find answers to their research questions. Further research is required to improve the performance of such systems, particularly for drugs in development.
Collapse
Affiliation(s)
- Safa Fathiamini
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA
| | - Amber M Johnson
- Sheikh Khalifa Al Nahyan Ben Zayed Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Jia Zeng
- Sheikh Khalifa Al Nahyan Ben Zayed Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Alejandro Araya
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA
| | - Vijaykumar Holla
- Sheikh Khalifa Al Nahyan Ben Zayed Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Ann M Bailey
- Sheikh Khalifa Al Nahyan Ben Zayed Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Beate C Litzenburger
- Sheikh Khalifa Al Nahyan Ben Zayed Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Nora S Sanchez
- Sheikh Khalifa Al Nahyan Ben Zayed Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Yekaterina Khotskaya
- Sheikh Khalifa Al Nahyan Ben Zayed Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA
| | - Funda Meric-Bernstam
- Sheikh Khalifa Al Nahyan Ben Zayed Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, USA Department of Investigational Cancer Therapeutics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA Department of Surgical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Elmer V Bernstam
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA Division of General Internal Medicine, Department of Internal Medicine, The University of Texas Health Science Center at Houston, TX, USA
| | - Trevor Cohen
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA
| |
Collapse
|
23
|
Kilicoglu H, Rosemblat G, Fiszman M, Rindflesch TC. Sortal anaphora resolution to enhance relation extraction from biomedical literature. BMC Bioinformatics 2016; 17:163. [PMID: 27080229 PMCID: PMC4832532 DOI: 10.1186/s12859-016-1009-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2015] [Accepted: 04/01/2016] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Entity coreference is common in biomedical literature and it can affect text understanding systems that rely on accurate identification of named entities, such as relation extraction and automatic summarization. Coreference resolution is a foundational yet challenging natural language processing task which, if performed successfully, is likely to enhance such systems significantly. In this paper, we propose a semantically oriented, rule-based method to resolve sortal anaphora, a specific type of coreference that forms the majority of coreference instances in biomedical literature. The method addresses all entity types and relies on linguistic components of SemRep, a broad-coverage biomedical relation extraction system. It has been incorporated into SemRep, extending its core semantic interpretation capability from sentence level to discourse level. RESULTS We evaluated our sortal anaphora resolution method in several ways. The first evaluation specifically focused on sortal anaphora relations. Our methodology achieved a F1 score of 59.6 on the test portion of a manually annotated corpus of 320 Medline abstracts, a 4-fold improvement over the baseline method. Investigating the impact of sortal anaphora resolution on relation extraction, we found that the overall effect was positive, with 50 % of the changes involving uninformative relations being replaced by more specific and informative ones, while 35 % of the changes had no effect, and only 15 % were negative. We estimate that anaphora resolution results in changes in about 1.5 % of approximately 82 million semantic relations extracted from the entire PubMed. CONCLUSIONS Our results demonstrate that a heavily semantic approach to sortal anaphora resolution is largely effective for biomedical literature. Our evaluation and error analysis highlight some areas for further improvements, such as coordination processing and intra-sentential antecedent selection.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| | - Graciela Rosemblat
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| | - Marcelo Fiszman
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| | - Thomas C. Rindflesch
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| |
Collapse
|
24
|
Gasparyan AY, Yessirkepov M, Voronov AA, Gerasimov AN, Kostyukova EI, Kitas GD. Preserving the Integrity of Citations and References by All Stakeholders of Science Communication. J Korean Med Sci 2015; 30:1545-52. [PMID: 26538996 PMCID: PMC4630468 DOI: 10.3346/jkms.2015.30.11.1545] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 09/08/2015] [Indexed: 11/20/2022] Open
Abstract
Citations to scholarly items are building bricks for multidisciplinary science communication. Citation analyses are currently influencing individual career advancement and ranking of academic and research institutions worldwide. This article overviews the involvement of scientific authors, reviewers, editors, publishers, indexers, and learned associations in the citing and referencing to preserve the integrity of science communication. Authors are responsible for thorough bibliographic searches to select relevant references for their articles, comprehend main points, and cite them in an ethical way. Reviewers and editors may perform additional searches and recommend missing essential references. Publishers, in turn, are in a position to instruct their authors over the citations and references, provide tools for validation of references, and open access to bibliographies. Publicly available reference lists bear important information about the novelty and relatedness of the scholarly items with the published literature. Few editorial associations have dealt with the issue of citations and properly managed references. As a prime example, the International Committee of Medical Journal Editors (ICMJE) issued in December 2014 an updated set of recommendations on the need for citing primary literature and avoiding unethical references, which are applicable to the global scientific community. With the exponential growth of literature and related references, it is critically important to define functions of all stakeholders of science communication in curbing the issue of irrational and unethical citations and thereby improve the quality and indexability of scholarly journals.
Collapse
Affiliation(s)
- Armen Yuri Gasparyan
- Departments of Rheumatology and Research and Development, Dudley Group NHS Foundation Trust (Teaching Trust of the University of Birmingham, UK), Russells Hall Hospital, Dudley, West Midlands, UK
| | - Marlen Yessirkepov
- Department of Biochemistry, Biology and Microbiology, South Kazakhstan State Pharmaceutical Academy, Shymkent, Kazakhstan
| | - Alexander A Voronov
- Department of Marketing and Trade Deals, Kuban State University, Krasnodar, Russian Federation
| | - Alexey N Gerasimov
- Department of Statistics and Econometrics, Stavropol State Agrarian University, Stavropol, Russian Federation
| | - Elena I Kostyukova
- Faculty of Accounting and Finance, Department of Accounting Management Accounting, Stavropol State Agrarian University, Stavropol, Russian Federation
| | - George D Kitas
- Departments of Rheumatology and Research and Development, Dudley Group NHS Foundation Trust (Teaching Trust of the University of Birmingham, UK), Russells Hall Hospital, Dudley, West Midlands, UK. ; Arthritis Research UK Epidemiology Unit, University of Manchester, Manchester, UK
| |
Collapse
|
25
|
Disease Related Knowledge Summarization Based on Deep Graph Search. BIOMED RESEARCH INTERNATIONAL 2015; 2015:428195. [PMID: 26413521 PMCID: PMC4561941 DOI: 10.1155/2015/428195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/24/2014] [Revised: 03/29/2015] [Accepted: 04/25/2015] [Indexed: 11/30/2022]
Abstract
The volume of published biomedical literature on disease related knowledge is expanding rapidly. Traditional information retrieval (IR) techniques, when applied to large databases such as PubMed, often return large, unmanageable lists of citations that do not fulfill the searcher's information needs. In this paper, we present an approach to automatically construct disease related knowledge summarization from biomedical literature. In this approach, firstly Kullback-Leibler Divergence combined with mutual information metric is used to extract disease salient information. Then deep search based on depth first search (DFS) is applied to find hidden (indirect) relations between biomedical entities. Finally random walk algorithm is exploited to filter out the weak relations. The experimental results show that our approach achieves a precision of 60% and a recall of 61% on salient information extraction for Carcinoma of bladder and outperforms the method of Combo.
Collapse
|
26
|
From Literature to Knowledge: Exploiting PubMed to Answer Biomedical Questions in Natural Language. ACTA ACUST UNITED AC 2015. [DOI: 10.1007/978-3-319-22741-2_1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/19/2023]
|