1
|
Perez N, Cuadros M, Rigau G. Negation and speculation processing: A study on cue-scope labelling and assertion classification in Spanish clinical text. Artif Intell Med 2023; 145:102682. [PMID: 37925211 DOI: 10.1016/j.artmed.2023.102682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Revised: 08/25/2023] [Accepted: 10/06/2023] [Indexed: 11/06/2023]
Abstract
Natural Language Processing (NLP) based on new deep learning technology is contributing to the emergence of powerful solutions that help healthcare providers and researchers discover valuable patterns within insurmountable volumes of health records and scientific literature. Fundamental to the success of such solutions is the processing of negation and speculation. The article addresses this problem with state-of-the-art deep learning approaches from two perspectives: cue and scope labelling, and assertion classification. In light of the real struggle to access clinical annotated data, the study (a) proposes a methodology to automatically convert cue-scope annotations to assertion annotations; and (b) includes a range of scenarios with varying amounts of training data and adversarial test examples. The results expose the clear advantage of Transformer-based models in this regard, managing to overpass a series of baselines and the related work in the public corpus NUBes of clinical Spanish text.
Collapse
Affiliation(s)
- Naiara Perez
- SNLT group at Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Mikeletegi Pasealekua 57, Donostia/San Sebastián, 20009, Spain; HiTZ Basque Center for Language Technologies, University of the Basque Country (UPV-EHU), Manuel Lardizabal Ibilbidea 1, Donostia/San Sebastián, 20018, Spain.
| | - Montse Cuadros
- SNLT group at Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Mikeletegi Pasealekua 57, Donostia/San Sebastián, 20009, Spain
| | - German Rigau
- HiTZ Basque Center for Language Technologies, University of the Basque Country (UPV-EHU), Manuel Lardizabal Ibilbidea 1, Donostia/San Sebastián, 20018, Spain
| |
Collapse
|
2
|
Argüello-González G, Aquino-Esperanza J, Salvador D, Bretón-Romero R, Del Río-Bermudez C, Tello J, Menke S. Negation recognition in clinical natural language processing using a combination of the NegEx algorithm and a convolutional neural network. BMC Med Inform Decis Mak 2023; 23:216. [PMID: 37833661 PMCID: PMC10576331 DOI: 10.1186/s12911-023-02301-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 09/18/2023] [Indexed: 10/15/2023] Open
Abstract
BACKGROUND Important clinical information of patients is present in unstructured free-text fields of Electronic Health Records (EHRs). While this information can be extracted using clinical Natural Language Processing (cNLP), the recognition of negation modifiers represents an important challenge. A wide range of cNLP applications have been developed to detect the negation of medical entities in clinical free-text, however, effective solutions for languages other than English are scarce. This study aimed at developing a solution for negation recognition in Spanish EHRs based on a combination of a customized rule-based NegEx layer and a convolutional neural network (CNN). METHODS Based on our previous experience in real world evidence (RWE) studies using information embedded in EHRs, negation recognition was simplified into a binary problem ('affirmative' vs. 'non-affirmative' class). For the NegEx layer, negation rules were obtained from a publicly available Spanish corpus and enriched with custom ones, whereby the CNN binary classifier was trained on EHRs annotated for clinical named entities (cNEs) and negation markers by medical doctors. RESULTS The proposed negation recognition pipeline obtained precision, recall, and F1-score of 0.93, 0.94, and 0.94 for the 'affirmative' class, and 0.86, 0.84, and 0.85 for the 'non-affirmative' class, respectively. To validate the generalization capabilities of our methodology, we applied the negation recognition pipeline on EHRs (6,710 cNEs) from a different data source distribution than the training corpus and obtained consistent performance metrics for the 'affirmative' and 'non-affirmative' class (0.95, 0.97, and 0.96; and 0.90, 0.83, and 0.86 for precision, recall, and F1-score, respectively). Lastly, we evaluated the pipeline against two publicly available Spanish negation corpora, the IULA and NUBes, obtaining state-of-the-art metrics (1.00, 0.99, and 0.99; and 1.00, 0.93, and 0.96 for precision, recall, and F1-score, respectively). CONCLUSION Negation recognition is a source of low precision in the retrieval of cNEs from EHRs' free-text. Combining a customized rule-based NegEx layer with a CNN binary classifier outperformed many other current approaches. RWE studies highly benefit from the correct recognition of negation as it reduces false positive detections of cNE which otherwise would undoubtedly reduce the credibility of cNLP systems.
Collapse
Affiliation(s)
- Guillermo Argüello-González
- MedSavana SL, Madrid, 28004, Spain
- Statistics and Operations Research, University of Oviedo, Oviedo, 33003, Spain
| | - José Aquino-Esperanza
- MedSavana SL, Madrid, 28004, Spain
- Faculty of Medicine and Health Sciences, University of Barcelona, Barcelona, 08007, Spain
| | | | | | | | | | | |
Collapse
|
3
|
Scaboro S, Portelli B, Chersoni E, Santus E, Serra G. Increasing adverse drug events extraction robustness on social media: Case study on negation and speculation. Exp Biol Med (Maywood) 2022; 247:2003-2014. [PMID: 36314865 PMCID: PMC9791307 DOI: 10.1177/15353702221128577] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
In the last decade, an increasing number of users have started reporting adverse drug events (ADEs) on social media platforms, blogs, and health forums. Given the large volume of reports, pharmacovigilance has focused on ways to use natural language processing (NLP) techniques to rapidly examine these large collections of text, detecting mentions of drug-related adverse reactions to trigger medical investigations. However, despite the growing interest in the task and the advances in NLP, the robustness of these models in face of linguistic phenomena such as negations and speculations is an open research question. Negations and speculations are pervasive phenomena in natural language and can severely hamper the ability of an automated system to discriminate between factual and non-factual statements in text. In this article, we take into consideration four state-of-the-art systems for ADE detection on social media texts. We introduce SNAX, a benchmark to test their performance against samples containing negated and speculated ADEs, showing their fragility against these phenomena. We then introduce two possible strategies to increase the robustness of these models, showing that both of them bring significant increases in performance, lowering the number of spurious entities predicted by the models by 60% for negation and 80% for speculations.
Collapse
Affiliation(s)
- Simone Scaboro
- Department of Mathematics, Computer Science and Physics, University of Udine, Udine 33100, Italy
| | - Beatrice Portelli
- Department of Mathematics, Computer Science and Physics, University of Udine, Udine 33100, Italy,Università degli Studi di Napoli Federico II, Napoli 80138, Italy,Beatrice Portelli.
| | - Emmanuele Chersoni
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hung Hom 999077, Hong Kong
| | - Enrico Santus
- Decision Science and Advanced Analytics for MAPV & RA, Bayer, Bayer Pharmaceuticals, Whippany, NJ 07981-1544, USA
| | - Giuseppe Serra
- Department of Mathematics, Computer Science and Physics, University of Udine, Udine 33100, Italy
| |
Collapse
|
4
|
Shinohara E, Shibata D, Kawazoe Y. Development of comprehensive annotation criteria for patients' states from clinical texts. J Biomed Inform 2022; 134:104200. [PMID: 36089198 DOI: 10.1016/j.jbi.2022.104200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 08/17/2022] [Accepted: 09/04/2022] [Indexed: 11/18/2022]
Abstract
In clinical records, much of the clinical information is recorded as free text, thus necessitating the use of advanced automatic information extraction technology. The development of practical technologies requires a corpus with finer granularity annotations that describe the information in the corpus, but such annotation criteria have not been researched enough thus far. This study aimed to develop fine grained annotation criteria that exhaustively cover patients' states in case reports. We collected 362 case reports-written in Japanese-of intractable diseases that were expected to contain a broad range of patients' states. Criteria were developed by repeatedly revising and annotating the clinical case report text. A set of annotation criteria for patients' states, consisting of 46 entity types, 9 attributes, and 36 relations, was obtained it allows more detailed information to be expressed than in previous studies by broader range of concept types including treatment, and captures clinical information based on a combination of causality and judgment, which could not be expressed before.
Collapse
Affiliation(s)
- Emiko Shinohara
- Artificial Intelligence in Healthcare, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.
| | - Daisaku Shibata
- Artificial Intelligence in Healthcare, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Yoshimasa Kawazoe
- Artificial Intelligence in Healthcare, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
5
|
Fang Y, Idnay B, Sun Y, Liu H, Chen Z, Marder K, Xu H, Schnall R, Weng C. Combining human and machine intelligence for clinical trial eligibility querying. J Am Med Inform Assoc 2022; 29:1161-1171. [PMID: 35426943 PMCID: PMC9196697 DOI: 10.1093/jamia/ocac051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Accepted: 03/29/2022] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE To combine machine efficiency and human intelligence for converting complex clinical trial eligibility criteria text into cohort queries. MATERIALS AND METHODS Criteria2Query (C2Q) 2.0 was developed to enable real-time user intervention for criteria selection and simplification, parsing error correction, and concept mapping. The accuracy, precision, recall, and F1 score of enhanced modules for negation scope detection, temporal and value normalization were evaluated using a previously curated gold standard, the annotated eligibility criteria of 1010 COVID-19 clinical trials. The usability and usefulness were evaluated by 10 research coordinators in a task-oriented usability evaluation using 5 Alzheimer's disease trials. Data were collected by user interaction logging, a demographic questionnaire, the Health Information Technology Usability Evaluation Scale (Health-ITUES), and a feature-specific questionnaire. RESULTS The accuracies of negation scope detection, temporal and value normalization were 0.924, 0.916, and 0.966, respectively. C2Q 2.0 achieved a moderate usability score (3.84 out of 5) and a high learnability score (4.54 out of 5). On average, 9.9 modifications were made for a clinical study. Experienced researchers made more modifications than novice researchers. The most frequent modification was deletion (5.35 per study). Furthermore, the evaluators favored cohort queries resulting from modifications (score 4.1 out of 5) and the user engagement features (score 4.3 out of 5). DISCUSSION AND CONCLUSION Features to engage domain experts and to overcome the limitations in automated machine output are shown to be useful and user-friendly. We concluded that human-computer collaboration is key to improving the adoption and user-friendliness of natural language processing.
Collapse
Affiliation(s)
- Yilu Fang
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Betina Idnay
- School of Nursing, Columbia University, New York, New York, USA.,Department of Neurology, Columbia University, New York, New York, USA
| | - Yingcheng Sun
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Hao Liu
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Zhehuan Chen
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Karen Marder
- Department of Neurology, Columbia University, New York, New York, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Rebecca Schnall
- School of Nursing, Columbia University, New York, New York, USA.,Heilbrunn Department of Population and Family Health, Mailman School of Public Health, Columbia University, New York, New York, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| |
Collapse
|
6
|
Negation and Speculation in NLP: A Survey, Corpora, Methods, and Applications. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12105209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Negation and speculation are universal linguistic phenomena that affect the performance of Natural Language Processing (NLP) applications, such as those for opinion mining and information retrieval, especially in biomedical data. In this article, we review the corpora annotated with negation and speculation in various natural languages and domains. Furthermore, we discuss the ongoing research into recent rule-based, supervised, and transfer learning techniques for the detection of negating and speculative content. Many English corpora for various domains are now annotated with negation and speculation; moreover, the availability of annotated corpora in other languages has started to increase. However, this growth is insufficient to address these important phenomena in languages with limited resources. The use of cross-lingual models and translation of the well-known languages are acceptable alternatives. We also highlight the lack of consistent annotation guidelines and the shortcomings of the existing techniques, and suggest alternatives that may speed up progress in this research direction. Adding more syntactic features may alleviate the limitations of the existing techniques, such as cue ambiguity and detecting the discontinuous scopes. In some NLP applications, inclusion of a system that is negation- and speculation-aware improves performance, yet this aspect is still not addressed or considered an essential step.
Collapse
|
7
|
Solarte Pabón O, Montenegro O, Torrente M, Rodríguez González A, Provencio M, Menasalvas E. Negation and uncertainty detection in clinical texts written in Spanish: a deep learning-based approach. PeerJ Comput Sci 2022; 8:e913. [PMID: 35494817 PMCID: PMC9044225 DOI: 10.7717/peerj-cs.913] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Accepted: 02/10/2022] [Indexed: 06/14/2023]
Abstract
Detecting negation and uncertainty is crucial for medical text mining applications; otherwise, extracted information can be incorrectly identified as real or factual events. Although several approaches have been proposed to detect negation and uncertainty in clinical texts, most efforts have focused on the English language. Most proposals developed for Spanish have focused mainly on negation detection and do not deal with uncertainty. In this paper, we propose a deep learning-based approach for both negation and uncertainty detection in clinical texts written in Spanish. The proposed approach explores two deep learning methods to achieve this goal: (i) Bidirectional Long-Short Term Memory with a Conditional Random Field layer (BiLSTM-CRF) and (ii) Bidirectional Encoder Representation for Transformers (BERT). The approach was evaluated using NUBES and IULA, two public corpora for the Spanish language. The results obtained showed an F-score of 92% and 80% in the scope recognition task for negation and uncertainty, respectively. We also present the results of a validation process conducted using a real-life annotated dataset from clinical notes belonging to cancer patients. The proposed approach shows the feasibility of deep learning-based methods to detect negation and uncertainty in Spanish clinical texts. Experiments also highlighted that this approach improves performance in the scope recognition task compared to other proposals in the biomedical domain.
Collapse
Affiliation(s)
- Oswaldo Solarte Pabón
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Madrid, Spain
- Escuela de Ingeniería de Sistemas y Computación, Universidad del Valle, Cali, Colombia
| | - Orlando Montenegro
- Escuela de Ingeniería de Sistemas y Computación, Universidad del Valle, Cali, Colombia
| | | | | | | | - Ernestina Menasalvas
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Madrid, Spain
| |
Collapse
|
8
|
Pezanowski S, Mitra P, MacEachren AM. Exploring Descriptions of Movement Through Geovisual Analytics. KN - JOURNAL OF CARTOGRAPHY AND GEOGRAPHIC INFORMATION 2022; 72:5-27. [PMID: 35229072 PMCID: PMC8866112 DOI: 10.1007/s42489-022-00098-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 01/31/2022] [Indexed: 11/26/2022]
Abstract
Sensemaking using automatically extracted information from text is a challenging problem. In this paper, we address a specific type of information extraction, namely extracting information related to descriptions of movement. Aggregating and understanding information related to descriptions of movement and lack of movement specified in text can lead to an improved understanding and sensemaking of movement phenomena of various types, e.g., migration of people and animals, impediments to travel due to COVID-19, etc. We present GeoMovement, a system that is based on combining machine learning and rule-based extraction of movement-related information with state-of-the-art visualization techniques. Along with the depiction of movement, our tool can extract and present a lack of movement. Very little prior work exists on automatically extracting descriptions of movement, especially negation and movement. Apart from addressing these, GeoMovement also provides a novel integrated framework for combining these extraction modules with visualization. We include two systematic case studies of GeoMovement that show how humans can derive meaningful geographic movement information. GeoMovement can complement precise movement data, e.g., obtained using sensors, or be used by itself when precise data is unavailable.
Collapse
Affiliation(s)
- Scott Pezanowski
- Information Sciences and Technology, The Pennsylvania State University, Westgate Building, University Park, PA 16802 USA
| | - Prasenjit Mitra
- Information Sciences and Technology, The Pennsylvania State University, Westgate Building, University Park, PA 16802 USA
| | - Alan M. MacEachren
- Information Sciences and Technology, The Pennsylvania State University, Westgate Building, University Park, PA 16802 USA
- Department of Geography, The Pennsylvania State University, Walker Building, University Park, PA 16802 USA
| |
Collapse
|
9
|
Boguslav MR, Salem NM, White EK, Leach SM, Hunter LE. Identifying and classifying goals for scientific knowledge. BIOINFORMATICS ADVANCES 2021; 1:vbab012. [PMID: 34661112 PMCID: PMC8508177 DOI: 10.1093/bioadv/vbab012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 06/17/2021] [Indexed: 01/26/2023]
Abstract
MOTIVATION Science progresses by posing good questions, yet work in biomedical text mining has not focused on them much. We propose a novel idea for biomedical natural language processing: identifying and characterizing the questions stated in the biomedical literature. Formally, the task is to identify and characterize statements of ignorance, statements where scientific knowledge is missing or incomplete. The creation of such technology could have many significant impacts, from the training of PhD students to ranking publications and prioritizing funding based on particular questions of interest. The work presented here is intended as the first step towards these goals. RESULTS We present a novel ignorance taxonomy driven by the role statements of ignorance play in research, identifying specific goals for future scientific knowledge. Using this taxonomy and reliable annotation guidelines (inter-annotator agreement above 80%), we created a gold standard ignorance corpus of 60 full-text documents from the prenatal nutrition literature with over 10 000 annotations and used it to train classifiers that achieved over 0.80 F1 scores. AVAILABILITY AND IMPLEMENTATION Corpus and source code freely available for download at https://github.com/UCDenver-ccp/Ignorance-Question-Work. The source code is implemented in Python.
Collapse
Affiliation(s)
- Mayla R Boguslav
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA,To whom correspondence should be addressed.
| | - Nourah M Salem
- Health Informatics Program, College of Health Solutions at Arizona State University, Phoenix, AZ 85004, USA
| | - Elizabeth K White
- Center for Genes, Environment and Health, National Jewish Health, Denver, CO 80206, USA
| | - Sonia M Leach
- Center for Genes, Environment and Health, National Jewish Health, Denver, CO 80206, USA
| | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| |
Collapse
|
10
|
Hobbs ET, Goralski SM, Mitchell A, Simpson A, Leka D, Kotey E, Sekira M, Munro JB, Nadendla S, Jackson R, Gonzalez-Aguirre A, Krallinger M, Giglio M, Erill I. ECO-CollecTF: A Corpus of Annotated Evidence-Based Assertions in Biomedical Manuscripts. Front Res Metr Anal 2021; 6:674205. [PMID: 34327299 PMCID: PMC8313968 DOI: 10.3389/frma.2021.674205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Accepted: 06/28/2021] [Indexed: 11/20/2022] Open
Abstract
Analysis of high-throughput experiments in the life sciences frequently relies upon standardized information about genes, gene products, and other biological entities. To provide this information, expert curators are increasingly relying on text mining tools to identify, extract and harmonize statements from biomedical journal articles that discuss findings of interest. For determining reliability of the statements, curators need the evidence used by the authors to support their assertions. It is important to annotate the evidence directly used by authors to qualify their findings rather than simply annotating mentions of experimental methods without the context of what findings they support. Text mining tools require tuning and adaptation to achieve accurate performance. Many annotated corpora exist to enable developing and tuning text mining tools; however, none currently provides annotations of evidence based on the extensive and widely used Evidence and Conclusion Ontology. We present the ECO-CollecTF corpus, a novel, freely available, biomedical corpus of 84 documents that captures high-quality, evidence-based statements annotated with the Evidence and Conclusion Ontology.
Collapse
Affiliation(s)
- Elizabeth T Hobbs
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Stephen M Goralski
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Ashley Mitchell
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Andrew Simpson
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Dorjan Leka
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Emmanuel Kotey
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Matt Sekira
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - James B Munro
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, United States
| | - Suvarna Nadendla
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, United States
| | - Rebecca Jackson
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, United States
| | | | - Martin Krallinger
- Barcelona Supercomputing Center (BSC), Barcelona, Spain.,Centro Nacional de Investigaciones Oncológicas (CNIO), Madrid, Spain
| | - Michelle Giglio
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, United States
| | - Ivan Erill
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| |
Collapse
|
11
|
Sahoo HS, Silverman GM, Ingraham NE, Lupei MI, Puskarich MA, Finzel RL, Sartori J, Zhang R, Knoll BC, Liu S, Liu H, Melton GB, Tignanelli CJ, Pakhomov SVS. A fast, resource efficient, and reliable rule-based system for COVID-19 symptom identification. JAMIA Open 2021; 4:ooab070. [PMID: 34423261 PMCID: PMC8374371 DOI: 10.1093/jamiaopen/ooab070] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 07/16/2021] [Accepted: 08/05/2021] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE With COVID-19, there was a need for a rapidly scalable annotation system that facilitated real-time integration with clinical decision support systems (CDS). Current annotation systems suffer from a high-resource utilization and poor scalability limiting real-world integration with CDS. A potential solution to mitigate these issues is to use the rule-based gazetteer developed at our institution. MATERIALS AND METHODS Performance, resource utilization, and runtime of the rule-based gazetteer were compared with five annotation systems: BioMedICUS, cTAKES, MetaMap, CLAMP, and MedTagger. RESULTS This rule-based gazetteer was the fastest, had a low resource footprint, and similar performance for weighted microaverage and macroaverage measures of precision, recall, and f1-score compared to other annotation systems. DISCUSSION Opportunities to increase its performance include fine-tuning lexical rules for symptom identification. Additionally, it could run on multiple compute nodes for faster runtime. CONCLUSION This rule-based gazetteer overcame key technical limitations facilitating real-time symptomatology identification for COVID-19 and integration of unstructured data elements into our CDS. It is ideal for large-scale deployment across a wide variety of healthcare settings for surveillance of acute COVID-19 symptoms for integration into prognostic modeling. Such a system is currently being leveraged for monitoring of postacute sequelae of COVID-19 (PASC) progression in COVID-19 survivors. This study conducted the first in-depth analysis and developed a rule-based gazetteer for COVID-19 symptom extraction with the following key features: low processor and memory utilization, faster runtime, and similar weighted microaverage and macroaverage measures for precision, recall, and f1-score compared to industry-standard annotation systems.
Collapse
Affiliation(s)
- Himanshu S Sahoo
- Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, Minnesota, USA
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, USA
| | - Greg M Silverman
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, USA
| | - Nicholas E Ingraham
- Pulmonary Disease and Critical Care Medicine, University of Minnesota, Minneapolis, Minnesota, USA
| | - Monica I Lupei
- Department of Anesthesiology, University of Minnesota, Minneapolis, Minnesota, USA
| | - Michael A Puskarich
- Department of Emergency Medicine, University of Minnesota, Minneapolis, Minnesota, USA
| | - Raymond L Finzel
- Department of Pharmaceutical Care and Health Systems, University of Minnesota, Minneapolis, Minnesota, USA
| | - John Sartori
- Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, Minnesota, USA
| | - Rui Zhang
- Department of Pharmaceutical Care and Health Systems, University of Minnesota, Minneapolis, Minnesota, USA
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Benjamin C Knoll
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Sijia Liu
- Department of Health Science Research, Mayo Clinic, Rochester, New York, USA
| | - Hongfang Liu
- Department of Health Science Research, Mayo Clinic, Rochester, New York, USA
| | - Genevieve B Melton
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, USA
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | | | - Serguei V S Pakhomov
- Department of Pharmaceutical Care and Health Systems, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
12
|
French FastContext: A publicly accessible system for detecting negation, temporality and experiencer in French clinical notes. J Biomed Inform 2021; 117:103733. [PMID: 33737205 DOI: 10.1016/j.jbi.2021.103733] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Revised: 12/30/2020] [Accepted: 03/01/2021] [Indexed: 11/21/2022]
Abstract
The context of medical conditions is an important feature to consider when processing clinical narratives. NegEx and its extension ConText became the most well-known rule-based systems that allow determining whether a medical condition is negated, historical or experienced by someone other than the patient in English clinical text. In this paper, we present a French adaptation and enrichment of FastContext which is the most recent, n-trie engine-based implementation of the ConText algorithm. We compiled an extensive list of French lexical cues by automatic and manual translation and enrichment. To evaluate French FastContext, we manually annotated the context of medical conditions present in two types of clinical narratives: (i)death certificates and (ii)electronic health records. Results show good performance across different context values on both types of clinical notes (on average 0.93 and 0.86 F1, respectively). Furthermore, French FastContext outperforms previously reported French systems for negation detection when compared on the same datasets and it is the first implementation of contextual temporality and experiencer identification reported for French. Finally, French FastContext has been implemented within the SIFR Annotator: a publicly accessible Web service to annotate French biomedical text data (http://bioportal.lirmm.fr/annotator). To our knowledge, this is the first implementation of a Web-based ConText-like system in a publicly accessible platform allowing non-natural-language-processing experts to both annotate and contextualize medical conditions in clinical notes.
Collapse
|
13
|
Integrating Speculation Detection and Deep Learning to Extract Lung Cancer Diagnosis from Clinical Notes. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11020865] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Despite efforts to develop models for extracting medical concepts from clinical notes, there are still some challenges in particular to be able to relate concepts to dates. The high number of clinical notes written for each single patient, the use of negation, speculation, and different date formats cause ambiguity that has to be solved to reconstruct the patient’s natural history. In this paper, we concentrate on extracting from clinical narratives the cancer diagnosis and relating it to the diagnosis date. To address this challenge, a hybrid approach that combines deep learning-based and rule-based methods is proposed. The approach integrates three steps: (i) lung cancer named entity recognition, (ii) negation and speculation detection, and (iii) relating the cancer diagnosis to a valid date. In particular, we apply the proposed approach to extract the lung cancer diagnosis and its diagnosis date from clinical narratives written in Spanish. Results obtained show an F-score of 90% in the named entity recognition task, and a 89% F-score in the task of relating the cancer diagnosis to the diagnosis date. Our findings suggest that speculation detection is together with negation detection a key component to properly extract cancer diagnosis from clinical notes.
Collapse
|
14
|
Rivera Zavala R, Martinez P. The Impact of Pretrained Language Models on Negation and Speculation Detection in Cross-Lingual Medical Text: Comparative Study. JMIR Med Inform 2020; 8:e18953. [PMID: 33270027 PMCID: PMC7746498 DOI: 10.2196/18953] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2020] [Revised: 08/25/2020] [Accepted: 10/28/2020] [Indexed: 11/13/2022] Open
Abstract
Background Negation and speculation are critical elements in natural language processing (NLP)-related tasks, such as information extraction, as these phenomena change the truth value of a proposition. In the clinical narrative that is informal, these linguistic facts are used extensively with the objective of indicating hypotheses, impressions, or negative findings. Previous state-of-the-art approaches addressed negation and speculation detection tasks using rule-based methods, but in the last few years, models based on machine learning and deep learning exploiting morphological, syntactic, and semantic features represented as spare and dense vectors have emerged. However, although such methods of named entity recognition (NER) employ a broad set of features, they are limited to existing pretrained models for a specific domain or language. Objective As a fundamental subsystem of any information extraction pipeline, a system for cross-lingual and domain-independent negation and speculation detection was introduced with special focus on the biomedical scientific literature and clinical narrative. In this work, detection of negation and speculation was considered as a sequence-labeling task where cues and the scopes of both phenomena are recognized as a sequence of nested labels recognized in a single step. Methods We proposed the following two approaches for negation and speculation detection: (1) bidirectional long short-term memory (Bi-LSTM) and conditional random field using character, word, and sense embeddings to deal with the extraction of semantic, syntactic, and contextual patterns and (2) bidirectional encoder representations for transformers (BERT) with fine tuning for NER. Results The approach was evaluated for English and Spanish languages on biomedical and review text, particularly with the BioScope corpus, IULA corpus, and SFU Spanish Review corpus, with F-measures of 86.6%, 85.0%, and 88.1%, respectively, for NeuroNER and 86.4%, 80.8%, and 91.7%, respectively, for BERT. Conclusions These results show that these architectures perform considerably better than the previous rule-based and conventional machine learning–based systems. Moreover, our analysis results show that pretrained word embedding and particularly contextualized embedding for biomedical corpora help to understand complexities inherent to biomedical text.
Collapse
Affiliation(s)
- Renzo Rivera Zavala
- Department of Computer Science and Engineering, Carlos III University of Madrid, Madrid, Spain.,Department of Computer Science and Engineering, Universidad Católica de Santa Maria, Arequipa, Peru
| | - Paloma Martinez
- Department of Computer Science and Engineering, Carlos III University of Madrid, Madrid, Spain
| |
Collapse
|
15
|
Grljević O, Bošnjak Z, Kovačević A. Opinion mining in higher education: a corpus-based approach. ENTERP INF SYST-UK 2020. [DOI: 10.1080/17517575.2020.1773542] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Olivera Grljević
- Faculty of Economics in Subotica, University of Novi Sad, Subotica, Serbia
| | - Zita Bošnjak
- Faculty of Economics in Subotica, University of Novi Sad, Subotica, Serbia
| | | |
Collapse
|
16
|
Omero P, Valotto M, Bellana R, Bongelli R, Riccioni I, Zuczkowski A, Tasso C. Writer’s uncertainty identification in scientific biomedical articles: a tool for automatic if-clause tagging. LANG RESOUR EVAL 2020. [DOI: 10.1007/s10579-020-09491-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
AbstractIn a previous study, we manually identified seven categories (verbs, non-verbs, modal verbs in the simple present, modal verbs in the conditional mood, if, uncertain questions, and epistemic future) of Uncertainty Markers (UMs) in a corpus of 80 articles from the British Medical Journal randomly sampled from a 167-year period (1840–2007). The UMs detected on the base of an epistemic stance approach were those referring only to the authors of the articles and only in the present. We also performed preliminary experiments to assess the manual annotated corpus and to establish a baseline for the UMs automatic detection. The results of the experiments showed that most UMs could be recognized with good accuracy, except for the if-category, which includes four subcategories: if-clauses in a narrow sense; if-less clauses; as if/as though; if and whether introducing embedded questions. The unsatisfactory results concerning the if-category were probably due to both its complexity and the inadequacy of the detection rules, which were only lexical, not grammatical. In the current article, we describe a different approach, which combines grammatical and syntactic rules. The performed experiments show that the identification of uncertainty in the if-category has been largely double improved compared to our previous results. The complex overall process of uncertainty detection can greatly profit from a hybrid approach which should combine supervised Machine learning techniques with a knowledge-based approach constituted by a rule-based inference engine devoted to the if-clause case and designed on the basis of the above mentioned epistemic stance approach.
Collapse
|
17
|
Prieto M, Deus H, de Waard A, Schultes E, García-Jiménez B, Wilkinson MD. Data-driven classification of the certainty of scholarly assertions. PeerJ 2020; 8:e8871. [PMID: 32341891 PMCID: PMC7182025 DOI: 10.7717/peerj.8871] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2019] [Accepted: 03/09/2020] [Indexed: 01/02/2023] Open
Abstract
The grammatical structures scholars use to express their assertions are intended to convey various degrees of certainty or speculation. Prior studies have suggested a variety of categorization systems for scholarly certainty; however, these have not been objectively tested for their validity, particularly with respect to representing the interpretation by the reader, rather than the intention of the author. In this study, we use a series of questionnaires to determine how researchers classify various scholarly assertions, using three distinct certainty classification systems. We find that there are three distinct categories of certainty along a spectrum from high to low. We show that these categories can be detected in an automated manner, using a machine learning model, with a cross-validation accuracy of 89.2% relative to an author-annotated corpus, and 82.2% accuracy against a publicly-annotated corpus. This finding provides an opportunity for contextual metadata related to certainty to be captured as a part of text-mining pipelines, which currently miss these subtle linguistic cues. We provide an exemplar machine-accessible representation-a Nanopublication-where certainty category is embedded as metadata in a formal, ontology-based manner within text-mined scholarly assertions.
Collapse
Affiliation(s)
- Mario Prieto
- Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM)- Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Pozuelo de Alarcon, Madrid, Spain
| | - Helena Deus
- Elsevier Inc., Cambridge, MA, United States of America
| | - Anita de Waard
- Elsevier Research Collaborations Unit, Jericho, VT, United States of America
| | - Erik Schultes
- GO FAIR International Support and Coordination Office, Leiden, The Netherlands
| | - Beatriz García-Jiménez
- Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM)- Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Pozuelo de Alarcon, Madrid, Spain
| | - Mark D. Wilkinson
- Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM)- Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Pozuelo de Alarcon, Madrid, Spain
| |
Collapse
|
18
|
Kolhatkar V, Wu H, Cavasso L, Francis E, Shukla K, Taboada M. The SFU Opinion and Comments Corpus: A Corpus for the Analysis of Online News Comments. CORPUS PRAGMATICS : INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS AND PRAGMATICS 2019; 4:155-190. [PMID: 32685909 PMCID: PMC7357677 DOI: 10.1007/s41701-019-00065-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/13/2019] [Accepted: 10/15/2019] [Indexed: 06/02/2023]
Abstract
We present the SFU Opinion and Comments Corpus (SOCC ), a collection of opinion articles and the comments posted in response to the articles. The articles include all the opinion pieces published in the Canadian newspaper The Globe and Mail in the 5-year period between 2012 and 2016, a total of 10,339 articles and 663,173 comments. SOCC is part of a project that investigates the linguistic characteristics of online comments. The corpus can be used to study a host of pragmatic phenomena. Among other aspects, researchers can explore: the connections between articles and comments; the connections of comments to each other; the types of topics discussed in comments; the nice (constructive) or mean (toxic) ways in which commenters respond to each other; how language is used to convey very specific types of evaluation; and how negation affects the interpretation of evaluative meaning in discourse. Our current focus is the study of constructiveness and evaluation in the comments. To that end, we have annotated a subset of the large corpus (1043 comments) with four layers of annotations: constructiveness, toxicity, negation and Appraisal (Martin and White, The language of evaluation, Palgrave, New York, 2005). This paper details our corpus, the data collection process, the characteristics of the corpus and describes the annotations. While our focus is comments posted in response to opinion news articles, the phenomena in this corpus are likely to be present in many commenting platforms: other news comments, comments and replies in fora such as Reddit, feedback on blogs, or YouTube comments.
Collapse
Affiliation(s)
- Varada Kolhatkar
- Department of Computer Science, University of British Columbia, Vancouver, Canada
| | - Hanhan Wu
- Discourse Processing Lab, Department of Linguistics, Simon Fraser University, Burnaby, Canada
| | - Luca Cavasso
- Discourse Processing Lab, Department of Linguistics, Simon Fraser University, Burnaby, Canada
| | - Emilie Francis
- Discourse Processing Lab, Department of Linguistics, Simon Fraser University, Burnaby, Canada
| | - Kavan Shukla
- Discourse Processing Lab, Department of Linguistics, Simon Fraser University, Burnaby, Canada
| | - Maite Taboada
- Discourse Processing Lab, Department of Linguistics, Simon Fraser University, Burnaby, Canada
| |
Collapse
|
19
|
Bongelli R, Riccioni I, Burro R, Zuczkowski A. Writers' uncertainty in scientific and popular biomedical articles. A comparative analysis of the British Medical Journal and Discover Magazine. PLoS One 2019; 14:e0221933. [PMID: 31487308 PMCID: PMC6728051 DOI: 10.1371/journal.pone.0221933] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2018] [Accepted: 08/19/2019] [Indexed: 12/01/2022] Open
Abstract
Distinguishing certain and uncertain information is of crucial importance both in the scientific field in the strict sense and in the popular scientific domain. In this paper, by adopting an epistemic stance perspective on certainty and uncertainty, and a mixed procedure of analysis, which combines a bottom-up and a top-down approach, we perform a comparative study (both qualitative and quantitative) of the uncertainty linguistic markers (verbs, non-verbs, modal verbs, conditional clauses, uncertain questions, epistemic future) and their scope in three different corpora: a historical corpus of 80 biomedical articles from the British Medical Journal (BMJ) 1840–2007; a corpus of 12 biomedical articles from BMJ 2013, and a contemporary corpus of 12 scientific popular articles from Discover 2013. The variables under observation are time, structure (IMRaD vs no-IMRaD) and genre (scientific vs popular articles). We apply the Generalized Linear Models analysis in order to test whether there are statistically significant differences (1) in the amount of uncertainty among the different corpora, and (2) in the categories of uncertainty markers used by writers. The results of our analysis reveal that (1) in all corpora, the percentages of uncertainty are always much lower than that of certainty; (2) uncertainty progressively diminishes over time in biomedical articles (in conjunction with their structural changes–IMRaD–and to the increase of the BMJ Impact Factor); and (3) uncertainty is slightly higher in scientific popular articles (Discover 2013) as compared to the contemporary corpus of scientific articles (BMJ 2013). Nevertheless, in all corpora, modal verbs are the most used uncertainty markers. These results suggest that not only do scientific writers prefer to communicate their uncertainty with markers of possibility rather than those of subjectivity but also that science journalists prefer using a third-person subject followed by modal verbs rather than a first-person subject followed by mental verbs such as think or believe.
Collapse
Affiliation(s)
- Ramona Bongelli
- Department of Political Science, Communication and International Relations, University of Macerata, Macerata, Italy
- * E-mail:
| | - Ilaria Riccioni
- Department of Education, Cultural Heritage and Tourism, University of Macerata, Macerata, Italy
| | - Roberto Burro
- Department of Human Sciences, University of Verona, Verona, Italy
| | - Andrzej Zuczkowski
- Department of Education, Cultural Heritage and Tourism, University of Macerata, Macerata, Italy
| |
Collapse
|
20
|
Sergeeva E, Zhu H, Prinsen P, Tahmasebi A. Negation Scope Detection in Clinical Notes and Scientific Abstracts: A Feature-enriched LSTM-based Approach. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2019; 2019:212-221. [PMID: 31258973 PMCID: PMC6568093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Electronic Health Records contain a wealth of clinical information that can potentially be used for a variety of clinical tasks. Clinical narratives contain information about the existence or absence of medical conditions as well as clinical findings. It is essential to be able to distinguish between the two since the negated events and the non-negated events often have very different prognostic value. In this paper, we present a feature-enriched neural network-based model for negation scope detection in biomedical texts. The system achieves a robust high performance on two different types of texts, scientific abstracts, and radiology reports, achieving the new state-of-the-art result without requiring the availability of gold cue information for negation scope detection task on the scientific abstracts part of BioScope1 corpus and competitive result on the radiology report corpus.
Collapse
Affiliation(s)
- Elena Sergeeva
- Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Henghui Zhu
- Boston University, Systems Engineering, Brookline, MA, USA
| | - Peter Prinsen
- Philips Research Eindhoven, Eindhoven, The Netherlands
| | | |
Collapse
|
21
|
Kennedy N, Brodbelt DC, Church DB, O’Neill DG. Detecting false-positive disease references in veterinary clinical notes without manual annotations. NPJ Digit Med 2019; 2:33. [PMID: 31304379 PMCID: PMC6550178 DOI: 10.1038/s41746-019-0108-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2018] [Accepted: 04/12/2019] [Indexed: 11/09/2022] Open
Abstract
Clinicians often include references to diseases in clinical notes, which have not been diagnosed in their patients. For some diseases terms, the majority of disease references written in the patient notes may not refer to true disease diagnosis. These references occur because clinicians often use their clinical notes to speculate about disease existence (differential diagnosis) or to state that the disease has been ruled out. To train classifiers for disambiguating disease references, previous researchers built training sets by manually annotating sentences. We show how to create very large training sets without the need for manual annotation. We obtain state-of- the-art classification performance with a bidirectional long short-term memory model trained to distinguish disease references between patients with or without the disease diagnosis in veterinary clinical notes.
Collapse
Affiliation(s)
- Noel Kennedy
- IT Department, The Royal Veterinary College, 4 Royal College St, London, NW1 0TU UK
| | - Dave C. Brodbelt
- Pathobiology and Population Science, The Royal Veterinary College, Hawkshead Lane, North Mymms, Hatfield, Herts AL9 7TA UK
| | - David B. Church
- Clinical Sciences and Services, The Royal Veterinary College, Hawkshead Lane, North Mymms, Hatfield, Herts AL9 7TA UK
| | - Dan G. O’Neill
- Pathobiology and Population Science, The Royal Veterinary College, Hawkshead Lane, North Mymms, Hatfield, Herts AL9 7TA UK
| |
Collapse
|
22
|
Jagannatha A, Liu F, Liu W, Yu H. Overview of the First Natural Language Processing Challenge for Extracting Medication, Indication, and Adverse Drug Events from Electronic Health Record Notes (MADE 1.0). Drug Saf 2019; 42:99-111. [PMID: 30649735 PMCID: PMC6860017 DOI: 10.1007/s40264-018-0762-z] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
INTRODUCTION This work describes the Medication and Adverse Drug Events from Electronic Health Records (MADE 1.0) corpus and provides an overview of the MADE 1.0 2018 challenge for extracting medication, indication, and adverse drug events (ADEs) from electronic health record (EHR) notes. OBJECTIVE The goal of MADE is to provide a set of common evaluation tasks to assess the state of the art for natural language processing (NLP) systems applied to EHRs supporting drug safety surveillance and pharmacovigilance. We also provide benchmarks on the MADE dataset using the system submissions received in the MADE 2018 challenge. METHODS The MADE 1.0 challenge has released an expert-annotated cohort of medication and ADE information comprising 1089 fully de-identified longitudinal EHR notes from 21 randomly selected patients with cancer at the University of Massachusetts Memorial Hospital. Using this cohort as a benchmark, the MADE 1.0 challenge designed three shared NLP tasks. The named entity recognition (NER) task identifies medications and their attributes (dosage, route, duration, and frequency), indications, ADEs, and severity. The relation identification (RI) task identifies relations between the named entities: medication-indication, medication-ADE, and attribute relations. The third shared task (NER-RI) evaluates NLP models that perform the NER and RI tasks jointly. In total, 11 teams from four countries participated in at least one of the three shared tasks, and 41 system submissions were received in total. RESULTS The best systems F1 scores for NER, RI, and NER-RI were 0.82, 0.86, and 0.61, respectively. Ensemble classifiers using the team submissions improved the performance further, with an F1 score of 0.85, 0.87, and 0.66 for the three tasks, respectively. CONCLUSION MADE results show that recent progress in NLP has led to remarkable improvements in NER and RI tasks for the clinical domain. However, some room for improvement remains, particularly in the NER-RI task.
Collapse
Affiliation(s)
- Abhyuday Jagannatha
- College of Information and Computer Sciences, University of Massachusetts, Amherst, MA, USA
| | - Feifan Liu
- Department of Quantitative Health Sciences and Radiology, University of Massachusetts Medical School, Worcester, MA, USA
| | - Weisong Liu
- Department of Computer Science, University of Massachusetts, 220 Pawtucket St., Lowell, MA, 01854-2874, USA
- Department of Medicine, University of Massachusetts Medical School, Worcester, MA, USA
| | - Hong Yu
- College of Information and Computer Sciences, University of Massachusetts, Amherst, MA, USA.
- Department of Computer Science, University of Massachusetts, 220 Pawtucket St., Lowell, MA, 01854-2874, USA.
- Department of Medicine, University of Massachusetts Medical School, Worcester, MA, USA.
- Bedford VAMC, Bedford, MA, USA.
| |
Collapse
|
23
|
Taylor SJ, Harabagiu SM. The Role of a Deep-Learning Method for Negation Detection in Patient Cohort Identification from Electroencephalography Reports. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:1018-1027. [PMID: 30815145 PMCID: PMC6371289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Detecting negation in biomedical texts entails the automatic identification of negation cues (e.g. "never", "not", "no longer") as well as the scope of these cues. When medical concepts or terms are identified within the scope of a negation cue, their polarity is inferred as "negative". All the other concepts or words receive a positive polarity. Correctly inferring the polarity is essential for patient cohort retrieval systems, as all inclusion criteria need to be automatically assigned positive polarity, whereas exclusion criteria should receive negative polarity. Motivated by the recent development of techniques using deep learning, we have experimented with a neural negation detection technique and compared it against an existing neural polarity recognition system, which were incorporated in a patient cohort system operating on clinical electroencephalography (EEG) reports. Our experiments indicate that the neural negation detection method produces better patient cohorts then the polarity recognition method.
Collapse
|
24
|
Kilicoglu H. Biomedical text mining for research rigor and integrity: tasks, challenges, directions. Brief Bioinform 2018; 19:1400-1414. [PMID: 28633401 PMCID: PMC6291799 DOI: 10.1093/bib/bbx057] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Revised: 04/10/2017] [Indexed: 01/01/2023] Open
Abstract
An estimated quarter of a trillion US dollars is invested in the biomedical research enterprise annually. There is growing alarm that a significant portion of this investment is wasted because of problems in reproducibility of research findings and in the rigor and integrity of research conduct and reporting. Recent years have seen a flurry of activities focusing on standardization and guideline development to enhance the reproducibility and rigor of biomedical research. Research activity is primarily communicated via textual artifacts, ranging from grant applications to journal publications. These artifacts can be both the source and the manifestation of practices leading to research waste. For example, an article may describe a poorly designed experiment, or the authors may reach conclusions not supported by the evidence presented. In this article, we pose the question of whether biomedical text mining techniques can assist the stakeholders in the biomedical research enterprise in doing their part toward enhancing research integrity and rigor. In particular, we identify four key areas in which text mining techniques can make a significant contribution: plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload and accurate citation/enhanced bibliometrics. We review the existing methods and tools for specific tasks, if they exist, or discuss relevant research that can provide guidance for future work. With the exponential increase in biomedical research output and the ability of text mining approaches to perform automatic tasks at large scale, we propose that such approaches can support tools that promote responsible research practices, providing significant benefits for the biomedical research enterprise.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, US National Library of Medicine
| |
Collapse
|
25
|
Fabregat H, Araujo L, Martinez-Romo J. Deep neural models for extracting entities and relationships in the new RDD corpus relating disabilities and rare diseases. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 164:121-129. [PMID: 30195420 DOI: 10.1016/j.cmpb.2018.07.007] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Revised: 06/20/2018] [Accepted: 07/16/2018] [Indexed: 06/08/2023]
Abstract
BACKGROUND AND OBJECTIVE There is a huge amount of rare diseases, many of which have associated important disabilities. It is paramount to know in advance the evolution of the disease in order to limit and prevent the appearance of disabilities and to prepare the patient to manage the future difficulties. Rare disease associations are making an effort to manually collect this information, but it is a long process. A lot of information about the consequences of rare diseases is published in scientific papers, and our goal is to automatically extract disabilities associated with diseases from them. METHODS This work presents a new corpus of abstracts from scientific papers related to rare diseases, which has been manually annotated with disabilities. This corpus allows to train machine and deep learning systems that can automatically process other papers, thus extracting new information about the relations between rare diseases and disabilities. The corpus is also annotated with negation and speculation when they appear affecting disabilities. The corpus has been made publicly accessible. RESULTS We have devised some experiments using deep learning techniques to show the usefulness of the developed corpus. Specifically, we have designed a long short-term memory based architecture for disabilities identification, as well as a convolutional neural network for detecting their relationships to diseases. The systems designed do not need any preprocessing of the data, but only low dimensional vectors representing the words. CONCLUSIONS The developed corpus will allow to train systems to identify disabilities in biomedical documents, which the current annotation systems are not able to detect. The system could also be trained to detect relationships between them and diseases, as well as negation and speculation, that can change the meaning of the language. The deep learning models designed for identifying disabilities and their relationships to diseases in new documents show that the corpus allows obtaining an F-measure of around 81% for the disability recognition and 75% for relation extraction.
Collapse
Affiliation(s)
- Hermenegildo Fabregat
- Department of Computer Science, Universidad Nacional de Educación a Distancia (UNED), Juan del Rosal 16, Madrid 28040, Spain.
| | - Lourdes Araujo
- Department of Computer Science, Universidad Nacional de Educación a Distancia (UNED), Juan del Rosal 16, Madrid 28040, Spain; IMIENS: Instituto Mixto de Investigación, Escuela Nacional de Sanidad, Monforte de Lemos 5, Madrid 28019, Spain.
| | - Juan Martinez-Romo
- Department of Computer Science, Universidad Nacional de Educación a Distancia (UNED), Juan del Rosal 16, Madrid 28040, Spain; IMIENS: Instituto Mixto de Investigación, Escuela Nacional de Sanidad, Monforte de Lemos 5, Madrid 28019, Spain.
| |
Collapse
|
26
|
Shardlow M, Batista-Navarro R, Thompson P, Nawaz R, McNaught J, Ananiadou S. Identification of research hypotheses and new knowledge from scientific literature. BMC Med Inform Decis Mak 2018; 18:46. [PMID: 29940927 PMCID: PMC6019216 DOI: 10.1186/s12911-018-0639-1] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2017] [Accepted: 06/11/2018] [Indexed: 01/05/2023] Open
Abstract
Background Text mining (TM) methods have been used extensively to extract relations and events from the literature. In addition, TM techniques have been used to extract various types or dimensions of interpretative information, known as Meta-Knowledge (MK), from the context of relations and events, e.g. negation, speculation, certainty and knowledge type. However, most existing methods have focussed on the extraction of individual dimensions of MK, without investigating how they can be combined to obtain even richer contextual information. In this paper, we describe a novel, supervised method to extract new MK dimensions that encode Research Hypotheses (an author’s intended knowledge gain) and New Knowledge (an author’s findings). The method incorporates various features, including a combination of simple MK dimensions. Methods We identify previously explored dimensions and then use a random forest to combine these with linguistic features into a classification model. To facilitate evaluation of the model, we have enriched two existing corpora annotated with relations and events, i.e., a subset of the GENIA-MK corpus and the EU-ADR corpus, by adding attributes to encode whether each relation or event corresponds to Research Hypothesis or New Knowledge. In the GENIA-MK corpus, these new attributes complement simpler MK dimensions that had previously been annotated. Results We show that our approach is able to assign different types of MK dimensions to relations and events with a high degree of accuracy. Firstly, our method is able to improve upon the previously reported state of the art performance for an existing dimension, i.e., Knowledge Type. Secondly, we also demonstrate high F1-score in predicting the new dimensions of Research Hypothesis (GENIA: 0.914, EU-ADR 0.802) and New Knowledge (GENIA: 0.829, EU-ADR 0.836). Conclusion We have presented a novel approach for predicting New Knowledge and Research Hypothesis, which combines simple MK dimensions to achieve high F1-scores. The extraction of such information is valuable for a number of practical TM applications. Electronic supplementary material The online version of this article (10.1186/s12911-018-0639-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Matthew Shardlow
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | | | - Paul Thompson
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - Raheel Nawaz
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - John McNaught
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, University of Manchester, Manchester, UK.
| |
Collapse
|
27
|
Demner-Fushman D, Rogers WJ, Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc 2018; 24:841-844. [PMID: 28130331 DOI: 10.1093/jamia/ocw177] [Citation(s) in RCA: 83] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2016] [Accepted: 12/09/2016] [Indexed: 11/13/2022] Open
Abstract
MetaMap is a widely used named entity recognition tool that identifies concepts from the Unified Medical Language System Metathesaurus in text. This study presents MetaMap Lite, an implementation of some of the basic MetaMap functions in Java. On several collections of biomedical literature and clinical text, MetaMap Lite demonstrated real-time speed and precision, recall, and F1 scores comparable to or exceeding those of MetaMap and other popular biomedical text processing tools, clinical Text Analysis and Knowledge Extraction System (cTAKES) and DNorm.
Collapse
Affiliation(s)
- Dina Demner-Fushman
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Willie J Rogers
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Alan R Aronson
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
28
|
Kilicoglu H, Ben Abacha A, Mrabet Y, Shooshan SE, Rodriguez L, Masterton K, Demner-Fushman D. Semantic annotation of consumer health questions. BMC Bioinformatics 2018; 19:34. [PMID: 29409442 PMCID: PMC5802048 DOI: 10.1186/s12859-018-2045-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 01/24/2018] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Consumers increasingly use online resources for their health information needs. While current search engines can address these needs to some extent, they generally do not take into account that most health information needs are complex and can only fully be expressed in natural language. Consumer health question answering (QA) systems aim to fill this gap. A major challenge in developing consumer health QA systems is extracting relevant semantic content from the natural language questions (question understanding). To develop effective question understanding tools, question corpora semantically annotated for relevant question elements are needed. In this paper, we present a two-part consumer health question corpus annotated with several semantic categories: named entities, question triggers/types, question frames, and question topic. The first part (CHQA-email) consists of relatively long email requests received by the U.S. National Library of Medicine (NLM) customer service, while the second part (CHQA-web) consists of shorter questions posed to MedlinePlus search engine as queries. Each question has been annotated by two annotators. The annotation methodology is largely the same between the two parts of the corpus; however, we also explain and justify the differences between them. Additionally, we provide information about corpus characteristics, inter-annotator agreement, and our attempts to measure annotation confidence in the absence of adjudication of annotations. RESULTS The resulting corpus consists of 2614 questions (CHQA-email: 1740, CHQA-web: 874). Problems are the most frequent named entities, while treatment and general information questions are the most common question types. Inter-annotator agreement was generally modest: question types and topics yielded highest agreement, while the agreement for more complex frame annotations was lower. Agreement in CHQA-web was consistently higher than that in CHQA-email. Pairwise inter-annotator agreement proved most useful in estimating annotation confidence. CONCLUSIONS To our knowledge, our corpus is the first focusing on annotation of uncurated consumer health questions. It is currently used to develop machine learning-based methods for question understanding. We make the corpus publicly available to stimulate further research on consumer health QA.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| | - Asma Ben Abacha
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| | - Yassine Mrabet
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| | - Sonya E. Shooshan
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| | - Laritza Rodriguez
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| | - Kate Masterton
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| | - Dina Demner-Fushman
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD USA
| |
Collapse
|
29
|
Chen C, Song M, Heo GE. A scalable and adaptive method for finding semantically equivalent cue words of uncertainty. J Informetr 2018. [DOI: 10.1016/j.joi.2017.12.004] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
30
|
Zerva C, Batista-Navarro R, Day P, Ananiadou S. Using uncertainty to link and rank evidence from biomedical literature for model curation. Bioinformatics 2017; 33:3784-3792. [PMID: 29036627 PMCID: PMC5860317 DOI: 10.1093/bioinformatics/btx466] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2017] [Revised: 06/27/2017] [Accepted: 07/21/2017] [Indexed: 11/20/2022] Open
Abstract
MOTIVATION In recent years, there has been great progress in the field of automated curation of biomedical networks and models, aided by text mining methods that provide evidence from literature. Such methods must not only extract snippets of text that relate to model interactions, but also be able to contextualize the evidence and provide additional confidence scores for the interaction in question. Although various approaches calculating confidence scores have focused primarily on the quality of the extracted information, there has been little work on exploring the textual uncertainty conveyed by the author. Despite textual uncertainty being acknowledged in biomedical text mining as an attribute of text mined interactions (events), it is significantly understudied as a means of providing a confidence measure for interactions in pathways or other biomedical models. In this work, we focus on improving identification of textual uncertainty for events and explore how it can be used as an additional measure of confidence for biomedical models. RESULTS We present a novel method for extracting uncertainty from the literature using a hybrid approach that combines rule induction and machine learning. Variations of this hybrid approach are then discussed, alongside their advantages and disadvantages. We use subjective logic theory to combine multiple uncertainty values extracted from different sources for the same interaction. Our approach achieves F-scores of 0.76 and 0.88 based on the BioNLP-ST and Genia-MK corpora, respectively, making considerable improvements over previously published work. Moreover, we evaluate our proposed system on pathways related to two different areas, namely leukemia and melanoma cancer research. AVAILABILITY AND IMPLEMENTATION The leukemia pathway model used is available in Pathway Studio while the Ras model is available via PathwayCommons. Online demonstration of the uncertainty extraction system is available for research purposes at http://argo.nactem.ac.uk/test. The related code is available on https://github.com/c-zrv/uncertainty_components.git. Details on the above are available in the Supplementary Material. CONTACT sophia.ananiadou@manchester.ac.uk. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chrysoula Zerva
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Riza Batista-Navarro
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Philip Day
- Manchester Institute of Biotechnology, The University of Manchester, Manchester, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| |
Collapse
|
31
|
Kilicoglu H, Rosemblat G, Rindflesch TC. Assigning factuality values to semantic relations extracted from biomedical research literature. PLoS One 2017; 12:e0179926. [PMID: 28678823 PMCID: PMC5497973 DOI: 10.1371/journal.pone.0179926] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2016] [Accepted: 06/06/2017] [Indexed: 11/22/2022] Open
Abstract
Biomedical knowledge claims are often expressed as hypotheses, speculations, or opinions, rather than explicit facts (propositions). Much biomedical text mining has focused on extracting propositions from biomedical literature. One such system is SemRep, which extracts propositional content in the form of subject-predicate-object triples called predications. In this study, we investigated the feasibility of assessing the factuality level of SemRep predications to provide more nuanced distinctions between predications for downstream applications. We annotated semantic predications extracted from 500 PubMed abstracts with seven factuality values (fact, probable, possible, doubtful, counterfact, uncommitted, and conditional). We extended a rule-based, compositional approach that uses lexical and syntactic information to predict factuality levels. We compared this approach to a supervised machine learning method that uses a rich feature set based on the annotated corpus. Our results indicate that the compositional approach is more effective than the machine learning method in recognizing the factuality values of predications. The annotated corpus as well as the source code and binaries for factuality assignment are publicly available. We will also incorporate the results of the better performing compositional approach into SemMedDB, a PubMed-scale repository of semantic predications extracted using SemRep.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD, 20894, United States of America
| | - Graciela Rosemblat
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD, 20894, United States of America
| | - Thomas C. Rindflesch
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD, 20894, United States of America
| |
Collapse
|
32
|
Bokharaeian B, Diaz A, Taghizadeh N, Chitsaz H, Chavoshinejad R. SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature. J Biomed Semantics 2017; 8:14. [PMID: 28388928 PMCID: PMC5383945 DOI: 10.1186/s13326-017-0116-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2016] [Accepted: 01/13/2017] [Indexed: 11/17/2022] Open
Abstract
Background Single Nucleotide Polymorphisms (SNPs) are among the most important types of genetic variations influencing common diseases and phenotypes. Recently, some corpora and methods have been developed with the purpose of extracting mutations and diseases from texts. However, there is no available corpus, for extracting associations from texts, that is annotated with linguistic-based negation, modality markers, neutral candidates, and confidence level of associations. Method In this research, different steps were presented so as to produce the SNPPhenA corpus. They include automatic Named Entity Recognition (NER) followed by the manual annotation of SNP and phenotype names, annotation of the SNP-phenotype associations and their level of confidence, as well as modality markers. Moreover, the produced corpus was annotated with negation scopes and cues as well as neutral candidates that play crucial role as far as negation and the modality phenomenon in relation to extraction tasks. Result The agreement between annotators was measured by Cohen’s Kappa coefficient where the resulting scores indicated the reliability of the corpus. The Kappa score was 0.79 for annotating the associations and 0.80 for the confidence degree of associations. Further presented were the basic statistics of the annotated features of the corpus in addition to the results of our first experiments related to the extraction of ranked SNP-Phenotype associations. The prepared guideline documents render the corpus more convenient and facile to use. The corpus, guidelines and inter-annotator agreement analysis are available on the website of the corpus: http://nil.fdi.ucm.es/?q=node/639. Conclusion Specifying the confidence degree of SNP-phenotype associations from articles helps identify the strength of associations that could in turn assist genomics scientists in determining phenotypic plasticity and the importance of environmental factors. What is more, our first experiments with the corpus show that linguistic-based confidence alongside other non-linguistic features can be utilized in order to estimate the strength of the observed SNP-phenotype associations. Trial Registration: Not Applicable Electronic supplementary material The online version of this article (doi:10.1186/s13326-017-0116-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Behrouz Bokharaeian
- Facultad informatica, Complutense University of Madrid, Calle Profesor José García Santesmases, 9, 28040, Madrid, Spain.
| | - Alberto Diaz
- Facultad informatica, Complutense University of Madrid, Calle Profesor José García Santesmases, 9, 28040, Madrid, Spain
| | - Nasrin Taghizadeh
- School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
| | - Hamidreza Chitsaz
- Department of Computer Science, Colorado State University, Fort Collins, CO, 80523, USA
| | - Ramyar Chavoshinejad
- External Collaborator, Reproductive Biomedicine Research Center, Royan Institute for Reproductive Biomedicine, Tehran, Iran
| |
Collapse
|
33
|
Kang T, Zhang S, Xu N, Wen D, Zhang X, Lei J. Detecting negation and scope in Chinese clinical notes using character and word embedding. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 140:53-59. [PMID: 28254090 DOI: 10.1016/j.cmpb.2016.11.009] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/10/2016] [Revised: 11/04/2016] [Accepted: 11/22/2016] [Indexed: 06/06/2023]
Abstract
BACKGROUND AND OBJECTIVES Researchers have developed effective methods to index free-text clinical notes into structured database, in which negation detection is a critical but challenging step. In Chinese clinical records, negation detection is particularly challenging because it may depend on upstream Chinese information processing components such as word segmentation [1]. Traditionally, negation detection was carried out mostly using rule-based methods, whose comprehensiveness and portability were usually limited. Our objectives in this paper are to: 1) Construct a large Chinese clinical notes corpus with negation annotated; 2) develop a negation detection tool for Chinese clinical notes; 3) evaluate the performance of character and word embedding features in Chinese clinical natural language processing. METHODS In this paper, we construct a Chinese clinical corpus consisting of admission and discharge summaries, and propose sequence labeling based systems for negation and scope detection. Our systems rely on features from bag of characters, bag of words, character embedding and word embedding. For scopes, we introduce an additional feature to handle nested scopes with multiple negations. RESULTS The two annotators reached an agreement of 0.79 measured by Kappa in manual annotation. In cue detection, our systems are able to achieve a performance as high as 99.0% measured by F score, which significantly outperform its rule-based counterpart (79% F). The best system uses word embedding as features, which yields precision of 99.0% and recall of 99.1%. In scope detection, our system is able to achieve a performance of 94.6% measured by F score. CONCLUSIONS Our study provides a state-of-the-art negation-detecting tool for Chinese clinical free-text notes; Experimental results demonstrate that word embedding is effective in identifying negations, and that nested scopes can be identified effectively by our method.
Collapse
Affiliation(s)
- Tian Kang
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Shaodian Zhang
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Nanfang Xu
- Department of orthopedic surgery, Peking University Third Hospital, Beijing, China
| | - Dong Wen
- Center for Medical Informatics, Peking University, Beijing, China
| | - Xingting Zhang
- Center for Medical Informatics, Peking University, Beijing, China
| | - Jianbo Lei
- Center for Medical Informatics, Peking University, Beijing, China; School of Medical Informatics and Engineering, Southwest Medical University, Luzhou city, Sichuan Province, PR. China.
| |
Collapse
|
34
|
Zhang S, Kang T, Zhang X, Wen D, Elhadad N, Lei J. Speculation detection for Chinese clinical notes: Impacts of word segmentation and embedding models. J Biomed Inform 2016; 60:334-41. [PMID: 26923634 DOI: 10.1016/j.jbi.2016.02.011] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2015] [Revised: 02/09/2016] [Accepted: 02/12/2016] [Indexed: 11/25/2022]
Abstract
Speculations represent uncertainty toward certain facts. In clinical texts, identifying speculations is a critical step of natural language processing (NLP). While it is a nontrivial task in many languages, detecting speculations in Chinese clinical notes can be particularly challenging because word segmentation may be necessary as an upstream operation. The objective of this paper is to construct a state-of-the-art speculation detection system for Chinese clinical notes and to investigate whether embedding features and word segmentations are worth exploiting toward this overall task. We propose a sequence labeling based system for speculation detection, which relies on features from bag of characters, bag of words, character embedding, and word embedding. We experiment on a novel dataset of 36,828 clinical notes with 5103 gold-standard speculation annotations on 2000 notes, and compare the systems in which word embeddings are calculated based on word segmentations given by general and by domain specific segmenters respectively. Our systems are able to reach performance as high as 92.2% measured by F score. We demonstrate that word segmentation is critical to produce high quality word embedding to facilitate downstream information extraction applications, and suggest that a domain dependent word segmenter can be vital to such a clinical NLP task in Chinese language.
Collapse
Affiliation(s)
- Shaodian Zhang
- Department of Biomedical Informatics, Columbia University, New York, USA
| | - Tian Kang
- Department of Biomedical Informatics, Columbia University, New York, USA
| | - Xingting Zhang
- Center for Medical Informatics, Peking University, Beijing, China
| | - Dong Wen
- Center for Medical Informatics, Peking University, Beijing, China
| | - Noémie Elhadad
- Department of Biomedical Informatics, Columbia University, New York, USA
| | - Jianbo Lei
- Center for Medical Informatics, Peking University, Beijing, China.
| |
Collapse
|
35
|
Thompson P, Nawaz R, McNaught J, Ananiadou S. Enriching news events with meta-knowledge information. LANG RESOUR EVAL 2016. [DOI: 10.1007/s10579-016-9344-9] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
36
|
Weegar R, Kvist M, Sundström K, Brunak S, Dalianis H. Finding Cervical Cancer Symptoms in Swedish Clinical Text using a Machine Learning Approach and NegEx. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2015; 2015:1296-1305. [PMID: 26958270 PMCID: PMC4765575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Detection of early symptoms in cervical cancer is crucial for early treatment and survival. To find symptoms of cervical cancer in clinical text, Named Entity Recognition is needed. In this paper the Clinical Entity Finder, a machine-learning tool trained on annotated clinical text from a Swedish internal medicine emergency unit, is evaluated on cervical cancer records. The Clinical Entity Finder identifies entities of the types body part, finding and disorder and is extended with negation detection using the rule-based tool NegEx, to distinguish between negated and non-negated entities. To measure the performance of the tools on this new domain, two physicians annotated a set of clinical notes from the health records of cervical cancer patients. The inter-annotator agreement for finding, disorder and body part obtained an average F-score of 0.677 and the Clinical Entity Finder extended with NegEx had an average F-score of 0.667.
Collapse
Affiliation(s)
- Rebecka Weegar
- Department of Computer and Systems Sciences, (DSV), Stockholm University, Sweden
| | - Maria Kvist
- Department of Computer and Systems Sciences, (DSV), Stockholm University, Sweden; Department of Learning, Informatics, Management and Ethics (LIME), Karolinska Institutet, Stockholm, Sweden
| | - Karin Sundström
- Department of Laboratory medicine (LABMED), Karolinska Institutet, Stockholm, Sweden
| | - Søren Brunak
- NNF Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
| | - Hercules Dalianis
- Department of Computer and Systems Sciences, (DSV), Stockholm University, Sweden
| |
Collapse
|
37
|
A research framework for pharmacovigilance in health social media: Identification and evaluation of patient adverse drug event reports. J Biomed Inform 2015; 58:268-279. [PMID: 26518315 DOI: 10.1016/j.jbi.2015.10.011] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2015] [Revised: 10/20/2015] [Accepted: 10/21/2015] [Indexed: 11/23/2022]
Abstract
Social media offer insights of patients' medical problems such as drug side effects and treatment failures. Patient reports of adverse drug events from social media have great potential to improve current practice of pharmacovigilance. However, extracting patient adverse drug event reports from social media continues to be an important challenge for health informatics research. In this study, we develop a research framework with advanced natural language processing techniques for integrated and high-performance patient reported adverse drug event extraction. The framework consists of medical entity extraction for recognizing patient discussions of drug and events, adverse drug event extraction with shortest dependency path kernel based statistical learning method and semantic filtering with information from medical knowledge bases, and report source classification to tease out noise. To evaluate the proposed framework, a series of experiments were conducted on a test bed encompassing about postings from major diabetes and heart disease forums in the United States. The results reveal that each component of the framework significantly contributes to its overall effectiveness. Our framework significantly outperforms prior work.
Collapse
|
38
|
Pyysalo S, Ohta T, Rak R, Rowley A, Chun HW, Jung SJ, Choi SP, Tsujii J, Ananiadou S. Overview of the Cancer Genetics and Pathway Curation tasks of BioNLP Shared Task 2013. BMC Bioinformatics 2015; 16 Suppl 10:S2. [PMID: 26202570 PMCID: PMC4511510 DOI: 10.1186/1471-2105-16-s10-s2] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Since their introduction in 2009, the BioNLP Shared Task events have been instrumental in advancing the development of methods and resources for the automatic extraction of information from the biomedical literature. In this paper, we present the Cancer Genetics (CG) and Pathway Curation (PC) tasks, two event extraction tasks introduced in the BioNLP Shared Task 2013. The CG task focuses on cancer, emphasizing the extraction of physiological and pathological processes at various levels of biological organization, and the PC task targets reactions relevant to the development of biomolecular pathway models, defining its extraction targets on the basis of established pathway representations and ontologies. RESULTS Six groups participated in the CG task and two groups in the PC task, together applying a wide range of extraction approaches including both established state-of-the-art systems and newly introduced extraction methods. The best-performing systems achieved F-scores of 55% on the CG task and 53% on the PC task, demonstrating a level of performance comparable to the best results achieved in similar previously proposed tasks. CONCLUSIONS The results indicate that existing event extraction technology can generalize to meet the novel challenges represented by the CG and PC task settings, suggesting that extraction methods are capable of supporting the construction of knowledge bases on the molecular mechanisms of cancer and the curation of biomolecular pathway models. The CG and PC tasks continue as open challenges for all interested parties, with data, tools and resources available from the shared task homepage.
Collapse
Affiliation(s)
- Sampo Pyysalo
- Department of Information technology, University of Turku, Turku, Finland
| | | | - Rafal Rak
- National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester, UK
| | - Andrew Rowley
- National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester, UK
| | - Hong-Woo Chun
- Software Research Center, Korea Institute of Science and Technology Information (KISTI), Daejeon, South Korea
| | - Sung-Jae Jung
- Software Research Center, Korea Institute of Science and Technology Information (KISTI), Daejeon, South Korea
- Department of Applied Information Science, University of Science and Technology (UST), Daejeon, South Korea
| | - Sung-Pil Choi
- Department of Library and Information Science, Kyonggi University, Suwon, South Korea
| | | | - Sophia Ananiadou
- National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester, UK
| |
Collapse
|
39
|
Demner-Fushman D, Kohli MD, Rosenman MB, Shooshan SE, Rodriguez L, Antani S, Thoma GR, McDonald CJ. Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc 2015; 23:304-10. [PMID: 26133894 DOI: 10.1093/jamia/ocv080] [Citation(s) in RCA: 165] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 05/20/2015] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE Clinical documents made available for secondary use play an increasingly important role in discovery of clinical knowledge, development of research methods, and education. An important step in facilitating secondary use of clinical document collections is easy access to descriptions and samples that represent the content of the collections. This paper presents an approach to developing a collection of radiology examinations, including both the images and radiologist narrative reports, and making them publicly available in a searchable database. MATERIALS AND METHODS The authors collected 3996 radiology reports from the Indiana Network for Patient Care and 8121 associated images from the hospitals' picture archiving systems. The images and reports were de-identified automatically and then the automatic de-identification was manually verified. The authors coded the key findings of the reports and empirically assessed the benefits of manual coding on retrieval. RESULTS The automatic de-identification of the narrative was aggressive and achieved 100% precision at the cost of rendering a few findings uninterpretable. Automatic de-identification of images was not quite as perfect. Images for two of 3996 patients (0.05%) showed protected health information. Manual encoding of findings improved retrieval precision. CONCLUSION Stringent de-identification methods can remove all identifiers from text radiology reports. DICOM de-identification of images does not remove all identifying information and needs special attention to images scanned from film. Adding manual coding to the radiologist narrative reports significantly improved relevancy of the retrieved clinical documents. The de-identified Indiana chest X-ray collection is available for searching and downloading from the National Library of Medicine (http://openi.nlm.nih.gov/).
Collapse
Affiliation(s)
- Dina Demner-Fushman
- Staff Scientist, Lister Hill National Center for Biomedical Communications National Library of Medicine, National Institutes of Health Bldg. 38A, Room 10S-1022, 8600 Rockville Pike MSC-3824 Bethesda, MD 20894, USA
| | - Marc D Kohli
- Assistant Professor, Director of Informatics, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Marc B Rosenman
- Associate Professor, Children's Health Services Research, Department of Pediatrics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Sonya E Shooshan
- Computer Science Branch, Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Laritza Rodriguez
- Computer Science Branch, Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Sameer Antani
- Staff Scientist, Communications Engineering Branch, Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - George R Thoma
- Branch Chief, Communications Engineering Branch, Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Clement J McDonald
- Director, Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
40
|
Mehrabi S, Krishnan A, Sohn S, Roch AM, Schmidt H, Kesterson J, Beesley C, Dexter P, Max Schmidt C, Liu H, Palakal M. DEEPEN: A negation detection system for clinical text incorporating dependency relation into NegEx. J Biomed Inform 2015; 54:213-9. [PMID: 25791500 PMCID: PMC5863758 DOI: 10.1016/j.jbi.2015.02.010] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Revised: 01/22/2015] [Accepted: 02/24/2015] [Indexed: 12/01/2022]
Abstract
In Electronic Health Records (EHRs), much of valuable information regarding patients' conditions is embedded in free text format. Natural language processing (NLP) techniques have been developed to extract clinical information from free text. One challenge faced in clinical NLP is that the meaning of clinical entities is heavily affected by modifiers such as negation. A negation detection algorithm, NegEx, applies a simplistic approach that has been shown to be powerful in clinical NLP. However, due to the failure to consider the contextual relationship between words within a sentence, NegEx fails to correctly capture the negation status of concepts in complex sentences. Incorrect negation assignment could cause inaccurate diagnosis of patients' condition or contaminated study cohorts. We developed a negation algorithm called DEEPEN to decrease NegEx's false positives by taking into account the dependency relationship between negation words and concepts within a sentence using Stanford dependency parser. The system was developed and tested using EHR data from Indiana University (IU) and it was further evaluated on Mayo Clinic dataset to assess its generalizability. The evaluation results demonstrate DEEPEN, which incorporates dependency parsing into NegEx, can reduce the number of incorrect negation assignment for patients with positive findings, and therefore improve the identification of patients with the target clinical findings in EHRs.
Collapse
Affiliation(s)
- Saeed Mehrabi
- School of Informatics and Computing, Indiana University, Indianapolis, IN, USA; Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Anand Krishnan
- School of Informatics and Computing, Indiana University, Indianapolis, IN, USA
| | - Sunghwan Sohn
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Alexandra M Roch
- Department of Surgery, Indiana University, Indianapolis, IN, USA
| | - Heidi Schmidt
- Department of Surgery, Indiana University, Indianapolis, IN, USA
| | | | | | - Paul Dexter
- Regenstrief Institute, Indianapolis, IN, USA
| | - C Max Schmidt
- Department of Surgery, Indiana University, Indianapolis, IN, USA
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
| | - Mathew Palakal
- School of Informatics and Computing, Indiana University, Indianapolis, IN, USA.
| |
Collapse
|
41
|
Automatic negation detection in narrative pathology reports. Artif Intell Med 2015; 64:41-50. [PMID: 25990897 DOI: 10.1016/j.artmed.2015.03.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2014] [Revised: 01/20/2015] [Accepted: 03/17/2015] [Indexed: 11/21/2022]
Abstract
OBJECTIVE To detect negations of medical entities in free-text pathology reports with different approaches, and evaluate their performances. METHODS AND MATERIAL Three different approaches were applied for negation detection: the lexicon-based approach was a rule-based method, relying on trigger terms and termination clues; the syntax-based approach was also a rule-based method, where the rules and negation patterns were designed using the dependency output from the Stanford parser; the machine-learning-based approach used a support vector machine as a classifier to build models with a number of features. A total of 284 English pathology reports of lymphoma were used for the study. RESULTS The machine-learning-based approach had the best overall performance on the test set with micro-averaged F-score of 82.56%, while the syntax-based approach performed worst with 78.62% F-score. The lexicon-based approach attained an overall average precision of 89.74% and recall of 76.09%, which were significantly better than the results achieved by Negation Tagger with a similar approach. DISCUSSION The lexicon-based approach benefitted from being customized to the corpus more than the other two methods. The errors in negation detection with the syntax-based approach producing poorest performance were mainly due to the poor parsing results, and the errors with the other methods were probably because of the abnormal grammatical structures. CONCLUSIONS A machine-learning-based approach has potential advantages for negation detection, and may be preferable for the task. To improve the overall performance, one of the possible solutions is to apply different approaches to each section in the reports.
Collapse
|
42
|
Bravo À, Piñero J, Queralt-Rosinach N, Rautschka M, Furlong LI. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics 2015; 16:55. [PMID: 25886734 PMCID: PMC4466840 DOI: 10.1186/s12859-015-0472-9] [Citation(s) in RCA: 104] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Accepted: 01/19/2015] [Indexed: 11/23/2022] Open
Abstract
Background Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases. Results By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications. Conclusions BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0472-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Àlex Bravo
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| | - Janet Piñero
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| | - Núria Queralt-Rosinach
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| | - Michael Rautschka
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| | - Laura I Furlong
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| |
Collapse
|
43
|
Kim Y, Garvin J, Goldstein MK, Meystre SM. Classification of Contextual Use of Left Ventricular Ejection Fraction Assessments. Stud Health Technol Inform 2015; 216:599-603. [PMID: 26262121 PMCID: PMC5055832] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Knowledge of the left ventricular ejection fraction is critical for the optimal care of patients with heart failure. When a document contains multiple ejection fraction assessments, accurate classification of their contextual use is necessary to filter out historical findings or recommendations and prioritize the assessments for selection of document level ejection fraction information. We present a natural language processing system that classifies the contextual use of both quantitative and qualitative left ventricular ejection fraction assessments in clinical narrative documents. We created support vector machine classifiers with a variety of features extracted from the target assessment, associated concepts, and document section information. The experimental results showed that our classifiers achieved good performance, reaching 95.6% F1-measure for quantitative assessments and 94.2% F1-measure for qualitative assessments in a five-fold cross-validation evaluation.
Collapse
Affiliation(s)
- Youngjun Kim
- School of Computing, University of Utah, Salt Lake City, USA
- VA Health Care System, Salt Lake City, Utah, USA
| | - Jennifer Garvin
- Department of Biomedical Informatics, University of Utah, Salt Lake City, USA
- VA Health Care System, Salt Lake City, Utah, USA
| | - Mary K. Goldstein
- VA Palo Alto Health Care System, Palo Alto, CA, and Stanford University, Stanford, CA, USA
| | - Stéphane M. Meystre
- Department of Biomedical Informatics, University of Utah, Salt Lake City, USA
- VA Health Care System, Salt Lake City, Utah, USA
| |
Collapse
|
44
|
Afzal Z, Pons E, Kang N, Sturkenboom MCJM, Schuemie MJ, Kors JA. ContextD: an algorithm to identify contextual properties of medical terms in a Dutch clinical corpus. BMC Bioinformatics 2014; 15:373. [PMID: 25432799 PMCID: PMC4264258 DOI: 10.1186/s12859-014-0373-3] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2014] [Accepted: 11/01/2014] [Indexed: 11/10/2022] Open
Abstract
Background In order to extract meaningful information from electronic medical records, such as signs and symptoms, diagnoses, and treatments, it is important to take into account the contextual properties of the identified information: negation, temporality, and experiencer. Most work on automatic identification of these contextual properties has been done on English clinical text. This study presents ContextD, an adaptation of the English ConText algorithm to the Dutch language, and a Dutch clinical corpus. We created a Dutch clinical corpus containing four types of anonymized clinical documents: entries from general practitioners, specialists’ letters, radiology reports, and discharge letters. Using a Dutch list of medical terms extracted from the Unified Medical Language System, we identified medical terms in the corpus with exact matching. The identified terms were annotated for negation, temporality, and experiencer properties. To adapt the ConText algorithm, we translated English trigger terms to Dutch and added several general and document specific enhancements, such as negation rules for general practitioners’ entries and a regular expression based temporality module. Results The ContextD algorithm utilized 41 unique triggers to identify the contextual properties in the clinical corpus. For the negation property, the algorithm obtained an F-score from 87% to 93% for the different document types. For the experiencer property, the F-score was 99% to 100%. For the historical and hypothetical values of the temporality property, F-scores ranged from 26% to 54% and from 13% to 44%, respectively. Conclusions The ContextD showed good performance in identifying negation and experiencer property values across all Dutch clinical document types. Accurate identification of the temporality property proved to be difficult and requires further work. The anonymized and annotated Dutch clinical corpus can serve as a useful resource for further algorithm development. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0373-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Zubair Afzal
- Department of Medical Informatics, Erasmus Medical Center, P.O. Box 2040, Rotterdam, CA, 3000, Netherlands.
| | - Ewoud Pons
- Department of Medical Informatics, Erasmus Medical Center, P.O. Box 2040, Rotterdam, CA, 3000, Netherlands.
| | - Ning Kang
- Department of Medical Informatics, Erasmus Medical Center, P.O. Box 2040, Rotterdam, CA, 3000, Netherlands.
| | - Miriam C J M Sturkenboom
- Department of Medical Informatics, Erasmus Medical Center, P.O. Box 2040, Rotterdam, CA, 3000, Netherlands.
| | | | - Jan A Kors
- Department of Medical Informatics, Erasmus Medical Center, P.O. Box 2040, Rotterdam, CA, 3000, Netherlands.
| |
Collapse
|
45
|
Wu S, Miller T, Masanz J, Coarr M, Halgrim S, Carrell D, Clark C. Negation's not solved: generalizability versus optimizability in clinical natural language processing. PLoS One 2014; 9:e112774. [PMID: 25393544 PMCID: PMC4231086 DOI: 10.1371/journal.pone.0112774] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2014] [Accepted: 10/18/2014] [Indexed: 11/30/2022] Open
Abstract
A review of published work in clinical natural language processing (NLP) may suggest that the negation detection task has been “solved.” This work proposes that an optimizable solution does not equal a generalizable solution. We introduce a new machine learning-based Polarity Module for detecting negation in clinical text, and extensively compare its performance across domains. Using four manually annotated corpora of clinical text, we show that negation detection performance suffers when there is no in-domain development (for manual methods) or training data (for machine learning-based methods). Various factors (e.g., annotation guidelines, named entity characteristics, the amount of data, and lexical and syntactic context) play a role in making generalizability difficult, but none completely explains the phenomenon. Furthermore, generalizability remains challenging because it is unclear whether to use a single source for accurate data, combine all sources into a single model, or apply domain adaptation methods. The most reliable means to improve negation detection is to manually annotate in-domain training data (or, perhaps, manually modify rules); this is a strategy for optimizing performance, rather than generalizing it. These results suggest a direction for future work in domain-adaptive and task-adaptive methods for clinical NLP.
Collapse
Affiliation(s)
- Stephen Wu
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America; Oregon Health and Science University, Portland, Oregon, United States of America
| | - Timothy Miller
- Children's Hospital Boston Informatics Program, Harvard Medical School, Boston, Massachusetts, United States of America
| | - James Masanz
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Matt Coarr
- Human Language Technology Department, The MITRE Corporation, Bedford, Massachusetts, United States of America
| | - Scott Halgrim
- Group Health Research Institute, Seattle, Washington, United States of America
| | - David Carrell
- Group Health Research Institute, Seattle, Washington, United States of America
| | - Cheryl Clark
- Human Language Technology Department, The MITRE Corporation, Bedford, Massachusetts, United States of America
| |
Collapse
|
46
|
Velupillai S, Skeppstedt M, Kvist M, Mowery D, Chapman BE, Dalianis H, Chapman WW. Cue-based assertion classification for Swedish clinical text--developing a lexicon for pyConTextSwe. Artif Intell Med 2014; 61:137-44. [PMID: 24556644 PMCID: PMC4104142 DOI: 10.1016/j.artmed.2014.01.001] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2013] [Revised: 12/19/2013] [Accepted: 01/10/2014] [Indexed: 11/17/2022]
Abstract
OBJECTIVE The ability of a cue-based system to accurately assert whether a disorder is affirmed, negated, or uncertain is dependent, in part, on its cue lexicon. In this paper, we continue our study of porting an assertion system (pyConTextNLP) from English to Swedish (pyConTextSwe) by creating an optimized assertion lexicon for clinical Swedish. METHODS AND MATERIAL We integrated cues from four external lexicons, along with generated inflections and combinations. We used subsets of a clinical corpus in Swedish. We applied four assertion classes (definite existence, probable existence, probable negated existence and definite negated existence) and two binary classes (existence yes/no and uncertainty yes/no) to pyConTextSwe. We compared pyConTextSwe's performance with and without the added cues on a development set, and improved the lexicon further after an error analysis. On a separate evaluation set, we calculated the system's final performance. RESULTS Following integration steps, we added 454 cues to pyConTextSwe. The optimized lexicon developed after an error analysis resulted in statistically significant improvements on the development set (83% F-score, overall). The system's final F-scores on an evaluation set were 81% (overall). For the individual assertion classes, F-score results were 88% (definite existence), 81% (probable existence), 55% (probable negated existence), and 63% (definite negated existence). For the binary classifications existence yes/no and uncertainty yes/no, final system performance was 97%/87% and 78%/86% F-score, respectively. CONCLUSIONS We have successfully ported pyConTextNLP to Swedish (pyConTextSwe). We have created an extensive and useful assertion lexicon for Swedish clinical text, which could form a valuable resource for similar studies, and which is publicly available.
Collapse
Affiliation(s)
- Sumithra Velupillai
- Department of Computer and Systems Sciences (DSV), Stockholm University, Forum 100, 164 40 Kista, Sweden.
| | - Maria Skeppstedt
- Department of Computer and Systems Sciences (DSV), Stockholm University, Forum 100, 164 40 Kista, Sweden.
| | - Maria Kvist
- Department of Computer and Systems Sciences (DSV), Stockholm University, Forum 100, 164 40 Kista, Sweden; Department of Learning, Informatics, Management and Ethics (LIME), Karolinska Institutet, Widerström Building, Tomtebodavägen 18A, Solna, Sweden.
| | - Danielle Mowery
- Department of Biomedical Informatics, University of Pittsburgh, 5607 Baum Boulevard, BAUM 423, Pittsburgh, PA 15206-3701, United States.
| | - Brian E Chapman
- Department of Radiology, University of Utah, 729 Arapeen Drive, Salt Lake City, UT 84108, United States.
| | - Hercules Dalianis
- Department of Computer and Systems Sciences (DSV), Stockholm University, Forum 100, 164 40 Kista, Sweden.
| | - Wendy W Chapman
- Department of Biomedical Informatics, University of Utah, 26 South 2000 East, Room 5775 HSEB, Salt Lake City, UT 84112-5775, United States.
| |
Collapse
|
47
|
Abstract
Collection of documents annotated with semantic entities and relationships are crucial resources to support development and evaluation of text mining solutions for the biomedical domain. Here I present an overview of 36 corpora and show an analysis on the semantic annotations they contain. Annotations for entity types were classified into six semantic groups and an overview on the semantic entities which can be found in each corpus is shown. Results show that while some semantic entities, such as genes, proteins and chemicals are consistently annotated in many collections, corpora available for diseases, variations and mutations are still few, in spite of their importance in the biological domain.
Collapse
Affiliation(s)
- Mariana Neves
- Hasso-Plattner-Institut, Potsdam Universität, Potsdam, Germany
| |
Collapse
|
48
|
Styler WF, Bethard S, Finan S, Palmer M, Pradhan S, de Groen PC, Erickson B, Miller T, Lin C, Savova G, Pustejovsky J. Temporal Annotation in the Clinical Domain. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS 2014; 2:143-154. [PMID: 29082229 PMCID: PMC5657277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
This article discusses the requirements of a formal specification for the annotation of temporal information in clinical narratives. We discuss the implementation and extension of ISO-TimeML for annotating a corpus of clinical notes, known as the THYME corpus. To reflect the information task and the heavily inference-based reasoning demands in the domain, a new annotation guideline has been developed, "the THYME Guidelines to ISO-TimeML (THYME-TimeML)". To clarify what relations merit annotation, we distinguish between linguistically-derived and inferentially-derived temporal orderings in the text. We also apply a top performing TempEval 2013 system against this new resource to measure the difficulty of adapting systems to the clinical domain. The corpus is available to the community and has been proposed for use in a SemEval 2015 task.
Collapse
Affiliation(s)
| | - Steven Bethard
- Department of Computer and Information Sciences, University of Alabama at Birmingham
| | - Sean Finan
- Children's Hospital Boston Informatics Program and Harvard Medical School
| | - Martha Palmer
- Department of Linguistics, University of Colorado at Boulder
| | - Sameer Pradhan
- Children's Hospital Boston Informatics Program and Harvard Medical School
| | | | - Brad Erickson
- Mayo Clinic College of Medicine, Mayo Clinic, Rochester, MN
| | - Timothy Miller
- Children's Hospital Boston Informatics Program and Harvard Medical School
| | - Chen Lin
- Children's Hospital Boston Informatics Program and Harvard Medical School
| | - Guergana Savova
- Children's Hospital Boston Informatics Program and Harvard Medical School
| | | |
Collapse
|
49
|
Liu V, Clark MP, Mendoza M, Saket R, Gardner MN, Turk BJ, Escobar GJ. Automated identification of pneumonia in chest radiograph reports in critically ill patients. BMC Med Inform Decis Mak 2013; 13:90. [PMID: 23947340 PMCID: PMC3765332 DOI: 10.1186/1472-6947-13-90] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2013] [Accepted: 08/12/2013] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Prior studies demonstrate the suitability of natural language processing (NLP) for identifying pneumonia in chest radiograph (CXR) reports, however, few evaluate this approach in intensive care unit (ICU) patients. METHODS From a total of 194,615 ICU reports, we empirically developed a lexicon to categorize pneumonia-relevant terms and uncertainty profiles. We encoded lexicon items into unique queries within an NLP software application and designed an algorithm to assign automated interpretations ('positive', 'possible', or 'negative') based on each report's query profile. We evaluated algorithm performance in a sample of 2,466 CXR reports interpreted by physician consensus and in two ICU patient subgroups including those admitted for pneumonia and for rheumatologic/endocrine diagnoses. RESULTS Most reports were deemed 'negative' (51.8%) by physician consensus. Many were 'possible' (41.7%); only 6.5% were 'positive' for pneumonia. The lexicon included 105 terms and uncertainty profiles that were encoded into 31 NLP queries. Queries identified 534,322 'hits' in the full sample, with 2.7 ± 2.6 'hits' per report. An algorithm, comprised of twenty rules and probability steps, assigned interpretations to reports based on query profiles. In the validation set, the algorithm had 92.7% sensitivity, 91.1% specificity, 93.3% positive predictive value, and 90.3% negative predictive value for differentiating 'negative' from 'positive'/'possible' reports. In the ICU subgroups, the algorithm also demonstrated good performance, misclassifying few reports (5.8%). CONCLUSIONS Many CXR reports in ICU patients demonstrate frank uncertainty regarding a pneumonia diagnosis. This electronic tool demonstrates promise for assigning automated interpretations to CXR reports by leveraging both terms and uncertainty profiles.
Collapse
Affiliation(s)
- Vincent Liu
- Division of Research and Systems Research Initiative, Kaiser Permanente, 2000 Broadway, Webster Annex CA 94612 Oakland, Northern California
- Santa Clara Medical Center, Kaiser Permanente, Santa Clara, CA, Northern California
| | - Mark P Clark
- Vallejo Medical Center, Kaiser Permanente, Vallejo, CA, Northern California
| | - Mark Mendoza
- Santa Clara Medical Center, Kaiser Permanente, Santa Clara, CA, Northern California
| | - Ramin Saket
- Santa Clara Medical Center, Kaiser Permanente, Santa Clara, CA, Northern California
| | - Marla N Gardner
- Division of Research and Systems Research Initiative, Kaiser Permanente, 2000 Broadway, Webster Annex CA 94612 Oakland, Northern California
| | - Benjamin J Turk
- Division of Research and Systems Research Initiative, Kaiser Permanente, 2000 Broadway, Webster Annex CA 94612 Oakland, Northern California
| | - Gabriel J Escobar
- Division of Research and Systems Research Initiative, Kaiser Permanente, 2000 Broadway, Webster Annex CA 94612 Oakland, Northern California
- Walnut Creek Medical Center, Kaiser Permanente, Oakland, CA, Northern California
| |
Collapse
|
50
|
Friedman C, Rindflesch TC, Corn M. Natural language processing: state of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine. J Biomed Inform 2013; 46:765-73. [PMID: 23810857 DOI: 10.1016/j.jbi.2013.06.004] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2013] [Revised: 06/07/2013] [Accepted: 06/07/2013] [Indexed: 01/29/2023]
Abstract
Natural language processing (NLP) is crucial for advancing healthcare because it is needed to transform relevant information locked in text into structured data that can be used by computer processes aimed at improving patient care and advancing medicine. In light of the importance of NLP to health, the National Library of Medicine (NLM) recently sponsored a workshop to review the state of the art in NLP focusing on text in English, both in biomedicine and in the general language domain. Specific goals of the NLM-sponsored workshop were to identify the current state of the art, grand challenges and specific roadblocks, and to identify effective use and best practices. This paper reports on the main outcomes of the workshop, including an overview of the state of the art, strategies for advancing the field, and obstacles that need to be addressed, resulting in recommendations for a research agenda intended to advance the field.
Collapse
Affiliation(s)
- Carol Friedman
- Department of Biomedical Informatics, Columbia University, United States.
| | | | | |
Collapse
|