Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Vincze V, Szarvas G, Farkas R, Móra G, Csirik J. The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinformatics 2008;9 Suppl 11:S9. [PMID: 19025695 PMCID: PMC2586758 DOI: 10.1186/1471-2105-9-s11-s9] [Citation(s) in RCA: 123] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

For:	Vincze V, Szarvas G, Farkas R, Móra G, Csirik J. The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinformatics 2008;9 Suppl 11:S9. [PMID: 19025695 PMCID: PMC2586758 DOI: 10.1186/1471-2105-9-s11-s9] [Citation(s) in RCA: 123] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Number

Cited by Other Article(s)

Perez N, Cuadros M, Rigau G. Negation and speculation processing: A study on cue-scope labelling and assertion classification in Spanish clinical text. Artif Intell Med 2023;145:102682. [PMID: 37925211 DOI: 10.1016/j.artmed.2023.102682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Revised: 08/25/2023] [Accepted: 10/06/2023] [Indexed: 11/06/2023]

Argüello-González G, Aquino-Esperanza J, Salvador D, Bretón-Romero R, Del Río-Bermudez C, Tello J, Menke S. Negation recognition in clinical natural language processing using a combination of the NegEx algorithm and a convolutional neural network. BMC Med Inform Decis Mak 2023;23:216. [PMID: 37833661 PMCID: PMC10576331 DOI: 10.1186/s12911-023-02301-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 09/18/2023] [Indexed: 10/15/2023] Open

Abstract

BACKGROUND

Important clinical information of patients is present in unstructured free-text fields of Electronic Health Records (EHRs). While this information can be extracted using clinical Natural Language Processing (cNLP), the recognition of negation modifiers represents an important challenge. A wide range of cNLP applications have been developed to detect the negation of medical entities in clinical free-text, however, effective solutions for languages other than English are scarce. This study aimed at developing a solution for negation recognition in Spanish EHRs based on a combination of a customized rule-based NegEx layer and a convolutional neural network (CNN).

METHODS

Based on our previous experience in real world evidence (RWE) studies using information embedded in EHRs, negation recognition was simplified into a binary problem ('affirmative' vs. 'non-affirmative' class). For the NegEx layer, negation rules were obtained from a publicly available Spanish corpus and enriched with custom ones, whereby the CNN binary classifier was trained on EHRs annotated for clinical named entities (cNEs) and negation markers by medical doctors.

RESULTS

The proposed negation recognition pipeline obtained precision, recall, and F1-score of 0.93, 0.94, and 0.94 for the 'affirmative' class, and 0.86, 0.84, and 0.85 for the 'non-affirmative' class, respectively. To validate the generalization capabilities of our methodology, we applied the negation recognition pipeline on EHRs (6,710 cNEs) from a different data source distribution than the training corpus and obtained consistent performance metrics for the 'affirmative' and 'non-affirmative' class (0.95, 0.97, and 0.96; and 0.90, 0.83, and 0.86 for precision, recall, and F1-score, respectively). Lastly, we evaluated the pipeline against two publicly available Spanish negation corpora, the IULA and NUBes, obtaining state-of-the-art metrics (1.00, 0.99, and 0.99; and 1.00, 0.93, and 0.96 for precision, recall, and F1-score, respectively).

CONCLUSION

Negation recognition is a source of low precision in the retrieval of cNEs from EHRs' free-text. Combining a customized rule-based NegEx layer with a CNN binary classifier outperformed many other current approaches. RWE studies highly benefit from the correct recognition of negation as it reduces false positive detections of cNE which otherwise would undoubtedly reduce the credibility of cNLP systems.

Collapse

Scaboro S, Portelli B, Chersoni E, Santus E, Serra G. Increasing adverse drug events extraction robustness on social media: Case study on negation and speculation. Exp Biol Med (Maywood) 2022;247:2003-2014. [PMID: 36314865 PMCID: PMC9791307 DOI: 10.1177/15353702221128577] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open

Shinohara E, Shibata D, Kawazoe Y. Development of comprehensive annotation criteria for patients' states from clinical texts. J Biomed Inform 2022;134:104200. [PMID: 36089198 DOI: 10.1016/j.jbi.2022.104200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 08/17/2022] [Accepted: 09/04/2022] [Indexed: 11/18/2022]

Fang Y, Idnay B, Sun Y, Liu H, Chen Z, Marder K, Xu H, Schnall R, Weng C. Combining human and machine intelligence for clinical trial eligibility querying. J Am Med Inform Assoc 2022;29:1161-1171. [PMID: 35426943 PMCID: PMC9196697 DOI: 10.1093/jamia/ocac051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Accepted: 03/29/2022] [Indexed: 11/13/2022] Open

Negation and Speculation in NLP: A Survey, Corpora, Methods, and Applications. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12105209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]

Solarte Pabón O, Montenegro O, Torrente M, Rodríguez González A, Provencio M, Menasalvas E. Negation and uncertainty detection in clinical texts written in Spanish: a deep learning-based approach. PeerJ Comput Sci 2022;8:e913. [PMID: 35494817 PMCID: PMC9044225 DOI: 10.7717/peerj-cs.913] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Accepted: 02/10/2022] [Indexed: 06/14/2023]

Pezanowski S, Mitra P, MacEachren AM. Exploring Descriptions of Movement Through Geovisual Analytics. KN - JOURNAL OF CARTOGRAPHY AND GEOGRAPHIC INFORMATION 2022;72:5-27. [PMID: 35229072 PMCID: PMC8866112 DOI: 10.1007/s42489-022-00098-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 01/31/2022] [Indexed: 11/26/2022]

Boguslav MR, Salem NM, White EK, Leach SM, Hunter LE. Identifying and classifying goals for scientific knowledge. BIOINFORMATICS ADVANCES 2021;1:vbab012. [PMID: 34661112 PMCID: PMC8508177 DOI: 10.1093/bioadv/vbab012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 06/17/2021] [Indexed: 01/26/2023]

Hobbs ET, Goralski SM, Mitchell A, Simpson A, Leka D, Kotey E, Sekira M, Munro JB, Nadendla S, Jackson R, Gonzalez-Aguirre A, Krallinger M, Giglio M, Erill I. ECO-CollecTF: A Corpus of Annotated Evidence-Based Assertions in Biomedical Manuscripts. Front Res Metr Anal 2021;6:674205. [PMID: 34327299 PMCID: PMC8313968 DOI: 10.3389/frma.2021.674205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Accepted: 06/28/2021] [Indexed: 11/20/2022] Open

Sahoo HS, Silverman GM, Ingraham NE, Lupei MI, Puskarich MA, Finzel RL, Sartori J, Zhang R, Knoll BC, Liu S, Liu H, Melton GB, Tignanelli CJ, Pakhomov SVS. A fast, resource efficient, and reliable rule-based system for COVID-19 symptom identification. JAMIA Open 2021;4:ooab070. [PMID: 34423261 PMCID: PMC8374371 DOI: 10.1093/jamiaopen/ooab070] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 07/16/2021] [Accepted: 08/05/2021] [Indexed: 11/14/2022] Open

Abstract

OBJECTIVE

With COVID-19, there was a need for a rapidly scalable annotation system that facilitated real-time integration with clinical decision support systems (CDS). Current annotation systems suffer from a high-resource utilization and poor scalability limiting real-world integration with CDS. A potential solution to mitigate these issues is to use the rule-based gazetteer developed at our institution.

MATERIALS AND METHODS

Performance, resource utilization, and runtime of the rule-based gazetteer were compared with five annotation systems: BioMedICUS, cTAKES, MetaMap, CLAMP, and MedTagger.

RESULTS

This rule-based gazetteer was the fastest, had a low resource footprint, and similar performance for weighted microaverage and macroaverage measures of precision, recall, and f1-score compared to other annotation systems.

DISCUSSION

Opportunities to increase its performance include fine-tuning lexical rules for symptom identification. Additionally, it could run on multiple compute nodes for faster runtime.

CONCLUSION

This rule-based gazetteer overcame key technical limitations facilitating real-time symptomatology identification for COVID-19 and integration of unstructured data elements into our CDS. It is ideal for large-scale deployment across a wide variety of healthcare settings for surveillance of acute COVID-19 symptoms for integration into prognostic modeling. Such a system is currently being leveraged for monitoring of postacute sequelae of COVID-19 (PASC) progression in COVID-19 survivors. This study conducted the first in-depth analysis and developed a rule-based gazetteer for COVID-19 symptom extraction with the following key features: low processor and memory utilization, faster runtime, and similar weighted microaverage and macroaverage measures for precision, recall, and f1-score compared to industry-standard annotation systems.

Collapse

French FastContext: A publicly accessible system for detecting negation, temporality and experiencer in French clinical notes. J Biomed Inform 2021;117:103733. [PMID: 33737205 DOI: 10.1016/j.jbi.2021.103733] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Revised: 12/30/2020] [Accepted: 03/01/2021] [Indexed: 11/21/2022]

Integrating Speculation Detection and Deep Learning to Extract Lung Cancer Diagnosis from Clinical Notes. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11020865] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]

Rivera Zavala R, Martinez P. The Impact of Pretrained Language Models on Negation and Speculation Detection in Cross-Lingual Medical Text: Comparative Study. JMIR Med Inform 2020;8:e18953. [PMID: 33270027 PMCID: PMC7746498 DOI: 10.2196/18953] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2020] [Revised: 08/25/2020] [Accepted: 10/28/2020] [Indexed: 11/13/2022] Open

Abstract

Background

Negation and speculation are critical elements in natural language processing (NLP)-related tasks, such as information extraction, as these phenomena change the truth value of a proposition. In the clinical narrative that is informal, these linguistic facts are used extensively with the objective of indicating hypotheses, impressions, or negative findings. Previous state-of-the-art approaches addressed negation and speculation detection tasks using rule-based methods, but in the last few years, models based on machine learning and deep learning exploiting morphological, syntactic, and semantic features represented as spare and dense vectors have emerged. However, although such methods of named entity recognition (NER) employ a broad set of features, they are limited to existing pretrained models for a specific domain or language.

Objective

As a fundamental subsystem of any information extraction pipeline, a system for cross-lingual and domain-independent negation and speculation detection was introduced with special focus on the biomedical scientific literature and clinical narrative. In this work, detection of negation and speculation was considered as a sequence-labeling task where cues and the scopes of both phenomena are recognized as a sequence of nested labels recognized in a single step.

Methods

We proposed the following two approaches for negation and speculation detection: (1) bidirectional long short-term memory (Bi-LSTM) and conditional random field using character, word, and sense embeddings to deal with the extraction of semantic, syntactic, and contextual patterns and (2) bidirectional encoder representations for transformers (BERT) with fine tuning for NER.

Results

The approach was evaluated for English and Spanish languages on biomedical and review text, particularly with the BioScope corpus, IULA corpus, and SFU Spanish Review corpus, with F-measures of 86.6%, 85.0%, and 88.1%, respectively, for NeuroNER and 86.4%, 80.8%, and 91.7%, respectively, for BERT.

Conclusions

These results show that these architectures perform considerably better than the previous rule-based and conventional machine learning–based systems. Moreover, our analysis results show that pretrained word embedding and particularly contextualized embedding for biomedical corpora help to understand complexities inherent to biomedical text.

Collapse

Grljević O, Bošnjak Z, Kovačević A. Opinion mining in higher education: a corpus-based approach. ENTERP INF SYST-UK 2020. [DOI: 10.1080/17517575.2020.1773542] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]

Omero P, Valotto M, Bellana R, Bongelli R, Riccioni I, Zuczkowski A, Tasso C. Writer’s uncertainty identification in scientific biomedical articles: a tool for automatic if-clause tagging. LANG RESOUR EVAL 2020. [DOI: 10.1007/s10579-020-09491-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]

Prieto M, Deus H, de Waard A, Schultes E, García-Jiménez B, Wilkinson MD. Data-driven classification of the certainty of scholarly assertions. PeerJ 2020;8:e8871. [PMID: 32341891 PMCID: PMC7182025 DOI: 10.7717/peerj.8871] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2019] [Accepted: 03/09/2020] [Indexed: 01/02/2023] Open

Kolhatkar V, Wu H, Cavasso L, Francis E, Shukla K, Taboada M. The SFU Opinion and Comments Corpus: A Corpus for the Analysis of Online News Comments. CORPUS PRAGMATICS : INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS AND PRAGMATICS 2019;4:155-190. [PMID: 32685909 PMCID: PMC7357677 DOI: 10.1007/s41701-019-00065-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/13/2019] [Accepted: 10/15/2019] [Indexed: 06/02/2023]

Bongelli R, Riccioni I, Burro R, Zuczkowski A. Writers' uncertainty in scientific and popular biomedical articles. A comparative analysis of the British Medical Journal and Discover Magazine. PLoS One 2019;14:e0221933. [PMID: 31487308 PMCID: PMC6728051 DOI: 10.1371/journal.pone.0221933] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2018] [Accepted: 08/19/2019] [Indexed: 12/01/2022] Open

Abstract

Distinguishing certain and uncertain information is of crucial importance both in the scientific field in the strict sense and in the popular scientific domain. In this paper, by adopting an epistemic stance perspective on certainty and uncertainty, and a mixed procedure of analysis, which combines a bottom-up and a top-down approach, we perform a comparative study (both qualitative and quantitative) of the uncertainty linguistic markers (verbs, non-verbs, modal verbs, conditional clauses, uncertain questions, epistemic future) and their scope in three different corpora: a historical corpus of 80 biomedical articles from the British Medical Journal (BMJ) 1840–2007; a corpus of 12 biomedical articles from BMJ 2013, and a contemporary corpus of 12 scientific popular articles from Discover 2013. The variables under observation are time, structure (IMRaD vs no-IMRaD) and genre (scientific vs popular articles). We apply the Generalized Linear Models analysis in order to test whether there are statistically significant differences (1) in the amount of uncertainty among the different corpora, and (2) in the categories of uncertainty markers used by writers. The results of our analysis reveal that (1) in all corpora, the percentages of uncertainty are always much lower than that of certainty; (2) uncertainty progressively diminishes over time in biomedical articles (in conjunction with their structural changes–IMRaD–and to the increase of the BMJ Impact Factor); and (3) uncertainty is slightly higher in scientific popular articles (Discover 2013) as compared to the contemporary corpus of scientific articles (BMJ 2013). Nevertheless, in all corpora, modal verbs are the most used uncertainty markers. These results suggest that not only do scientific writers prefer to communicate their uncertainty with markers of possibility rather than those of subjectivity but also that science journalists prefer using a third-person subject followed by modal verbs rather than a first-person subject followed by mental verbs such as think or believe.

Collapse

Sergeeva E, Zhu H, Prinsen P, Tahmasebi A. Negation Scope Detection in Clinical Notes and Scientific Abstracts: A Feature-enriched LSTM-based Approach. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2019;2019:212-221. [PMID: 31258973 PMCID: PMC6568093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]

Kennedy N, Brodbelt DC, Church DB, O’Neill DG. Detecting false-positive disease references in veterinary clinical notes without manual annotations. NPJ Digit Med 2019;2:33. [PMID: 31304379 PMCID: PMC6550178 DOI: 10.1038/s41746-019-0108-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2018] [Accepted: 04/12/2019] [Indexed: 11/09/2022] Open

Jagannatha A, Liu F, Liu W, Yu H. Overview of the First Natural Language Processing Challenge for Extracting Medication, Indication, and Adverse Drug Events from Electronic Health Record Notes (MADE 1.0). Drug Saf 2019;42:99-111. [PMID: 30649735 PMCID: PMC6860017 DOI: 10.1007/s40264-018-0762-z] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]

Abstract

INTRODUCTION

This work describes the Medication and Adverse Drug Events from Electronic Health Records (MADE 1.0) corpus and provides an overview of the MADE 1.0 2018 challenge for extracting medication, indication, and adverse drug events (ADEs) from electronic health record (EHR) notes.

OBJECTIVE

The goal of MADE is to provide a set of common evaluation tasks to assess the state of the art for natural language processing (NLP) systems applied to EHRs supporting drug safety surveillance and pharmacovigilance. We also provide benchmarks on the MADE dataset using the system submissions received in the MADE 2018 challenge.

METHODS

The MADE 1.0 challenge has released an expert-annotated cohort of medication and ADE information comprising 1089 fully de-identified longitudinal EHR notes from 21 randomly selected patients with cancer at the University of Massachusetts Memorial Hospital. Using this cohort as a benchmark, the MADE 1.0 challenge designed three shared NLP tasks. The named entity recognition (NER) task identifies medications and their attributes (dosage, route, duration, and frequency), indications, ADEs, and severity. The relation identification (RI) task identifies relations between the named entities: medication-indication, medication-ADE, and attribute relations. The third shared task (NER-RI) evaluates NLP models that perform the NER and RI tasks jointly. In total, 11 teams from four countries participated in at least one of the three shared tasks, and 41 system submissions were received in total.

RESULTS

The best systems F1 scores for NER, RI, and NER-RI were 0.82, 0.86, and 0.61, respectively. Ensemble classifiers using the team submissions improved the performance further, with an F1 score of 0.85, 0.87, and 0.66 for the three tasks, respectively.

CONCLUSION

MADE results show that recent progress in NLP has led to remarkable improvements in NER and RI tasks for the clinical domain. However, some room for improvement remains, particularly in the NER-RI task.

Collapse

Taylor SJ, Harabagiu SM. The Role of a Deep-Learning Method for Negation Detection in Patient Cohort Identification from Electroencephalography Reports. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018;2018:1018-1027. [PMID: 30815145 PMCID: PMC6371289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]

Kilicoglu H. Biomedical text mining for research rigor and integrity: tasks, challenges, directions. Brief Bioinform 2018;19:1400-1414. [PMID: 28633401 PMCID: PMC6291799 DOI: 10.1093/bib/bbx057] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Revised: 04/10/2017] [Indexed: 01/01/2023] Open

Fabregat H, Araujo L, Martinez-Romo J. Deep neural models for extracting entities and relationships in the new RDD corpus relating disabilities and rare diseases. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018;164:121-129. [PMID: 30195420 DOI: 10.1016/j.cmpb.2018.07.007] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Revised: 06/20/2018] [Accepted: 07/16/2018] [Indexed: 06/08/2023]

Abstract

BACKGROUND AND OBJECTIVE

There is a huge amount of rare diseases, many of which have associated important disabilities. It is paramount to know in advance the evolution of the disease in order to limit and prevent the appearance of disabilities and to prepare the patient to manage the future difficulties. Rare disease associations are making an effort to manually collect this information, but it is a long process. A lot of information about the consequences of rare diseases is published in scientific papers, and our goal is to automatically extract disabilities associated with diseases from them.

METHODS

This work presents a new corpus of abstracts from scientific papers related to rare diseases, which has been manually annotated with disabilities. This corpus allows to train machine and deep learning systems that can automatically process other papers, thus extracting new information about the relations between rare diseases and disabilities. The corpus is also annotated with negation and speculation when they appear affecting disabilities. The corpus has been made publicly accessible.

RESULTS

We have devised some experiments using deep learning techniques to show the usefulness of the developed corpus. Specifically, we have designed a long short-term memory based architecture for disabilities identification, as well as a convolutional neural network for detecting their relationships to diseases. The systems designed do not need any preprocessing of the data, but only low dimensional vectors representing the words.

CONCLUSIONS

The developed corpus will allow to train systems to identify disabilities in biomedical documents, which the current annotation systems are not able to detect. The system could also be trained to detect relationships between them and diseases, as well as negation and speculation, that can change the meaning of the language. The deep learning models designed for identifying disabilities and their relationships to diseases in new documents show that the corpus allows obtaining an F-measure of around 81% for the disability recognition and 75% for relation extraction.

Collapse

Shardlow M, Batista-Navarro R, Thompson P, Nawaz R, McNaught J, Ananiadou S. Identification of research hypotheses and new knowledge from scientific literature. BMC Med Inform Decis Mak 2018;18:46. [PMID: 29940927 PMCID: PMC6019216 DOI: 10.1186/s12911-018-0639-1] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2017] [Accepted: 06/11/2018] [Indexed: 01/05/2023] Open

Abstract

Background

Text mining (TM) methods have been used extensively to extract relations and events from the literature. In addition, TM techniques have been used to extract various types or dimensions of interpretative information, known as Meta-Knowledge (MK), from the context of relations and events, e.g. negation, speculation, certainty and knowledge type. However, most existing methods have focussed on the extraction of individual dimensions of MK, without investigating how they can be combined to obtain even richer contextual information. In this paper, we describe a novel, supervised method to extract new MK dimensions that encode Research Hypotheses (an author’s intended knowledge gain) and New Knowledge (an author’s findings). The method incorporates various features, including a combination of simple MK dimensions.

Methods

We identify previously explored dimensions and then use a random forest to combine these with linguistic features into a classification model. To facilitate evaluation of the model, we have enriched two existing corpora annotated with relations and events, i.e., a subset of the GENIA-MK corpus and the EU-ADR corpus, by adding attributes to encode whether each relation or event corresponds to Research Hypothesis or New Knowledge. In the GENIA-MK corpus, these new attributes complement simpler MK dimensions that had previously been annotated.

Results

We show that our approach is able to assign different types of MK dimensions to relations and events with a high degree of accuracy. Firstly, our method is able to improve upon the previously reported state of the art performance for an existing dimension, i.e., Knowledge Type. Secondly, we also demonstrate high F1-score in predicting the new dimensions of Research Hypothesis (GENIA: 0.914, EU-ADR 0.802) and New Knowledge (GENIA: 0.829, EU-ADR 0.836).

Conclusion

We have presented a novel approach for predicting New Knowledge and Research Hypothesis, which combines simple MK dimensions to achieve high F1-scores. The extraction of such information is valuable for a number of practical TM applications.

Electronic supplementary material

The online version of this article (10.1186/s12911-018-0639-1) contains supplementary material, which is available to authorized users.

Collapse

Demner-Fushman D, Rogers WJ, Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc 2018;24:841-844. [PMID: 28130331 DOI: 10.1093/jamia/ocw177] [Citation(s) in RCA: 83] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2016] [Accepted: 12/09/2016] [Indexed: 11/13/2022] Open

Kilicoglu H, Ben Abacha A, Mrabet Y, Shooshan SE, Rodriguez L, Masterton K, Demner-Fushman D. Semantic annotation of consumer health questions. BMC Bioinformatics 2018;19:34. [PMID: 29409442 PMCID: PMC5802048 DOI: 10.1186/s12859-018-2045-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 01/24/2018] [Indexed: 11/25/2022] Open

Abstract

BACKGROUND

Consumers increasingly use online resources for their health information needs. While current search engines can address these needs to some extent, they generally do not take into account that most health information needs are complex and can only fully be expressed in natural language. Consumer health question answering (QA) systems aim to fill this gap. A major challenge in developing consumer health QA systems is extracting relevant semantic content from the natural language questions (question understanding). To develop effective question understanding tools, question corpora semantically annotated for relevant question elements are needed. In this paper, we present a two-part consumer health question corpus annotated with several semantic categories: named entities, question triggers/types, question frames, and question topic. The first part (CHQA-email) consists of relatively long email requests received by the U.S. National Library of Medicine (NLM) customer service, while the second part (CHQA-web) consists of shorter questions posed to MedlinePlus search engine as queries. Each question has been annotated by two annotators. The annotation methodology is largely the same between the two parts of the corpus; however, we also explain and justify the differences between them. Additionally, we provide information about corpus characteristics, inter-annotator agreement, and our attempts to measure annotation confidence in the absence of adjudication of annotations.

RESULTS

The resulting corpus consists of 2614 questions (CHQA-email: 1740, CHQA-web: 874). Problems are the most frequent named entities, while treatment and general information questions are the most common question types. Inter-annotator agreement was generally modest: question types and topics yielded highest agreement, while the agreement for more complex frame annotations was lower. Agreement in CHQA-web was consistently higher than that in CHQA-email. Pairwise inter-annotator agreement proved most useful in estimating annotation confidence.

CONCLUSIONS

To our knowledge, our corpus is the first focusing on annotation of uncurated consumer health questions. It is currently used to develop machine learning-based methods for question understanding. We make the corpus publicly available to stimulate further research on consumer health QA.

Collapse

Chen C, Song M, Heo GE. A scalable and adaptive method for finding semantically equivalent cue words of uncertainty. J Informetr 2018. [DOI: 10.1016/j.joi.2017.12.004] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]

Zerva C, Batista-Navarro R, Day P, Ananiadou S. Using uncertainty to link and rank evidence from biomedical literature for model curation. Bioinformatics 2017;33:3784-3792. [PMID: 29036627 PMCID: PMC5860317 DOI: 10.1093/bioinformatics/btx466] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2017] [Revised: 06/27/2017] [Accepted: 07/21/2017] [Indexed: 11/20/2022] Open

Abstract

MOTIVATION

In recent years, there has been great progress in the field of automated curation of biomedical networks and models, aided by text mining methods that provide evidence from literature. Such methods must not only extract snippets of text that relate to model interactions, but also be able to contextualize the evidence and provide additional confidence scores for the interaction in question. Although various approaches calculating confidence scores have focused primarily on the quality of the extracted information, there has been little work on exploring the textual uncertainty conveyed by the author. Despite textual uncertainty being acknowledged in biomedical text mining as an attribute of text mined interactions (events), it is significantly understudied as a means of providing a confidence measure for interactions in pathways or other biomedical models. In this work, we focus on improving identification of textual uncertainty for events and explore how it can be used as an additional measure of confidence for biomedical models.

RESULTS

We present a novel method for extracting uncertainty from the literature using a hybrid approach that combines rule induction and machine learning. Variations of this hybrid approach are then discussed, alongside their advantages and disadvantages. We use subjective logic theory to combine multiple uncertainty values extracted from different sources for the same interaction. Our approach achieves F-scores of 0.76 and 0.88 based on the BioNLP-ST and Genia-MK corpora, respectively, making considerable improvements over previously published work. Moreover, we evaluate our proposed system on pathways related to two different areas, namely leukemia and melanoma cancer research.

AVAILABILITY AND IMPLEMENTATION

The leukemia pathway model used is available in Pathway Studio while the Ras model is available via PathwayCommons. Online demonstration of the uncertainty extraction system is available for research purposes at http://argo.nactem.ac.uk/test. The related code is available on https://github.com/c-zrv/uncertainty_components.git. Details on the above are available in the Supplementary Material.

CONTACT

sophia.ananiadou@manchester.ac.uk.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Kilicoglu H, Rosemblat G, Rindflesch TC. Assigning factuality values to semantic relations extracted from biomedical research literature. PLoS One 2017;12:e0179926. [PMID: 28678823 PMCID: PMC5497973 DOI: 10.1371/journal.pone.0179926] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2016] [Accepted: 06/06/2017] [Indexed: 11/22/2022] Open

Bokharaeian B, Diaz A, Taghizadeh N, Chitsaz H, Chavoshinejad R. SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature. J Biomed Semantics 2017;8:14. [PMID: 28388928 PMCID: PMC5383945 DOI: 10.1186/s13326-017-0116-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2016] [Accepted: 01/13/2017] [Indexed: 11/17/2022] Open

Abstract

Background

Single Nucleotide Polymorphisms (SNPs) are among the most important types of genetic variations influencing common diseases and phenotypes. Recently, some corpora and methods have been developed with the purpose of extracting mutations and diseases from texts. However, there is no available corpus, for extracting associations from texts, that is annotated with linguistic-based negation, modality markers, neutral candidates, and confidence level of associations.

Method

In this research, different steps were presented so as to produce the SNPPhenA corpus. They include automatic Named Entity Recognition (NER) followed by the manual annotation of SNP and phenotype names, annotation of the SNP-phenotype associations and their level of confidence, as well as modality markers. Moreover, the produced corpus was annotated with negation scopes and cues as well as neutral candidates that play crucial role as far as negation and the modality phenomenon in relation to extraction tasks.

Result

The agreement between annotators was measured by Cohen’s Kappa coefficient where the resulting scores indicated the reliability of the corpus. The Kappa score was 0.79 for annotating the associations and 0.80 for the confidence degree of associations. Further presented were the basic statistics of the annotated features of the corpus in addition to the results of our first experiments related to the extraction of ranked SNP-Phenotype associations. The prepared guideline documents render the corpus more convenient and facile to use. The corpus, guidelines and inter-annotator agreement analysis are available on the website of the corpus: http://nil.fdi.ucm.es/?q=node/639.

Conclusion

Specifying the confidence degree of SNP-phenotype associations from articles helps identify the strength of associations that could in turn assist genomics scientists in determining phenotypic plasticity and the importance of environmental factors. What is more, our first experiments with the corpus show that linguistic-based confidence alongside other non-linguistic features can be utilized in order to estimate the strength of the observed SNP-phenotype associations. Trial Registration: Not Applicable

Electronic supplementary material

The online version of this article (doi:10.1186/s13326-017-0116-2) contains supplementary material, which is available to authorized users.

Collapse

Kang T, Zhang S, Xu N, Wen D, Zhang X, Lei J. Detecting negation and scope in Chinese clinical notes using character and word embedding. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017;140:53-59. [PMID: 28254090 DOI: 10.1016/j.cmpb.2016.11.009] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/10/2016] [Revised: 11/04/2016] [Accepted: 11/22/2016] [Indexed: 06/06/2023]

Abstract

BACKGROUND AND OBJECTIVES

Researchers have developed effective methods to index free-text clinical notes into structured database, in which negation detection is a critical but challenging step. In Chinese clinical records, negation detection is particularly challenging because it may depend on upstream Chinese information processing components such as word segmentation [1]. Traditionally, negation detection was carried out mostly using rule-based methods, whose comprehensiveness and portability were usually limited. Our objectives in this paper are to: 1) Construct a large Chinese clinical notes corpus with negation annotated; 2) develop a negation detection tool for Chinese clinical notes; 3) evaluate the performance of character and word embedding features in Chinese clinical natural language processing.

METHODS

In this paper, we construct a Chinese clinical corpus consisting of admission and discharge summaries, and propose sequence labeling based systems for negation and scope detection. Our systems rely on features from bag of characters, bag of words, character embedding and word embedding. For scopes, we introduce an additional feature to handle nested scopes with multiple negations.

RESULTS

The two annotators reached an agreement of 0.79 measured by Kappa in manual annotation. In cue detection, our systems are able to achieve a performance as high as 99.0% measured by F score, which significantly outperform its rule-based counterpart (79% F). The best system uses word embedding as features, which yields precision of 99.0% and recall of 99.1%. In scope detection, our system is able to achieve a performance of 94.6% measured by F score.

CONCLUSIONS

Our study provides a state-of-the-art negation-detecting tool for Chinese clinical free-text notes; Experimental results demonstrate that word embedding is effective in identifying negations, and that nested scopes can be identified effectively by our method.

Collapse

Zhang S, Kang T, Zhang X, Wen D, Elhadad N, Lei J. Speculation detection for Chinese clinical notes: Impacts of word segmentation and embedding models. J Biomed Inform 2016;60:334-41. [PMID: 26923634 DOI: 10.1016/j.jbi.2016.02.011] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2015] [Revised: 02/09/2016] [Accepted: 02/12/2016] [Indexed: 11/25/2022]

Thompson P, Nawaz R, McNaught J, Ananiadou S. Enriching news events with meta-knowledge information. LANG RESOUR EVAL 2016. [DOI: 10.1007/s10579-016-9344-9] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]

Weegar R, Kvist M, Sundström K, Brunak S, Dalianis H. Finding Cervical Cancer Symptoms in Swedish Clinical Text using a Machine Learning Approach and NegEx. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2015;2015:1296-1305. [PMID: 26958270 PMCID: PMC4765575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]

A research framework for pharmacovigilance in health social media: Identification and evaluation of patient adverse drug event reports. J Biomed Inform 2015;58:268-279. [PMID: 26518315 DOI: 10.1016/j.jbi.2015.10.011] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2015] [Revised: 10/20/2015] [Accepted: 10/21/2015] [Indexed: 11/23/2022]

Pyysalo S, Ohta T, Rak R, Rowley A, Chun HW, Jung SJ, Choi SP, Tsujii J, Ananiadou S. Overview of the Cancer Genetics and Pathway Curation tasks of BioNLP Shared Task 2013. BMC Bioinformatics 2015;16 Suppl 10:S2. [PMID: 26202570 PMCID: PMC4511510 DOI: 10.1186/1471-2105-16-s10-s2] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open

Demner-Fushman D, Kohli MD, Rosenman MB, Shooshan SE, Rodriguez L, Antani S, Thoma GR, McDonald CJ. Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc 2015;23:304-10. [PMID: 26133894 DOI: 10.1093/jamia/ocv080] [Citation(s) in RCA: 165] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 05/20/2015] [Indexed: 11/12/2022] Open

Abstract

OBJECTIVE

Clinical documents made available for secondary use play an increasingly important role in discovery of clinical knowledge, development of research methods, and education. An important step in facilitating secondary use of clinical document collections is easy access to descriptions and samples that represent the content of the collections. This paper presents an approach to developing a collection of radiology examinations, including both the images and radiologist narrative reports, and making them publicly available in a searchable database.

MATERIALS AND METHODS

The authors collected 3996 radiology reports from the Indiana Network for Patient Care and 8121 associated images from the hospitals' picture archiving systems. The images and reports were de-identified automatically and then the automatic de-identification was manually verified. The authors coded the key findings of the reports and empirically assessed the benefits of manual coding on retrieval.

RESULTS

The automatic de-identification of the narrative was aggressive and achieved 100% precision at the cost of rendering a few findings uninterpretable. Automatic de-identification of images was not quite as perfect. Images for two of 3996 patients (0.05%) showed protected health information. Manual encoding of findings improved retrieval precision.

CONCLUSION

Stringent de-identification methods can remove all identifiers from text radiology reports. DICOM de-identification of images does not remove all identifying information and needs special attention to images scanned from film. Adding manual coding to the radiologist narrative reports significantly improved relevancy of the retrieved clinical documents. The de-identified Indiana chest X-ray collection is available for searching and downloading from the National Library of Medicine (http://openi.nlm.nih.gov/).

Collapse

Mehrabi S, Krishnan A, Sohn S, Roch AM, Schmidt H, Kesterson J, Beesley C, Dexter P, Max Schmidt C, Liu H, Palakal M. DEEPEN: A negation detection system for clinical text incorporating dependency relation into NegEx. J Biomed Inform 2015;54:213-9. [PMID: 25791500 PMCID: PMC5863758 DOI: 10.1016/j.jbi.2015.02.010] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Revised: 01/22/2015] [Accepted: 02/24/2015] [Indexed: 12/01/2022]

Automatic negation detection in narrative pathology reports. Artif Intell Med 2015;64:41-50. [PMID: 25990897 DOI: 10.1016/j.artmed.2015.03.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2014] [Revised: 01/20/2015] [Accepted: 03/17/2015] [Indexed: 11/21/2022]

Bravo À, Piñero J, Queralt-Rosinach N, Rautschka M, Furlong LI. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics 2015;16:55. [PMID: 25886734 PMCID: PMC4466840 DOI: 10.1186/s12859-015-0472-9] [Citation(s) in RCA: 104] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Accepted: 01/19/2015] [Indexed: 11/23/2022] Open

Abstract

Background

Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases.

Results

By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications.

Conclusions

BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0472-9) contains supplementary material, which is available to authorized users.

Collapse

Kim Y, Garvin J, Goldstein MK, Meystre SM. Classification of Contextual Use of Left Ventricular Ejection Fraction Assessments. Stud Health Technol Inform 2015;216:599-603. [PMID: 26262121 PMCID: PMC5055832] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]

Afzal Z, Pons E, Kang N, Sturkenboom MCJM, Schuemie MJ, Kors JA. ContextD: an algorithm to identify contextual properties of medical terms in a Dutch clinical corpus. BMC Bioinformatics 2014;15:373. [PMID: 25432799 PMCID: PMC4264258 DOI: 10.1186/s12859-014-0373-3] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2014] [Accepted: 11/01/2014] [Indexed: 11/10/2022] Open

Abstract

Background

In order to extract meaningful information from electronic medical records, such as signs and symptoms, diagnoses, and treatments, it is important to take into account the contextual properties of the identified information: negation, temporality, and experiencer. Most work on automatic identification of these contextual properties has been done on English clinical text. This study presents ContextD, an adaptation of the English ConText algorithm to the Dutch language, and a Dutch clinical corpus.

We created a Dutch clinical corpus containing four types of anonymized clinical documents: entries from general practitioners, specialists’ letters, radiology reports, and discharge letters. Using a Dutch list of medical terms extracted from the Unified Medical Language System, we identified medical terms in the corpus with exact matching. The identified terms were annotated for negation, temporality, and experiencer properties. To adapt the ConText algorithm, we translated English trigger terms to Dutch and added several general and document specific enhancements, such as negation rules for general practitioners’ entries and a regular expression based temporality module.

Results

The ContextD algorithm utilized 41 unique triggers to identify the contextual properties in the clinical corpus. For the negation property, the algorithm obtained an F-score from 87% to 93% for the different document types. For the experiencer property, the F-score was 99% to 100%. For the historical and hypothetical values of the temporality property, F-scores ranged from 26% to 54% and from 13% to 44%, respectively.

Conclusions

The ContextD showed good performance in identifying negation and experiencer property values across all Dutch clinical document types. Accurate identification of the temporality property proved to be difficult and requires further work. The anonymized and annotated Dutch clinical corpus can serve as a useful resource for further algorithm development.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0373-3) contains supplementary material, which is available to authorized users.

Collapse

Wu S, Miller T, Masanz J, Coarr M, Halgrim S, Carrell D, Clark C. Negation's not solved: generalizability versus optimizability in clinical natural language processing. PLoS One 2014;9:e112774. [PMID: 25393544 PMCID: PMC4231086 DOI: 10.1371/journal.pone.0112774] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2014] [Accepted: 10/18/2014] [Indexed: 11/30/2022] Open

Velupillai S, Skeppstedt M, Kvist M, Mowery D, Chapman BE, Dalianis H, Chapman WW. Cue-based assertion classification for Swedish clinical text--developing a lexicon for pyConTextSwe. Artif Intell Med 2014;61:137-44. [PMID: 24556644 PMCID: PMC4104142 DOI: 10.1016/j.artmed.2014.01.001] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2013] [Revised: 12/19/2013] [Accepted: 01/10/2014] [Indexed: 11/17/2022]

Abstract

OBJECTIVE

The ability of a cue-based system to accurately assert whether a disorder is affirmed, negated, or uncertain is dependent, in part, on its cue lexicon. In this paper, we continue our study of porting an assertion system (pyConTextNLP) from English to Swedish (pyConTextSwe) by creating an optimized assertion lexicon for clinical Swedish.

METHODS AND MATERIAL

We integrated cues from four external lexicons, along with generated inflections and combinations. We used subsets of a clinical corpus in Swedish. We applied four assertion classes (definite existence, probable existence, probable negated existence and definite negated existence) and two binary classes (existence yes/no and uncertainty yes/no) to pyConTextSwe. We compared pyConTextSwe's performance with and without the added cues on a development set, and improved the lexicon further after an error analysis. On a separate evaluation set, we calculated the system's final performance.

RESULTS

Following integration steps, we added 454 cues to pyConTextSwe. The optimized lexicon developed after an error analysis resulted in statistically significant improvements on the development set (83% F-score, overall). The system's final F-scores on an evaluation set were 81% (overall). For the individual assertion classes, F-score results were 88% (definite existence), 81% (probable existence), 55% (probable negated existence), and 63% (definite negated existence). For the binary classifications existence yes/no and uncertainty yes/no, final system performance was 97%/87% and 78%/86% F-score, respectively.

CONCLUSIONS

We have successfully ported pyConTextNLP to Swedish (pyConTextSwe). We have created an extensive and useful assertion lexicon for Swedish clinical text, which could form a valuable resource for similar studies, and which is publicly available.

Collapse

Neves M. An analysis on the entity annotations in biological corpora. F1000Res 2014;3:96. [PMID: 25254099 PMCID: PMC4168744 DOI: 10.12688/f1000research.3216.1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/17/2014] [Indexed: 11/20/2022] Open

Styler WF, Bethard S, Finan S, Palmer M, Pradhan S, de Groen PC, Erickson B, Miller T, Lin C, Savova G, Pustejovsky J. Temporal Annotation in the Clinical Domain. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS 2014;2:143-154. [PMID: 29082229 PMCID: PMC5657277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]

Liu V, Clark MP, Mendoza M, Saket R, Gardner MN, Turk BJ, Escobar GJ. Automated identification of pneumonia in chest radiograph reports in critically ill patients. BMC Med Inform Decis Mak 2013;13:90. [PMID: 23947340 PMCID: PMC3765332 DOI: 10.1186/1472-6947-13-90] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2013] [Accepted: 08/12/2013] [Indexed: 11/22/2022] Open

Abstract

BACKGROUND

Prior studies demonstrate the suitability of natural language processing (NLP) for identifying pneumonia in chest radiograph (CXR) reports, however, few evaluate this approach in intensive care unit (ICU) patients.

METHODS

From a total of 194,615 ICU reports, we empirically developed a lexicon to categorize pneumonia-relevant terms and uncertainty profiles. We encoded lexicon items into unique queries within an NLP software application and designed an algorithm to assign automated interpretations ('positive', 'possible', or 'negative') based on each report's query profile. We evaluated algorithm performance in a sample of 2,466 CXR reports interpreted by physician consensus and in two ICU patient subgroups including those admitted for pneumonia and for rheumatologic/endocrine diagnoses.

RESULTS

Most reports were deemed 'negative' (51.8%) by physician consensus. Many were 'possible' (41.7%); only 6.5% were 'positive' for pneumonia. The lexicon included 105 terms and uncertainty profiles that were encoded into 31 NLP queries. Queries identified 534,322 'hits' in the full sample, with 2.7 ± 2.6 'hits' per report. An algorithm, comprised of twenty rules and probability steps, assigned interpretations to reports based on query profiles. In the validation set, the algorithm had 92.7% sensitivity, 91.1% specificity, 93.3% positive predictive value, and 90.3% negative predictive value for differentiating 'negative' from 'positive'/'possible' reports. In the ICU subgroups, the algorithm also demonstrated good performance, misclassifying few reports (5.8%).

CONCLUSIONS

Many CXR reports in ICU patients demonstrate frank uncertainty regarding a pneumonia diagnosis. This electronic tool demonstrates promise for assigning automated interpretations to CXR reports by leveraging both terms and uncertainty profiles.

Collapse

Friedman C, Rindflesch TC, Corn M. Natural language processing: state of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine. J Biomed Inform 2013;46:765-73. [PMID: 23810857 DOI: 10.1016/j.jbi.2013.06.004] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2013] [Revised: 06/07/2013] [Accepted: 06/07/2013] [Indexed: 01/29/2023]