1
|
Campillos-Llanos L. MedLexSp - a medical lexicon for Spanish medical natural language processing. J Biomed Semantics 2023; 14:2. [PMID: 36732862 PMCID: PMC9892682 DOI: 10.1186/s13326-022-00281-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Accepted: 12/03/2022] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND Medical lexicons enable the natural language processing (NLP) of health texts. Lexicons gather terms and concepts from thesauri and ontologies, and linguistic data for part-of-speech (PoS) tagging, lemmatization or natural language generation. To date, there is no such type of resource for Spanish. CONSTRUCTION AND CONTENT This article describes an unified medical lexicon for Medical Natural Language Processing in Spanish. MedLexSp includes terms and inflected word forms with PoS information and Unified Medical Language System[Formula: see text] (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used NLP techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs. 10, the Anatomical Therapeutic Chemical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. MedLexSp includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 UMLS CUIs. We report two use cases of MedLexSp. First, applying the lexicon to pre-annotate a corpus of 1200 texts related to clinical trials. Second, PoS tagging and lemmatizing texts about clinical cases. MedLexSp improved the scores for PoS tagging and lemmatization compared to the default Spacy and Stanza python libraries. CONCLUSIONS The lexicon is distributed in a delimiter-separated value file; an XML file with the Lexical Markup Framework; a lemmatizer module for the Spacy and Stanza libraries; and complementary Lexical Record (LR) files. The embeddings and code to extract COVID-19 terms, and the Spacy and Stanza lemmatizers enriched with medical terms are provided in a public repository.
Collapse
Affiliation(s)
- Leonardo Campillos-Llanos
- Instituto de Lengua, Literatura y Antropología (ILLA), CSIC (Spanish National Research Council), Albasanz 26-28, 28037, Madrid, Spain.
| |
Collapse
|
2
|
|
3
|
Coghlan A, Turner S, Coverdale S. Danger in discharge summaries: Abbreviations create confusion for both author and recipient. Intern Med J 2021; 53:550-558. [PMID: 34636114 DOI: 10.1111/imj.15582] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2021] [Revised: 09/26/2021] [Accepted: 10/06/2021] [Indexed: 11/29/2022]
Abstract
BACKGROUND The transition from hospital inpatient care to medical care in the community is a high-risk period for adverse events. Inadequate communication, including low quality or unavailable discharge summaries, has been shown to impact patient care. AIMS Assess use of abbreviations in clinical handover documents from inpatient hospital teams to general practitioners (GPs), and the interpretation of these abbreviations by GPs and hospital-based junior doctors. METHODS Retrospective audit of 802 discharge summaries completed during a one-week period in 2017 by a Queensland regional health service. GPs and local junior doctors then attempted interpretation of twenty relevant abbreviations. RESULTS 99% (794) discharge summaries included abbreviations. 1612 different abbreviations were used on 16 327 occasions. The median number of abbreviations per discharge summary was 17 (range 0-86). 254 GPs and 62 junior doctors responded to a survey which found that no abbreviation was interpreted the same by all respondents. GPs and junior doctors were unable to offer any interpretation in 17.9% and 15.2% of cases respectively. GPs offered a greater range of interpretations than junior doctors, with a median of 9 and 3 different interpretations per abbreviation respectively. 94% (239) of GPs felt that the use of abbreviations in discharge summaries had the potential to impact patient care. 152 (60%) GPs felt that time spent clarifying abbreviations in discharge summaries could be excessive. CONCLUSIONS Abbreviations are often used in discharge summaries, yet poorly understood. This has the potential to impact patient care in the transition period after hospitalisation This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Anna Coghlan
- Sunshine Coast Hospital and Health Service, 6 Doherty St, Birtinya QLD AUS 4575.,Fernlands Radius Medical Centre, 10 Woodhill Road, Ferny Hills QLD AUS 4055.,University of Queensland Faculty of Medicine, Herston QLD AUS 4006, Australia
| | - Sophie Turner
- Sunshine Coast Hospital and Health Service, 6 Doherty St, Birtinya QLD AUS 4575.,Metro North Hospital and Health Service, 7 Butterfield St, Herston QLD AUS 4006, Australia.,University of Queensland Faculty of Medicine, Herston QLD AUS 4006, Australia
| | - Steven Coverdale
- School of Medicine, Sunshine Coast, Griffith University, 6, Doherty St, BIRTINYA, QLD 4575, Australia
| |
Collapse
|
4
|
Jing X. The Unified Medical Language System at 30 Years and How It Is Used and Published: Systematic Review and Content Analysis. JMIR Med Inform 2021; 9:e20675. [PMID: 34236337 PMCID: PMC8433943 DOI: 10.2196/20675] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Revised: 11/25/2020] [Accepted: 07/02/2021] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND The Unified Medical Language System (UMLS) has been a critical tool in biomedical and health informatics, and the year 2021 marks its 30th anniversary. The UMLS brings together many broadly used vocabularies and standards in the biomedical field to facilitate interoperability among different computer systems and applications. OBJECTIVE Despite its longevity, there is no comprehensive publication analysis of the use of the UMLS. Thus, this review and analysis is conducted to provide an overview of the UMLS and its use in English-language peer-reviewed publications, with the objective of providing a comprehensive understanding of how the UMLS has been used in English-language peer-reviewed publications over the last 30 years. METHODS PubMed, ACM Digital Library, and the Nursing & Allied Health Database were used to search for studies. The primary search strategy was as follows: UMLS was used as a Medical Subject Headings term or a keyword or appeared in the title or abstract. Only English-language publications were considered. The publications were screened first, then coded and categorized iteratively, following the grounded theory. The review process followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. RESULTS A total of 943 publications were included in the final analysis. Moreover, 32 publications were categorized into 2 categories; hence the total number of publications before duplicates are removed is 975. After analysis and categorization of the publications, UMLS was found to be used in the following emerging themes or areas (the number of publications and their respective percentages are given in parentheses): natural language processing (230/975, 23.6%), information retrieval (125/975, 12.8%), terminology study (90/975, 9.2%), ontology and modeling (80/975, 8.2%), medical subdomains (76/975, 7.8%), other language studies (53/975, 5.4%), artificial intelligence tools and applications (46/975, 4.7%), patient care (35/975, 3.6%), data mining and knowledge discovery (25/975, 2.6%), medical education (20/975, 2.1%), degree-related theses (13/975, 1.3%), digital library (5/975, 0.5%), and the UMLS itself (150/975, 15.4%), as well as the UMLS for other purposes (27/975, 2.8%). CONCLUSIONS The UMLS has been used successfully in patient care, medical education, digital libraries, and software development, as originally planned, as well as in degree-related theses, the building of artificial intelligence tools, data mining and knowledge discovery, foundational work in methodology, and middle layers that may lead to advanced products. Natural language processing, the UMLS itself, and information retrieval are the 3 most common themes that emerged among the included publications. The results, although largely related to academia, demonstrate that UMLS achieves its intended uses successfully, in addition to achieving uses broadly beyond its original intentions.
Collapse
Affiliation(s)
- Xia Jing
- Department of Public Health Sciences, College of Behavioral, Social and Health Sciences, Clemson University, Clemson, SC, United States
| |
Collapse
|
5
|
Carriere J, Shafi H, Brehon K, Pohar Manhas K, Churchill K, Ho C, Tavakoli M. Case Report: Utilizing AI and NLP to Assist with Healthcare and Rehabilitation During the COVID-19 Pandemic. Front Artif Intell 2021; 4:613637. [PMID: 33733232 PMCID: PMC7907599 DOI: 10.3389/frai.2021.613637] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Accepted: 01/08/2021] [Indexed: 01/16/2023] Open
Abstract
The COVID-19 pandemic has profoundly affected healthcare systems and healthcare delivery worldwide. Policy makers are utilizing social distancing and isolation policies to reduce the risk of transmission and spread of COVID-19, while the research, development, and testing of antiviral treatments and vaccines are ongoing. As part of these isolation policies, in-person healthcare delivery has been reduced, or eliminated, to avoid the risk of COVID-19 infection in high-risk and vulnerable populations, particularly those with comorbidities. Clinicians, occupational therapists, and physiotherapists have traditionally relied on in-person diagnosis and treatment of acute and chronic musculoskeletal (MSK) and neurological conditions and illnesses. The assessment and rehabilitation of persons with acute and chronic conditions has, therefore, been particularly impacted during the pandemic. This article presents a perspective on how Artificial Intelligence and Machine Learning (AI/ML) technologies, such as Natural Language Processing (NLP), can be used to assist with assessment and rehabilitation for acute and chronic conditions.
Collapse
Affiliation(s)
- Jay Carriere
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada
| | - Hareem Shafi
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada
| | - Katelyn Brehon
- School of Public Health, University of Alberta, Edmonton, AB, Canada
| | - Kiran Pohar Manhas
- Neurosciences, Rehabilitation, and Vision Strategic Clinical Network, Alberta Health Services, Calgary, AB, Canada
| | - Katie Churchill
- Department of Occupational Therapy, University of Alberta, Edmonton, AB, Canada
- Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Chester Ho
- Neurosciences, Rehabilitation, and Vision Strategic Clinical Network, Alberta Health Services, Calgary, AB, Canada
- Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, Canada
| | - Mahdi Tavakoli
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada
| |
Collapse
|
6
|
Zhang C, Biś D, Liu X, He Z. Biomedical word sense disambiguation with bidirectional long short-term memory and attention-based neural networks. BMC Bioinformatics 2019; 20:502. [PMID: 31787096 PMCID: PMC6886160 DOI: 10.1186/s12859-019-3079-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Background In recent years, deep learning methods have been applied to many natural language processing tasks to achieve state-of-the-art performance. However, in the biomedical domain, they have not out-performed supervised word sense disambiguation (WSD) methods based on support vector machines or random forests, possibly due to inherent similarities of medical word senses. Results In this paper, we propose two deep-learning-based models for supervised WSD: a model based on bi-directional long short-term memory (BiLSTM) network, and an attention model based on self-attention architecture. Our result shows that the BiLSTM neural network model with a suitable upper layer structure performs even better than the existing state-of-the-art models on the MSH WSD dataset, while our attention model was 3 or 4 times faster than our BiLSTM model with good accuracy. In addition, we trained “universal” models in order to disambiguate all ambiguous words together. That is, we concatenate the embedding of the target ambiguous word to the max-pooled vector in the universal models, acting as a “hint”. The result shows that our universal BiLSTM neural network model yielded about 90 percent accuracy. Conclusion Deep contextual models based on sequential information processing methods are able to capture the relative contextual information from pre-trained input word embeddings, in order to provide state-of-the-art results for supervised biomedical WSD tasks.
Collapse
Affiliation(s)
- Canlin Zhang
- Department of Mathematics, Florida State University, Tallahassee, FL, US
| | - Daniel Biś
- Department of Computer Science, Florida State University, Tallahassee, FL, US
| | - Xiuwen Liu
- Department of Computer Science, Florida State University, Tallahassee, FL, US
| | - Zhe He
- School of Information, Florida State University, Tallahassee, FL, US.
| |
Collapse
|
7
|
Wang Y, Zheng K, Xu H, Mei Q. Interactive medical word sense disambiguation through informed learning. J Am Med Inform Assoc 2018; 25:800-808. [PMID: 29584896 PMCID: PMC6658868 DOI: 10.1093/jamia/ocy013] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2017] [Revised: 01/19/2018] [Accepted: 02/09/2018] [Indexed: 11/13/2022] Open
Abstract
Objective Medical word sense disambiguation (WSD) is challenging and often requires significant training with data labeled by domain experts. This work aims to develop an interactive learning algorithm that makes efficient use of expert's domain knowledge in building high-quality medical WSD models with minimal human effort. Methods We developed an interactive learning algorithm with expert labeling instances and features. An expert can provide supervision in 3 ways: labeling instances, specifying indicative words of a sense, and highlighting supporting evidence in a labeled instance. The algorithm learns from these labels and iteratively selects the most informative instances to ask for future labels. Our evaluation used 3 WSD corpora: 198 ambiguous terms from Medical Subject Headings (MSH) as MEDLINE indexing terms, 74 ambiguous abbreviations in clinical notes from the University of Minnesota (UMN), and 24 ambiguous abbreviations in clinical notes from Vanderbilt University Hospital (VUH). For each ambiguous term and each learning algorithm, a learning curve that plots the accuracy on the test set against the number of labeled instances was generated. The area under the learning curve was used as the primary evaluation metric. Results Our interactive learning algorithm significantly outperformed active learning, the previous fastest learning algorithm for medical WSD. Compared to active learning, it achieved 90% accuracy for the MSH corpus with 42% less labeling effort, 35% less labeling effort for the UMN corpus, and 16% less labeling effort for the VUH corpus. Conclusions High-quality WSD models can be efficiently trained with minimal supervision by inviting experts to label informative instances and provide domain knowledge through labeling/highlighting contextual features.
Collapse
Affiliation(s)
- Yue Wang
- Department of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, MI, 48109, USA
| | - Kai Zheng
- Department of Informatics, The University of California, Irvine, CA, 92697, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Qiaozhu Mei
- Department of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, MI, 48109, USA
- School of Information, The University of Michigan, Ann Arbor, MI, 48109, USA
| |
Collapse
|
8
|
Duque A, Stevenson M, Martinez-Romo J, Araujo L. Co-occurrence graphs for word sense disambiguation in the biomedical domain. Artif Intell Med 2018; 87:9-19. [DOI: 10.1016/j.artmed.2018.03.002] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2017] [Revised: 01/23/2018] [Accepted: 03/11/2018] [Indexed: 10/17/2022]
|
9
|
Chen J, Druhl E, Polepalli Ramesh B, Houston TK, Brandt CA, Zulman DM, Vimalananda VG, Malkani S, Yu H. A Natural Language Processing System That Links Medical Terms in Electronic Health Record Notes to Lay Definitions: System Development Using Physician Reviews. J Med Internet Res 2018; 20:e26. [PMID: 29358159 PMCID: PMC5799720 DOI: 10.2196/jmir.8669] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Revised: 11/21/2017] [Accepted: 12/06/2017] [Indexed: 11/23/2022] Open
Abstract
Background Many health care systems now allow patients to access their electronic health record (EHR) notes online through patient portals. Medical jargon in EHR notes can confuse patients, which may interfere with potential benefits of patient access to EHR notes. Objective The aim of this study was to develop and evaluate the usability and content quality of NoteAid, a Web-based natural language processing system that links medical terms in EHR notes to lay definitions, that is, definitions easily understood by lay people. Methods NoteAid incorporates two core components: CoDeMed, a lexical resource of lay definitions for medical terms, and MedLink, a computational unit that links medical terms to lay definitions. We developed innovative computational methods, including an adapted distant supervision algorithm to prioritize medical terms important for EHR comprehension to facilitate the effort of building CoDeMed. Ten physician domain experts evaluated the user interface and content quality of NoteAid. The evaluation protocol included a cognitive walkthrough session and a postsession questionnaire. Physician feedback sessions were audio-recorded. We used standard content analysis methods to analyze qualitative data from these sessions. Results Physician feedback was mixed. Positive feedback on NoteAid included (1) Easy to use, (2) Good visual display, (3) Satisfactory system speed, and (4) Adequate lay definitions. Opportunities for improvement arising from evaluation sessions and feedback included (1) improving the display of definitions for partially matched terms, (2) including more medical terms in CoDeMed, (3) improving the handling of terms whose definitions vary depending on different contexts, and (4) standardizing the scope of definitions for medicines. On the basis of these results, we have improved NoteAid’s user interface and a number of definitions, and added 4502 more definitions in CoDeMed. Conclusions Physician evaluation yielded useful feedback for content validation and refinement of this innovative tool that has the potential to improve patient EHR comprehension and experience using patient portals. Future ongoing work will develop algorithms to handle ambiguous medical terms and test and evaluate NoteAid with patients.
Collapse
Affiliation(s)
- Jinying Chen
- Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA, United States
| | - Emily Druhl
- Bedford Veterans Affairs Medical Center, Center for Healthcare Organization and Implementation Research, Bedford, MA, United States
| | | | - Thomas K Houston
- Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA, United States.,Bedford Veterans Affairs Medical Center, Center for Healthcare Organization and Implementation Research, Bedford, MA, United States
| | - Cynthia A Brandt
- Veterans Affairs Connecticut Health Care System, West Haven, CT, United States.,Center for Medical Informatics, Yale University, New Haven, CT, United States
| | - Donna M Zulman
- Division of Primary Care and Population Health, Stanford University School of Medicine, Stanford, CA, United States.,Veterans Affairs Palo Alto Health Care System, Menlo Park, CA, United States
| | - Varsha G Vimalananda
- Bedford Veterans Affairs Medical Center, Center for Healthcare Organization and Implementation Research, Bedford, MA, United States.,School of Medicine, Boston University, Boston, MA, United States
| | - Samir Malkani
- Diabetes Center of Excellence, University of Massachusetts Medical School, Worcester, MA, United States
| | - Hong Yu
- Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA, United States.,Bedford Veterans Affairs Medical Center, Center for Healthcare Organization and Implementation Research, Bedford, MA, United States
| |
Collapse
|
10
|
Henriksson A, Zhao J, Dalianis H, Boström H. Ensembles of randomized trees using diverse distributed representations of clinical events. BMC Med Inform Decis Mak 2016; 16 Suppl 2:69. [PMID: 27459846 PMCID: PMC4965720 DOI: 10.1186/s12911-016-0309-0] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Learning deep representations of clinical events based on their distributions in electronic health records has been shown to allow for subsequent training of higher-performing predictive models compared to the use of shallow, count-based representations. The predictive performance may be further improved by utilizing multiple representations of the same events, which can be obtained by, for instance, manipulating the representation learning procedure. The question, however, remains how to make best use of a set of diverse representations of clinical events - modeled in an ensemble of semantic spaces - for the purpose of predictive modeling. METHODS Three different ways of exploiting a set of (ten) distributed representations of four types of clinical events - diagnosis codes, drug codes, measurements, and words in clinical notes - are investigated in a series of experiments using ensembles of randomized trees. Here, the semantic space ensembles are obtained by varying the context window size in the representation learning procedure. The proposed method trains a forest wherein each tree is built from a bootstrap replicate of the training set whose entire original feature set is represented in a randomly selected set of semantic spaces - corresponding to the considered data types - of a given context window size. RESULTS The proposed method significantly outperforms concatenating the multiple representations of the bagged dataset; it also significantly outperforms representing, for each decision tree, only a subset of the features in a randomly selected set of semantic spaces. A follow-up analysis indicates that the proposed method exhibits less diversity while significantly improving average tree performance. It is also shown that the size of the semantic space ensemble has a significant impact on predictive performance and that performance tends to improve as the size increases. CONCLUSIONS The strategy for utilizing a set of diverse distributed representations of clinical events when constructing ensembles of randomized trees has a significant impact on predictive performance. The most successful strategy - significantly outperforming the considered alternatives - involves randomly sampling distributed representations of the clinical events when building each decision tree in the forest.
Collapse
Affiliation(s)
- Aron Henriksson
- Department of Computer and Systems Sciences, Stockholm University, Borgarfjordsgatan 12, Kista, SE-16407, Sweden.
| | - Jing Zhao
- Department of Computer and Systems Sciences, Stockholm University, Borgarfjordsgatan 12, Kista, SE-16407, Sweden
| | - Hercules Dalianis
- Department of Computer and Systems Sciences, Stockholm University, Borgarfjordsgatan 12, Kista, SE-16407, Sweden
| | - Henrik Boström
- Department of Computer and Systems Sciences, Stockholm University, Borgarfjordsgatan 12, Kista, SE-16407, Sweden
| |
Collapse
|
11
|
Determining the difficulty of Word Sense Disambiguation. J Biomed Inform 2014; 47:83-90. [DOI: 10.1016/j.jbi.2013.09.009] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2013] [Revised: 09/10/2013] [Accepted: 09/13/2013] [Indexed: 11/19/2022]
|
12
|
Rajpathak DG. An ontology based text mining system for knowledge discovery from the diagnosis data in the automotive domain. COMPUT IND 2013. [DOI: 10.1016/j.compind.2013.03.001] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
13
|
Eriksson R, Jensen PB, Frankild S, Jensen LJ, Brunak S. Dictionary construction and identification of possible adverse drug events in Danish clinical narrative text. J Am Med Inform Assoc 2013; 20:947-53. [PMID: 23703825 PMCID: PMC3756275 DOI: 10.1136/amiajnl-2013-001708] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE Drugs have tremendous potential to cure and relieve disease, but the risk of unintended effects is always present. Healthcare providers increasingly record data in electronic patient records (EPRs), in which we aim to identify possible adverse events (AEs) and, specifically, possible adverse drug events (ADEs). MATERIALS AND METHODS Based on the undesirable effects section from the summary of product characteristics (SPC) of 7446 drugs, we have built a Danish ADE dictionary. Starting from this dictionary we have developed a pipeline for identifying possible ADEs in unstructured clinical narrative text. We use a named entity recognition (NER) tagger to identify dictionary matches in the text and post-coordination rules to construct ADE compound terms. Finally, we apply post-processing rules and filters to handle, for example, negations and sentences about subjects other than the patient. Moreover, this method allows synonyms to be identified and anatomical location descriptions can be merged to allow appropriate grouping of effects in the same location. RESULTS The method identified 1 970 731 (35 477 unique) possible ADEs in a large corpus of 6011 psychiatric hospital patient records. Validation was performed through manual inspection of possible ADEs, resulting in precision of 89% and recall of 75%. DISCUSSION The presented dictionary-building method could be used to construct other ADE dictionaries. The complication of compound words in Germanic languages was addressed. Additionally, the synonym and anatomical location collapse improve the method. CONCLUSIONS The developed dictionary and method can be used to identify possible ADEs in Danish clinical narratives.
Collapse
Affiliation(s)
- Robert Eriksson
- Department of Disease Systems Biology, Faculty of Health and Medical Sciences, NNF Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
| | | | | | | | | |
Collapse
|
14
|
A controlled greedy supervised approach for co-reference resolution on clinical text. J Biomed Inform 2013; 46:506-15. [PMID: 23562650 DOI: 10.1016/j.jbi.2013.03.007] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2012] [Revised: 03/24/2013] [Accepted: 03/26/2013] [Indexed: 11/22/2022]
Abstract
Identification of co-referent entity mentions inside text has significant importance for other natural language processing (NLP) tasks (e.g. event linking). However, this task, known as co-reference resolution, remains a complex problem, partly because of the confusion over different evaluation metrics and partly because the well-researched existing methodologies do not perform well on new domains such as clinical records. This paper presents a variant of the influential mention-pair model for co-reference resolution. Using a series of linguistically and semantically motivated constraints, the proposed approach controls generation of less-informative/sub-optimal training and test instances. Additionally, the approach also introduces some aggressive greedy strategies in chain clustering. The proposed approach has been tested on the official test corpus of the recently held i2b2/VA 2011 challenge. It achieves an unweighted average F1 score of 0.895, calculated from multiple evaluation metrics (MUC, B(3) and CEAF scores). These results are comparable to the best systems of the challenge. What makes our proposed system distinct is that it also achieves high average F1 scores for each individual chain type (Test: 0.897, Person: 0.852, PROBLEM: 0.855, TREATMENT: 0.884). Unlike other works, it obtains good scores for each of the individual metrics rather than being biased towards a particular metric.
Collapse
|
15
|
Bennani-Baiti B, Bennani-Baiti IM. Gene symbol precision. Gene 2012; 491:103-9. [DOI: 10.1016/j.gene.2011.09.035] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2011] [Revised: 09/21/2011] [Accepted: 09/29/2011] [Indexed: 11/26/2022]
|
16
|
Jimeno-Yepes AJ, McInnes BT, Aronson AR. Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation. BMC Bioinformatics 2011; 12:223. [PMID: 21635749 PMCID: PMC3123611 DOI: 10.1186/1471-2105-12-223] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2010] [Accepted: 06/02/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Evaluation of Word Sense Disambiguation (WSD) methods in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We present a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE. We demonstrate the use of this method by developing such a data set, called MSH WSD. METHODS In our method, the Metathesaurus is first screened to identify ambiguous terms whose possible senses consist of two or more MeSH headings. We then use each ambiguous term and its corresponding MeSH heading to extract MEDLINE citations where the term and only one of the MeSH headings co-occur. The term found in the MEDLINE citation is automatically assigned the UMLS CUI linked to the MeSH heading. Each instance has been assigned a UMLS Concept Unique Identifier (CUI). We compare the characteristics of the MSH WSD data set to the previously existing NLM WSD data set. RESULTS The resulting MSH WSD data set consists of 106 ambiguous abbreviations, 88 ambiguous terms and 9 which are a combination of both, for a total of 203 ambiguous entities. For each ambiguous term/abbreviation, the data set contains a maximum of 100 instances per sense obtained from MEDLINE.We evaluated the reliability of the MSH WSD data set using existing knowledge-based methods and compared their performance to that of the results previously obtained by these algorithms on the pre-existing data set, NLM WSD. We show that the knowledge-based methods achieve different results but keep their relative performance except for the Journal Descriptor Indexing (JDI) method, whose performance is below the other methods. CONCLUSIONS The MSH WSD data set allows the evaluation of WSD algorithms in the biomedical domain. Compared to previously existing data sets, MSH WSD contains a larger number of biomedical terms/abbreviations and covers the largest set of UMLS Semantic Types. Furthermore, the MSH WSD data set has been generated automatically reusing already existing annotations and, therefore, can be regenerated from subsequent UMLS versions.
Collapse
|
17
|
Disambiguation in the biomedical domain: The role of ambiguity type. J Biomed Inform 2010; 43:972-81. [DOI: 10.1016/j.jbi.2010.08.009] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2010] [Revised: 08/19/2010] [Accepted: 08/20/2010] [Indexed: 10/19/2022]
|
18
|
Okazaki N, Ananiadou S, Tsujii J. Building a high-quality sense inventory for improved abbreviation disambiguation. Bioinformatics 2010; 26:1246-53. [PMID: 20360059 PMCID: PMC2859134 DOI: 10.1093/bioinformatics/btq129] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Motivation: The ultimate goal of abbreviation management is to disambiguate every occurrence of an abbreviation into its expanded form (concept or sense). To collect expanded forms for abbreviations, previous studies have recognized abbreviations and their expanded forms in parenthetical expressions of bio-medical texts. However, expanded forms extracted by abbreviation recognition are mixtures of concepts/senses and their term variations. Consequently, a list of expanded forms should be structured into a sense inventory, which provides possible concepts or senses for abbreviation disambiguation. Results: A sense inventory is a key to robust management of abbreviations. Therefore, we present a supervised approach for clustering expanded forms. The experimental result reports 0.915 F1 score in clustering expanded forms. We then investigate the possibility of conflicts of protein and gene names with abbreviations. Finally, an experiment of abbreviation disambiguation on the sense inventory yielded 0.984 accuracy and 0.986 F1 score using the dataset obtained from MEDLINE abstracts. Availability: The sense inventory and disambiguator of abbreviations are accessible at http://www.nactem.ac.uk/software/acromine/ and http://www.nactem.ac.uk/software/acromine_disambiguation/ Contact:okazaki@chokkan.org
Collapse
Affiliation(s)
- Naoaki Okazaki
- Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan.
| | | | | |
Collapse
|
19
|
Abstract
Clinical coding is variable in UK general practice. The reasons for this remain undefined. This review explains why there are no readily available alternatives to recording structured clinical data and reviews the barriers to recording structured clinical data. Methods used included a literature review of bibliographic databases, university health informatics departments, and national and international medical informatics associations. The results show that the current state of development of computers and data processing means there is no practical alternative to coding data. The identified barriers to clinical coding are: the limitations of the coding systems and terminologies and the skill gap in their use; recording structured data in the consultation takes time and is distracting; the level of motivation of primary care professionals; and the priority within the organization. A taxonomy is proposed to describe the barriers to clinical coding. This can be used to identify barriers to coding and facilitate the development of strategies to overcome them.
Collapse
|
20
|
Farkas R. The strength of co-authorship in gene name disambiguation. BMC Bioinformatics 2008; 9:69. [PMID: 18230174 PMCID: PMC2262057 DOI: 10.1186/1471-2105-9-69] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2007] [Accepted: 01/29/2008] [Indexed: 12/04/2022] Open
Abstract
Background A biomedical entity mention in articles and other free texts is often ambiguous. For example, 13% of the gene names (aliases) might refer to more than one gene. The task of Gene Symbol Disambiguation (GSD) – a special case of Word Sense Disambiguation (WSD) – is to assign a unique gene identifier for all identified gene name aliases in biology-related articles. Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results. We examine here the utilisation potential of the fact – one of the special features of biological articles – that the authors of the documents are known through graph-based semi-supervised methods for the GSD task. Results Our key hypothesis is that a biologist refers to each particular gene by a fixed gene alias and this holds for the co-authors as well. To make use of the co-authorship information we decided to build the inverse co-author graph on MedLine abstracts. The nodes of the inverse co-author graph are articles and there is an edge between two nodes if and only if the two articles have a mutual author. We introduce here two methods using distances (based on the graph) of abstracts for the GSD task. We found that a disambiguation decision can be made in 85% of cases with an extremely high (99.5%) precision rate just by using information obtained from the inverse co-author graph. We incorporated the co-authorship information into two GSD systems in order to attain full coverage and in experiments our procedure achieved precision of 94.3%, 98.85%, 96.05% and 99.63% on the human, mouse, fly and yeast GSD evaluation sets, respectively. Conclusion Based on the promising results obtained so far we suggest that the co-authorship information and the circumstances of the articles' release (like the title of the journal, the year of publication) can be a crucial building block of any sophisticated similarity measure among biological articles and hence the methods introduced here should be useful for other biomedical natural language processing tasks (like organism or target disease detection) as well.
Collapse
Affiliation(s)
- Richárd Farkas
- Hungarian Academy of Science, Research Group on Artificial Intelligence, Aradi vertanuk tere, Szeged, Hungary.
| |
Collapse
|
21
|
Torii M, Hu ZZ, Song M, Wu CH, Liu H. A comparison study on algorithms of detecting long forms for short forms in biomedical text. BMC Bioinformatics 2007; 8 Suppl 9:S5. [PMID: 18047706 PMCID: PMC2217663 DOI: 10.1186/1471-2105-8-s9-s5] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs) and their corresponding definitions as long forms (LFs). The study was designed to answer the following questions; i) how well a system performs in detecting LFs from novel text, ii) what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii) how to combine results from various SF knowledge bases. METHOD We evaluated the following three publicly available detection systems in detecting LFs for SFs: i) a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii) a machine learning system by Chang et al., and iii) a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i) the UMLS (the Unified Medical Language System), and ii) the BioThesaurus (a thesaurus of names for all UniProt protein records). We also implemented a web interface that provides a virtual integration of various SF knowledge bases. RESULTS We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases. AVAILABILITY The web site is http://gauss.dbb.georgetown.edu/liblab/SFThesaurus.
Collapse
Affiliation(s)
- Manabu Torii
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, 4000 Resevoir Rd, NW, Washington, DC 20057, USA.
| | | | | | | | | |
Collapse
|
22
|
Erhardt RAA, Schneider R, Blaschke C. Status of text-mining techniques applied to biomedical text. Drug Discov Today 2007; 11:315-25. [PMID: 16580973 DOI: 10.1016/j.drudis.2006.02.011] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2005] [Revised: 02/08/2006] [Accepted: 02/27/2006] [Indexed: 11/16/2022]
Abstract
Scientific progress is increasingly based on knowledge and information. Knowledge is now recognized as the driver of productivity and economic growth, leading to a new focus on the role of information in the decision-making process. Most scientific knowledge is registered in publications and other unstructured representations that make it difficult to use and to integrate the information with other sources (e.g. biological databases). Making a computer understand human language has proven to be a complex achievement, but there are techniques capable of detecting, distinguishing and extracting a limited number of different classes of facts. In the biomedical field, extracting information has specific problems: complex and ever-changing nomenclature (especially genes and proteins) and the limited representation of domain knowledge.
Collapse
|
23
|
Yu H, Kim W, Hatzivassiloglou V, Wilbur J. A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations. ACM T INFORM SYST 2006. [DOI: 10.1145/1165774.1165778] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Abbreviations and acronyms are widely used in the biomedical literature and many of them represent important biomedical concepts. Because many abbreviations are ambiguous (e.g.,
CAT
denotes both
chloramphenicol acetyl transferase
and
computed axial tomography
, depending on the context), recognizing the full form associated with each abbreviation is in most cases equivalent to identifying the meaning of the abbreviation. This, in turn, allows us to perform more accurate natural language processing, information extraction, and retrieval. In this study, we have developed supervised approaches to identifying the full forms of ambiguous abbreviations within the context they appear. We first automatically assigned multiple possible full forms for each abbreviation; we then treated the in-context full-form prediction for each specific abbreviation occurrence as a case of word-sense disambiguation. We generated automatically a dictionary of all possible full forms for each abbreviation. We applied supervised machine-learning algorithms for disambiguation. Because some of the links between abbreviations and their corresponding full forms are explicitly given in the text and can be recovered automatically, we can use these explicit links to automatically provide training data for disambiguating the abbreviations that are not linked to a full form within a text. We evaluated our methods on over 150 thousand abstracts and obtain for coverage and precision results of 82% and 92%, respectively, when performed as tenfold cross-validation, and 79% and 80%, respectively, when evaluated against an external set of abstracts in which the abbreviations are not defined.
Collapse
Affiliation(s)
- Hong Yu
- University of Wisconsin-Milwaukee, Milwaukee, WI
| | - Won Kim
- National Center for Biotechnology Information, Bethesda, MD
| | | | - John Wilbur
- National Center for Biotechnology Information, Bethesda, MD
| |
Collapse
|
24
|
Liu H, Hu ZZ, Torii M, Wu C, Friedman C. Quantitative assessment of dictionary-based protein named entity tagging. J Am Med Inform Assoc 2006; 13:497-507. [PMID: 16799122 PMCID: PMC1561801 DOI: 10.1197/jamia.m2085] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE Natural language processing (NLP) approaches have been explored to manage and mine information recorded in biological literature. A critical step for biological literature mining is biological named entity tagging (BNET) that identifies names mentioned in text and normalizes them with entries in biological databases. The aim of this study was to provide quantitative assessment of the complexity of BNET on protein entities through BioThesaurus, a thesaurus of gene/protein names for UniProt knowledgebase (UniProtKB) entries that was acquired using online resources. METHODS We evaluated the complexity through several perspectives: ambiguity (i.e., the number of genes/proteins represented by one name), synonymy (i.e., the number of names associated with the same gene/protein), and coverage (i.e., the percentage of gene/protein names in text included in the thesaurus). We also normalized names in BioThesaurus and measures were obtained twice, once before normalization and once after. RESULTS The current version of BioThesaurus has over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProtKB entries. The average synonymy is 3.53 (2.86 after normalization), ambiguity is 2.31 before normalization and 2.32 after, while the coverage is 94.0% based on the BioCreAtive data set comprising MEDLINE abstracts containing genes/proteins. CONCLUSION The study indicated that names for genes/proteins are highly ambiguous and there are usually multiple names for the same gene or protein. It also demonstrated that most gene/protein names appearing in text can be found in BioThesaurus.
Collapse
Affiliation(s)
- Hongfang Liu
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, DC 20007, USA.
| | | | | | | | | |
Collapse
|
25
|
Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J Biomed Inform 2006; 40:150-9. [PMID: 16843731 DOI: 10.1016/j.jbi.2006.06.001] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2006] [Revised: 06/02/2006] [Accepted: 06/02/2006] [Indexed: 11/17/2022]
Abstract
Biomedical abbreviations and acronyms are widely used in biomedical literature. Since many of them represent important content in biomedical literature, information retrieval and extraction benefits from identifying the meanings of those terms. On the other hand, many abbreviations and acronyms are ambiguous, it would be important to map them to their full forms, which ultimately represent the meanings of the abbreviations. In this study, we present a semi-supervised method that applies MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. We first automatically generated from the MEDLINE abstracts a dictionary of abbreviation-full pairs based on a rule-based system that maps abbreviations to full forms when full forms are defined in the abstracts. We then trained on the MEDLINE abstracts and predicted the full forms of abbreviations in full-text journal articles by applying supervised machine-learning algorithms in a semi-supervised fashion. We report up to 92% prediction precision and up to 91% coverage.
Collapse
|
26
|
Abstract
Background Accuracy of document retrieval from MEDLINE for gene queries is crucially important for many applications in bioinformatics. We explore five information retrieval-based methods to rank documents retrieved by PubMed gene queries for the human genome. The aim is to rank relevant documents higher in the retrieved list. We address the special challenges faced due to ambiguity in gene nomenclature: gene terms that refer to multiple genes, gene terms that are also English words, and gene terms that have other biological meanings. Results Our two baseline ranking strategies are quite similar in performance. Two of our three LocusLink-based strategies offer significant improvements. These methods work very well even when there is ambiguity in the gene terms. Our best ranking strategy offers significant improvements on three different kinds of ambiguities over our two baseline strategies (improvements range from 15.9% to 17.7% and 11.7% to 13.3% depending on the baseline). For most genes the best ranking query is one that is built from the LocusLink (now Entrez Gene) summary and product information along with the gene names and aliases. For others, the gene names and aliases suffice. We also present an approach that successfully predicts, for a given gene, which of these two ranking queries is more appropriate. Conclusion We explore the effect of different post-retrieval strategies on the ranking of documents returned by PubMed for human gene queries. We have successfully applied some of these strategies to improve the ranking of relevant documents in the retrieved sets. This holds true even when various kinds of ambiguity are encountered. We feel that it would be very useful to apply strategies like ours on PubMed search results as these are not ordered by relevance in any way. This is especially so for queries that retrieve a large number of documents.
Collapse
Affiliation(s)
- Aditya K Sehgal
- Department of Computer Science, The University of Iowa, Iowa City, IA 52246, USA
| | - Padmini Srinivasan
- Department of Computer Science, The University of Iowa, Iowa City, IA 52246, USA
- School of Library and Information Science, The University of Iowa, Iowa City, IA 52246, USA
| |
Collapse
|
27
|
|
28
|
Schuemie MJ, Kors JA, Mons B. Word sense disambiguation in the biomedical domain: an overview. J Comput Biol 2005; 12:554-65. [PMID: 15952878 DOI: 10.1089/cmb.2005.12.554] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
There is a trend towards automatic analysis of large amounts of literature in the biomedical domain. However, this can be effective only if the ambiguity in natural language is resolved. In this paper, the current state of research in word sense disambiguation (WSD) is reviewed. Several methods for WSD have already been proposed, but many systems have been tested only on evaluation sets of limited size. There are currently only very few applications of WSD in the biomedical domain. The current direction of research points towards statistically based algorithms that use existing curated data and can be applied to large sets of biomedical literature. There is a need for manually tagged evaluation sets to test WSD algorithms in the biomedical domain. WSD algorithms should preferably be able to take into account both known and unknown senses of a word. Without WSD, automatic metaanalysis of large corpora of text will be error prone.
Collapse
Affiliation(s)
- Martijn J Schuemie
- Biosemantics Group, Medical Informatics Department, Erasmus Medical Center, Dr. Molewaterplein 50, 3015 GE, Rotterdam, The Netherlands.
| | | | | |
Collapse
|
29
|
Leroy G, Rindflesch TC. Effects of information and machine learning algorithms on word sense disambiguation with small datasets. Int J Med Inform 2005; 74:573-85. [PMID: 15897005 DOI: 10.1016/j.ijmedinf.2005.03.013] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2004] [Revised: 02/17/2005] [Accepted: 03/17/2005] [Indexed: 11/26/2022]
Abstract
Current approaches to word sense disambiguation use (and often combine) various machine learning techniques. Most refer to characteristics of the ambiguity and its surrounding words and are based on thousands of examples. Unfortunately, developing large training sets is burdensome, and in response to this challenge, we investigate the use of symbolic knowledge for small datasets. A naïve Bayes classifier was trained for 15 words with 100 examples for each. Unified Medical Language System (UMLS) semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in nine experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. To investigate this large variance, we performed several follow-up evaluations, testing additional algorithms (decision tree and neural network), and gold standards (per expert), but the results did not significantly differ. However, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators. We conclude that neither algorithm nor individual human behavior cause these large differences, but that the structure of the UMLS Metathesaurus (used to represent senses of ambiguous words) contributes to inaccuracies in the gold standard, leading to varied performance of word sense disambiguation techniques.
Collapse
Affiliation(s)
- Gondy Leroy
- School of Information Science, Claremont Graduate University, 130 E. Ninth Street, Claremont, CA 91711, USA.
| | | |
Collapse
|
30
|
Schijvenaars BJA, Mons B, Weeber M, Schuemie MJ, van Mulligen EM, Wain HM, Kors JA. Thesaurus-based disambiguation of gene symbols. BMC Bioinformatics 2005; 6:149. [PMID: 15958172 PMCID: PMC1183190 DOI: 10.1186/1471-2105-6-149] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2004] [Accepted: 06/16/2005] [Indexed: 11/28/2022] Open
Abstract
Background Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck. Results We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set. Conclusion The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.
Collapse
Affiliation(s)
- Bob JA Schijvenaars
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Barend Mons
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Marc Weeber
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Martijn J Schuemie
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Erik M van Mulligen
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Hester M Wain
- HUGO Gene Nomenclature Committee, Department of Biology, University College London, Wolfson House, 4 Stephenson Way, London NW1 2HE, UK
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| |
Collapse
|
31
|
Jelier R, Jenster G, Dorssers LCJ, van der Eijk CC, van Mulligen EM, Mons B, Kors JA. Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes. Bioinformatics 2005; 21:2049-58. [PMID: 15657104 DOI: 10.1093/bioinformatics/bti268] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The advent of high-throughput experiments in molecular biology creates a need for methods to efficiently extract and use information for large numbers of genes. Recently, the associative concept space (ACS) has been developed for the representation of information extracted from biomedical literature. The ACS is a Euclidean space in which thesaurus concepts are positioned and the distances between concepts indicates their relatedness. The ACS uses co-occurrence of concepts as a source of information. In this paper we evaluate how well the system can retrieve functionally related genes and we compare its performance with a simple gene co-occurrence method. RESULTS To assess the performance of the ACS we composed a test set of five groups of functionally related genes. With the ACS good scores were obtained for four of the five groups. When compared to the gene co-occurrence method, the ACS is capable of revealing more functional biological relations and can achieve results with less literature available per gene. Hierarchical clustering was performed on the ACS output, as a potential aid to users, and was found to provide useful clusters. Our results suggest that the algorithm can be of value for researchers studying large numbers of genes. AVAILABILITY The ACS program is available upon request from the authors.
Collapse
Affiliation(s)
- R Jelier
- Department of Medical Informatics, Erasmus MC-University Medical Center, Rotterdam, The Netherlands.
| | | | | | | | | | | | | |
Collapse
|
32
|
Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc 2004; 11:392-402. [PMID: 15187068 PMCID: PMC516246 DOI: 10.1197/jamia.m1552] [Citation(s) in RCA: 301] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2004] [Accepted: 04/13/2004] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE The aim of this study was to develop a method based on natural language processing (NLP) that automatically maps an entire clinical document to codes with modifiers and to quantitatively evaluate the method. METHODS An existing NLP system, MedLEE, was adapted to automatically generate codes. The method involves matching of structured output generated by MedLEE consisting of findings and modifiers to obtain the most specific code. Recall and precision applied to Unified Medical Language System (UMLS) coding were evaluated in two separate studies. Recall was measured using a test set of 150 randomly selected sentences, which were processed using MedLEE. Results were compared with a reference standard determined manually by seven experts. Precision was measured using a second test set of 150 randomly selected sentences from which UMLS codes were automatically generated by the method and then validated by experts. RESULTS Recall of the system for UMLS coding of all terms was .77 (95% CI.72-.81), and for coding terms that had corresponding UMLS codes recall was .83 (.79-.87). Recall of the system for extracting all terms was .84 (.81-.88). Recall of the experts ranged from .69 to .91 for extracting terms. The precision of the system was .89 (.87-.91), and precision of the experts ranged from .61 to .91. CONCLUSION Extraction of relevant clinical information and UMLS coding were accomplished using a method based on NLP. The method appeared to be comparable to or better than six experts. The advantage of the method is that it maps text to codes along with other related information, rendering the coded output suitable for effective retrieval.
Collapse
Affiliation(s)
- Carol Friedman
- Department of Biomedical Informatics, Columbia University, 622 West 168 Street, VC-5, New York, NY 10032, USA.
| | | | | | | |
Collapse
|
33
|
Liu H, Teller V, Friedman C. A multi-aspect comparison study of supervised word sense disambiguation. J Am Med Inform Assoc 2004; 11:320-31. [PMID: 15064284 PMCID: PMC436083 DOI: 10.1197/jamia.m1533] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE The aim of this study was to investigate relations among different aspects in supervised word sense disambiguation (WSD; supervised machine learning for disambiguating the sense of a term in a context) and compare supervised WSD in the biomedical domain with that in the general English domain. METHODS The study involves three data sets (a biomedical abbreviation data set, a general biomedical term data set, and a general English data set). The authors implemented three machine-learning algorithms, including (1) naïve Bayes (NBL) and decision lists (TDLL), (2) their adaptation of decision lists (ODLL), and (3) their mixed supervised learning (MSL). There were six feature representations (various combinations of collocations, bag of words, oriented bag of words, etc.) and five window sizes (2, 4, 6, 8, and 10). RESULTS Supervised WSD is suitable only when there are enough sense-tagged instances with at least a few dozens of instances for each sense. Collocations combined with neighboring words are appropriate selections for the context. For terms with unrelated biomedical senses, a large window size such as the whole paragraph should be used, while for general English words a moderate window size between 4 and 10 should be used. The performance of the authors' implementation of decision list classifiers for abbreviations was better than that of traditional decision list classifiers. However, the opposite held for the other two sets. Also, the authors' mixed supervised learning was stable and generally better than others for all sets. CONCLUSION From this study, it was found that different aspects of supervised WSD depend on each other. The experiment method presented in the study can be used to select the best supervised WSD classifier for each ambiguous term.
Collapse
Affiliation(s)
- Hongfang Liu
- Department of Information Systems, University of Maryland at Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA.
| | | | | |
Collapse
|
34
|
Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P, Morris M, Yu H, Duboué PA, Weng W, Wilbur WJ, Hatzivassiloglou V, Friedman C. GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 2004; 37:43-53. [PMID: 15016385 DOI: 10.1016/j.jbi.2003.10.001] [Citation(s) in RCA: 144] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2003] [Indexed: 11/16/2022]
Abstract
The immense growth in the volume of research literature and experimental data in the field of molecular biology calls for efficient automatic methods to capture and store information. In recent years, several groups have worked on specific problems in this area, such as automated selection of articles pertinent to molecular biology, or automated extraction of information using natural-language processing, information visualization, and generation of specialized knowledge bases for molecular biology. GeneWays is an integrated system that combines several such subtasks. It analyzes interactions between molecular substances, drawing on multiple sources of information to infer a consensus view of molecular networks. GeneWays is designed as an open platform, allowing researchers to query, review, and critique stored information.
Collapse
Affiliation(s)
- Andrey Rzhetsky
- Columbia Genome Center, Columbia University, New York, NY 10032, USA.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Leroy G, Chen H, Martinez JD. A shallow parser based on closed-class words to capture relations in biomedical text. J Biomed Inform 2003; 36:145-58. [PMID: 14615225 DOI: 10.1016/s1532-0464(03)00039-x] [Citation(s) in RCA: 60] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Natural language processing for biomedical text currently focuses mostly on entity and relation extraction. These entities and relations are usually pre-specified entities, e.g., proteins, and pre-specified relations, e.g., inhibit relations. A shallow parser that captures the relations between noun phrases automatically from free text has been developed and evaluated. It uses heuristics and a noun phraser to capture entities of interest in the text. Cascaded finite state automata structure the relations between individual entities. The automata are based on closed-class English words and model generic relations not limited to specific words. The parser also recognizes coordinating conjunctions and captures negation in text, a feature usually ignored by others. Three cancer researchers evaluated 330 relations extracted from 26 abstracts of interest to them. There were 296 relations correctly extracted from the abstracts resulting in 90% precision of the relations and an average of 11 correct relations per abstract.
Collapse
Affiliation(s)
- Gondy Leroy
- Management Information Systems, The University of Arizona, McClelland Hall, Room 430, 1130 E. Helen St., Tucson, AZ 85721, USA.
| | | | | |
Collapse
|
36
|
Abstract
Literature mining is the process of extracting and combining facts from scientific publications. In recent years, many computer programs have been designed to extract various molecular biology findings from Medline abstracts or full-text articles. The present article describes the range of text mining techniques that have been applied to scientific documents. It divides 'automated reading' into four general subtasks: text categorization, named entity tagging, fact extraction, and collection-wide analysis. Literature mining offers powerful methods to support knowledge discovery and the construction of topic maps and ontologies. An overview is given of recent developments in medical language processing. Special attention is given to the domain particularities of molecular biology, and the emerging synergy between literature mining and molecular databases accessible through Internet.
Collapse
Affiliation(s)
- Berry de Bruijn
- Institute for Information Technology, National Research Council, Montreal Road Bldg M50, Ottawa, Ont, Canada K1A 0R6.
| | | |
Collapse
|
37
|
Liu H, Johnson SB, Friedman C. Automatic resolution of ambiguous terms based on machine learning and conceptual relations in the UMLS. J Am Med Inform Assoc 2002; 9:621-36. [PMID: 12386113 PMCID: PMC349379 DOI: 10.1197/jamia.m1101] [Citation(s) in RCA: 59] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
UNLABELLED Motivation. The UMLS has been used in natural language processing applications such as information retrieval and information extraction systems. The mapping of free-text to UMLS concepts is important for these applications. To improve the mapping, we need a method to disambiguate terms that possess multiple UMLS concepts. In the general English domain, machine-learning techniques have been applied to sense-tagged corpora, in which senses (or concepts) of ambiguous terms have been annotated (mostly manually). Sense disambiguation classifiers are then derived to determine senses (or concepts) of those ambiguous terms automatically. However, manual annotation of a corpus is an expensive task. We propose an automatic method that constructs sense-tagged corpora for ambiguous terms in the UMLS using MEDLINE abstracts. METHODS For a term W that represents multiple UMLS concepts, a collection of MEDLINE abstracts that contain W is extracted. For each abstract in the collection, occurrences of concepts that have relations with W as defined in the UMLS are automatically identified. A sense-tagged corpus, in which senses of W are annotated, is then derived based on those identified concepts. The method was evaluated on a set of 35 frequently occurring ambiguous biomedical abbreviations using a gold standard set that was automatically derived. The quality of the derived sense-tagged corpus was measured using precision and recall. RESULTS The derived sense-tagged corpus had an overall precision of 92.9% and an overall recall of 47.4%. After removing rare senses and ignoring abbreviations with closely related senses, the overall precision was 96.8% and the overall recall was 50.6%. CONCLUSIONS UMLS conceptual relations and MEDLINE abstracts can be used to automatically acquire knowledge needed for resolving ambiguity when mapping free-text to UMLS concepts.
Collapse
Affiliation(s)
- Hongfang Liu
- City University of New York, New York, New York 10032, USA.
| | | | | |
Collapse
|