1
|
Zhou H, Austin R, Lu SC, Silverman GM, Zhou Y, Kilicoglu H, Xu H, Zhang R. Complementary and Integrative Health Information in the literature: its lexicon and named entity recognition. J Am Med Inform Assoc 2024; 31:426-434. [PMID: 37952122 PMCID: PMC10797266 DOI: 10.1093/jamia/ocad216] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 10/20/2023] [Accepted: 11/08/2023] [Indexed: 11/14/2023] Open
Abstract
OBJECTIVE To construct an exhaustive Complementary and Integrative Health (CIH) Lexicon (CIHLex) to help better represent the often underrepresented physical and psychological CIH approaches in standard terminologies, and to also apply state-of-the-art natural language processing (NLP) techniques to help recognize them in the biomedical literature. MATERIALS AND METHODS We constructed the CIHLex by integrating various resources, compiling and integrating data from biomedical literature and relevant sources of knowledge. The Lexicon encompasses 724 unique concepts with 885 corresponding unique terms. We matched these concepts to the Unified Medical Language System (UMLS), and we developed and utilized BERT models comparing their efficiency in CIH named entity recognition to well-established models including MetaMap and CLAMP, as well as the large language model GPT3.5-turbo. RESULTS Of the 724 unique concepts in CIHLex, 27.2% could be matched to at least one term in the UMLS. About 74.9% of the mapped UMLS Concept Unique Identifiers were categorized as "Therapeutic or Preventive Procedure." Among the models applied to CIH named entity recognition, BLUEBERT delivered the highest macro-average F1-score of 0.91, surpassing other models. CONCLUSION Our CIHLex significantly augments representation of CIH approaches in biomedical literature. Demonstrating the utility of advanced NLP models, BERT notably excelled in CIH entity recognition. These results highlight promising strategies for enhancing standardization and recognition of CIH terminology in biomedical contexts.
Collapse
Affiliation(s)
- Huixue Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, United States
| | - Robin Austin
- School of Nursing, University of Minnesota, Minneapolis, MN, United States
| | - Sheng-Chieh Lu
- Department of Symptom Research, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | - Greg Marc Silverman
- Department of Surgery, University of Minnesota, Minneapolis, MN, United States
| | - Yuqi Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, United States
- Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, United States
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, United States
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, United States
| | - Rui Zhang
- Department of Surgery, University of Minnesota, Minneapolis, MN, United States
| |
Collapse
|
2
|
Newbury A, Liu H, Idnay B, Weng C. The suitability of UMLS and SNOMED-CT for encoding outcome concepts. J Am Med Inform Assoc 2023; 30:1895-1903. [PMID: 37615994 PMCID: PMC10654851 DOI: 10.1093/jamia/ocad161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 06/14/2023] [Accepted: 08/02/2023] [Indexed: 08/25/2023] Open
Abstract
OBJECTIVE Outcomes are important clinical study information. Despite progress in automated extraction of PICO (Population, Intervention, Comparison, and Outcome) entities from PubMed, rarely are these entities encoded by standard terminology to achieve semantic interoperability. This study aims to evaluate the suitability of the Unified Medical Language System (UMLS) and SNOMED-CT in encoding outcome concepts in randomized controlled trial (RCT) abstracts. MATERIALS AND METHODS We iteratively developed and validated an outcome annotation guideline and manually annotated clinically significant outcome entities in the Results and Conclusions sections of 500 randomly selected RCT abstracts on PubMed. The extracted outcomes were fully, partially, or not mapped to the UMLS via MetaMap based on established heuristics. Manual UMLS browser search was performed for select unmapped outcome entities to further differentiate between UMLS and MetaMap errors. RESULTS Only 44% of 2617 outcome concepts were fully covered in the UMLS, among which 67% were complex concepts that required the combination of 2 or more UMLS concepts to represent them. SNOMED-CT was present as a source in 61% of the fully mapped outcomes. DISCUSSION Domains such as Metabolism and Nutrition, and Infections and Infectious Diseases need expanded outcome concept coverage in the UMLS and MetaMap. Future work is warranted to similarly assess the terminology coverage for P, I, C entities. CONCLUSION Computational representation of clinical outcomes is important for clinical evidence extraction and appraisal and yet faces challenges from the inherent complexity and lack of coverage of these concepts in UMLS and SNOMED-CT, as demonstrated in this study.
Collapse
Affiliation(s)
- Abigail Newbury
- Department of Biomedical Informatics, Columbia University, New York City, NY 10032, United States
| | - Hao Liu
- Department of Biomedical Informatics, Columbia University, New York City, NY 10032, United States
| | - Betina Idnay
- Department of Biomedical Informatics, Columbia University, New York City, NY 10032, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York City, NY 10032, United States
| |
Collapse
|
3
|
Chen A, Huang R, Wu E, Han R, Wen J, Li Q, Zhang Z, Shen B. The Generation of a Lung Cancer Health Factor Distribution Using Patient Graphs Constructed From Electronic Medical Records: Retrospective Study. J Med Internet Res 2022; 24:e40361. [PMID: 36427233 PMCID: PMC9736747 DOI: 10.2196/40361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 09/09/2022] [Accepted: 10/25/2022] [Indexed: 11/27/2022] Open
Abstract
BACKGROUND Electronic medical records (EMRs) of patients with lung cancer (LC) capture a variety of health factors. Understanding the distribution of these factors will help identify key factors for risk prediction in preventive screening for LC. OBJECTIVE We aimed to generate an integrated biomedical graph from EMR data and Unified Medical Language System (UMLS) ontology for LC, and to generate an LC health factor distribution from a hospital EMR of approximately 1 million patients. METHODS The data were collected from 2 sets of 1397 patients with and those without LC. A patient-centered health factor graph was plotted with 108,000 standardized data, and a graph database was generated to integrate the graphs of patient health factors and the UMLS ontology. With the patient graph, we calculated the connection delta ratio (CDR) for each of the health factors to measure the relative strength of the factor's relationship to LC. RESULTS The patient graph had 93,000 relations between the 2794 patient nodes and 650 factor nodes. An LC graph with 187 related biomedical concepts and 188 horizontal biomedical relations was plotted and linked to the patient graph. Searching the integrated biomedical graph with any number or category of health factors resulted in graphical representations of relationships between patients and factors, while searches using any patient presented the patient's health factors from the EMR and the LC knowledge graph (KG) from the UMLS in the same graph. Sorting the health factors by CDR in descending order generated a distribution of health factors for LC. The top 70 CDR-ranked factors of disease, symptom, medical history, observation, and laboratory test categories were verified to be concordant with those found in the literature. CONCLUSIONS By collecting standardized data of thousands of patients with and those without LC from the EMR, it was possible to generate a hospital-wide patient-centered health factor graph for graph search and presentation. The patient graph could be integrated with the UMLS KG for LC and thus enable hospitals to bring continuously updated international standard biomedical KGs from the UMLS for clinical use in hospitals. CDR analysis of the graph of patients with LC generated a CDR-sorted distribution of health factors, in which the top CDR-ranked health factors were concordant with the literature. The resulting distribution of LC health factors can be used to help personalize risk evaluation and preventive screening recommendations.
Collapse
Affiliation(s)
- Anjun Chen
- Institutes for System Genetics, West China Hospital, Chengdu, China
| | - Ran Huang
- Institutes for System Genetics, West China Hospital, Chengdu, China
| | - Erman Wu
- Institutes for System Genetics, West China Hospital, Chengdu, China
| | | | - Jian Wen
- Guilin Medical University Affiliateted Hospital, Guilin, China
| | - Qinghua Li
- Guilin Medical University, Guilin, China
| | | | - Bairong Shen
- Institutes for System Genetics, West China Hospital, Chengdu, China
| |
Collapse
|
4
|
Güngör B, Deppenwiese N, Mang JM, Toddenroth D. Analysis of the Representation of Frequent Clinical Attributes in the Unified Medical Language System. Stud Health Technol Inform 2022; 299:217-222. [PMID: 36325866 DOI: 10.3233/shti220987] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Mapping clinical attributes from hospital information systems to standardized terminologies may allow their scientific reuse for multicenter studies. The Unified Medical Language System (UMLS) defines synonyms in different terminologies, which could be valuable for achieving semantic interoperability between different sites. Here we aim to explore the potential relevance of UMLS concepts and associated semantic relations for widely used clinical terminologies in a German university hospital. To semi-automatically examine a sample of the 200 most frequent codes from Erlangen University Hospital for three relevant terminologies, we implemented a script that queries their UMLS representation and associated mappings via a programming interface. We found that 94% of frequent diagnostic codes were available in UMLS, and that most of these codes could be mapped to other terminologies such as SNOMED CT. We observed that all examined laboratory codes were represented in UMLS, and that various translations to other languages were available for these concepts. The classification that is most widely used in German hospital for documenting clinical procedures was not originally represented in UMLS, but external mappings to SNOMED CT allowed identifying UMLS entries for 90.5% of frequent codes. Future research could extend this investigation to other code sets and terminologies, or study the potential utility of available mappings for specific applications.
Collapse
Affiliation(s)
- Baris Güngör
- Medical Informatics, University Erlangen-Nuremberg, Germany
| | - Noemi Deppenwiese
- Medical Center for Information and Communication Technology, University Hospital Erlangen, Germany
| | - Jonathan M Mang
- Medical Center for Information and Communication Technology, University Hospital Erlangen, Germany
| | | |
Collapse
|
5
|
Nguyen V, Bodenreider O. Adding an Attention Layer Improves the Performance of a Neural Network Architecture for Synonymy Prediction in the UMLS Metathesaurus. Stud Health Technol Inform 2022; 290:116-119. [PMID: 35672982 PMCID: PMC9484765 DOI: 10.3233/shti220043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
BACKGROUND Terminology integration at the scale of the UMLS Metathesaurus (i.e., over 200 source vocabularies) remains challenging despite recent advances in ontology alignment techniques based on neural networks. OBJECTIVES To improve the performance of the neural network architecture we developed for predicting synonymy between terms in the UMLS Metathesaurus, specifically through the addition of an attention layer. METHODS We modify our original Siamese neural network architecture with Long-Short Term Memory (LSTM) and create two variants by (1) adding an attention layer on top of the existing LSTM, and (2) replacing the existing LSTM layer by an attention layer. RESULTS Adding an attention layer to the LSTM layer resulted in increasing precision to 92.38% (+3.63%) and F1 score to 91,74% (+1.13%), with limited impact on recall at 91.12% (-1.42%). CONCLUSIONS Although limited, this increase in precision substantially reduces the false positive rate and minimizes the need for manual curation.
Collapse
Affiliation(s)
- Vinh Nguyen
- National Library of Medicine, National Institutes of Health, USA
| | | |
Collapse
|
6
|
Ulrich H, Uzunova H, Handels H, Ingenerf J. Proposal of Semantic Annotation for German Metadata Using Bidirectional Recurrent Neural Networks. Stud Health Technol Inform 2022; 294:357-361. [PMID: 35612096 DOI: 10.3233/shti220474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The distributed nature of our digital healthcare and the rapid emergence of new data sources prevents a compelling overview and the joint use of new data. Data integration, e.g., with metadata and semantic annotations, is expected to overcome this challenge. In this paper, we present an approach to predict UMLS codes to given German metadata using recurrent neural networks. The augmentation of the training dataset using the Medical Subject Headings (MeSH), particularly the German translations, also improved the model accuracy. The model demonstrates robust performance with 75% accuracy and aims to show that increasingly sophisticated machine learning tools can already play a significant role in data integration.
Collapse
Affiliation(s)
- Hannes Ulrich
- IT Center for Clinical Research (ITCR-L), University of Lübeck, Germany
| | - Hristina Uzunova
- German Research Center for Artificial Intelligence, Lübeck, Germany
| | - Heinz Handels
- German Research Center for Artificial Intelligence, Lübeck, Germany
- Institute of Medical Informatics, University of Lübeck, Germany
| | - Josef Ingenerf
- IT Center for Clinical Research (ITCR-L), University of Lübeck, Germany
- Institute of Medical Informatics, University of Lübeck, Germany
| |
Collapse
|
7
|
Humphreys BL, Tuttle MS. Something New and Different: The Unified Medical Language System. Stud Health Technol Inform 2022; 288:100-112. [PMID: 35102832 DOI: 10.3233/shti210985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Donald A.B. Lindberg M.D. arrived at the U.S. National Library of Medicine in 1984 and quickly launched the Unified Medical Language System (UMLS) research and development project to help computer understand biomedical meaning and to enable retrieval and integration of information from disparate electronic sources, e.g., patient records, biomedical literature, knowledge bases. This chapter focuses on how Lindberg's thinking, preferred ways of working, and decision-making guided UMLS goals and development and on what made the UMLS markedly "new and different" and ahead of its time.
Collapse
|
8
|
Abdollahi M, Gao X, Mei Y, Ghosh S, Li J, Narag M. Substituting clinical features using synthetic medical phrases: Medical text data augmentation techniques. Artif Intell Med 2021; 120:102167. [PMID: 34629150 DOI: 10.1016/j.artmed.2021.102167] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Revised: 09/02/2021] [Accepted: 09/03/2021] [Indexed: 11/22/2022]
Abstract
Biomedical natural language processing (NLP) has an important role in extracting consequential information in medical discharge notes. Detecting meaningful features from unstructured notes is a challenging task in medical document classification. The domain specific phrases and different synonyms within the medical documents make it hard to analyze them. Analyzing clinical notes becomes more challenging for short documents like abstract texts. All of these can result in poor classification performance, especially when there is a shortage of the clinical data in real life. Two new approaches (an ontology-guided approach and a combined ontology-based with dictionary-based approach) are suggested for augmenting medical data to enrich training data. Three different deep learning approaches are used to evaluate the classification performance of the proposed methods. The obtained results show that the proposed methods improved the classification accuracy in clinical notes classification.
Collapse
|
9
|
Jing X. The Unified Medical Language System at 30 Years and How It Is Used and Published: Systematic Review and Content Analysis. JMIR Med Inform 2021; 9:e20675. [PMID: 34236337 PMCID: PMC8433943 DOI: 10.2196/20675] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Revised: 11/25/2020] [Accepted: 07/02/2021] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND The Unified Medical Language System (UMLS) has been a critical tool in biomedical and health informatics, and the year 2021 marks its 30th anniversary. The UMLS brings together many broadly used vocabularies and standards in the biomedical field to facilitate interoperability among different computer systems and applications. OBJECTIVE Despite its longevity, there is no comprehensive publication analysis of the use of the UMLS. Thus, this review and analysis is conducted to provide an overview of the UMLS and its use in English-language peer-reviewed publications, with the objective of providing a comprehensive understanding of how the UMLS has been used in English-language peer-reviewed publications over the last 30 years. METHODS PubMed, ACM Digital Library, and the Nursing & Allied Health Database were used to search for studies. The primary search strategy was as follows: UMLS was used as a Medical Subject Headings term or a keyword or appeared in the title or abstract. Only English-language publications were considered. The publications were screened first, then coded and categorized iteratively, following the grounded theory. The review process followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. RESULTS A total of 943 publications were included in the final analysis. Moreover, 32 publications were categorized into 2 categories; hence the total number of publications before duplicates are removed is 975. After analysis and categorization of the publications, UMLS was found to be used in the following emerging themes or areas (the number of publications and their respective percentages are given in parentheses): natural language processing (230/975, 23.6%), information retrieval (125/975, 12.8%), terminology study (90/975, 9.2%), ontology and modeling (80/975, 8.2%), medical subdomains (76/975, 7.8%), other language studies (53/975, 5.4%), artificial intelligence tools and applications (46/975, 4.7%), patient care (35/975, 3.6%), data mining and knowledge discovery (25/975, 2.6%), medical education (20/975, 2.1%), degree-related theses (13/975, 1.3%), digital library (5/975, 0.5%), and the UMLS itself (150/975, 15.4%), as well as the UMLS for other purposes (27/975, 2.8%). CONCLUSIONS The UMLS has been used successfully in patient care, medical education, digital libraries, and software development, as originally planned, as well as in degree-related theses, the building of artificial intelligence tools, data mining and knowledge discovery, foundational work in methodology, and middle layers that may lead to advanced products. Natural language processing, the UMLS itself, and information retrieval are the 3 most common themes that emerged among the included publications. The results, although largely related to academia, demonstrate that UMLS achieves its intended uses successfully, in addition to achieving uses broadly beyond its original intentions.
Collapse
Affiliation(s)
- Xia Jing
- Department of Public Health Sciences, College of Behavioral, Social and Health Sciences, Clemson University, Clemson, SC, United States
| |
Collapse
|
10
|
Newman-Griffis D, Divita G, Desmet B, Zirikly A, Rosé CP, Fosler-Lussier E. Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets. J Am Med Inform Assoc 2021; 28:516-532. [PMID: 33319905 DOI: 10.1093/jamia/ocaa269] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Revised: 09/13/2020] [Accepted: 11/17/2020] [Indexed: 12/18/2022] Open
Abstract
OBJECTIVES Normalizing mentions of medical concepts to standardized vocabularies is a fundamental component of clinical text analysis. Ambiguity-words or phrases that may refer to different concepts-has been extensively researched as part of information extraction from biomedical literature, but less is known about the types and frequency of ambiguity in clinical text. This study characterizes the distribution and distinct types of ambiguity exhibited by benchmark clinical concept normalization datasets, in order to identify directions for advancing medical concept normalization research. MATERIALS AND METHODS We identified ambiguous strings in datasets derived from the 2 available clinical corpora for concept normalization and categorized the distinct types of ambiguity they exhibited. We then compared observed string ambiguity in the datasets with potential ambiguity in the Unified Medical Language System (UMLS) to assess how representative available datasets are of ambiguity in clinical language. RESULTS We found that <15% of strings were ambiguous within the datasets, while over 50% were ambiguous in the UMLS, indicating only partial coverage of clinical ambiguity. The percentage of strings in common between any pair of datasets ranged from 2% to only 36%; of these, 40% were annotated with different sets of concepts, severely limiting generalization. Finally, we observed 12 distinct types of ambiguity, distributed unequally across the available datasets, reflecting diverse linguistic and medical phenomena. DISCUSSION Existing datasets are not sufficient to cover the diversity of clinical concept ambiguity, limiting both training and evaluation of normalization methods for clinical text. Additionally, the UMLS offers important semantic information for building and evaluating normalization methods. CONCLUSIONS Our findings identify 3 opportunities for concept normalization research, including a need for ambiguity-specific clinical datasets and leveraging the rich semantics of the UMLS in new methods and evaluation measures for normalization.
Collapse
Affiliation(s)
- Denis Newman-Griffis
- Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA.,Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA
| | - Guy Divita
- Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA
| | - Bart Desmet
- Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA
| | - Ayah Zirikly
- Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA
| | - Carolyn P Rosé
- Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA.,Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Eric Fosler-Lussier
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA
| |
Collapse
|
11
|
Chang E, Mostafa J. The use of SNOMED CT, 2013-2020: a literature review. J Am Med Inform Assoc 2021; 28:2017-2026. [PMID: 34151978 DOI: 10.1093/jamia/ocab084] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Revised: 03/30/2021] [Accepted: 04/26/2021] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE This article reviews recent literature on the use of SNOMED CT as an extension of Lee et al's 2014 review on the same topic. The Lee et al's article covered literature published from 2001-2012, and the scope of this review was 2013-2020. MATERIALS AND METHODS In line with Lee et al's methods, we searched the PubMed and Embase databases and identified 1002 articles for review, including studies from January 2013 to September 2020. The retrieved articles were categorized and analyzed according to SNOMED CT focus categories (ie, indeterminate, theoretical, pre-development, implementation, and evaluation/commodity), usage categories (eg, illustrate terminology systems theory, prospective content coverage, used to classify or code in a study, retrieve or analyze patient data, etc.), medical domains, and countries. RESULTS After applying inclusion and exclusion criteria, 622 articles were selected for final review. Compared to the papers published between 2001 and 2012, papers published between 2013 and 2020 revealed an increase in more mature usage of SNOMED CT, and the number of papers classified in the "implementation" and "evaluation/commodity" focus categories expanded. When analyzed by decade, papers in the "pre-development," "implementation," and "evaluation/commodity" categories were much more numerous in 2011-2020 than in 2001-2010, increasing from 169 to 293, 30 to 138, and 3 to 65, respectively. CONCLUSION Published papers in more mature usage categories have substantially increased since 2012. From 2013 to present, SNOMED CT has been increasingly implemented in more practical settings. Future research should concentrate on addressing whether SNOMED CT influences improvement in patient care.
Collapse
Affiliation(s)
- Eunsuk Chang
- Carolina Health Informatics Program, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Javed Mostafa
- Carolina Health Informatics Program, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| |
Collapse
|
12
|
Kang T, Perotte A, Tang Y, Ta C, Weng C. UMLS-based data augmentation for natural language processing of clinical research literature. J Am Med Inform Assoc 2021; 28:812-823. [PMID: 33367705 DOI: 10.1093/jamia/ocaa309] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2020] [Accepted: 11/23/2020] [Indexed: 01/17/2023] Open
Abstract
OBJECTIVE The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity. MATERIALS AND METHODS We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT. RESULTS UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82). CONCLUSIONS This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.
Collapse
Affiliation(s)
- Tian Kang
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Adler Perotte
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Youlan Tang
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Casey Ta
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| |
Collapse
|
13
|
Tran TTT, Nghiem SV, Le VT, Quan TT, Nguyen V, Yip HY, Bodenreider O. Siamese KG-LSTM: A deep learning model for enriching UMLS Metathesaurus synonymy. Int Conf Knowl Syst Eng 2020; 2020:281-286. [PMID: 36277606 PMCID: PMC9584311 DOI: 10.1109/kse50997.2020.9287797] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
The Unified Medical Language System, or UMLS, is a repository of medical terminology developed by the U.S. National Library of Medicine for improving the computer system's ability of understanding the biomedical and health languages. The UMLS Metathesaurus is one of the three UMLS knowledge sources, containing medical terms and their relationships. Due to the rapid increase in the number of medical terms recently, the current construction of UMLS Metathesaurus, which heavily depends on lexical tools and human editors, is error-prone and time-consuming. This paper takes advantages of the emerging deep learning models for learning to predict the synonyms and non-synonyms between the pairs of biomedical terms in the Metathesaurus. Our learning approach focuses a subset of specific terms instead of the whole Metathesaurus corpus. Particularly, we train the models with biomedical terms from the Disorders semantic group. To strengthen the models, we enrich the inputs with different strategies, including synonyms and hierarchical relationships from source vocabularies. Our deep learning model adopts the Siamese KG-LSTM (Siamese Knowledge Graph - Long Short-Term Memory) in the architecture. The experimental results show that this approach yields excellent performance when handling the task of synonym detection for Disorders semantic group in the Metathesaurus. This shows the potential of applying machine learning techniques in the UMLS Metathesaurus construction process. Although the work in this paper focuses only on specific semantic group of Disorders, we believe that the proposed method can be applied to other semantic groups in the UMLS Metathesaurus.
Collapse
Affiliation(s)
- Tien T T Tran
- Computer Science and Engineering, Ho Chi Minh University of Technology, HCMC, Vietnam
| | - Sy V Nghiem
- Computer Science and Engineering, Ho Chi Minh University of Technology, HCMC, Vietnam
| | - Van T Le
- Computer Science and Engineering, Ho Chi Minh University of Technology, HCMC, Vietnam
| | - Tho T Quan
- Computer Science and Engineering, Ho Chi Minh University of Technology, HCMC, Vietnam
| | - Vinh Nguyen
- National Library of Medicine, National Institute of Health, Bethesda, MD, USA
| | - Hong Yung Yip
- Artificial Intelligence Institute, University of South Carolina, Columbia, SC, USA
| | - Olivier Bodenreider
- National Library of Medicine, National Institute of Health, Bethesda, MD, USA
| |
Collapse
|
14
|
Xu S, Xu D, Wen L, Zhu C, Yang Y, Han S, Guan P. Integrating Unified Medical Language System and Kleinberg's Burst Detection Algorithm into Research Topics of Medications for Post-Traumatic Stress Disorder. Drug Des Devel Ther 2020; 14:3899-3913. [PMID: 33061296 PMCID: PMC7522601 DOI: 10.2147/dddt.s270379] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/02/2020] [Accepted: 09/07/2020] [Indexed: 11/23/2022]
Abstract
Background The treatment of post-traumatic stress disorder (PTSD) has long been a challenge because the symptoms of PTSD are multifaceted. PTSD is primarily treated with psychotherapy and medication, or a combination of psychotherapy and medication. The present study was designed to analyze the literature on medications for PTSD and explore high-frequency common drugs and low-frequency burst drugs by burst detection algorithm combined with Unified Medical Language System (UMLS) and provide references for developing new drugs for PTSD. Methods Publications related to medications for PTSD from 2010 to 2019 were identified through PubMed, Web of Science Core Collection, and BIOSIS Previews. SemRep and SemRep semantic result processing system were performed to extract the set of drug concepts with therapeutic relationship according to the semantic relationship of UMLS. Kleinberg’s burst detection algorithm was applied to calculate the burst weight index of drug concepts by a Java-based program. These concepts were sorted according to the frequency and the burst weight index. Results Four hundred and fifty-nine treatment-related drug concepts were extracted. The drug with the highest burst weight index was “Psilocybine”, a hallucinogen, which was more likely to be a hotspot for the pharmacotherapy of PTSD. The highest frequency concept was “prazosin”, which was more likely to be the focus of research in the medications for PTSD. Conclusion The present study assessed the medication-related literature on PTSD treatment, providing a framework of burst words detection-based method, a baseline of information for future research and the new attempt for the discovery of textual knowledge. The bibliometric analysis based on the burst detection algorithm combined with UMLS has shown certain feasibility in amplifying the microscopic changes of a specific research direction in a field, it can also be used in other aspects of disease and to explore the trends of various disciplines.
Collapse
Affiliation(s)
- Shuang Xu
- School of Library and Medical Informatics, China Medical University, Shenyang, Liaoning, People's Republic of China
| | - Dan Xu
- School of Library and Medical Informatics, China Medical University, Shenyang, Liaoning, People's Republic of China
| | - Liang Wen
- Department of Neurosurgery, The General Hospital of Shenyang Military Command, Shenyang, Liaoning, People's Republic of China
| | - Chen Zhu
- Department of Neurosurgery, The First Hospital of China Medical University, Shenyang, Liaoning, People's Republic of China
| | - Ying Yang
- School of Library and Medical Informatics, China Medical University, Shenyang, Liaoning, People's Republic of China
| | - Shuang Han
- School of Library and Medical Informatics, China Medical University, Shenyang, Liaoning, People's Republic of China
| | - Peng Guan
- Department of Epidemiology, School of Public Health, China Medical University, Shenyang, Liaoning, People's Republic of China
| |
Collapse
|
15
|
Zheng F, Shi J, Yang Y, Zheng WJ, Cui L. A transformation-based method for auditing the IS-A hierarchy of biomedical terminologies in the Unified Medical Language System. J Am Med Inform Assoc 2020; 27:1568-1575. [PMID: 32918476 PMCID: PMC7566369 DOI: 10.1093/jamia/ocaa123] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 05/09/2020] [Accepted: 05/20/2020] [Indexed: 01/06/2023] Open
Abstract
OBJECTIVE The Unified Medical Language System (UMLS) integrates various source terminologies to support interoperability between biomedical information systems. In this article, we introduce a novel transformation-based auditing method that leverages the UMLS knowledge to systematically identify missing hierarchical IS-A relations in the source terminologies. MATERIALS AND METHODS Given a concept name in the UMLS, we first identify its base and secondary noun chunks. For each identified noun chunk, we generate replacement candidates that are more general than the noun chunk. Then, we replace the noun chunks with their replacement candidates to generate new potential concept names that may serve as supertypes of the original concept. If a newly generated name is an existing concept name in the same source terminology with the original concept, then a potentially missing IS-A relation between the original and the new concept is identified. RESULTS Applying our transformation-based method to English-language concept names in the UMLS (2019AB release), a total of 39 359 potentially missing IS-A relations were detected in 13 source terminologies. Domain experts evaluated a random sample of 200 potentially missing IS-A relations identified in the SNOMED CT (U.S. edition) and 100 in Gene Ontology. A total of 173 of 200 and 63 of 100 potentially missing IS-A relations were confirmed by domain experts, indicating that our method achieved a precision of 86.5% and 63% for the SNOMED CT and Gene Ontology, respectively. CONCLUSIONS Our results showed that our transformation-based method is effective in identifying missing IS-A relations in the UMLS source terminologies.
Collapse
Affiliation(s)
- Fengbo Zheng
- Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA
| | - Jay Shi
- Department of Internal Medicine, University of Kentucky, Lexington, Kentucky, USA
| | - Yuntao Yang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - W Jim Zheng
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Licong Cui
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
16
|
Amos L, Anderson D, Brody S, Ripple A, Humphreys BL. UMLS users and uses: a current overview. J Am Med Inform Assoc 2020; 27:ocaa084. [PMID: 32683453 PMCID: PMC7580803 DOI: 10.1093/jamia/ocaa084] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2020] [Revised: 03/31/2020] [Accepted: 05/01/2020] [Indexed: 11/16/2022] Open
Abstract
The US National Library of Medicine regularly collects summary data on direct use of Unified Medical Language System (UMLS) resources. The summary data sources include UMLS user registration data, required annual reports submitted by registered users, and statistics on downloads and application programming interface calls. In 2019, the National Library of Medicine analyzed the summary data on 2018 UMLS use. The library also conducted a scoping review of the literature to provide additional intelligence about the research uses of UMLS as input to a planned 2020 review of UMLS production methods and priorities. 5043 direct users of UMLS data and tools downloaded 4402 copies of the UMLS resources and issued 66 130 951 UMLS application programming interface requests in 2018. The annual reports and the scoping review results agree that the primary UMLS uses are to process and interpret text and facilitate mapping or linking between terminologies. These uses align with the original stated purpose of the UMLS.
Collapse
Affiliation(s)
- Liz Amos
- Office of the Director, National Library of Medicine, Bethesda, USA
| | - David Anderson
- Library Operations, National Library of Medicine, Bethesda, USA
| | - Stacy Brody
- Himmelfarb Health Sciences Library, Reference & Instructional Services, George Washington University, Washington DC, USA
| | - Anna Ripple
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, USA
| | | |
Collapse
|
17
|
Grosjean J, Billey K, Charlet J, Darmoni SJ. Manual Evaluation of the Automatic Mapping of International Classification of Diseases (ICD)-11 (in French). Stud Health Technol Inform 2020; 270:1335-1336. [PMID: 32570646 DOI: 10.3233/shti200429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
A lexical method was used to map ICD-11 to the terminologies included in the HeTOP server. About half of ICD-11 codes (47.76%) were mapped to at least one concept. The developed tool reached a global precision of 0.98 and a recall of 0.66. Lexical methods are powerful methods to map health terminologies. Supervised and manual mapping is still necessary to complete the mapping.
Collapse
Affiliation(s)
- Julien Grosjean
- Department of Biomedical Informatics, Rouen University Hospital, Normandy, France
- LIMICS U1142 INSERM, Sorbonne Université, Paris & Rouen University, France
| | - Kévin Billey
- Department of Biomedical Informatics, Rouen University Hospital, Normandy, France
- LITIS EA4108, Rouen University, France
| | - Jean Charlet
- LIMICS U1142 INSERM, Sorbonne Université, Paris & Rouen University, France
- Assistance Publique-Hôpitaux de Paris, DRCI, Paris, France
| | - Stefan J Darmoni
- Department of Biomedical Informatics, Rouen University Hospital, Normandy, France
- LIMICS U1142 INSERM, Sorbonne Université, Paris & Rouen University, France
| |
Collapse
|
18
|
Abstract
The ability to automatically categorize submitted questions based on topics and suggest similar question and answer to the users reduces the number of redundant questions. Our objective was to compare intra-topic and inter-topic similarity between question and answers by using concept-based similarity computing analysis. We gathered existing question and answers from several popular online health communities. Then, Unified Medical Language System concepts related to selected questions and experts in different topics were extracted and weighted by term frequency -inverse document frequency values. Finally, the similarity between weighted vectors of Unified Medical Language System concepts was computed. Our result showed a considerable gap between intra-topic and inter-topic similarities in such a way that the average of intra-topic similarity (0.095, 0.192, and 0.110, respectively) was higher than the average of inter-topic similarity (0.012, 0.025, and 0.018, respectively) for questions of the top 3 popular online communities including NetWellness, WebMD, and Yahoo Answers. Similarity scores between the content of questions answered by experts in the same and different topics were calculated as 0.51 and 0.11, respectively. Concept-based similarity computing methods can be used in developing intelligent question and answering retrieval systems that contain auto recommendation functionality for similar questions and experts.
Collapse
|
19
|
Li Y, Yao L, Mao C, Srivastava A, Jiang X, Luo Y. Early Prediction of Acute Kidney Injury in Critical Care Setting Using Clinical Notes. Proceedings (IEEE Int Conf Bioinformatics Biomed) 2018; 2018:683-686. [PMID: 33376624 PMCID: PMC7768909 DOI: 10.1109/bibm.2018.8621574] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Acute kidney injury (AKI) in critically ill patients is associated with significant morbidity and mortality. Development of novel methods to identify patients with AKI earlier will allow for testing of novel strategies to prevent or reduce the complications of AKI. We developed data-driven prediction models to estimate the risk of new AKI onset. We generated models from clinical notes within the first 24 hours following intensive care unit (ICU) admission extracted from Medical Information Mart for Intensive Care III (MIMIC-III). From the clinical notes, we generated clinically meaningful word and concept representations and embeddings, respectively. Five supervised learning classifiers and knowledge-guided deep learning architecture were used to construct prediction models. The best configuration yielded a competitive AUC of 0.779. Our work suggests that natural language processing of clinical notes can be applied to assist clinicians in identifying the risk of incident AKI onset in critically ill patients upon admission to the ICU.
Collapse
Affiliation(s)
- Yikuan Li
- Dept. of EECS, Northwestern University, Evanston, IL, U.S.A
| | | | - Chengsheng Mao
- Dept. of Preventive Medicine, Northwestern University, Chicago, IL, U.S.A
| | - Anand Srivastava
- Div. of Nephrology and Hypertension, Northwestern University, Chicago, IL, U.S.A
| | - Xiaoqian Jiang
- School of Biomedical Informatics, Univ. of Texas Health Science Center, Houston, TX, U.S.A
| | - Yuan Luo
- Dept. of Preventive Medicine, Northwestern University, Chicago, IL, U.S.A
| |
Collapse
|
20
|
Varghese J, Sandmann S, Dugas M. Web-Based Information Infrastructure Increases the Interrater Reliability of Medical Coders: Quasi-Experimental Study. J Med Internet Res 2018; 20:e274. [PMID: 30322834 PMCID: PMC6231825 DOI: 10.2196/jmir.9644] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Revised: 05/03/2018] [Accepted: 06/28/2018] [Indexed: 01/05/2023] Open
Abstract
Background Medical coding is essential for standardized communication and integration of clinical data. The Unified Medical Language System by the National Library of Medicine is the largest clinical terminology system for medical coders and Natural Language Processing tools. However, the abundance of ambiguous codes leads to low rates of uniform coding among different coders. Objective The objective of our study was to measure uniform coding among different medical experts in terms of interrater reliability and analyze the effect on interrater reliability using an expert- and Web-based code suggestion system. Methods We conducted a quasi-experimental study in which 6 medical experts coded 602 medical items from structured quality assurance forms or free-text eligibility criteria of 20 different clinical trials. The medical item content was selected on the basis of mortality-leading diseases according to World Health Organization data. The intervention comprised using a semiautomatic code suggestion tool that is linked to a European information infrastructure providing a large medical text corpus of >300,000 medical form items with expert-assigned semantic codes. Krippendorff alpha (Kalpha) with bootstrap analysis was used for the interrater reliability analysis, and coding times were measured before and after the intervention. Results The intervention improved interrater reliability in structured quality assurance form items (from Kalpha=0.50, 95% CI 0.43-0.57 to Kalpha=0.62 95% CI 0.55-0.69) and free-text eligibility criteria (from Kalpha=0.19, 95% CI 0.14-0.24 to Kalpha=0.43, 95% CI 0.37-0.50) while preserving or slightly reducing the mean coding time per item for all 6 coders. Regardless of the intervention, precoordination and structured items were associated with significantly high interrater reliability, but the proportion of items that were precoordinated significantly increased after intervention (eligibility criteria: OR 4.92, 95% CI 2.78-8.72; quality assurance: OR 1.96, 95% CI 1.19-3.25). Conclusions The Web-based code suggestion mechanism improved interrater reliability toward moderate or even substantial intercoder agreement. Precoordination and the use of structured versus free-text data elements are key drivers of higher interrater reliability.
Collapse
Affiliation(s)
- Julian Varghese
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Sarah Sandmann
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Martin Dugas
- Institute of Medical Informatics, European Research Center for Information Systems, Münster, Germany
| |
Collapse
|
21
|
Varghese J, Fujarski M, Hegselmann S, Neuhaus P, Dugas M. CDEGenerator: an online platform to learn from existing data models to build model registries. Clin Epidemiol 2018; 10:961-970. [PMID: 30127646 PMCID: PMC6089100 DOI: 10.2147/clep.s170075] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
OBJECTIVE Best-practice data models harmonize semantics and data structure of medical variables in clinical or epidemiological studies. While there exist several published data sets, it remains challenging to find and reuse published eligibility criteria or other data items that match specific needs of a newly planned study or registry. A novel Internet-based method for rapid comparison of published data models was implemented to enable reuse, customization, and harmonization of item catalogs for the early planning and development phase of research databases. METHODS Based on prior work, a European information infrastructure with a large collection of medical data models was established. A newly developed analysis module called CDEGenerator provides systematic comparison of selected data models and user-tailored creation of minimum data sets or harmonized item catalogs. Usability was assessed by eight external medical documentation experts in a workshop by the umbrella organization for networked medical research in Germany with the System Usability Scale. RESULTS The analysis and item-tailoring module provides multilingual comparisons of semantically complex eligibility criteria of clinical trials. The System Usability Scale yielded "good usability" (mean 75.0, range 65.0-92.5). User-tailored models can be exported to several data formats, such as XLS, REDCap or Operational Data Model by the Clinical Data Interchange Standards Consortium, which is supported by the US Food and Drug Administration and European Medicines Agency for metadata exchange of clinical studies. CONCLUSION The online tool provides user-friendly methods to reuse, compare, and thus learn from data items of standardized or published models to design a blueprint for a harmonized research database.
Collapse
Affiliation(s)
| | - Michael Fujarski
- Faculty of Mathematics and Computer Sciences, University of Münster
| | | | | | - Martin Dugas
- Institute of Medical Informatics, University of Münster,
- Institute of Medical Informatics, European Research Center for Information Systems (ERCIS), Münster, Germany
| |
Collapse
|
22
|
Chen D, Zhang R, Liu K, Hou L. Knowledge Discovery from Posts in Online Health Communities Using Unified Medical Language System. Int J Environ Res Public Health 2018; 15:E1291. [PMID: 29921824 PMCID: PMC6025155 DOI: 10.3390/ijerph15061291] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/31/2018] [Revised: 06/15/2018] [Accepted: 06/16/2018] [Indexed: 12/03/2022]
Abstract
Patient-reported posts in Online Health Communities (OHCs) contain various valuable information that can help establish knowledge-based online support for online patients. However, utilizing these reports to improve online patient services in the absence of appropriate medical and healthcare expert knowledge is difficult. Thus, we propose a comprehensive knowledge discovery method that is based on the Unified Medical Language System for the analysis of narrative posts in OHCs. First, we propose a domain-knowledge support framework for OHCs to provide a basis for post analysis. Second, we develop a Knowledge-Involved Topic Modeling (KI-TM) method to extract and expand explicit knowledge within the text. We propose four metrics, namely, explicit knowledge rate, latent knowledge rate, knowledge correlation rate, and perplexity, for the evaluation of the KI-TM method. Our experimental results indicate that our proposed method outperforms existing methods in terms of providing knowledge support. Our method enhances knowledge support for online patients and can help develop intelligent OHCs in the future.
Collapse
Affiliation(s)
- Donghua Chen
- Department of Information Management, School of Economics and Management, Beijing Jiaotong University, Beijing 100044, China.
| | - Runtong Zhang
- Department of Information Management, School of Economics and Management, Beijing Jiaotong University, Beijing 100044, China.
| | - Kecheng Liu
- Henley Business School, University of Reading, Reading RG6 6UD, UK.
| | - Lei Hou
- Henley Business School, University of Reading, Reading RG6 6UD, UK.
| |
Collapse
|
23
|
Weng WH, Wagholikar KB, McCray AT, Szolovits P, Chueh HC. Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach. BMC Med Inform Decis Mak 2017; 17:155. [PMID: 29191207 PMCID: PMC5709846 DOI: 10.1186/s12911-017-0556-8] [Citation(s) in RCA: 79] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2017] [Accepted: 11/19/2017] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND The medical subdomain of a clinical note, such as cardiology or neurology, is useful content-derived metadata for developing machine learning downstream applications. To classify the medical subdomain of a note accurately, we have constructed a machine learning-based natural language processing (NLP) pipeline and developed medical subdomain classifiers based on the content of the note. METHODS We constructed the pipeline using the clinical NLP system, clinical Text Analysis and Knowledge Extraction System (cTAKES), the Unified Medical Language System (UMLS) Metathesaurus, Semantic Network, and learning algorithms to extract features from two datasets - clinical notes from Integrating Data for Analysis, Anonymization, and Sharing (iDASH) data repository (n = 431) and Massachusetts General Hospital (MGH) (n = 91,237), and built medical subdomain classifiers with different combinations of data representation methods and supervised learning algorithms. We evaluated the performance of classifiers and their portability across the two datasets. RESULTS The convolutional recurrent neural network with neural word embeddings trained-medical subdomain classifier yielded the best performance measurement on iDASH and MGH datasets with area under receiver operating characteristic curve (AUC) of 0.975 and 0.991, and F1 scores of 0.845 and 0.870, respectively. Considering better clinical interpretability, linear support vector machine-trained medical subdomain classifier using hybrid bag-of-words and clinically relevant UMLS concepts as the feature representation, with term frequency-inverse document frequency (tf-idf)-weighting, outperformed other shallow learning classifiers on iDASH and MGH datasets with AUC of 0.957 and 0.964, and F1 scores of 0.932 and 0.934 respectively. We trained classifiers on one dataset, applied to the other dataset and yielded the threshold of F1 score of 0.7 in classifiers for half of the medical subdomains we studied. CONCLUSION Our study shows that a supervised learning-based NLP approach is useful to develop medical subdomain classifiers. The deep learning algorithm with distributed word representation yields better performance yet shallow learning algorithms with the word and concept representation achieves comparable performance with better clinical interpretability. Portable classifiers may also be used across datasets from different institutions.
Collapse
Affiliation(s)
- Wei-Hung Weng
- Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, 4th Floor, Boston, MA 02115 USA
- Laboratory of Computer Science, Massachusetts General Hospital, 50 Staniford Street, Suite 750, Boston, MA 02114 USA
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139 USA
| | - Kavishwar B. Wagholikar
- Laboratory of Computer Science, Massachusetts General Hospital, 50 Staniford Street, Suite 750, Boston, MA 02114 USA
- Department of Medicine, Massachusetts General Hospital, 55 Fruit St, Boston, MA 02114 USA
| | - Alexa T. McCray
- Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, 4th Floor, Boston, MA 02115 USA
| | - Peter Szolovits
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139 USA
| | - Henry C. Chueh
- Laboratory of Computer Science, Massachusetts General Hospital, 50 Staniford Street, Suite 750, Boston, MA 02114 USA
- Department of Medicine, Massachusetts General Hospital, 55 Fruit St, Boston, MA 02114 USA
| |
Collapse
|
24
|
Hegselmann S, Gessner S, Neuhaus P, Henke J, Schmidt CO, Dugas M. Automatic Conversion of Metadata from the Study of Health in Pomerania to ODM. Stud Health Technol Inform 2017; 236:88-96. [PMID: 28508783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
BACKGROUND Electronic collection and high quality analysis of medical data is expected to have a big potential to improve patient care and medical research. However, the integration of data from different stake holders is posing a crucial problem. The exchange and reuse of medical data models as well as annotations with unique semantic identifiers were proposed as a solution. OBJECTIVES Convert metadata from the Study of Health in Pomerania to the standardized CDISC ODM format. METHODS The structure of the two data formats is analyzed and a mapping is suggested and implemented. RESULTS The metadata from the Study of Health in Pomerania was successfully converted to ODM. All relevant information was included in the resulting forms. Three sample forms were evaluated in-depth, which demonstrates the feasibility of this conversion. CONCLUSION Hundreds of data entry forms with more than 15.000 items can be converted into a standardized format with some limitations, e.g. regarding logical constraints. This enables the integration of the Study of Health in Pomerania metadata into various systems, facilitating the implementation and reuse in different study sites.
Collapse
Affiliation(s)
| | - Sophia Gessner
- Institute of Medical Informatics, University of Münster, Germany
| | - Philipp Neuhaus
- Institute of Medical Informatics, University of Münster, Germany
| | - Jörg Henke
- Institute for Community Medicine, University of Greifswald, Germany
| | | | - Martin Dugas
- Institute of Medical Informatics, University of Münster, Germany
| |
Collapse
|
25
|
Raje S, Bodenreider O. Interoperability of Disease Concepts in Clinical and Research Ontologies: Contrasting Coverage and Structure in the Disease Ontology and SNOMED CT. Stud Health Technol Inform 2017; 245:925-929. [PMID: 29295235 PMCID: PMC5881393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
OBJECTIVES To contrast the coverage of diseases between the Disease Ontology (DO) and SNOMED CT, and to compare the hierarchical structure of the two ontologies. METHODS We establish a reference list of mappings. We characterize unmapped concepts in DO semantically and structurally. Finally, we compare the hierarchical structure between the two ontologies. RESULTS Overall, 4478 (65%) the 6931 DO concepts are mapped to SNOMED CT. The cancer and neoplasm subtrees of DO account for many of the unmapped concepts. The most frequent differentiae in unmapped concepts include morphology (for cancers and neoplasms), specific subtypes (for rare genetic disorders), and anatomical subtypes. Unmapped concepts usually form subtrees, and less often correspond to isolated leaves or intermediary concepts. CONCLUSION This detailed analysis of the gaps in coverage and structural differences between DO and SNOMED CT contributes to the interoperability between these two ontologies and will guide further validation of the mapping.
Collapse
|
26
|
Yu Z, Wallace BC, Johnson T, Cohen T. Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness. Stud Health Technol Inform 2017; 245:657-661. [PMID: 29295178 PMCID: PMC6464117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Estimation of semantic similarity and relatedness between biomedical concepts has utility for many informatics applications. Automated methods fall into two categories: methods based on distributional statistics drawn from text corpora, and methods using the structure of existing knowledge resources. Methods in the former category disregard taxonomic structure, while those in the latter fail to consider semantically relevant empirical information. In this paper, we present a method that retrofits distributional context vector representations of biomedical concepts using structural information from the UMLS Metathesaurus, such that the similarity between vector representations of linked concepts is augmented. We evaluated it on the UMNSRS benchmark. Our results demonstrate that retrofitting of concept vector representations leads to better correlation with human raters for both similarity and relatedness, surpassing the best results reported to date. They also demonstrate a clear improvement in performance on this reference standard for retrofitted vector representations, as compared to those without retrofitting.
Collapse
Affiliation(s)
- Zhiguo Yu
- The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA
| | - Byron C. Wallace
- College of Computer and Information Science, Northeastern University, Boston, Massachusetts, USA
| | - Todd Johnson
- The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA
| | - Trevor Cohen
- The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA
| |
Collapse
|
27
|
Festag S, Spreckelsen C. Word Sense Disambiguation of Medical Terms via Recurrent Convolutional Neural Networks. Stud Health Technol Inform 2017; 236:8-15. [PMID: 28508773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
BACKGROUND Tagging text data with codes representing biomedical concepts plays an important role in medical data management and analysis. A problem occurs if there are ambiguous words linked to several concepts. OBJECTIVES AND METHODS This study aims at investigating word sense disambiguation based on word embedding and recurrent convolutional neural networks. The study focuses on terms mapped to multiple concepts of the Unified Medical Language System (UMLS). RESULTS We created 20 text processing pipelines trained on a subset of the MeSH Word Sense Disambiguation (MSH WSD) data set, each pipeline disambiguating the sense of one word. The pipelines were then tested on a disjoint subset of MSH WSD data. Most pipelines achieved good or even excellent results (70% of the pipelines achieved at least 90% accuracy, 40% achieved at least 98% accuracy). One poor-performing outlier was detected. CONCLUSION The proposed approach can serve as a basis for an up-scaled system combining pipelines for many ambiguous words. The methods used here recently proved very successful in other fields of text understanding and can be expected to scale-up with improved availability of training data.
Collapse
Affiliation(s)
- Sven Festag
- Department of Medical Informatics, Medical Faculty, RWTH Aachen University
| | - Cord Spreckelsen
- Department of Medical Informatics, Medical Faculty, RWTH Aachen University
| |
Collapse
|
28
|
Lu CJ, Tormey D, McCreedy L, Browne AC. Enhanced LexSynonym Acquisition for Effective UMLS Concept Mapping. Stud Health Technol Inform 2017; 245:501-505. [PMID: 29295145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Concept mapping is important in natural language processing (NLP) for bioinformatics. The UMLS Metathesaurus provides a rich synonym thesaurus and is a popular resource for concept mapping. Query expansion using synonyms for subterm substitutions is an effective technique to increase recall for UMLS concept mapping. Synonyms used to substitute subterms are called element synonyms. The completeness and quality of both element synonyms and the UMLS synonym thesaurus is the key to success in such applications. The Lexical Systems Group (LSG) has developed a new system for element synonym acquisition based on new enhanced requirements and design for better performance. The results show: 1) A 36.71 times growth of synonyms in the Lexicon (lexSynonym) in the 2017 release; 2) Improvements of concept mapping for recall and F1 with similar precision using the lexSynonym.2017 as element synonyms due to the broader coverage and better quality.
Collapse
Affiliation(s)
- Chris J Lu
- National Library of Medicine, Bethesda, MD, USA
| | | | | | | |
Collapse
|
29
|
Duque A, Martinez-Romo J, Araujo L. Can multilinguality improve Biomedical Word Sense Disambiguation? J Biomed Inform 2016; 64:320-332. [PMID: 27815227 DOI: 10.1016/j.jbi.2016.10.020] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Revised: 10/24/2016] [Accepted: 10/31/2016] [Indexed: 10/20/2022]
Abstract
Ambiguity in the biomedical domain represents a major issue when performing Natural Language Processing tasks over the huge amount of available information in the field. For this reason, Word Sense Disambiguation is critical for achieving accurate systems able to tackle complex tasks such as information extraction, summarization or document classification. In this work we explore whether multilinguality can help to solve the problem of ambiguity, and the conditions required for a system to improve the results obtained by monolingual approaches. Also, we analyze the best ways to generate those useful multilingual resources, and study different languages and sources of knowledge. The proposed system, based on co-occurrence graphs containing biomedical concepts and textual information, is evaluated on a test dataset frequently used in biomedicine. We can conclude that multilingual resources are able to provide a clear improvement of more than 7% compared to monolingual approaches, for graphs built from a small number of documents. Also, empirical results show that automatically translated resources are a useful source of information for this particular task.
Collapse
Affiliation(s)
- Andres Duque
- NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Madrid 28040, Spain.
| | - Juan Martinez-Romo
- NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Madrid 28040, Spain.
| | - Lourdes Araujo
- NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Madrid 28040, Spain.
| |
Collapse
|
30
|
Mowery DL, South BR, Christensen L, Leng J, Peltonen LM, Salanterä S, Suominen H, Martinez D, Velupillai S, Elhadad N, Savova G, Pradhan S, Chapman WW. Normalizing acronyms and abbreviations to aid patient understanding of clinical texts: ShARe/CLEF eHealth Challenge 2013, Task 2. J Biomed Semantics 2016; 7:43. [PMID: 27370271 PMCID: PMC4930590 DOI: 10.1186/s13326-016-0084-y] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2014] [Accepted: 06/01/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The ShARe/CLEF eHealth challenge lab aims to stimulate development of natural language processing and information retrieval technologies to aid patients in understanding their clinical reports. In clinical text, acronyms and abbreviations, also referenced as short forms, can be difficult for patients to understand. For one of three shared tasks in 2013 (Task 2), we generated a reference standard of clinical short forms normalized to the Unified Medical Language System. This reference standard can be used to improve patient understanding by linking to web sources with lay descriptions of annotated short forms or by substituting short forms with a more simplified, lay term. METHODS In this study, we evaluate 1) accuracy of participating systems' normalizing short forms compared to a majority sense baseline approach, 2) performance of participants' systems for short forms with variable majority sense distributions, and 3) report the accuracy of participating systems' normalizing shared normalized concepts between the test set and the Consumer Health Vocabulary, a vocabulary of lay medical terms. RESULTS The best systems submitted by the five participating teams performed with accuracies ranging from 43 to 72 %. A majority sense baseline approach achieved the second best performance. The performance of participating systems for normalizing short forms with two or more senses with low ambiguity (majority sense greater than 80 %) ranged from 52 to 78 % accuracy, with two or more senses with moderate ambiguity (majority sense between 50 and 80 %) ranged from 23 to 57 % accuracy, and with two or more senses with high ambiguity (majority sense less than 50 %) ranged from 2 to 45 % accuracy. With respect to the ShARe test set, 69 % of short form annotations contained common concept unique identifiers with the Consumer Health Vocabulary. For these 2594 possible annotations, the performance of participating systems ranged from 50 to 75 % accuracy. CONCLUSION Short form normalization continues to be a challenging problem. Short form normalization systems perform with moderate to reasonable accuracies. The Consumer Health Vocabulary could enrich its knowledge base with missed concept unique identifiers from the ShARe test set to further support patient understanding of unfamiliar medical terms.
Collapse
Affiliation(s)
- Danielle L Mowery
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA.
| | - Brett R South
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA
| | - Lee Christensen
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA
| | - Jianwei Leng
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA
| | - Laura-Maria Peltonen
- Nursing Science, University of Turku, and Turku University Hospital, Turku, Finland
| | - Sanna Salanterä
- Nursing Science, University of Turku, and Turku University Hospital, Turku, Finland
| | - Hanna Suominen
- Data61, CSIRO, The Australian National University, University of Canberra, and University of Turku, Locked Bag 8001, Canberra, 2601, ACT, Australia
| | - David Martinez
- MedWhat.com, San Francisco, CA, USA.,University of Melbourne, Parkville, VIC, Australia
| | - Sumithra Velupillai
- Department of Computer and Systems Sciences (DSV), Stockholm University, Stockholm, Sweden
| | - Noémie Elhadad
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Guergana Savova
- Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
| | - Sameer Pradhan
- Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
| | - Wendy W Chapman
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA
| |
Collapse
|
31
|
Scuba W, Tharp M, Mowery D, Tseytlin E, Liu Y, Drews FA, Chapman WW. Knowledge Author: facilitating user-driven, domain content development to support clinical information extraction. J Biomed Semantics 2016; 7:42. [PMID: 27338146 PMCID: PMC4919842 DOI: 10.1186/s13326-016-0086-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2015] [Accepted: 06/01/2016] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Clinical Natural Language Processing (NLP) systems require a semantic schema comprised of domain-specific concepts, their lexical variants, and associated modifiers to accurately extract information from clinical texts. An NLP system leverages this schema to structure concepts and extract meaning from the free texts. In the clinical domain, creating a semantic schema typically requires input from both a domain expert, such as a clinician, and an NLP expert who will represent clinical concepts created from the clinician's domain expertise into a computable format usable by an NLP system. The goal of this work is to develop a web-based tool, Knowledge Author, that bridges the gap between the clinical domain expert and the NLP system development by facilitating the development of domain content represented in a semantic schema for extracting information from clinical free-text. RESULTS Knowledge Author is a web-based, recommendation system that supports users in developing domain content necessary for clinical NLP applications. Knowledge Author's schematic model leverages a set of semantic types derived from the Secondary Use Clinical Element Models and the Common Type System to allow the user to quickly create and modify domain-related concepts. Features such as collaborative development and providing domain content suggestions through the mapping of concepts to the Unified Medical Language System Metathesaurus database further supports the domain content creation process. Two proof of concept studies were performed to evaluate the system's performance. The first study evaluated Knowledge Author's flexibility to create a broad range of concepts. A dataset of 115 concepts was created of which 87 (76 %) were able to be created using Knowledge Author. The second study evaluated the effectiveness of Knowledge Author's output in an NLP system by extracting concepts and associated modifiers representing a clinical element, carotid stenosis, from 34 clinical free-text radiology reports using Knowledge Author and an NLP system, pyConText. Knowledge Author's domain content produced high recall for concepts (targeted findings: 86 %) and varied recall for modifiers (certainty: 91 % sidedness: 80 %, neurovascular anatomy: 46 %). CONCLUSION Knowledge Author can support clinical domain content development for information extraction by supporting semantic schema creation by domain experts.
Collapse
Affiliation(s)
- William Scuba
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, 84108, USA
| | - Melissa Tharp
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, 84108, USA
| | - Danielle Mowery
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, 84108, USA
| | - Eugene Tseytlin
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, 15206, USA
| | - Yang Liu
- University of California, San Diego, CA, 92093, USA
| | - Frank A Drews
- Department of Psychology, University of Utah, Salt Lake City, UT, 84108, USA
| | - Wendy W Chapman
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, 84108, USA.
| |
Collapse
|
32
|
Shivade C, Malewadkar P, Fosler-Lussier E, Lai AM. Comparison of UMLS terminologies to identify risk of heart disease using clinical notes. J Biomed Inform 2015; 58 Suppl:S103-S110. [PMID: 26375493 DOI: 10.1016/j.jbi.2015.08.025] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2015] [Revised: 08/23/2015] [Accepted: 08/25/2015] [Indexed: 10/23/2022]
Abstract
The second track of the 2014 i2b2 challenge asked participants to automatically identify risk factors for heart disease among diabetic patients using natural language processing techniques for clinical notes. This paper describes a rule-based system developed using a combination of regular expressions, concepts from the Unified Medical Language System (UMLS), and freely-available resources from the community. With a performance (F1=90.7) that is significantly higher than the median (F1=87.20) and close to the top performing system (F1=92.8), it was the best rule-based system of all the submissions in the challenge. We also used this system to evaluate the utility of different terminologies in the UMLS towards the challenge task. Of the 155 terminologies in the UMLS, 129 (76.78%) have no representation in the corpus. The Consumer Health Vocabulary had very good coverage of relevant concepts and was the most useful terminology for the challenge task. While segmenting notes into sections and lists has a significant impact on the performance, identifying negations and experiencer of the medical event results in negligible gain.
Collapse
Affiliation(s)
- Chaitanya Shivade
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA.
| | - Pranav Malewadkar
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
| | - Eric Fosler-Lussier
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
| | - Albert M Lai
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
33
|
Hanauer DA, Saeed M, Zheng K, Mei Q, Shedden K, Aronson AR, Ramakrishnan N. Applying MetaMap to Medline for identifying novel associations in a large clinical dataset: a feasibility analysis. J Am Med Inform Assoc 2014; 21:925-37. [PMID: 24928177 PMCID: PMC4147617 DOI: 10.1136/amiajnl-2014-002767] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2014] [Revised: 05/23/2014] [Accepted: 05/27/2014] [Indexed: 02/07/2023] Open
Abstract
OBJECTIVE We describe experiments designed to determine the feasibility of distinguishing known from novel associations based on a clinical dataset comprised of International Classification of Disease, V.9 (ICD-9) codes from 1.6 million patients by comparing them to associations of ICD-9 codes derived from 20.5 million Medline citations processed using MetaMap. Associations appearing only in the clinical dataset, but not in Medline citations, are potentially novel. METHODS Pairwise associations of ICD-9 codes were independently identified in both the clinical and Medline datasets, which were then compared to quantify their degree of overlap. We also performed a manual review of a subset of the associations to validate how well MetaMap performed in identifying diagnoses mentioned in Medline citations that formed the basis of the Medline associations. RESULTS The overlap of associations based on ICD-9 codes in the clinical and Medline datasets was low: only 6.6% of the 3.1 million associations found in the clinical dataset were also present in the Medline dataset. Further, a manual review of a subset of the associations that appeared in both datasets revealed that co-occurring diagnoses from Medline citations do not always represent clinically meaningful associations. DISCUSSION Identifying novel associations derived from large clinical datasets remains challenging. Medline as a sole data source for existing knowledge may not be adequate to filter out widely known associations. CONCLUSIONS In this study, novel associations were not readily identified. Further improvements in accuracy and relevance for tools such as MetaMap are needed to realize their expected utility.
Collapse
Affiliation(s)
- David A Hanauer
- Department of Pediatrics, University of Michigan Medical School, Ann Arbor, Michigan, USA
| | - Mohammed Saeed
- Department of Internal Medicine, University of Michigan Medical School, Ann Arbor, Michigan, USA
| | - Kai Zheng
- Department of Health Management and Policy, University of Michigan School of Public Health, Ann Arbor, Michigan, USA
- School of Information, University of Michigan, Ann Arbor, Michigan, USA
| | - Qiaozhu Mei
- School of Information, University of Michigan, Ann Arbor, Michigan, USA
- Department of Electronic Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan, USA
| | - Kerby Shedden
- Center for Statistical Consultation and Research, University of Michigan, Ann Arbor, Michigan, USA
| | - Alan R Aronson
- Lister Hill Center, National Library of Medicine, Bethesda, Maryland, USA
| | - Naren Ramakrishnan
- Department of Computer Science, Discovery Analytics Center, Virginia Tech, Arlington, Virginia, USA
| |
Collapse
|
34
|
Guo X, Yu Q, Alm CO, Calvelli C, Pelz JB, Shi P, Haake AR. From spoken narratives to domain knowledge: mining linguistic data for medical image understanding. Artif Intell Med 2014; 62:79-90. [PMID: 25174882 DOI: 10.1016/j.artmed.2014.08.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2013] [Revised: 07/29/2014] [Accepted: 08/10/2014] [Indexed: 10/24/2022]
Abstract
OBJECTIVES Extracting useful visual clues from medical images allowing accurate diagnoses requires physicians' domain knowledge acquired through years of systematic study and clinical training. This is especially true in the dermatology domain, a medical specialty that requires physicians to have image inspection experience. Automating or at least aiding such efforts requires understanding physicians' reasoning processes and their use of domain knowledge. Mining physicians' references to medical concepts in narratives during image-based diagnosis of a disease is an interesting research topic that can help reveal experts' reasoning processes. It can also be a useful resource to assist with design of information technologies for image use and for image case-based medical education systems. METHODS AND MATERIALS We collected data for analyzing physicians' diagnostic reasoning processes by conducting an experiment that recorded their spoken descriptions during inspection of dermatology images. In this paper we focus on the benefit of physicians' spoken descriptions and provide a general workflow for mining medical domain knowledge based on linguistic data from these narratives. The challenge of a medical image case can influence the accuracy of the diagnosis as well as how physicians pursue the diagnostic process. Accordingly, we define two lexical metrics for physicians' narratives--lexical consensus score and top N relatedness score--and evaluate their usefulness by assessing the diagnostic challenge levels of corresponding medical images. We also report on clustering medical images based on anchor concepts obtained from physicians' medical term usage. These analyses are based on physicians' spoken narratives that have been preprocessed by incorporating the Unified Medical Language System for detecting medical concepts. RESULTS The image rankings based on lexical consensus score and on top 1 relatedness score are well correlated with those based on challenge levels (Spearman correlation>0.5 and Kendall correlation>0.4). Clustering results are largely improved based on our anchor concept method (accuracy>70% and mutual information>80%). CONCLUSIONS Physicians' spoken narratives are valuable for the purpose of mining the domain knowledge that physicians use in medical image inspections. We also show that the semantic metrics introduced in the paper can be successfully applied to medical image understanding and allow discussion of additional uses of these metrics.
Collapse
Affiliation(s)
- Xuan Guo
- College of Computing & Information Sciences, Rochester Institute of Technology, 20 Lomb Memorial Drive, Rochester, NY 14623, USA.
| | - Qi Yu
- College of Computing & Information Sciences, Rochester Institute of Technology, 20 Lomb Memorial Drive, Rochester, NY 14623, USA
| | - Cecilia Ovesdotter Alm
- College of Liberal Arts, Rochester Institute of Technology, 92 Lomb Memorial Drive, Rochester, NY 14623, USA
| | - Cara Calvelli
- College of Health Sciences & Technology, Rochester Institute of Technology, 90 Lomb Memorial Drive, Rochester, NY 14623, USA
| | - Jeff B Pelz
- Center for Imaging Science, Rochester Institute of Technology, 54 Lomb Memorial Drive, Rochester, NY 14623, USA
| | - Pengcheng Shi
- College of Computing & Information Sciences, Rochester Institute of Technology, 20 Lomb Memorial Drive, Rochester, NY 14623, USA
| | - Anne R Haake
- College of Computing & Information Sciences, Rochester Institute of Technology, 20 Lomb Memorial Drive, Rochester, NY 14623, USA
| |
Collapse
|
35
|
Abstract
OBJECTIVE This work focuses on multiply-related Unified Medical Language System (UMLS) concepts, that is, concepts associated through multiple relations. The relations involved in such situations are audited to determine whether they are provided by source vocabularies or result from the integration of these vocabularies within the UMLS. METHODS We study the compatibility of the multiple relations which associate the concepts under investigation and try to explain the reason why they co-occur. Towards this end, we analyze the relations both at the concept and term levels. In addition, we randomly select 288 concepts associated through contradictory relations and manually analyze them. RESULTS At the UMLS scale, only 0.7% of combinations of relations are contradictory, while homogeneous combinations are observed in one-third of situations. At the scale of source vocabularies, one-third do not contain more than one relation between the concepts under investigation. Among the remaining source vocabularies, seven of them mainly present multiple non-homogeneous relations between terms. Analysis at the term level also shows that only in a quarter of cases are the source vocabularies responsible for the presence of multiply-related concepts in the UMLS. These results are available at: http://www.isped.u-bordeaux2.fr/ArticleJAMIA/results_multiply_related_concepts.aspx. DISCUSSION Manual analysis was useful to explain the conceptualization difference in relations between terms across source vocabularies. The exploitation of source relations was helpful for understanding why some source vocabularies describe multiple relations between a given pair of terms.
Collapse
Affiliation(s)
- Fleur Mougin
- ISPED, Université de Bordeaux 2, Bordeaux, France ERIAS, INSERM, Centre INSERM U897, Bordeaux, France
| | - Natalia Grabar
- CNRS UMR 8163 STL, Université Lille 1 and 3, Villeneuve d'Ascq, France
| |
Collapse
|
36
|
Gobbel GT, Reeves R, Jayaramaraja S, Giuse D, Speroff T, Brown SH, Elkin PL, Matheny ME. Development and evaluation of RapTAT: a machine learning system for concept mapping of phrases from medical narratives. J Biomed Inform 2013; 48:54-65. [PMID: 24316051 DOI: 10.1016/j.jbi.2013.11.008] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2013] [Revised: 08/16/2013] [Accepted: 11/17/2013] [Indexed: 11/16/2022]
Abstract
Rapid, automated determination of the mapping of free text phrases to pre-defined concepts could assist in the annotation of clinical notes and increase the speed of natural language processing systems. The aim of this study was to design and evaluate a token-order-specific naïve Bayes-based machine learning system (RapTAT) to predict associations between phrases and concepts. Performance was assessed using a reference standard generated from 2860 VA discharge summaries containing 567,520 phrases that had been mapped to 12,056 distinct Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT) concepts by the MCVS natural language processing system. It was also assessed on the manually annotated, 2010 i2b2 challenge data. Performance was established with regard to precision, recall, and F-measure for each of the concepts within the VA documents using bootstrapping. Within that corpus, concepts identified by MCVS were broadly distributed throughout SNOMED CT, and the token-order-specific language model achieved better performance based on precision, recall, and F-measure (0.95±0.15, 0.96±0.16, and 0.95±0.16, respectively; mean±SD) than the bag-of-words based, naïve Bayes model (0.64±0.45, 0.61±0.46, and 0.60±0.45, respectively) that has previously been used for concept mapping. Precision, recall, and F-measure on the i2b2 test set were 92.9%, 85.9%, and 89.2% respectively, using the token-order-specific model. RapTAT required just 7.2ms to map all phrases within a single discharge summary, and mapping rate did not decrease as the number of processed documents increased. The high performance attained by the tool in terms of both accuracy and speed was encouraging, and the mapping rate should be sufficient to support near-real-time, interactive annotation of medical narratives. These results demonstrate the feasibility of rapidly and accurately mapping phrases to a wide range of medical concepts based on a token-order-specific naïve Bayes model and machine learning.
Collapse
Affiliation(s)
- Glenn T Gobbel
- Geriatric Research, Education and Clinical Center (GRECC), Department of Veterans Affairs Medical Center, Nashville, TN, USA; Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA; Division of General Internal Medicine & Public Health, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA.
| | - Ruth Reeves
- Geriatric Research, Education and Clinical Center (GRECC), Department of Veterans Affairs Medical Center, Nashville, TN, USA; Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA.
| | - Shrimalini Jayaramaraja
- Geriatric Research, Education and Clinical Center (GRECC), Department of Veterans Affairs Medical Center, Nashville, TN, USA; Division of General Internal Medicine & Public Health, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA.
| | - Dario Giuse
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA.
| | - Theodore Speroff
- Geriatric Research, Education and Clinical Center (GRECC), Department of Veterans Affairs Medical Center, Nashville, TN, USA; Division of General Internal Medicine & Public Health, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA; Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN, USA.
| | - Steven H Brown
- Geriatric Research, Education and Clinical Center (GRECC), Department of Veterans Affairs Medical Center, Nashville, TN, USA; Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA.
| | - Peter L Elkin
- Department of Biomedical Informatics, University at Buffalo, SUNY, Buffalo, NY, USA.
| | - Michael E Matheny
- Geriatric Research, Education and Clinical Center (GRECC), Department of Veterans Affairs Medical Center, Nashville, TN, USA; Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA; Division of General Internal Medicine & Public Health, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA; Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN, USA.
| |
Collapse
|
37
|
Wei WQ, Cronin RM, Xu H, Lasko TA, Bastarache L, Denny JC. Development and evaluation of an ensemble resource linking medications to their indications. J Am Med Inform Assoc 2013; 20:954-61. [PMID: 23576672 PMCID: PMC3756263 DOI: 10.1136/amiajnl-2012-001431] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2012] [Revised: 02/25/2013] [Accepted: 03/18/2013] [Indexed: 11/09/2022] Open
Abstract
OBJECTIVE To create a computable MEDication Indication resource (MEDI) to support primary and secondary use of electronic medical records (EMRs). MATERIALS AND METHODS We processed four public medication resources, RxNorm, Side Effect Resource (SIDER) 2, MedlinePlus, and Wikipedia, to create MEDI. We applied natural language processing and ontology relationships to extract indications for prescribable, single-ingredient medication concepts and all ingredient concepts as defined by RxNorm. Indications were coded as Unified Medical Language System (UMLS) concepts and International Classification of Diseases, 9th edition (ICD9) codes. A total of 689 extracted indications were randomly selected for manual review for accuracy using dual-physician review. We identified a subset of medication-indication pairs that optimizes recall while maintaining high precision. RESULTS MEDI contains 3112 medications and 63 343 medication-indication pairs. Wikipedia was the largest resource, with 2608 medications and 34 911 pairs. For each resource, estimated precision and recall, respectively, were 94% and 20% for RxNorm, 75% and 33% for MedlinePlus, 67% and 31% for SIDER 2, and 56% and 51% for Wikipedia. The MEDI high-precision subset (MEDI-HPS) includes indications found within either RxNorm or at least two of the three other resources. MEDI-HPS contains 13 304 unique indication pairs regarding 2136 medications. The mean±SD number of indications for each medication in MEDI-HPS is 6.22 ± 6.09. The estimated precision of MEDI-HPS is 92%. CONCLUSIONS MEDI is a publicly available, computable resource that links medications with their indications as represented by concepts and billing codes. MEDI may benefit clinical EMR applications and reuse of EMR data for research.
Collapse
Affiliation(s)
- Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, USA
| | | | | | | | | | | |
Collapse
|
38
|
Abstract
Taxonomies are commonly used for organizing knowledge, particularly in biomedicine where the taxonomy of living organisms and the classification of diseases are central to the domain. The principles used to produce taxonomies are either intrinsic (properties of the partial ordering relation) or added to make knowledge more manageable (opposition of siblings and economy). The applicability of these principles in the biomedical domain is presented using the Unified Medical Language System (UMLS) and issues raised by the application of these principles are illustrated. While intrinsic principles are not challenged, we argue that the opposition of siblings brings to bear excessive constraints on a domain ontology and that the adverse effects of economy may outweigh its benefits. The two-level structure used in the UMLS is discussed.
Collapse
|