1
|
Thakkar V, Silverman GM, Kc A, Ingraham NE, Jones EK, King S, Melton GB, Zhang R, Tignanelli CJ. A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2. PLoS One 2025; 20:e0323535. [PMID: 40373001 PMCID: PMC12080813 DOI: 10.1371/journal.pone.0323535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Accepted: 04/10/2025] [Indexed: 05/17/2025] Open
Abstract
BACKGROUND Patient symptoms, crucial for disease progression and diagnosis, are often captured in unstructured clinical notes. Large language models (LLMs) offer potential advantages in extracting patient symptoms compared to traditional rule-based information extraction (IE) systems. METHODS This study compared fine-tuned LLMs (LLaMA2-13B and LLaMA3-8B) against BioMedICUS, a rule-based IE system, for extracting symptoms related to acute and post-acute sequelae of SARS-CoV-2 from clinical notes. The study utilized three corpora: UMN-COVID, UMN-PASC, and N3C-COVID. Prevalence, keyword and fairness analyses were conducted to assess symptom distribution and model equity across demographics. RESULTS BioMedICUS outperformed fine-tuned LLMs in most cases. On the UMN PASC dataset, BioMedICUS achieved a macro-averaged F1-score of 0.70 for positive mention detection, compared to 0.66 for LLaMA2-13B and 0.62 for LLaMA3-8B. For the N3C COVID dataset, BioMedICUS scored 0.75, while LLaMA2-13B and LLaMA3-8B scored 0.53 and 0.68, respectively for positive mention detection. However, LLMs performed better in specific instances, such as detecting positive mentions of change in sleep in the UMN PASC dataset, where LLaMA2-13B (0.79) and LLaMA3-8B (0.65) outperformed BioMedICUS (0.60). For fairness analysis, BioMedICUS generally showed stronger performance across patient demographics. Keyword analysis using ANOVA on symptom distributions across all three corpora showed that both corpus (df = 2, p < 0.001) and symptom (df = 79, p < 0.001) have a statistically significant effect on log-transformed term frequency-inverse document frequency (TF-IDF) values such that corpus accounts for 52% of the variance in log_tfidf values and symptom accounts for 35%. CONCLUSION While BioMedICUS generally outperformed the LLMs, the latter showed promising results in specific areas, particularly LLaMA3-8B, in identifying negative symptom mentions. However, both LLaMA models faced challenges in demographic fairness and generalizability. These findings underscore the need for diverse, high-quality training datasets and robust annotation processes to enhance LLMs' performance and reliability in clinical applications.
Collapse
Affiliation(s)
- Vedansh Thakkar
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, United States of America
- Natural Language Processing/Information Extraction Program, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Greg M. Silverman
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, United States of America
- Natural Language Processing/Information Extraction Program, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Abhinab Kc
- University of Minnesota Medical School, Minneapolis, Minnesota, United States of America
| | - Nicholas E. Ingraham
- Department of Pulmonary, Allergy, Critical Care, and Sleep Medicine, University of Minnesota, Minneapolis, Minnesota, United States of America
- Center for Learning Health Systems Sciences, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Emma K. Jones
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Samantha King
- Department of Surgery, University of Washington, Seattle, Washington, United States of America
| | - Genevieve B. Melton
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, United States of America
- Natural Language Processing/Information Extraction Program, University of Minnesota, Minneapolis, Minnesota, United States of America
- Center for Learning Health Systems Sciences, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Rui Zhang
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, United States of America
- Natural Language Processing/Information Extraction Program, University of Minnesota, Minneapolis, Minnesota, United States of America
- Center for Learning Health Systems Sciences, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Christopher J. Tignanelli
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, United States of America
- Natural Language Processing/Information Extraction Program, University of Minnesota, Minneapolis, Minnesota, United States of America
- Center for Learning Health Systems Sciences, University of Minnesota, Minneapolis, Minnesota, United States of America
| |
Collapse
|
2
|
Erlanson N, China JF, Taavola H, Norén GN. Clinical Relatedness and Stability of vigiVec Semantic Vector Representations of Adverse Events and Drugs in Pharmacovigilance. Drug Saf 2025; 48:401-413. [PMID: 39833656 PMCID: PMC11903574 DOI: 10.1007/s40264-024-01509-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/10/2024] [Indexed: 01/22/2025]
Abstract
INTRODUCTION Individual case reports are essential to identify and assess previously unknown adverse effects of medicines. On these reports, information on adverse events (AEs) and drugs are encoded in hierarchical terminologies. Encoding differences may hinder the retrieval and analysis of clinically related reports relevant to a topic of interest. Recent studies have explored the use of data-driven semantic vector representations to support analysis of pharmacovigilance data. OBJECTIVE This study aims to evaluate the stability and clinical relatedness of vigiVec, a semantic vector representation for codes of AEs and drugs. METHODS vigiVec is a published adaptation to pharmacovigilance of the publicly available Word2Vec model, applied to structured data instead of free text. It provides vector representations for MedDRA® Preferred Terms and WHODrug Global active ingredients, learned from reporting patterns in VigiBase, the WHO global database of adverse event reports for medicines and vaccines. For this study, a 20-dimensional Skip-gram architecture with window size 250 was used. Our evaluation focused on nearest neighbors identified by the cosine similarity of vigiVec vector representations. Clinical relatedness was measured through term intruder detection, whereby a medical doctor was tasked to identify a randomly selected term-the intruder-included among the four nearest neighbors to a specific AE or drug. Stability was measured as the average overlap in the ten nearest neighbors for each AE or drug, in repeated fittings of vigiVec. RESULTS Among the ten nearest neighbors, 1.8 AEs on average belonged to the same MedDRA High Level Term (HLT; e.g., coagulopathies), and 1.3 drugs belonged to the same Anatomical Therapeutic Chemical level 3 (ATC-3; e.g., opioids). In the intruder detection task, when neighbors and intruders were both chosen from the same HLT, the intruder detection rate was 46%. When selected from different HLTs, it was 79%. By random chance, we should expect 20% (1 in 5). Corresponding rates for drugs were 42% in same ATC-3 and 65% in different ATC-3. The stability of nearest neighbors was 80% for AEs and 64% for drugs. CONCLUSION Nearest neighbors identified with vigiVec are stable and show high level of clinical relatedness. They are often from different parts of the existing hierarchies and complement these.
Collapse
|
3
|
Gan Z, Zhou D, Rush E, Panickan VA, Ho YL, Ostrouchovm G, Xu Z, Shen S, Xiong X, Greco KF, Hong C, Bonzel CL, Wen J, Costa L, Cai T, Begoli E, Xia Z, Gaziano JM, Liao KP, Cho K, Cai T, Lu J. ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis. J Biomed Inform 2025; 162:104761. [PMID: 39863245 PMCID: PMC12066163 DOI: 10.1016/j.jbi.2024.104761] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 11/26/2024] [Accepted: 12/08/2024] [Indexed: 01/27/2025]
Abstract
OBJECTIVE Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes (NLP). The complexity of EHR presents challenges in feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features. METHODS Using data from 12.5 million Veterans Affairs patients, ARCH first derives embedding vectors and generates similarities along with associated p-values to measure the strength of relatedness between clinical features with statistical certainty quantification. Next, ARCH performs a sparse embedding regression to remove indirect linkage between features to build a sparse KG. Finally, ARCH was validated on various clinical tasks, including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients. RESULTS ARCH produces high-quality clinical embeddings and KG for over 60,000 codified and narrative EHR concepts. The KG and embeddings are visualized in the R-shiny powered web-API.3 ARCH achieved high accuracy in detecting EHR concept relationships, with AUCs of 0.926 (codified) and 0.861 (NLP) for similar EHR concepts, and 0.810 (codified) and 0.843 (NLP) for related pairs. It detected drug side effects with a 0.723 AUC, which improved to 0.826 after fine-tuning. Using both codified and NLP features, the detection power increased significantly. Compared to other methods, ARCH has superior accuracy and enhances weakly supervised phenotyping algorithms' performance. Notably, it successfully categorized Alzheimer's patients into two subgroups with varying mortality rates. CONCLUSION The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.
Collapse
Affiliation(s)
- Ziming Gan
- Department of Statistics, University of Chicago, 5801 S Ellis Ave, Chicago, 60615, IL, USA
| | - Doudou Zhou
- Department of Statistics and Data Science, National University of Singapore, 117546, Singapore
| | - Everett Rush
- Oak Ridge national Laboratory, Bethel Valley Rd, Oak Ridge, 37830, TN, USA
| | - Vidul A Panickan
- Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA
| | - Yuk-Lam Ho
- VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA
| | - George Ostrouchovm
- Oak Ridge national Laboratory, Bethel Valley Rd, Oak Ridge, 37830, TN, USA
| | - Zhiwei Xu
- Department of Statistics, University of Michigan, 500 S State St, Ann Arbor, 48109, MI, USA
| | - Shuting Shen
- Department of Biostatistics & Bioinformatics, Duke University, 1121 West Main St, Durham, 27708, NC, USA
| | - Xin Xiong
- Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, 02115, MA, USA
| | - Kimberly F Greco
- Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, 02115, MA, USA
| | - Chuan Hong
- Department of Biostatistics & Bioinformatics, Duke University, 1121 West Main St, Durham, 27708, NC, USA
| | - Clara-Lea Bonzel
- Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA
| | - Jun Wen
- Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA
| | - Lauren Costa
- VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA
| | - Tianrun Cai
- VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Brigham and Women's Hospital, 60 Fenwood Rd, Boston, 02115, MA, USA
| | - Edmon Begoli
- Oak Ridge national Laboratory, Bethel Valley Rd, Oak Ridge, 37830, TN, USA
| | - Zongqi Xia
- Clinical and Translational Science, University of Pittsburgh, 3501 Fifth Avenue, Pittsburgh, 15260, PA, USA
| | - J Michael Gaziano
- Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Brigham and Women's Hospital, 60 Fenwood Rd, Boston, 02115, MA, USA
| | - Katherine P Liao
- VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Brigham and Women's Hospital, 60 Fenwood Rd, Boston, 02115, MA, USA
| | - Kelly Cho
- Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Brigham and Women's Hospital, 60 Fenwood Rd, Boston, 02115, MA, USA
| | - Tianxi Cai
- Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, 02115, MA, USA
| | - Junwei Lu
- VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, 02115, MA, USA.
| |
Collapse
|
4
|
Zahra FA, Kate RJ. Obtaining clinical term embeddings from SNOMED CT ontology. J Biomed Inform 2024; 149:104560. [PMID: 38070816 DOI: 10.1016/j.jbi.2023.104560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/29/2023] [Accepted: 12/05/2023] [Indexed: 01/22/2024]
Abstract
Clinical term embeddings are traditionally obtained using corpus-based methods, however, these methods cannot incorporate knowledge about clinical terms which is already present in medical ontologies. On the other hand, graph-based methods can obtain embeddings of clinical concepts from ontologies, but they cannot obtain embeddings for clinical terms and words. In this paper, a novel method is presented to obtain embeddings for clinical terms and words from the SNOMED CT ontology. The method first obtains embeddings of clinical concepts from SNOMED CT using a graph-based method. Next, these concept embeddings are used as targets to train a deep learning model to map clinical terms to concepts embeddings. The learned model then provides embeddings for clinical terms and words as well as maps novel clinical terms to their embeddings. The embeddings obtained using the method out-performed corpus-based embeddings on the task of predicting clinical term similarity on five benchmark datasets. On the clinical term normalization task, using these embeddings simply as a means of computing similarity between clinical terms obtained accuracy which was competitive to methods trained specifically for this task. Both corpus-based and ontology-based embeddings have a limitation that they tend to learn similar embeddings for opposite or analogous terms. To counter this, we also introduce a method to automatically learn patterns that indicate when two clinical terms represent the same concept and when they represent different concepts. Supplementing the normalization process with these patterns showed improvement. Although clinical term embeddings obtained from SNOMED CT incorporate ontological knowledge which is missed by corpus-based embeddings, they do not incorporate linguistic knowledge which is needed for sentence-based tasks. Hence combining ontology-based embeddings with corpus-based embeddings is an avenue for future work.
Collapse
Affiliation(s)
- Fuad Abu Zahra
- Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, USA
| | - Rohit J Kate
- Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, USA.
| |
Collapse
|
5
|
Zhou H, Silverman G, Niu Z, Silverman J, Evans R, Austin R, Zhang R. Extracting Complementary and Integrative Health Approaches in Electronic Health Records. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2023; 7:277-290. [PMID: 37637720 PMCID: PMC10449701 DOI: 10.1007/s41666-023-00137-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Revised: 04/12/2023] [Accepted: 07/03/2023] [Indexed: 08/29/2023]
Abstract
Complementary and Integrative Health (CIH) has gained increasing popularity in the past decades. While the evidence bases to support them are growing, there is still a gap in understanding their effects and potential adverse events using real-world data. The overall goal of this study is to represent information pertinent to both psychological and physical CIH approaches (specifically, using examples of music therapy, chiropractic, and aquatic exercise in this study) in an electronic health record (EHR) system. We also aim to evaluate the ability of existing natural language processing (NLP) systems to identify CIH approaches. A total of 300 notes were randomly selected and manually annotated. Annotations were made for status, symptom, and frequency of each approach. This set of annotations was used as a gold standard to evaluate the performance of NLP systems used in this study (specifically BioMedICUS, MetaMap, and cTAKES) for extracting CIH concepts. Venn diagram was used to investigate the consistency of medical records searching by Current Procedural Terminology (CPT) codes and CIH approaches keywords in SQL. Since CPT codes usually do not have specific mentions of CIH approaches, the Venn diagram had less overlap with those found in clinical notes for all three CIH therapies. The three NLP systems achieved 0.41 in average lenient match F1-score in all three CIH approaches, respectively. BioMedICUS achieved the best performance in aquatic exercise with an F1-score of 0.66. This study contributes to the overall representation of CIH in clinical note and lays a foundation for using EHR for clinical research for CIH approaches.
Collapse
Affiliation(s)
- Huixue Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55414 USA
| | - Greg Silverman
- Department of Surgery, University of Minnesota, Minneapolis, MN 55414 USA
| | - Zhongran Niu
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55414 USA
| | - Jenzi Silverman
- Earl E. Bakken Center for Spirituality & Healing, University of Minnesota, Minneapolis, MN 55414 USA
| | - Roni Evans
- Earl E. Bakken Center for Spirituality & Healing, University of Minnesota, Minneapolis, MN 55414 USA
| | - Robin Austin
- School of Nursing, University of Minnesota, Minneapolis, MN 55414 USA
| | - Rui Zhang
- Department of Surgery, University of Minnesota, Minneapolis, MN 55414 USA
| |
Collapse
|
6
|
Welvaars K, Oosterhoff JHF, van den Bekerom MPJ, Doornberg JN, van Haarst EP, OLVG Urology Consortium, and the Machine Learning Consortium
van der ZeeJ Avan AndelG ALagerveldB WHoviusM CKauerP CBoevéL M Svan der KuitAMalleeWPoolmanR. Implications of resampling data to address the class imbalance problem (IRCIP): an evaluation of impact on performance between classification algorithms in medical data. JAMIA Open 2023; 6:ooad033. [PMID: 37266187 PMCID: PMC10232287 DOI: 10.1093/jamiaopen/ooad033] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 04/04/2023] [Accepted: 05/11/2023] [Indexed: 06/03/2023] Open
Abstract
Objective When correcting for the "class imbalance" problem in medical data, the effects of resampling applied on classifier algorithms remain unclear. We examined the effect on performance over several combinations of classifiers and resampling ratios. Materials and Methods Multiple classification algorithms were trained on 7 resampled datasets: no correction, random undersampling, 4 ratios of Synthetic Minority Oversampling Technique (SMOTE), and random oversampling with the Adaptive Synthetic algorithm (ADASYN). Performance was evaluated in Area Under the Curve (AUC), precision, recall, Brier score, and calibration metrics. A case study on prediction modeling for 30-day unplanned readmissions in previously admitted Urology patients was presented. Results For most algorithms, using resampled data showed a significant increase in AUC and precision, ranging from 0.74 (CI: 0.69-0.79) to 0.93 (CI: 0.92-0.94), and 0.35 (CI: 0.12-0.58) to 0.86 (CI: 0.81-0.92) respectively. All classification algorithms showed significant increases in recall, and significant decreases in Brier score with distorted calibration overestimating positives. Discussion Imbalance correction resulted in an overall improved performance, yet poorly calibrated models. There can still be clinical utility due to a strong discriminating performance, specifically when predicting only low and high risk cases is clinically more relevant. Conclusion Resampling data resulted in increased performances in classification algorithms, yet produced an overestimation of positive predictions. Based on the findings from our case study, a thoughtful predefinition of the clinical prediction task may guide the use of resampling techniques in future studies aiming to improve clinical decision support tools.
Collapse
Affiliation(s)
- Koen Welvaars
- Corresponding Author: Koen Welvaars, MSc, Data Science Team, OLVG, Jan Tooropstraat 164, 1061 AE Amsterdam, the Netherlands;
| | | | | | | | | | | |
Collapse
|
7
|
Kartheeswaran KP, Rayan AXA, Varrieth GT. Enhanced disease-disease association with information enriched disease representation. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:8892-8932. [PMID: 37161227 DOI: 10.3934/mbe.2023391] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
OBJECTIVE Quantification of disease-disease association (DDA) enables the understanding of disease relationships for discovering disease progression and finding comorbidity. For effective DDA strength calculation, there is a need to address the main challenge of integration of various biomedical aspects of DDA is to obtain an information rich disease representation. MATERIALS AND METHODS An enhanced and integrated DDA framework is developed that integrates enriched literature-based with concept-based DDA representation. The literature component of the proposed framework uses PubMed abstracts and consists of improved neural network model that classifies DDAs for an enhanced literature-based DDA representation. Similarly, an ontology-based joint multi-source association embedding model is proposed in the ontology component using Disease Ontology (DO), UMLS, claims insurance, clinical notes etc. Results and Discussion: The obtained information rich disease representation is evaluated on different aspects of DDA datasets such as Gene, Variant, Gene Ontology (GO) and a human rated benchmark dataset. The DDA scores calculated using the proposed method achieved a high correlation mainly in gene-based dataset. The quantified scores also shown better correlation of 0.821, when evaluated on human rated 213 disease pairs. In addition, the generated disease representation is proved to have substantial effect on correlation of DDA scores for different categories of disease pairs. CONCLUSION The enhanced context and semantic DDA framework provides an enriched disease representation, resulting in high correlated results with different DDA datasets. We have also presented the biological interpretation of disease pairs. The developed framework can also be used for deriving the strength of other biomedical associations.
Collapse
|
8
|
Parwez MA, Fazil M, Arif M, Nafis MT, Auwul MR. Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2023; 2023:2989791. [PMID: 39262497 PMCID: PMC11390191 DOI: 10.1155/2023/2989791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 09/26/2022] [Accepted: 09/27/2022] [Indexed: 09/13/2024]
Abstract
Due to the increasing use of information technologies by biomedical experts, researchers, public health agencies, and healthcare professionals, a large number of scientific literatures, clinical notes, and other structured and unstructured text resources are rapidly increasing and being stored in various data sources like PubMed. These massive text resources can be leveraged to extract valuable knowledge and insights using machine learning techniques. Recent advancement in neural network-based classification models has gained popularity which takes numeric vectors (aka word representation) of training data as the input to train classification models. Better the input vectors, more accurate would be the classification. Word representations are learned as the distribution of words in an embedding space, wherein each word has its vector and the semantically similar words based on the contexts appear nearby each other. However, such distributional word representations are incapable of encapsulating relational semantics between distant words. In the biomedical domain, relation mining is a well-studied problem which aims to extract relational words, which associates distant entities generally representing the subject and object of a sentence. Our goal is to capture the relational semantics information between distant words from a large corpus to learn enhanced word representation and employ the learned word representation for various natural language processing tasks such as text classification. In this article, we have proposed an application of biomedical relation triplets to learn word representation through incorporating relational semantic information within the distributional representation of words. In other words, the proposed approach aims to capture both distributional and relational contexts of the words to learn their numeric vectors from text corpus. We have also proposed an application of the learned word representations for text classification. The proposed approach is evaluated over multiple benchmark datasets, and the efficacy of the learned word representations is tested in terms of word similarity and concept categorization tasks. Our proposed approach provides better performance in comparison to the state-of-the-art GloVe model. Furthermore, we have applied the learned word representations to classify biomedical texts using four neural network-based classification models, and the classification accuracy further confirms the effectiveness of the learned word representations by our proposed approach.
Collapse
Affiliation(s)
- Md Aslam Parwez
- Department of Computer Science & Engineering, Jamia Hamdard, New Delhi, India
| | - Mohd Fazil
- University of Limerick, Limerick, Ireland
| | - Muhammad Arif
- Department of Computer Science, Superior University Lahore, Lahore 54000, Pakistan
| | - Md Tabrez Nafis
- Department of Computer Science & Engineering, Jamia Hamdard, New Delhi, India
| | - Md Rabiul Auwul
- Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Agricultural University, Gazipur 1706, Bangladesh
| |
Collapse
|
9
|
Yokokawa D, Noda K, Yanagita Y, Uehara T, Ohira Y, Shikino K, Tsukamoto T, Ikusaka M. Validating the representation of distance between infarct diseases using word embedding. BMC Med Inform Decis Mak 2022; 22:322. [PMID: 36476486 PMCID: PMC9730570 DOI: 10.1186/s12911-022-02061-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/22/2022] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND The pivot and cluster strategy (PCS) is a diagnostic reasoning strategy that automatically elicits disease clusters similar to a differential diagnosis in a batch. Although physicians know empirically which disease clusters are similar, there has been no quantitative evaluation. This study aimed to determine whether inter-disease distances between word embedding vectors using the PCS are a valid quantitative representation of similar disease groups in a limited domain. METHODS Abstracts were extracted from the Ichushi Web database and subjected to morphological analysis and training using Word2Vec, FastText, and GloVe. Consequently, word embedding vectors were obtained. For words including "infarction," we calculated the cophenetic correlation coefficient (CCC) as an internal validity measure and the adjusted rand index (ARI), normalized mutual information (NMI), and adjusted mutual information (AMI) with ICD-10 codes as the external validity measures. This was performed for each combination of metric and hierarchical clustering method. RESULTS Seventy-one words included "infarction," of which 38 diseases matched the ICD-10 standard with the appearance of 21 unique ICD-10 codes. When using Word2Vec, the CCC was most significant at 0.8690 (metric and method: euclidean and centroid), whereas the AMI was maximal at 0.4109 (metric and method: cosine and correlation, and average and weighted). The NMI and ARI were maximal at 0.8463 and 0.3593, respectively (metric and method: cosine and complete). FastText and GloVe generally resulted in the same trend as Word2Vec, and the metric and method that maximized CCC differed from the ones that maximized the external validity measures. CONCLUSIONS The metric and method that maximized the internal validity measure differed from those that maximized the external validity measures; both produced different results. The cosine distance should be used when considering ICD-10, and the Euclidean distance when considering the frequency of word occurrence. The distributed representation, when trained by Word2Vec on the "infarction" domain from a Japanese academic corpus, provides an objective inter-disease distance used in PCS.
Collapse
Affiliation(s)
- Daiki Yokokawa
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| | - Kazutaka Noda
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| | - Yasutaka Yanagita
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| | - Takanori Uehara
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| | - Yoshiyuki Ohira
- grid.412764.20000 0004 0372 3116Department of General Internal Medicine, St. Marianna University School of Medicine, 2-16-1 Sugao, Miyamae-Ku, Kawasaki City, Kanagawa Japan
| | - Kiyoshi Shikino
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| | - Tomoko Tsukamoto
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| | - Masatomi Ikusaka
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| |
Collapse
|
10
|
Chanda AK, Bai T, Yang Z, Vucetic S. Improving medical term embeddings using UMLS Metathesaurus. BMC Med Inform Decis Mak 2022; 22:114. [PMID: 35488252 PMCID: PMC9052653 DOI: 10.1186/s12911-022-01850-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Accepted: 03/29/2022] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Health providers create Electronic Health Records (EHRs) to describe the conditions and procedures used to treat their patients. Medical notes entered by medical staff in the form of free text are a particularly insightful component of EHRs. There is a great interest in applying machine learning tools on medical notes in numerous medical informatics applications. Learning vector representations, or embeddings, of terms in the notes, is an important pre-processing step in such applications. However, learning good embeddings is challenging because medical notes are rich in specialized terminology, and the number of available EHRs in practical applications is often very small. METHODS In this paper, we propose a novel algorithm to learn embeddings of medical terms from a limited set of medical notes. The algorithm, called definition2vec, exploits external information in the form of medical term definitions. It is an extension of a skip-gram algorithm that incorporates textual definitions of medical terms provided by the Unified Medical Language System (UMLS) Metathesaurus. RESULTS To evaluate the proposed approach, we used a publicly available Medical Information Mart for Intensive Care (MIMIC-III) EHR data set. We performed quantitative and qualitative experiments to measure the usefulness of the learned embeddings. The experimental results show that definition2vec keeps the semantically similar medical terms together in the embedding vector space even when they are rare or unobserved in the corpus. We also demonstrate that learned vector embeddings are helpful in downstream medical informatics applications. CONCLUSION This paper shows that medical term definitions can be helpful when learning embeddings of rare or previously unseen medical terms from a small corpus of specialized documents such as medical notes.
Collapse
Affiliation(s)
- Ashis Kumar Chanda
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Tian Bai
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Ziyu Yang
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Slobodan Vucetic
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA.
| |
Collapse
|
11
|
Khan MS, Landman BA, Deppen SA, Matheny ME. Intrinsic Evaluation of Contextual and Non-contextual Word Embeddings using Radiology Reports. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2022; 2021:631-640. [PMID: 35308988 PMCID: PMC8861761] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 02/13/2023]
Abstract
Many clinical natural language processing methods rely on non-contextual word embedding (NCWE) or contextual word embedding (CWE) models. Yet, few, if any, intrinsic evaluation benchmarks exist comparing embedding representations against clinician judgment. We developed intrinsic evaluation tasks for embedding models using a corpus of radiology reports: term pair similarity for NCWEs and cloze task accuracy for CWEs. Using surveys, we quantified the agreement between clinician judgment and embedding model representations. We compare embedding models trained on a custom radiology report corpus (RRC), a general corpus, and PubMed and MIMIC-III corpora (P&MC). Cloze task accuracy was equivalent for RRC and P&MC models. For term pair similarity, P&MC-trained NCWEs outperformed all other NCWE models (ρspearman 0.61 vs. 0.27-0.44). Among models trained on RRC, fastText models often outperformed other NCWE models and spherical embeddings provided overly optimistic representations of term pair similarity.
Collapse
Affiliation(s)
- Mirza S Khan
- US Dept. of Veterans Affairs, Nashville, TN,Vanderbilt University, Nasvhille, TN,Vanderbilt University Medical Center, Nashville, TN
| | - Bennett A Landman
- Vanderbilt University, Nasvhille, TN,Vanderbilt University Medical Center, Nashville, TN
| | | | - Michael E Matheny
- US Dept. of Veterans Affairs, Nashville, TN,Vanderbilt University Medical Center, Nashville, TN
| |
Collapse
|
12
|
Flamholz ZN, Crane-Droesch A, Ungar LH, Weissman GE. Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information. J Biomed Inform 2022; 125:103971. [PMID: 34920127 PMCID: PMC8766939 DOI: 10.1016/j.jbi.2021.103971] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 11/22/2021] [Accepted: 12/02/2021] [Indexed: 01/03/2023]
Abstract
OBJECTIVE Quantify tradeoffs in performance, reproducibility, and resource demands across several strategies for developing clinically relevant word embeddings. MATERIALS AND METHODS We trained separate embeddings on all full-text manuscripts in the Pubmed Central (PMC) Open Access subset, case reports therein, the English Wikipedia corpus, the Medical Information Mart for Intensive Care (MIMIC) III dataset, and all notes in the University of Pennsylvania Health System (UPHS) electronic health record. We tested embeddings in six clinically relevant tasks including mortality prediction and de-identification, and assessed performance using the scaled Brier score (SBS) and the proportion of notes successfully de-identified, respectively. RESULTS Embeddings from UPHS notes best predicted mortality (SBS 0.30, 95% CI 0.15 to 0.45) while Wikipedia embeddings performed worst (SBS 0.12, 95% CI -0.05 to 0.28). Wikipedia embeddings most consistently (78% of notes) and the full PMC corpus embeddings least consistently (48%) de-identified notes. Across all six tasks, the full PMC corpus demonstrated the most consistent performance, and the Wikipedia corpus the least. Corpus size ranged from 49 million tokens (PMC case reports) to 10 billion (UPHS). DISCUSSION Embeddings trained on published case reports performed as least as well as embeddings trained on other corpora in most tasks, and clinical corpora consistently outperformed non-clinical corpora. No single corpus produced a strictly dominant set of embeddings across all tasks and so the optimal training corpus depends on intended use. CONCLUSION Embeddings trained on published case reports performed comparably on most clinical tasks to embeddings trained on larger corpora. Open access corpora allow training of clinically relevant, effective, and reproducible embeddings.
Collapse
Affiliation(s)
- Zachary N. Flamholz
- Medical Scientist Training Program, Albert Einstein College of Medicine, Bronx, New York, USA
| | - Andrew Crane-Droesch
- Penn Medicine Predictive Healthcare, University of Pennsylvania Health System, Philadelphia, Pennsylvania, USA,Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
| | - Lyle H. Ungar
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, USA,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Gary E. Weissman
- Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA,Leonard Davis Institute of Health Economics, University of Pennsylvania, Philadelphia, Pennsylvania, USA,Pulmonary, Allergy, and Critical Care Division, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
| |
Collapse
|
13
|
Lin C, Lee YT, Wu FJ, Lin SA, Hsu CJ, Lee CC, Tsai DJ, Fang WH. The Application of Projection Word Embeddings on Medical Records Scoring System. Healthcare (Basel) 2021; 9:healthcare9101298. [PMID: 34682978 PMCID: PMC8544381 DOI: 10.3390/healthcare9101298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 09/24/2021] [Accepted: 09/28/2021] [Indexed: 11/16/2022] Open
Abstract
Medical records scoring is important in a health care system. Artificial intelligence (AI) with projection word embeddings has been validated in its performance disease coding tasks, which maintain the vocabulary diversity of open internet databases and the medical terminology understanding of electronic health records (EHRs). We considered that an AI-enhanced system might be also applied to automatically score medical records. This study aimed to develop a series of deep learning models (DLMs) and validated their performance in medical records scoring task. We also analyzed the practical value of the best model. We used the admission medical records from the Tri-Services General Hospital during January 2016 to May 2020, which were scored by our visiting staffs with different levels from different departments. The medical records were scored ranged 0 to 10. All samples were divided into a training set (n = 74,959) and testing set (n = 152,730) based on time, which were used to train and validate the DLMs, respectively. The mean absolute error (MAE) was used to evaluate each DLM performance. In original AI medical record scoring, the predicted score by BERT architecture is closer to the actual reviewer score than the projection word embedding and LSTM architecture. The original MAE is 0.84 ± 0.27 using the BERT model, and the MAE is 1.00 ± 0.32 using the LSTM model. Linear mixed model can be used to improve the model performance, and the adjusted predicted score was closer compared to the original score. However, the project word embedding with the LSTM model (0.66 ± 0.39) provided better performance compared to BERT (0.70 ± 0.33) after linear mixed model enhancement (p < 0.001). In addition to comparing different architectures to score the medical records, this study further uses a mixed linear model to successfully adjust the AI medical record score to make it closer to the actual physician's score.
Collapse
Affiliation(s)
- Chin Lin
- School of Medicine, National Defense Medical Center, Taipei 114, Taiwan;
- School of Public Health, National Defense Medical Center, Taipei 114, Taiwan
- Graduate Institute of Life Sciences, National Defense Medical Center, Taipei 114, Taiwan
- Artificial Intelligence of Things Center, Tri-Service General Hospital, National Defense Medical Center, Taipei 114, Taiwan
| | - Yung-Tsai Lee
- Division of Cardiovascular Surgery, Cheng Hsin Rehabilitation and Medical Center, Taipei 112, Taiwan;
| | - Feng-Jen Wu
- Department of Informatics, Taoyuan Armed Forces General Hospital, Taoyuan 325, Taiwan;
| | - Shing-An Lin
- Department of Medical Informatics, Tri-Service General Hospital, National Defense Medical Center, Taipei 114, Taiwan; (S.-A.L.); (C.-J.H.); (C.-C.L.)
| | - Chia-Jung Hsu
- Department of Medical Informatics, Tri-Service General Hospital, National Defense Medical Center, Taipei 114, Taiwan; (S.-A.L.); (C.-J.H.); (C.-C.L.)
| | - Chia-Cheng Lee
- Department of Medical Informatics, Tri-Service General Hospital, National Defense Medical Center, Taipei 114, Taiwan; (S.-A.L.); (C.-J.H.); (C.-C.L.)
- Division of Colorectal Surgery, Department of Surgery, Tri-Service General Hospital, National Defense Medical Center, Taipei 114, Taiwan
| | - Dung-Jang Tsai
- School of Public Health, National Defense Medical Center, Taipei 114, Taiwan
- Graduate Institute of Life Sciences, National Defense Medical Center, Taipei 114, Taiwan
- Artificial Intelligence of Things Center, Tri-Service General Hospital, National Defense Medical Center, Taipei 114, Taiwan
- Correspondence: (D.-J.T.); (W.-H.F.); Tel.: +886-2-8792-3100 (ext. #18305) (D.-J.T.); +886-2-8792-3100 (ext. #12322) (W.-H.F.); Fax: +886-2-8792-3147 (D.-J.T. & W.-H.F.)
| | - Wen-Hui Fang
- Artificial Intelligence of Things Center, Tri-Service General Hospital, National Defense Medical Center, Taipei 114, Taiwan
- Department of Family and Community Medicine, Department of Internal Medicine, Tri-Service General Hospital, National Defense Medical Center, Taipei 114, Taiwan
- Correspondence: (D.-J.T.); (W.-H.F.); Tel.: +886-2-8792-3100 (ext. #18305) (D.-J.T.); +886-2-8792-3100 (ext. #12322) (W.-H.F.); Fax: +886-2-8792-3147 (D.-J.T. & W.-H.F.)
| |
Collapse
|
14
|
Sahoo HS, Silverman GM, Ingraham NE, Lupei MI, Puskarich MA, Finzel RL, Sartori J, Zhang R, Knoll BC, Liu S, Liu H, Melton GB, Tignanelli CJ, Pakhomov SVS. A fast, resource efficient, and reliable rule-based system for COVID-19 symptom identification. JAMIA Open 2021; 4:ooab070. [PMID: 34423261 PMCID: PMC8374371 DOI: 10.1093/jamiaopen/ooab070] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 07/16/2021] [Accepted: 08/05/2021] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE With COVID-19, there was a need for a rapidly scalable annotation system that facilitated real-time integration with clinical decision support systems (CDS). Current annotation systems suffer from a high-resource utilization and poor scalability limiting real-world integration with CDS. A potential solution to mitigate these issues is to use the rule-based gazetteer developed at our institution. MATERIALS AND METHODS Performance, resource utilization, and runtime of the rule-based gazetteer were compared with five annotation systems: BioMedICUS, cTAKES, MetaMap, CLAMP, and MedTagger. RESULTS This rule-based gazetteer was the fastest, had a low resource footprint, and similar performance for weighted microaverage and macroaverage measures of precision, recall, and f1-score compared to other annotation systems. DISCUSSION Opportunities to increase its performance include fine-tuning lexical rules for symptom identification. Additionally, it could run on multiple compute nodes for faster runtime. CONCLUSION This rule-based gazetteer overcame key technical limitations facilitating real-time symptomatology identification for COVID-19 and integration of unstructured data elements into our CDS. It is ideal for large-scale deployment across a wide variety of healthcare settings for surveillance of acute COVID-19 symptoms for integration into prognostic modeling. Such a system is currently being leveraged for monitoring of postacute sequelae of COVID-19 (PASC) progression in COVID-19 survivors. This study conducted the first in-depth analysis and developed a rule-based gazetteer for COVID-19 symptom extraction with the following key features: low processor and memory utilization, faster runtime, and similar weighted microaverage and macroaverage measures for precision, recall, and f1-score compared to industry-standard annotation systems.
Collapse
Affiliation(s)
- Himanshu S Sahoo
- Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, Minnesota, USA
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, USA
| | - Greg M Silverman
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, USA
| | - Nicholas E Ingraham
- Pulmonary Disease and Critical Care Medicine, University of Minnesota, Minneapolis, Minnesota, USA
| | - Monica I Lupei
- Department of Anesthesiology, University of Minnesota, Minneapolis, Minnesota, USA
| | - Michael A Puskarich
- Department of Emergency Medicine, University of Minnesota, Minneapolis, Minnesota, USA
| | - Raymond L Finzel
- Department of Pharmaceutical Care and Health Systems, University of Minnesota, Minneapolis, Minnesota, USA
| | - John Sartori
- Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, Minnesota, USA
| | - Rui Zhang
- Department of Pharmaceutical Care and Health Systems, University of Minnesota, Minneapolis, Minnesota, USA
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Benjamin C Knoll
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Sijia Liu
- Department of Health Science Research, Mayo Clinic, Rochester, New York, USA
| | - Hongfang Liu
- Department of Health Science Research, Mayo Clinic, Rochester, New York, USA
| | - Genevieve B Melton
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, USA
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | | | - Serguei V S Pakhomov
- Department of Pharmaceutical Care and Health Systems, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
15
|
Yum Y, Lee JM, Jang MJ, Kim Y, Kim JH, Kim S, Shin U, Song S, Joo HJ. A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation. JMIR Med Inform 2021; 9:e29667. [PMID: 34185005 PMCID: PMC8277378 DOI: 10.2196/29667] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 05/08/2021] [Accepted: 05/16/2021] [Indexed: 01/16/2023] Open
Abstract
Background The fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed to evaluate word embedding models using the semantic similarity and relatedness of medical word pairs. However, there are very few reference standards in non-English languages. In addition, because the existing reference standards were developed a long time ago, there is a need to develop an updated standard to represent recent findings in medical sciences. Objective We propose a new Korean word pair reference set to verify embedding models. Methods From January 2010 to December 2020, 518 medical textbooks, 72,844 health information news, and 15,698 medical research articles were collected, and the top 10,000 medical terms were selected to develop medical word pairs. Attending physicians (n=16) participated in the verification of the developed set with 607 word pairs. Results The proportion of word pairs answered by all participants was 90.8% (551/607) for the similarity task and 86.5% (525/605) for the relatedness task. The similarity and relatedness of the word pair showed a high correlation (ρ=0.70, P<.001). The intraclass correlation coefficients to assess the interrater agreements of the word pair sets were 0.47 on the similarity task and 0.53 on the relatedness task. The final reference standard was 604 word pairs for the similarity task and 599 word pairs for relatedness, excluding word pairs with answers corresponding to outliers and word pairs that were answered by less than 50% of all the respondents. When FastText models were applied to the final reference standard word pair sets, the embedding models learning medical documents had a higher correlation between the calculated cosine similarity scores compared to human-judged similarity and relatedness scores (namu, ρ=0.12 vs with medical text for the similarity task, ρ=0.47; namu, ρ=0.02 vs with medical text for the relatedness task, ρ=0.30). Conclusions Korean medical word pair reference standard sets for semantic similarity and relatedness were developed based on medical documents from the past 10 years. It is expected that our word pair reference sets will be actively utilized in the development of medical and multilingual natural language processing technology in the future.
Collapse
Affiliation(s)
- Yunjin Yum
- Department of Biostatistics, Korea University College of Medicine, Seoul, Republic of Korea.,Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea
| | - Jeong Moon Lee
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea
| | - Moon Joung Jang
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea
| | - Yoojoong Kim
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea
| | - Jong-Ho Kim
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea.,Department of Cardiology, Cardiovascular Center, Korea University College of Medicine, Seoul, Republic of Korea
| | - Seongtae Kim
- Department of Linguistics, Korea University, Seoul, Republic of Korea
| | - Unsub Shin
- Department of Linguistics, Korea University, Seoul, Republic of Korea
| | - Sanghoun Song
- Department of Linguistics, Korea University, Seoul, Republic of Korea
| | - Hyung Joon Joo
- Department of Cardiology, Cardiovascular Center, Korea University College of Medicine, Seoul, Republic of Korea.,Korea University Research Institute for Medical Bigdata Science, Korea University Anam Hospital, Seoul, Republic of Korea.,Department of Medical Informatics, Korea University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
16
|
Mao Y, Fung KW. Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts. J Am Med Inform Assoc 2021; 27:1538-1546. [PMID: 33029614 PMCID: PMC7566472 DOI: 10.1093/jamia/ocaa136] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 06/03/2020] [Accepted: 06/04/2020] [Indexed: 12/03/2022] Open
Abstract
Objective The study sought to explore the use of deep learning techniques to measure the semantic relatedness between Unified Medical Language System (UMLS) concepts. Materials and Methods Concept sentence embeddings were generated for UMLS concepts by applying the word embedding models BioWordVec and various flavors of BERT to concept sentences formed by concatenating UMLS terms. Graph embeddings were generated by the graph convolutional networks and 4 knowledge graph embedding models, using graphs built from UMLS hierarchical relations. Semantic relatedness was measured by the cosine between the concepts’ embedding vectors. Performance was compared with 2 traditional path-based (shortest path and Leacock-Chodorow) measurements and the publicly available concept embeddings, cui2vec, generated from large biomedical corpora. The concept sentence embeddings were also evaluated on a word sense disambiguation (WSD) task. Reference standards used included the semantic relatedness and semantic similarity datasets from the University of Minnesota, concept pairs generated from the Standardized MedDRA Queries and the MeSH (Medical Subject Headings) WSD corpus. Results Sentence embeddings generated by BioWordVec outperformed all other methods used individually in semantic relatedness measurements. Graph convolutional network graph embedding uniformly outperformed path-based measurements and was better than some word embeddings for the Standardized MedDRA Queries dataset. When used together, combined word and graph embedding achieved the best performance in all datasets. For WSD, the enhanced versions of BERT outperformed BioWordVec. Conclusions Word and graph embedding techniques can be used to harness terms and relations in the UMLS to measure semantic relatedness between concepts. Concept sentence embedding outperforms path-based measurements and cui2vec, and can be further enhanced by combining with graph embedding.
Collapse
Affiliation(s)
- Yuqing Mao
- National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Kin Wah Fung
- National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
17
|
Newman-Griffis D, Fosler-Lussier E. Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health. Front Digit Health 2021; 3:620828. [PMID: 33791684 PMCID: PMC8009547 DOI: 10.3389/fdgth.2021.620828] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Accepted: 02/16/2021] [Indexed: 11/13/2022] Open
Abstract
Linking clinical narratives to standardized vocabularies and coding systems is a key component of unlocking the information in medical text for analysis. However, many domains of medical concepts, such as functional outcomes and social determinants of health, lack well-developed terminologies that can support effective coding of medical text. We present a framework for developing natural language processing (NLP) technologies for automated coding of medical information in under-studied domains, and demonstrate its applicability through a case study on physical mobility function. Mobility function is a component of many health measures, from post-acute care and surgical outcomes to chronic frailty and disability, and is represented as one domain of human activity in the International Classification of Functioning, Disability, and Health (ICF). However, mobility and other types of functional activity remain under-studied in the medical informatics literature, and neither the ICF nor commonly-used medical terminologies capture functional status terminology in practice. We investigated two data-driven paradigms, classification and candidate selection, to link narrative observations of mobility status to standardized ICF codes, using a dataset of clinical narratives from physical therapy encounters. Recent advances in language modeling and word embedding were used as features for established machine learning models and a novel deep learning approach, achieving a macro-averaged F-1 score of 84% on linking mobility activity reports to ICF codes. Both classification and candidate selection approaches present distinct strengths for automated coding in under-studied domains, and we highlight that the combination of (i) a small annotated data set; (ii) expert definitions of codes of interest; and (iii) a representative text corpus is sufficient to produce high-performing automated coding systems. This research has implications for continued development of language technologies to analyze functional status information, and the ongoing growth of NLP tools for a variety of specialized applications in clinical care and research.
Collapse
Affiliation(s)
- Denis Newman-Griffis
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, United States
- Epidemiology & Biostatistics Section, Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, MD, United States
| | - Eric Fosler-Lussier
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, United States
| |
Collapse
|
18
|
|
19
|
Jiang S, Wu W, Tomita N, Ganoe C, Hassanpour S. Multi-Ontology Refined Embeddings (MORE): A hybrid multi-ontology and corpus-based semantic representation model for biomedical concepts. J Biomed Inform 2020; 111:103581. [PMID: 33010425 DOI: 10.1016/j.jbi.2020.103581] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Revised: 09/22/2020] [Accepted: 09/26/2020] [Indexed: 11/25/2022]
Abstract
OBJECTIVE Currently, a major limitation for natural language processing (NLP) analyses in clinical applications is that concepts are not effectively referenced in various forms across different texts. This paper introduces Multi-Ontology Refined Embeddings (MORE), a novel hybrid framework that incorporates domain knowledge from multiple ontologies into a distributional semantic model, learned from a corpus of clinical text. MATERIALS AND METHODS We use the RadCore and MIMIC-III free-text datasets for the corpus-based component of MORE. For the ontology-based part, we use the Medical Subject Headings (MeSH) ontology and three state-of-the-art ontology-based similarity measures. In our approach, we propose a new learning objective, modified from the sigmoid cross-entropy objective function. RESULTS AND DISCUSSION We used two established datasets of semantic similarities among biomedical concept pairs to evaluate the quality of the generated word embeddings. On the first dataset with 29 concept pairs, with similarity scores established by physicians and medical coders, MORE's similarity scores have the highest combined correlation (0.633), which is 5.0% higher than that of the baseline model, and 12.4% higher than that of the best ontology-based similarity measure. On the second dataset with 449 concept pairs, MORE's similarity scores have a correlation of 0.481, based on the average of four medical residents' similarity ratings, and that outperforms the skip-gram model by 8.1%, and the best ontology measure by 6.9%. Furthermore, MORE outperforms three pre-trained transformer-based word embedding models (i.e., BERT, ClinicalBERT, and BioBERT) on both datasets. CONCLUSION MORE incorporates knowledge from several biomedical ontologies into an existing corpus-based distributional semantics model, improving both the accuracy of the learned word embeddings and the extensibility of the model to a broader range of biomedical concepts. MORE allows for more accurate clustering of concepts across a wide range of applications, such as analyzing patient health records to identify subjects with similar pathologies, or integrating heterogeneous clinical data to improve interoperability between hospitals.
Collapse
Affiliation(s)
- Steven Jiang
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA
| | - Weiyi Wu
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Naofumi Tomita
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Craig Ganoe
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Saeed Hassanpour
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA; Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA; Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA.
| |
Collapse
|
20
|
Arguello Casteleiro M, Des Diz J, Maroto N, Fernandez Prieto MJ, Peters S, Wroe C, Sevillano Torrado C, Maseda Fernandez D, Stevens R. Semantic Deep Learning: Prior Knowledge and a Type of Four-Term Embedding Analogy to Acquire Treatments for Well-Known Diseases. JMIR Med Inform 2020; 8:e16948. [PMID: 32759099 PMCID: PMC7441383 DOI: 10.2196/16948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2019] [Revised: 02/27/2020] [Accepted: 02/27/2020] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND How to treat a disease remains to be the most common type of clinical question. Obtaining evidence-based answers from biomedical literature is difficult. Analogical reasoning with embeddings from deep learning (embedding analogies) may extract such biomedical facts, although the state-of-the-art focuses on pair-based proportional (pairwise) analogies such as man:woman::king:queen ("queen = -man +king +woman"). OBJECTIVE This study aimed to systematically extract disease treatment statements with a Semantic Deep Learning (SemDeep) approach underpinned by prior knowledge and another type of 4-term analogy (other than pairwise). METHODS As preliminaries, we investigated Continuous Bag-of-Words (CBOW) embedding analogies in a common-English corpus with five lines of text and observed a type of 4-term analogy (not pairwise) applying the 3CosAdd formula and relating the semantic fields person and death: "dagger = -Romeo +die +died" (search query: -Romeo +die +died). Our SemDeep approach worked with pre-existing items of knowledge (what is known) to make inferences sanctioned by a 4-term analogy (search query -x +z1 +z2) from CBOW and Skip-gram embeddings created with a PubMed systematic reviews subset (PMSB dataset). Stage1: Knowledge acquisition. Obtaining a set of terms, candidate y, from embeddings using vector arithmetic. Some n-gram pairs from the cosine and validated with evidence (prior knowledge) are the input for the 3cosAdd, seeking a type of 4-term analogy relating the semantic fields disease and treatment. Stage 2: Knowledge organization. Identification of candidates sanctioned by the analogy belonging to the semantic field treatment and mapping these candidates to unified medical language system Metathesaurus concepts with MetaMap. A concept pair is a brief disease treatment statement (biomedical fact). Stage 3: Knowledge validation. An evidence-based evaluation followed by human validation of biomedical facts potentially useful for clinicians. RESULTS We obtained 5352 n-gram pairs from 446 search queries by applying the 3CosAdd. The microaveraging performance of MetaMap for candidate y belonging to the semantic field treatment was F-measure=80.00% (precision=77.00%, recall=83.25%). We developed an empirical heuristic with some predictive power for clinical winners, that is, search queries bringing candidate y with evidence of a therapeutic intent for target disease x. The search queries -asthma +inhaled_corticosteroids +inhaled_corticosteroid and -epilepsy +valproate +antiepileptic_drug were clinical winners, finding eight evidence-based beneficial treatments. CONCLUSIONS Extracting treatments with therapeutic intent by analogical reasoning from embeddings (423K n-grams from the PMSB dataset) is an ambitious goal. Our SemDeep approach is knowledge-based, underpinned by embedding analogies that exploit prior knowledge. Biomedical facts from embedding analogies (4-term type, not pairwise) are potentially useful for clinicians. The heuristic offers a practical way to discover beneficial treatments for well-known diseases. Learning from deep learning models does not require a massive amount of data. Embedding analogies are not limited to pairwise analogies; hence, analogical reasoning with embeddings is underexploited.
Collapse
Affiliation(s)
| | | | - Nava Maroto
- Departamento de Lingüística Aplicada a la Ciencia y a la Tecnología, Universidad Politécnica de Madrid, Madrid, Spain
| | | | - Simon Peters
- School of Social Sciences, University of Manchester, Manchester, United Kingdom
| | | | | | | | - Robert Stevens
- Department of Computer Science, University of Manchester, Manchester, United Kingdom
| |
Collapse
|
21
|
Liu H, Perl Y, Geller J. Transfer Learning from BERT to Support Insertion of New Concepts into SNOMED CT. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2020; 2019:1129-1138. [PMID: 32308910 PMCID: PMC7153142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
With advances in Machine Learning (ML), neural network-based methods, such as Convolutional/Recurrent Neural Networks, have been proposed to assist terminology curators in the development and maintenance of terminologies. Bidirectional Encoder Representations from Transformers (BERT), a new language representation model, obtains state-of-the-art results on a wide array of general English NLP tasks. We explore BERT's applicability to medical terminology-related tasks. Utilizing the "next sentence prediction" capability of BERT, we show that the Fine-tuning strategy of Transfer Learning (TL) from the BERTBASE model can address a challenging problem in automatic terminology enrichment - insertion of new concepts. Adding a pre-training strategy enhances the results. We apply our strategies to the two largest hierarchies of SNOMED CT, with one release as training data and the following release as test data. The performance of the combined two proposed TL models achieves an average F1 score of 0.85 and 0.86 for the two hierarchies, respectively.
Collapse
Affiliation(s)
- Hao Liu
- Dept of Computer Science, NJIT, Newark, NJ, USA
| | | | | |
Collapse
|
22
|
Pesaranghader A, Matwin S, Sokolova M, Pesaranghader A. deepBioWSD: effective deep neural word sense disambiguation of biomedical text data. J Am Med Inform Assoc 2020; 26:438-446. [PMID: 30811548 DOI: 10.1093/jamia/ocy189] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2018] [Revised: 12/03/2018] [Accepted: 12/19/2018] [Indexed: 01/05/2023] Open
Abstract
OBJECTIVE In biomedicine, there is a wealth of information hidden in unstructured narratives such as research articles and clinical reports. To exploit these data properly, a word sense disambiguation (WSD) algorithm prevents downstream difficulties in the natural language processing applications pipeline. Supervised WSD algorithms largely outperform un- or semisupervised and knowledge-based methods; however, they train 1 separate classifier for each ambiguous term, necessitating a large number of expert-labeled training data, an unattainable goal in medical informatics. To alleviate this need, a single model that shares statistical strength across all instances and scales well with the vocabulary size is desirable. MATERIALS AND METHODS Built on recent advances in deep learning, our deepBioWSD model leverages 1 single bidirectional long short-term memory network that makes sense prediction for any ambiguous term. In the model, first, the Unified Medical Language System sense embeddings will be computed using their text definitions; and then, after initializing the network with these embeddings, it will be trained on all (available) training data collectively. This method also considers a novel technique for automatic collection of training data from PubMed to (pre)train the network in an unsupervised manner. RESULTS We use the MSH WSD dataset to compare WSD algorithms, with macro and micro accuracies employed as evaluation metrics. deepBioWSD outperforms existing models in biomedical text WSD by achieving the state-of-the-art performance of 96.82% for macro accuracy. CONCLUSIONS Apart from the disambiguation improvement and unsupervised training, deepBioWSD depends on considerably less number of expert-labeled data as it learns the target and the context terms jointly. These merit deepBioWSD to be conveniently deployable in real-time biomedical applications.
Collapse
Affiliation(s)
- Ahmad Pesaranghader
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada.,Institute for Big Data Analytics, Dalhousie University, Halifax, NS B3H 4R2, Canada
| | - Stan Matwin
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada.,Institute for Big Data Analytics, Dalhousie University, Halifax, NS B3H 4R2, Canada
| | - Marina Sokolova
- Institute for Big Data Analytics, Dalhousie University, Halifax, NS B3H 4R2, Canada.,School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada.,School of Epidemiology and Public Health, University of Ottawa, University of Ottawa, Ottawa, ON K1G 5Z3, Canada
| | - Ali Pesaranghader
- School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada
| |
Collapse
|
23
|
Weegar R, Pérez A, Casillas A, Oronoz M. Recent advances in Swedish and Spanish medical entity recognition in clinical texts using deep neural approaches. BMC Med Inform Decis Mak 2019; 19:274. [PMID: 31865900 PMCID: PMC6927099 DOI: 10.1186/s12911-019-0981-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Text mining and natural language processing of clinical text, such as notes from electronic health records, requires specific consideration of the specialized characteristics of these texts. Deep learning methods could potentially mitigate domain specific challenges such as limited access to in-domain tools and data sets. METHODS A bi-directional Long Short-Term Memory network is applied to clinical notes in Spanish and Swedish for the task of medical named entity recognition. Several types of embeddings, both generated from in-domain and out-of-domain text corpora, and a number of generation and combination strategies for embeddings have been evaluated in order to investigate different input representations and the influence of domain on the final results. RESULTS For Spanish, a micro averaged F1-score of 75.25 was obtained and for Swedish, the corresponding score was 76.04. The best results for both languages were achieved using embeddings generated from in-domain corpora extracted from electronic health records, but embeddings generated from related domains were also found to be beneficial. CONCLUSIONS A recurrent neural network with in-domain embeddings improved the medical named entity recognition compared to shallow learning methods, showing this combination to be suitable for entity recognition in clinical text for both languages.
Collapse
Affiliation(s)
- Rebecka Weegar
- Department of Computer and Systems Sciences, DSV, Stockholm University, Borgarfjordsgatan 12, Kista, Sweden.
| | - Alicia Pérez
- IXA (UPV/EHU), University of the Basque Country, M. Lardizabal 1, Donostia, 20080, Spain
| | - Arantza Casillas
- IXA (UPV/EHU), University of the Basque Country, M. Lardizabal 1, Donostia, 20080, Spain
| | - Maite Oronoz
- IXA (UPV/EHU), University of the Basque Country, M. Lardizabal 1, Donostia, 20080, Spain
| |
Collapse
|
24
|
Fan Y, Pakhomov S, McEwan R, Zhao W, Lindemann E, Zhang R. Using word embeddings to expand terminology of dietary supplements on clinical notes. JAMIA Open 2019; 2:246-253. [PMID: 31825016 PMCID: PMC6904105 DOI: 10.1093/jamiaopen/ooz007] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Objective The objective of this study is to demonstrate the feasibility of applying word embeddings to expand the terminology of dietary supplements (DS) using over 26 million clinical notes. Methods Word embedding models (ie, word2vec and GloVe) trained on clinical notes were used to predefine a list of top 40 semantically related terms for each of 14 commonly used DS. Each list was further evaluated by experts to generate semantically similar terms. We investigated the effect of corpus size and other settings (ie, vector size and window size) as well as the 2 word embedding models on performance for DS term expansion. We compared the number of clinical notes (and patients they represent) that were retrieved using the word embedding expanded terms to both the baseline terms and external DS sources expanded terms. Results Using the word embedding models trained on clinical notes, we could identify 1–12 semantically similar terms for each DS. Using the word embedding expanded terms, we were able to retrieve averagely 8.39% more clinical notes and 11.68% more patients for each DS compared with 2 sets of terms. The increasing corpus size results in more misspellings, but not more semantic variants and brand names. Word2vec model is also found more capable of detecting semantically similar terms than GloVe. Conclusion Our study demonstrates the utility of word embeddings on clinical notes for terminology expansion on 14 DS. We propose that this method can be potentially applied to create a DS vocabulary for downstream applications, such as information extraction.
Collapse
Affiliation(s)
- Yadan Fan
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Serguei Pakhomov
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.,College of Pharmacy, University of Minnesota, Minneapolis, Minnesota, USA
| | - Reed McEwan
- Academic Health Center-Information Systems, University of Minnesota, Minneapolis, Minnesota, USA
| | - Wendi Zhao
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | | | - Rui Zhang
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.,College of Pharmacy, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
25
|
Arguello-Casteleiro M, Stevens R, Des-Diz J, Wroe C, Fernandez-Prieto MJ, Maroto N, Maseda-Fernandez D, Demetriou G, Peters S, Noble PJM, Jones PH, Dukes-McEwan J, Radford AD, Keane J, Nenadic G. Exploring semantic deep learning for building reliable and reusable one health knowledge from PubMed systematic reviews and veterinary clinical notes. J Biomed Semantics 2019; 10:22. [PMID: 31711540 PMCID: PMC6849172 DOI: 10.1186/s13326-019-0212-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Deep Learning opens up opportunities for routinely scanning large bodies of biomedical literature and clinical narratives to represent the meaning of biomedical and clinical terms. However, the validation and integration of this knowledge on a scale requires cross checking with ground truths (i.e. evidence-based resources) that are unavailable in an actionable or computable form. In this paper we explore how to turn information about diagnoses, prognoses, therapies and other clinical concepts into computable knowledge using free-text data about human and animal health. We used a Semantic Deep Learning approach that combines the Semantic Web technologies and Deep Learning to acquire and validate knowledge about 11 well-known medical conditions mined from two sets of unstructured free-text data: 300 K PubMed Systematic Review articles (the PMSB dataset) and 2.5 M veterinary clinical notes (the VetCN dataset). For each target condition we obtained 20 related clinical concepts using two deep learning methods applied separately on the two datasets, resulting in 880 term pairs (target term, candidate term). Each concept, represented by an n-gram, is mapped to UMLS using MetaMap; we also developed a bespoke method for mapping short forms (e.g. abbreviations and acronyms). Existing ontologies were used to formally represent associations. We also create ontological modules and illustrate how the extracted knowledge can be queried. The evaluation was performed using the content within BMJ Best Practice. RESULTS MetaMap achieves an F measure of 88% (precision 85%, recall 91%) when applied directly to the total of 613 unique candidate terms for the 880 term pairs. When the processing of short forms is included, MetaMap achieves an F measure of 94% (precision 92%, recall 96%). Validation of the term pairs with BMJ Best Practice yields precision between 98 and 99%. CONCLUSIONS The Semantic Deep Learning approach can transform neural embeddings built from unstructured free-text data into reliable and reusable One Health knowledge using ontologies and content from BMJ Best Practice.
Collapse
Affiliation(s)
| | - Robert Stevens
- School of Computer Science, University of Manchester, Manchester, UK
| | - Julio Des-Diz
- Hospital do Salnés, Villagarcía de Arousa, Pontevedra, Spain
| | | | | | - Nava Maroto
- Departamento de Lingüística Aplicada a la Ciencia y a la Tecnología, Universidad Politécnica de Madrid, Madrid, Spain
| | - Diego Maseda-Fernandez
- Midcheshire Hospital Foundation Trust, NHS England, Crewe, UK
- School of Medical Sciences, University of Manchester, Manchester, UK
| | - George Demetriou
- School of Computer Science, University of Manchester, Manchester, UK
| | - Simon Peters
- School of Social Sciences, University of Manchester, Manchester, UK
| | - Peter-John M Noble
- Small Animal Veterinary Surveillance Network, University of Liverpool, Liverpool, UK
| | - Phil H Jones
- Small Animal Veterinary Surveillance Network, University of Liverpool, Liverpool, UK
| | - Jo Dukes-McEwan
- Small Animal Teaching Hospital, University of Liverpool, Liverpool, UK
| | - Alan D Radford
- Small Animal Veterinary Surveillance Network, University of Liverpool, Liverpool, UK
| | - John Keane
- School of Computer Science, University of Manchester, Manchester, UK
- Manchester Institute of Biotechnology, University of Manchester, Manchester, UK
| | - Goran Nenadic
- School of Computer Science, University of Manchester, Manchester, UK
- Manchester Institute of Biotechnology, University of Manchester, Manchester, UK
- Health eResearch Centre, University of Manchester, Manchester, UK
| |
Collapse
|
26
|
Lavertu A, Altman RB. RedMed: Extending drug lexicons for social media applications. J Biomed Inform 2019; 99:103307. [PMID: 31627020 PMCID: PMC6874884 DOI: 10.1016/j.jbi.2019.103307] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2019] [Revised: 10/02/2019] [Accepted: 10/11/2019] [Indexed: 10/25/2022]
Abstract
Social media has been identified as a promising potential source of information for pharmacovigilance. The adoption of social media data has been hindered by the massive and noisy nature of the data. Initial attempts to use social media data have relied on exact text matches to drugs of interest, and therefore suffer from the gap between formal drug lexicons and the informal nature of social media. The Reddit comment archive represents an ideal corpus for bridging this gap. We trained a word embedding model, RedMed, to facilitate the identification and retrieval of health entities from Reddit data. We compare the performance of our model trained on a consumer-generated corpus against publicly available models trained on expert-generated corpora. Our automated classification pipeline achieves an accuracy of 0.88 and a specificity of >0.9 across four different term classes. Of all drug mentions, an average of 79% (±0.5%) were exact matches to a generic or trademark drug name, 14% (±0.5%) were misspellings, 6.4% (±0.3%) were synonyms, and 0.13% (±0.05%) were pill marks. We find that our system captures an additional 20% of mentions; these would have been missed by approaches that rely solely on exact string matches. We provide a lexicon of misspellings and synonyms for 2978 drugs and a word embedding model trained on a health-oriented subset of Reddit.
Collapse
Affiliation(s)
- Adam Lavertu
- Biomedical Informatics Training Program, Stanford University, Stanford, CA 94305, USA
| | - Russ B Altman
- Department of Bioengineering, Stanford University, Stanford, CA 94305, USA.
| |
Collapse
|
27
|
Si Y, Wang J, Xu H, Roberts K. Enhancing clinical concept extraction with contextual embeddings. J Am Med Inform Assoc 2019; 26:1297-1304. [PMID: 31265066 PMCID: PMC6798561 DOI: 10.1093/jamia/ocz096] [Citation(s) in RCA: 115] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2019] [Revised: 05/10/2019] [Accepted: 05/24/2019] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE Neural network-based representations ("embeddings") have dramatically advanced natural language processing (NLP) tasks, including clinical NLP tasks such as concept extraction. Recently, however, more advanced embedding methods and representations (eg, ELMo, BERT) have further pushed the state of the art in NLP, yet there are no common best practices for how to integrate these representations into clinical tasks. The purpose of this study, then, is to explore the space of possible options in utilizing these new models for clinical concept extraction, including comparing these to traditional word embedding methods (word2vec, GloVe, fastText). MATERIALS AND METHODS Both off-the-shelf, open-domain embeddings and pretrained clinical embeddings from MIMIC-III (Medical Information Mart for Intensive Care III) are evaluated. We explore a battery of embedding methods consisting of traditional word embeddings and contextual embeddings and compare these on 4 concept extraction corpora: i2b2 2010, i2b2 2012, SemEval 2014, and SemEval 2015. We also analyze the impact of the pretraining time of a large language model like ELMo or BERT on the extraction performance. Last, we present an intuitive way to understand the semantic information encoded by contextual embeddings. RESULTS Contextual embeddings pretrained on a large clinical corpus achieves new state-of-the-art performances across all concept extraction tasks. The best-performing model outperforms all state-of-the-art methods with respective F1-measures of 90.25, 93.18 (partial), 80.74, and 81.65. CONCLUSIONS We demonstrate the potential of contextual embeddings through the state-of-the-art performance these methods achieve on clinical concept extraction. Additionally, we demonstrate that contextual embeddings encode valuable semantic information not accounted for in traditional word representations.
Collapse
Affiliation(s)
- Yuqi Si
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Jingqi Wang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Hua Xu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Kirk Roberts
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
28
|
Silverman GM, Lindemann EA, Rajamani G, Finzel RL, McEwan R, Knoll BC, Pakhomov S, Melton GB, Tignanelli CJ. Named Entity Recognition in Prehospital Trauma Care. Stud Health Technol Inform 2019; 264:1586-1587. [PMID: 31438244 PMCID: PMC7360018 DOI: 10.3233/shti190547] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Natural language processing (NLP) methods would improve outcomes in the area of prehospital Emergency Medical Services (EMS) data collection and abstraction. This study evaluated off-the-shelf solutions for automating labelling of clinically relevant data from EMS reports. A qualitative approach for choosing the best possible ensemble of pretrained NLP systems was developed and validated along with a feature using word embeddings to test phrase synonymy. The ensemble showed increased performance over individual systems.
Collapse
Affiliation(s)
- Greg M Silverman
- Academic Health Center - Information Systems, University of Minnesota, Minneapolis, Minnesota, USA.,Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | | | | | - Raymond L Finzel
- College of Pharmacy, University of Minnesota, Minneapolis, Minnesota, USA
| | - Reed McEwan
- Academic Health Center - Information Systems, University of Minnesota, Minneapolis, Minnesota, USA
| | - Benjamin C Knoll
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Serguei Pakhomov
- College of Pharmacy, University of Minnesota, Minneapolis, Minnesota, USA
| | - Genevieve B Melton
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.,Department of Surgery, University of Minnesota, Minneapolis, Minnesota, USA
| | - Christopher J Tignanelli
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.,Department of Surgery, University of Minnesota, Minneapolis, Minnesota, USA.,Department of Surgery, North Memorial Health Hospital, Robbinsdale, Minnesota, USA
| |
Collapse
|
29
|
Lin C, Lou YS, Tsai DJ, Lee CC, Hsu CJ, Wu DC, Wang MC, Fang WH. Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study. JMIR Med Inform 2019; 7:e14499. [PMID: 31339103 PMCID: PMC6683650 DOI: 10.2196/14499] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2019] [Revised: 06/13/2019] [Accepted: 06/17/2019] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Most current state-of-the-art models for searching the International Classification of Diseases, Tenth Revision Clinical Modification (ICD-10-CM) codes use word embedding technology to capture useful semantic properties. However, they are limited by the quality of initial word embeddings. Word embedding trained by electronic health records (EHRs) is considered the best, but the vocabulary diversity is limited by previous medical records. Thus, we require a word embedding model that maintains the vocabulary diversity of open internet databases and the medical terminology understanding of EHRs. Moreover, we need to consider the particularity of the disease classification, wherein discharge notes present only positive disease descriptions. OBJECTIVE We aimed to propose a projection word2vec model and a hybrid sampling method. In addition, we aimed to conduct a series of experiments to validate the effectiveness of these methods. METHODS We compared the projection word2vec model and traditional word2vec model using two corpora sources: English Wikipedia and PubMed journal abstracts. We used seven published datasets to measure the medical semantic understanding of the word2vec models and used these embeddings to identify the three-character-level ICD-10-CM diagnostic codes in a set of discharge notes. On the basis of embedding technology improvement, we also tried to apply the hybrid sampling method to improve accuracy. The 94,483 labeled discharge notes from the Tri-Service General Hospital of Taipei, Taiwan, from June 1, 2015, to June 30, 2017, were used. To evaluate the model performance, 24,762 discharge notes from July 1, 2017, to December 31, 2017, from the same hospital were used. Moreover, 74,324 additional discharge notes collected from seven other hospitals were tested. The F-measure, which is the major global measure of effectiveness, was adopted. RESULTS In medical semantic understanding, the original EHR embeddings and PubMed embeddings exhibited superior performance to the original Wikipedia embeddings. After projection training technology was applied, the projection Wikipedia embeddings exhibited an obvious improvement but did not reach the level of original EHR embeddings or PubMed embeddings. In the subsequent ICD-10-CM coding experiment, the model that used both projection PubMed and Wikipedia embeddings had the highest testing mean F-measure (0.7362 and 0.6693 in Tri-Service General Hospital and the seven other hospitals, respectively). Moreover, the hybrid sampling method was found to improve the model performance (F-measure=0.7371/0.6698). CONCLUSIONS The word embeddings trained using EHR and PubMed could understand medical semantics better, and the proposed projection word2vec model improved the ability of medical semantics extraction in Wikipedia embeddings. Although the improvement from the projection word2vec model in the real ICD-10-CM coding task was not substantial, the models could effectively handle emerging diseases. The proposed hybrid sampling method enables the model to behave like a human expert.
Collapse
Affiliation(s)
- Chin Lin
- Graduate Institute of Life Sciences, National Defense Medical Center, Taipei, Taiwan
- School of Public Health, National Defense Medical Center, Taipei, Taiwan
| | - Yu-Sheng Lou
- Graduate Institute of Life Sciences, National Defense Medical Center, Taipei, Taiwan
- School of Public Health, National Defense Medical Center, Taipei, Taiwan
| | - Dung-Jang Tsai
- Graduate Institute of Life Sciences, National Defense Medical Center, Taipei, Taiwan
- School of Public Health, National Defense Medical Center, Taipei, Taiwan
| | - Chia-Cheng Lee
- Planning and Management Office, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Chia-Jung Hsu
- Planning and Management Office, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Ding-Chung Wu
- Department of Medical Record, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Mei-Chuen Wang
- Department of Medical Record, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Wen-Hui Fang
- Department of Family and Community Medicine, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| |
Collapse
|
30
|
Ning W, Chan S, Beam A, Yu M, Geva A, Liao K, Mullen M, Mandl KD, Kohane I, Cai T, Yu S. Feature extraction for phenotyping from semantic and knowledge resources. J Biomed Inform 2019; 91:103122. [PMID: 30738949 PMCID: PMC6424621 DOI: 10.1016/j.jbi.2019.103122] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
OBJECTIVE Phenotyping algorithms can efficiently and accurately identify patients with a specific disease phenotype and construct electronic health records (EHR)-based cohorts for subsequent clinical or genomic studies. Previous studies have introduced unsupervised EHR-based feature selection methods that yielded algorithms with high accuracy. However, those selection methods still require expert intervention to tweak the parameter settings according to the EHR data distribution for each phenotype. To further accelerate the development of phenotyping algorithms, we propose a fully automated and robust unsupervised feature selection method that leverages only publicly available medical knowledge sources, instead of EHR data. METHODS SEmantics-Driven Feature Extraction (SEDFE) collects medical concepts from online knowledge sources as candidate features and gives them vector-form distributional semantic representations derived with neural word embedding and the Unified Medical Language System Metathesaurus. A number of features that are semantically closest and that sufficiently characterize the target phenotype are determined by a linear decomposition criterion and are selected for the final classification algorithm. RESULTS SEDFE was compared with the EHR-based SAFE algorithm and domain experts on feature selection for the classification of five phenotypes including coronary artery disease, rheumatoid arthritis, Crohn's disease, ulcerative colitis, and pediatric pulmonary arterial hypertension using both supervised and unsupervised approaches. Algorithms yielded by SEDFE achieved comparable accuracy to those yielded by SAFE and expert-curated features. SEDFE is also robust to the input semantic vectors. CONCLUSION SEDFE attains satisfying performance in unsupervised feature selection for EHR phenotyping. Both fully automated and EHR-independent, this method promises efficiency and accuracy in developing algorithms for high-throughput phenotyping.
Collapse
Affiliation(s)
- Wenxin Ning
- Department of Industrial Engineering, Tsinghua University, Beijing, China
| | - Stephanie Chan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Andrew Beam
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Ming Yu
- Department of Industrial Engineering, Tsinghua University, Beijing, China
| | - Alon Geva
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA; Department of Anesthesiology, Critical Care, and Pain Medicine, Boston Children's Hospital, Boston, MA, USA; Department of Anesthesia, Harvard Medical School, Boston, MA, USA
| | - Katherine Liao
- Department of Medicine, Division of Rheumatology, Immunology and Allergy, Brigham and Women's Hospital, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Mary Mullen
- Department of Cardiology, Boston Children's Hospital, Boston, MA, USA; Department of Pediatrics, Harvard Medical School, Boston, MA, USA
| | - Kenneth D Mandl
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Isaac Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Sheng Yu
- Center for Statistical Science, Tsinghua University, Beijing, China; Department of Industrial Engineering, Tsinghua University, Beijing, China; Institute for Data Science, Tsinghua University, Beijing, China.
| |
Collapse
|
31
|
Chang KP, Chu YW, Wang J. Analysis of Hormone Receptor Status in Primary and Recurrent Breast Cancer Via Data Mining Pathology Reports. Open Med (Wars) 2019; 14:91-98. [PMID: 30847396 PMCID: PMC6401490 DOI: 10.1515/med-2019-0013] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2018] [Accepted: 12/05/2018] [Indexed: 11/15/2022] Open
Abstract
BACKGROUND Hormone receptors of breast cancer, such as estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (Her-2), are important prognostic factors for breast cancer. OBJECTIVE The current study aimed to develop a method to retrieve the statistics of hormone receptor expression status, documented in pathology reports, given their importance in research for primary and recurrent breast cancer, and quality management of pathology laboratories. METHOD A two-stage text mining approach via regular expression-based word/phrase matching, was developed to retrieve the data. RESULTS The method achieved a sensitivity of 98.8%, 98.7% and 98.4% for extraction of ER, PR, and Her-2 results. The hormone expression status from 3679 primary and 44 recurrent breast cancer cases was successfully retrieved with the method. Statistical analysis of these data showed that the recurrent disease had a significantly lower positivity rate for ER (54.5% vs 76.5%, p=0.001278) than primary breast cancer and a higher positivity rate for Her-2 (48.8% vs 16.2%, p=9.79e-8). These results corroborated the previous literature. CONCLUSION Text mining on pathology reports using the developed method may benefit research of primary and recurrent breast cancer.
Collapse
Affiliation(s)
- Kai-Po Chang
- Department of Pathology, China Medical University Hospital, Taichung404, Taiwan
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung402, Taiwan
| | - Yen-Wei Chu
- Biotechnology Center, Agricultural Biotechnology Center, Institute of Molecular Biology, National Chung Hsing University, Taichung402, Taiwan
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung402, Taiwan
| | - John Wang
- Department of Pathology, China Medical University Hospital, Taichung404, Taiwan
| |
Collapse
|
32
|
Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings. J Biomed Inform 2019; 90:103096. [PMID: 30654030 DOI: 10.1016/j.jbi.2019.103096] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2018] [Revised: 11/27/2018] [Accepted: 12/31/2018] [Indexed: 11/21/2022]
Abstract
Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title + abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.
Collapse
|
33
|
Liu H, Geller J, Halper M, Perl Y. Using Convolutional Neural Networks to Support Insertion of New Concepts into SNOMED CT. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:750-759. [PMID: 30815117 PMCID: PMC6371320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Many major medical ontologies go through a regular (bi-annual, monthly, etc.) release cycle. A new release will contain corrections to the previous release, as well as genuinely new concepts that are the result of either user requests or new developments in the domain. New concepts need to be placed at the correct place in the ontology hierarchy. Traditionally, this is done by an expert modeling a new concept and running a classifier algorithm. We propose an alternative approach that is based on providing only the name of a new concept and using a Convolutional Neural Network-based machine learning method. We first tested this approach within one version of SNOMED CT and achieved an average 88.5% precision and an F1 score of 0.793. In comparing the July 2017 release with the January 2018 release, limiting ourselves to predicting one out of two or more parents, our average F1 score was 0.701.
Collapse
Affiliation(s)
- Hao Liu
- New Jersey Institute of Technology, Newark, NJ
| | | | | | | |
Collapse
|
34
|
Vemulakonda VM, Bush RA, Kahn MG. "Minimally invasive research?" Use of the electronic health record to facilitate research in pediatric urology. J Pediatr Urol 2018; 14:374-381. [PMID: 29929853 PMCID: PMC6286872 DOI: 10.1016/j.jpurol.2018.04.033] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/05/2018] [Accepted: 04/19/2018] [Indexed: 01/20/2023]
Abstract
BACKGROUND The electronic health record (EHR) was designed as a clinical and administrative tool to improve clinical patient care. Electronic healthcare systems have been successfully adopted across the world through use of government mandates and incentives. METHODS Using electronic health record, health information system, electronic medical record, health information systems, research, outcomes, pediatric, surgery, and urology as initial search terms, the literature focusing on clinical documentation data capture and the EHR as a potential resource for research related to clinical outcomes, quality improvement, and comparative effectiveness was reviewed. Relevant articles were supplemented by secondary review of article references as well as seminal articles in the field as identified by the senior author. FINDINGS US federal funding agencies, including the Agency for Healthcare Research and Quality, the Patient-Centered Outcomes Research Institute, the National Institutes of Health, and the Food and Drug Administration have recognized the EHR's role supporting research. The main approached to using EHR data include enhanced lists, direct data extraction, structured data entry, and unstructured data entry. The EHR's potential to facilitate research, overcoming cost and time burdens associated with traditional data collection, has not resulted in widespread use of EHR-based research tools. CONCLUSION There are strengths and weaknesses for all existing methodologies of using EHR data to support research. Collaboration is needed to identify the method that best suits the institution for incorporation of research-oriented data collection into routine pediatric urologic clinical practice.
Collapse
Affiliation(s)
- Vijaya M Vemulakonda
- Department of Pediatric Urology, Children's Hospital Colorado, Aurora, CO, USA; Division of Urology, Department of Surgery, University of Colorado Denver Anschutz Medical Campus, Aurora, CO, USA.
| | - Ruth A Bush
- Clinical Informatics, Rady Children's Hospital San Diego, San Diego, CA, USA; University of San Diego Beyster Institute for Nursing Research, San Diego, CA, USA
| | - Michael G Kahn
- Department of Pediatrics, Colorado Clinical and Translational Sciences Institute and Colorado Center for Personalized Medicine, University of Colorado Denver Anschutz Medical Campus, Aurora, CO, USA; Research Informatics, Children's Hospital Colorado, Aurora, CO, USA
| |
Collapse
|
35
|
Wang Y, Liu S, Afzal N, Rastegar-Mojarad M, Wang L, Shen F, Kingsbury P, Liu H. A comparison of word embeddings for the biomedical natural language processing. J Biomed Inform 2018; 87:12-20. [PMID: 30217670 DOI: 10.1016/j.jbi.2018.09.008] [Citation(s) in RCA: 136] [Impact Index Per Article: 19.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2018] [Revised: 07/18/2018] [Accepted: 09/10/2018] [Indexed: 10/28/2022]
Abstract
BACKGROUND Word embeddings have been prevalently used in biomedical Natural Language Processing (NLP) applications due to the ability of the vector representations being able to capture useful semantic properties and linguistic relationships between words. Different textual resources (e.g., Wikipedia and biomedical literature corpus) have been utilized in biomedical NLP to train word embeddings and these word embeddings have been commonly leveraged as feature input to downstream machine learning models. However, there has been little work on evaluating the word embeddings trained from different textual resources. METHODS In this study, we empirically evaluated word embeddings trained from four different corpora, namely clinical notes, biomedical publications, Wikipedia, and news. For the former two resources, we trained word embeddings using unstructured electronic health record (EHR) data available at Mayo Clinic and articles (MedLit) from PubMed Central, respectively. For the latter two resources, we used publicly available pre-trained word embeddings, GloVe and Google News. The evaluation was done qualitatively and quantitatively. For the qualitative evaluation, we randomly selected medical terms from three categories (i.e., disorder, symptom, and drug), and manually inspected the five most similar words computed by embeddings for each term. We also analyzed the word embeddings through a 2-dimensional visualization plot of 377 medical terms. For the quantitative evaluation, we conducted both intrinsic and extrinsic evaluation. For the intrinsic evaluation, we evaluated the word embeddings' ability to capture medical semantics by measruing the semantic similarity between medical terms using four published datasets: Pedersen's dataset, Hliaoutakis's dataset, MayoSRS, and UMNSRS. For the extrinsic evaluation, we applied word embeddings to multiple downstream biomedical NLP applications, including clinical information extraction (IE), biomedical information retrieval (IR), and relation extraction (RE), with data from shared tasks. RESULTS The qualitative evaluation shows that the word embeddings trained from EHR and MedLit can find more similar medical terms than those trained from GloVe and Google News. The intrinsic quantitative evaluation verifies that the semantic similarity captured by the word embeddings trained from EHR is closer to human experts' judgments on all four tested datasets. The extrinsic quantitative evaluation shows that the word embeddings trained on EHR achieved the best F1 score of 0.900 for the clinical IE task; no word embeddings improved the performance for the biomedical IR task; and the word embeddings trained on Google News had the best overall F1 score of 0.790 for the RE task. CONCLUSION Based on the evaluation results, we can draw the following conclusions. First, the word embeddings trained from EHR and MedLit can capture the semantics of medical terms better, and find semantically relevant medical terms closer to human experts' judgments than those trained from GloVe and Google News. Second, there does not exist a consistent global ranking of word embeddings for all downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, the word embeddings trained from the biomedical domain corpora do not necessarily have better performance than those trained from the general domain corpora for any downstream biomedical NLP task.
Collapse
Affiliation(s)
- Yanshan Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| | - Sijia Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| | - Naveed Afzal
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| | | | - Liwei Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| | - Feichen Shen
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| | - Paul Kingsbury
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| |
Collapse
|
36
|
Lossio-Ventura JA, Bian J, Jonquet C, Roche M, Teisseire M. A novel framework for biomedical entity sense induction. J Biomed Inform 2018; 84:31-41. [PMID: 29935347 DOI: 10.1016/j.jbi.2018.06.007] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2017] [Revised: 05/08/2018] [Accepted: 06/12/2018] [Indexed: 11/28/2022]
Abstract
BACKGROUND Rapid advancements in biomedical research have accelerated the number of relevant electronic documents published online, ranging from scholarly articles to news, blogs, and user-generated social media content. Nevertheless, the vast amount of this information is poorly organized, making it difficult to navigate. Emerging technologies such as ontologies and knowledge bases (KBs) could help organize and track the information associated with biomedical research developments. A major challenge in the automatic construction of ontologies and KBs is the identification of words with its respective sense(s) from a free-text corpus. Word-sense induction (WSI) is a task to automatically induce the different senses of a target word in the different contexts. In the last two decades, there have been several efforts on WSI. However, few methods are effective in biomedicine and life sciences. METHODS We developed a framework for biomedical entity sense induction using a mixture of natural language processing, supervised, and unsupervised learning methods with promising results. It is composed of three main steps: (1) a polysemy detection method to determine if a biomedical entity has many possible meanings; (2) a clustering quality index-based approach to predict the number of senses for the biomedical entity; and (3) a method to induce the concept(s) (i.e., senses) of the biomedical entity in a given context. RESULTS To evaluate our framework, we used the well-known MSH WSD polysemic dataset that contains 203 annotated ambiguous biomedical entities, where each entity is linked to 2-5 concepts. Our polysemy detection method obtained an F-measure of 98%. Second, our approach for predicting the number of senses achieved an F-measure of 93%. Finally, we induced the concepts of the biomedical entities based on a clustering algorithm and then extracted the keywords of reach cluster to represent the concept. CONCLUSIONS We have developed a framework for biomedical entity sense induction with promising results. Our study results can benefit a number of downstream applications, for example, help to resolve concept ambiguities when building Semantic Web KBs from biomedical text.
Collapse
Affiliation(s)
| | - J Bian
- College of Medicine, University of Florida, USA.
| | - C Jonquet
- University of Montpellier, LIRMM, CNRS, Montpellier, France.
| | - M Roche
- Cirad, TETIS, Montpellier, France; TETIS, Univ. Montpellier, APT, Cirad, Cnrs, Irstea, Montpellier, France.
| | | |
Collapse
|
37
|
Ye C, Fabbri D. Extracting similar terms from multiple EMR-based semantic embeddings to support chart reviews. J Biomed Inform 2018; 83:63-72. [PMID: 29793071 DOI: 10.1016/j.jbi.2018.05.014] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2017] [Revised: 04/24/2018] [Accepted: 05/20/2018] [Indexed: 01/20/2023]
Abstract
OBJECTIVE Word embeddings project semantically similar terms into nearby points in a vector space. When trained on clinical text, these embeddings can be leveraged to improve keyword search and text highlighting. In this paper, we present methods to refine the selection process of similar terms from multiple EMR-based word embeddings, and evaluate their performance quantitatively and qualitatively across multiple chart review tasks. MATERIALS AND METHODS Word embeddings were trained on each clinical note type in an EMR. These embeddings were then combined, weighted, and truncated to select a refined set of similar terms to be used in keyword search and text highlighting. To evaluate their quality, we measured the similar terms' information retrieval (IR) performance using precision-at-K (P@5, P@10). Additionally a user study evaluated users' search term preferences, while a timing study measured the time to answer a question from a clinical chart. RESULTS The refined terms outperformed the baseline method's information retrieval performance (e.g., increasing the average P@5 from 0.48 to 0.60). Additionally, the refined terms were preferred by most users, and reduced the average time to answer a question. CONCLUSIONS Clinical information can be more quickly retrieved and synthesized when using semantically similar term from multiple embeddings.
Collapse
Affiliation(s)
- Cheng Ye
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA.
| | - Daniel Fabbri
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA; Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
38
|
Zhang Y, Li HJ, Wang J, Cohen T, Roberts K, Xu H. Adapting Word Embeddings from Multiple Domains to Symptom Recognition from Psychiatric Notes. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2018; 2017:281-289. [PMID: 29888086 PMCID: PMC5961810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Mental health is increasingly recognized an important topic in healthcare. Information concerning psychiatric symptoms is critical for the timely diagnosis of mental disorders, as well as for the personalization of interventions. However, the diversity and sparsity of psychiatric symptoms make it challenging for conventional natural language processing techniques to automatically extract such information from clinical text. To address this problem, this study takes the initiative to use and adapt word embeddings from four source domains - intensive care, biomedical literature, Wikipedia and Psychiatric Forum - to recognize symptoms in the target domain of psychiatry. We investigated four different approaches including 1) only using word embeddings of the source domain, 2) directly combining data of the source and target to generate word embeddings, 3) assigning different weights to word embeddings, and 4) retraining the word embedding model of the source domain using a corpus of the target domain. To the best of our knowledge, this is the first work of adapting multiple word embeddings of external domains to improve psychiatric symptom recognition in clinical text. Experimental results showed that the last two approaches outperformed the baseline methods, indicating the effectiveness of our new strategies to leverage embeddings from other domains.
Collapse
Affiliation(s)
- Yaoyun Zhang
- School of Biomedical Informatics, The University of Texas Health Science Centerat Houston, Houston, TX, USA
| | - Hee-Jin Li
- School of Biomedical Informatics, The University of Texas Health Science Centerat Houston, Houston, TX, USA
| | - Jingqi Wang
- School of Biomedical Informatics, The University of Texas Health Science Centerat Houston, Houston, TX, USA
| | - Trevor Cohen
- School of Biomedical Informatics, The University of Texas Health Science Centerat Houston, Houston, TX, USA
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Centerat Houston, Houston, TX, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Centerat Houston, Houston, TX, USA
| |
Collapse
|
39
|
Jackson R, Patel R, Velupillai S, Gkotsis G, Hoyle D, Stewart R. Knowledge discovery for Deep Phenotyping serious mental illness from Electronic Mental Health records. F1000Res 2018; 7:210. [PMID: 29899974 PMCID: PMC5968362 DOI: 10.12688/f1000research.13830.2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/30/2018] [Indexed: 11/23/2022] Open
Abstract
Background: Deep Phenotyping is the precise and comprehensive analysis of phenotypic features in which the individual components of the phenotype are observed and described. In UK mental health clinical practice, most clinically relevant information is recorded as free text in the Electronic Health Record, and offers a granularity of information beyond what is expressed in most medical knowledge bases. The SNOMED CT nomenclature potentially offers the means to model such information at scale, yet given a sufficiently large body of clinical text collected over many years, it is difficult to identify the language that clinicians favour to express concepts. Methods: By utilising a large corpus of healthcare data, we sought to make use of semantic modelling and clustering techniques to represent the relationship between the clinical vocabulary of internationally recognised SMI symptoms and the preferred language used by clinicians within a care setting. We explore how such models can be used for discovering novel vocabulary relevant to the task of phenotyping Serious Mental Illness (SMI) with only a small amount of prior knowledge. Results: 20 403 terms were derived and curated via a two stage methodology. The list was reduced to 557 putative concepts based on eliminating redundant information content. These were then organised into 9 distinct categories pertaining to different aspects of psychiatric assessment. 235 concepts were found to be expressions of putative clinical significance. Of these, 53 were identified having novel synonymy with existing SNOMED CT concepts. 106 had no mapping to SNOMED CT. Conclusions: We demonstrate a scalable approach to discovering new concepts of SMI symptomatology based on real-world clinical observation. Such approaches may offer the opportunity to consider broader manifestations of SMI symptomatology than is typically assessed via current diagnostic frameworks, and create the potential for enhancing nomenclatures such as SNOMED CT based on real-world expressions.
Collapse
Affiliation(s)
- Richard Jackson
- Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, SE5 8AF, UK
| | - Rashmi Patel
- Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, SE5 8AF, UK.,South London and Maudsley NHS Foundation Trust, London, SE5 8AZ, UK
| | - Sumithra Velupillai
- Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, SE5 8AF, UK.,School of Computer Science and Communication, TH Royal Institute of Technology, Stockholm, SE-100 44, Sweden
| | - George Gkotsis
- Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, SE5 8AF, UK
| | | | - Robert Stewart
- Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, SE5 8AF, UK.,South London and Maudsley NHS Foundation Trust, London, SE5 8AZ, UK
| |
Collapse
|
40
|
Sabbir A, Jimeno-Yepes A, Kavuluru R. Knowledge-Based Biomedical Word Sense Disambiguation with Neural Concept Embeddings. PROCEEDINGS. IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING 2017; 2017:163-170. [PMID: 29399672 PMCID: PMC5792196 DOI: 10.1109/bibe.2017.00-61] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Biomedical word sense disambiguation (WSD) is an important intermediate task in many natural language processing applications such as named entity recognition, syntactic parsing, and relation extraction. In this paper, we employ knowledge-based approaches that also exploit recent advances in neural word/concept embeddings to improve over the state-of-the-art in biomedical WSD using the public MSH WSD dataset [1] as the test set. Our methods involve weak supervision - we do not use any hand-labeled examples for WSD to build our prediction models; however, we employ an existing concept mapping program, MetaMap, to obtain our concept vectors. Over the MSH WSD dataset, our linear time (in terms of numbers of senses and words in the test instance) method achieves an accuracy of 92.24% which is a 3% improvement over the best known results [2] obtained via unsupervised means. A more expensive approach that we developed relies on a nearest neighbor framework and achieves accuracy of 94.34%, essentially cutting the error rate in half. Employing dense vector representations learned from unlabeled free text has been shown to benefit many language processing tasks recently and our efforts show that biomedical WSD is no exception to this trend. For a complex and rapidly evolving domain such as biomedicine, building labeled datasets for larger sets of ambiguous terms may be impractical. Here, we show that weak supervision that leverages recent advances in representation learning can rival supervised approaches in biomedical WSD. However, external knowledge bases (here sense inventories) play a key role in the improvements achieved.
Collapse
Affiliation(s)
- Akm Sabbir
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | | | - Ramakanth Kavuluru
- Division of Biomedical Informatics (Department of Internal Medicine) and the Department of Computer Science, University of Kentucky, Lexington, KY, USA
| |
Collapse
|
41
|
Névéol A, Zweigenbaum P, Section Editors for the IMIA Yearbook Section on Clinical Natural Language Processing . Making Sense of Big Textual Data for Health Care: Findings from the Section on Clinical Natural Language Processing. Yearb Med Inform 2017; 26:228-234. [PMID: 29063569 PMCID: PMC6239234 DOI: 10.15265/iy-2017-027] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Indexed: 02/01/2023] Open
Abstract
Objectives: To summarize recent research and present a selection of the best papers published in 2016 in the field of clinical Natural Language Processing (NLP). Method: A survey of the literature was performed by the two section editors of the IMIA Yearbook NLP section. Bibliographic databases were searched for papers with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. Papers were automatically ranked and then manually reviewed based on titles and abstracts. A shortlist of candidate best papers was first selected by the section editors before being peer-reviewed by independent external reviewers. Results: The five clinical NLP best papers provide a contribution that ranges from emerging original foundational methods to transitioning solid established research results to a practical clinical setting. They offer a framework for abbreviation disambiguation and coreference resolution, a classification method to identify clinically useful sentences, an analysis of counseling conversations to improve support to patients with mental disorder and grounding of gradable adjectives. Conclusions: Clinical NLP continued to thrive in 2016, with an increasing number of contributions towards applications compared to fundamental methods. Fundamental work addresses increasingly complex problems such as lexical semantics, coreference resolution, and discourse analysis. Research results translate into freely available tools, mainly for English.
Collapse
Affiliation(s)
- A. Névéol
- LIMSI, CNRS, Université Paris Saclay, Orsay, France
| | | | | |
Collapse
|
42
|
Zhu Y, Yan E, Wang F. Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec. BMC Med Inform Decis Mak 2017; 17:95. [PMID: 28673289 PMCID: PMC5496182 DOI: 10.1186/s12911-017-0498-1] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2017] [Accepted: 06/28/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Understanding semantic relatedness and similarity between biomedical terms has a great impact on a variety of applications such as biomedical information retrieval, information extraction, and recommender systems. The objective of this study is to examine word2vec's ability in deriving semantic relatedness and similarity between biomedical terms from large publication data. Specifically, we focus on the effects of recency, size, and section of biomedical publication data on the performance of word2vec. METHODS We download abstracts of 18,777,129 articles from PubMed and 766,326 full-text articles from PubMed Central (PMC). The datasets are preprocessed and grouped into subsets by recency, size, and section. Word2vec models are trained on these subtests. Cosine similarities between biomedical terms obtained from the word2vec models are compared against reference standards. Performance of models trained on different subsets are compared to examine recency, size, and section effects. RESULTS Models trained on recent datasets did not boost the performance. Models trained on larger datasets identified more pairs of biomedical terms than models trained on smaller datasets in relatedness task (from 368 at the 10% level to 494 at the 100% level) and similarity task (from 374 at the 10% level to 491 at the 100% level). The model trained on abstracts produced results that have higher correlations with the reference standards than the one trained on article bodies (i.e., 0.65 vs. 0.62 in the similarity task and 0.66 vs. 0.59 in the relatedness task). However, the latter identified more pairs of biomedical terms than the former (i.e., 344 vs. 498 in the similarity task and 339 vs. 503 in the relatedness task). CONCLUSIONS Increasing the size of dataset does not always enhance the performance. Increasing the size of datasets can result in the identification of more relations of biomedical terms even though it does not guarantee better precision. As summaries of research articles, compared with article bodies, abstracts excel in accuracy but lose in coverage of identifiable relations.
Collapse
Affiliation(s)
- Yongjun Zhu
- Healthcare Policy and Research, Weill Cornell Medicine, Cornell University, New York, NY, USA.
| | - Erjia Yan
- College of Computing and Informatics, Drexel University, Philadelphia, PA, USA
| | - Fei Wang
- Healthcare Policy and Research, Weill Cornell Medicine, Cornell University, New York, NY, USA
| |
Collapse
|
43
|
Zhang Y, Zhang O, Wu Y, Lee HJ, Xu J, Xu H, Roberts K. Psychiatric symptom recognition without labeled data using distributional representations of phrases and on-line knowledge. J Biomed Inform 2017. [PMID: 28624644 DOI: 10.1016/j.jbi.2017.06.014] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
OBJECTIVE Mental health is becoming an increasingly important topic in healthcare. Psychiatric symptoms, which consist of subjective descriptions of the patient's experience, as well as the nature and severity of mental disorders, are critical to support the phenotypic classification for personalized prevention, diagnosis, and intervention of mental disorders. However, few automated approaches have been proposed to extract psychiatric symptoms from clinical text, mainly due to (a) the lack of annotated corpora, which are time-consuming and costly to build, and (b) the inherent linguistic difficulties that symptoms present as they are not well-defined clinical concepts like diseases. The goal of this study is to investigate techniques for recognizing psychiatric symptoms in clinical text without labeled data. Instead, external knowledge in the form of publicly available "seed" lists of symptoms is leveraged using unsupervised distributional representations. MATERIALS AND METHODS First, psychiatric symptoms are collected from three online repositories of healthcare knowledge for consumers-MedlinePlus, Mayo Clinic, and the American Psychiatric Association-for use as seed terms. Candidate symptoms in psychiatric notes are automatically extracted using phrasal syntax patterns. In particular, the 2016 CEGS N-GRID challenge data serves as the psychiatric note corpus. Second, three corpora-psychiatric notes, psychiatric forum data, and MIMIC II-are adopted to generate distributional representations with paragraph2vec. Finally, semantic similarity between the distributional representations of the seed symptoms and candidate symptoms is calculated to assess the relevance of a phrase. Experiments were performed on a set of psychiatric notes from the CEGS N-GRID 2016 Challenge. RESULTS & CONCLUSION Our method demonstrates good performance at extracting symptoms from an unseen corpus, including symptoms with no word overlap with the provided seed terms. Semantic similarity based on the distributional representation outperformed baseline methods. Our experiment yielded two interesting results. First, distributional representations built from social media data outperformed those built from clinical data. And second, the distributional representation model built from sentences resulted in better representations of phrases than the model built from phrase alone.
Collapse
Affiliation(s)
- Yaoyun Zhang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | | | - Yonghui Wu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Hee-Jin Lee
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Jun Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.
| |
Collapse
|
44
|
Cohen T, Widdows D. Embedding of semantic predications. J Biomed Inform 2017; 68:150-166. [PMID: 28284761 DOI: 10.1016/j.jbi.2017.03.003] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2016] [Revised: 02/27/2017] [Accepted: 03/05/2017] [Indexed: 11/20/2022]
Abstract
This paper concerns the generation of distributed vector representations of biomedical concepts from structured knowledge, in the form of subject-relation-object triplets known as semantic predications. Specifically, we evaluate the extent to which a representational approach we have developed for this purpose previously, known as Predication-based Semantic Indexing (PSI), might benefit from insights gleaned from neural-probabilistic language models, which have enjoyed a surge in popularity in recent years as a means to generate distributed vector representations of terms from free text. To do so, we develop a novel neural-probabilistic approach to encoding predications, called Embedding of Semantic Predications (ESP), by adapting aspects of the Skipgram with Negative Sampling (SGNS) algorithm to this purpose. We compare ESP and PSI across a number of tasks including recovery of encoded information, estimation of semantic similarity and relatedness, and identification of potentially therapeutic and harmful relationships using both analogical retrieval and supervised learning. We find advantages for ESP in some, but not all of these tasks, revealing the contexts in which the additional computational work of neural-probabilistic modeling is justified.
Collapse
Affiliation(s)
- Trevor Cohen
- School of Biomedical Informatics, The University of Texas Health Science Center, Houston, TX, United States.
| | | |
Collapse
|
45
|
Yu Z, Wallace BC, Johnson T, Cohen T. Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness. Stud Health Technol Inform 2017; 245:657-661. [PMID: 29295178 PMCID: PMC6464117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Estimation of semantic similarity and relatedness between biomedical concepts has utility for many informatics applications. Automated methods fall into two categories: methods based on distributional statistics drawn from text corpora, and methods using the structure of existing knowledge resources. Methods in the former category disregard taxonomic structure, while those in the latter fail to consider semantically relevant empirical information. In this paper, we present a method that retrofits distributional context vector representations of biomedical concepts using structural information from the UMLS Metathesaurus, such that the similarity between vector representations of linked concepts is augmented. We evaluated it on the UMNSRS benchmark. Our results demonstrate that retrofitting of concept vector representations leads to better correlation with human raters for both similarity and relatedness, surpassing the best results reported to date. They also demonstrate a clear improvement in performance on this reference standard for retrofitted vector representations, as compared to those without retrofitting.
Collapse
Affiliation(s)
- Zhiguo Yu
- The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA
| | - Byron C. Wallace
- College of Computer and Information Science, Northeastern University, Boston, Massachusetts, USA
| | - Todd Johnson
- The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA
| | - Trevor Cohen
- The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA
| |
Collapse
|
46
|
Measuring content overlap during handoff communication using distributional semantics: An exploratory study. J Biomed Inform 2017; 65:132-144. [DOI: 10.1016/j.jbi.2016.11.009] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2016] [Revised: 11/08/2016] [Accepted: 11/26/2016] [Indexed: 11/23/2022]
|