1
|
Ong JCL, Chen MH, Ng N, Elangovan K, Tan NYT, Jin L, Xie Q, Ting DSW, Rodriguez-Monguio R, Bates DW, Liu N. A scoping review on generative AI and large language models in mitigating medication related harm. NPJ Digit Med 2025; 8:182. [PMID: 40155703 PMCID: PMC11953325 DOI: 10.1038/s41746-025-01565-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2024] [Accepted: 03/06/2025] [Indexed: 04/01/2025] Open
Abstract
Medication-related harm has a significant impact on global healthcare costs and patient outcomes. Generative artificial intelligence (GenAI) and large language models (LLM) have emerged as a promising tool in mitigating risks of medication-related harm. This review evaluates the scope and effectiveness of GenAI and LLM in reducing medication-related harm. We screened 4 databases for literature published from 1st January 2012 to 15th October 2024. A total of 3988 articles were identified, and 30 met the criteria for inclusion into the final review. Generative AI and LLMs were applied in three key applications: drug-drug interaction identification and prediction, clinical decision support, and pharmacovigilance. While the performance and utility of these models varied, they generally showed promise in early identification, classification of adverse drug events, and supporting decision-making for medication management. However, no studies tested these models prospectively, suggesting a need for further investigation into integration and real-world application.
Collapse
Affiliation(s)
- Jasmine Chiat Ling Ong
- Division of Pharmacy, Singapore General Hospital, Singapore, Singapore
- Department of Pharmacy, University of California, San Francisco, CA, USA
- Duke-NUS Medical School, Singapore, Singapore
| | | | - Ning Ng
- Artificial Intelligence Office, Singapore Health Services, Singapore, Singapore
| | - Kabilan Elangovan
- Artificial Intelligence Office, Singapore Health Services, Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
| | | | - Liyuan Jin
- Duke-NUS Medical School, Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
| | - Qihuang Xie
- School of Pharmacy, National University of Singapore, Singapore, Singapore
| | - Daniel Shu Wei Ting
- Artificial Intelligence Office, Singapore Health Services, Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
- Byers Eye Institute, Stanford University, California, CA, USA
| | - Rosa Rodriguez-Monguio
- Department of Clinical Pharmacy, School of Pharmacy, University of California, San Francisco, CA, USA
- Medication Outcomes Center, University of California, San Francisco, CA, USA
| | - David W Bates
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Nan Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore.
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore.
- NUS AI Institute, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
2
|
Xu K, Song Y, Ma J. Identifying protected health information by transformers-based deep learning approach in Chinese medical text. Health Informatics J 2025; 31:14604582251315594. [PMID: 39862116 DOI: 10.1177/14604582251315594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2025]
Abstract
Purpose: In the context of Chinese clinical texts, this paper aims to propose a deep learning algorithm based on Bidirectional Encoder Representation from Transformers (BERT) to identify privacy information and to verify the feasibility of our method for privacy protection in the Chinese clinical context. Methods: We collected and double-annotated 33,017 discharge summaries from 151 medical institutions on a municipal regional health information platform, developed a BERT-based Bidirectional Long Short-Term Memory Model (BiLSTM) and Conditional Random Field (CRF) model, and tested the performance of privacy identification on the dataset. To explore the performance of different substructures of the neural network, we created five additional baseline models and evaluated the impact of different models on performance. Results: Based on the annotated data, the BERT model pre-trained with the medical corpus showed a significant performance improvement to the BiLSTM-CRF model with a micro-recall of 0.979 and an F1 value of 0.976, which indicates that the model has promising performance in identifying private information in Chinese clinical texts. Conclusions: The BERT-based BiLSTM-CRF model excels in identifying privacy information in Chinese clinical texts, and the application of this model is very effective in protecting patient privacy and facilitating data sharing.
Collapse
Affiliation(s)
- Kun Xu
- School of Medicine and Health Management in Huazhong University of Science and Technology, Wuhan, China
| | - Yang Song
- School of Medicine and Health Management in Huazhong University of Science and Technology, Wuhan, China
| | - Jingdong Ma
- School of Medicine and Health Management in Huazhong University of Science and Technology, Wuhan, China
| |
Collapse
|
3
|
Kugic A, Martin I, Modersohn L, Pallaoro P, Kreuzthaler M, Schulz S, Boeker M. Processing of Short-Form Content in Clinical Narratives: Systematic Scoping Review. J Med Internet Res 2024; 26:e57852. [PMID: 39325515 PMCID: PMC11467596 DOI: 10.2196/57852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 05/24/2024] [Accepted: 07/25/2024] [Indexed: 09/27/2024] Open
Abstract
BACKGROUND Clinical narratives are essential components of electronic health records. The adoption of electronic health records has increased documentation time for hospital staff, leading to the use of abbreviations and acronyms more frequently. This brevity can potentially hinder comprehension for both professionals and patients. OBJECTIVE This review aims to provide an overview of the types of short forms found in clinical narratives, as well as the natural language processing (NLP) techniques used for their identification, expansion, and disambiguation. METHODS In the databases Web of Science, Embase, MEDLINE, EBMR (Evidence-Based Medicine Reviews), and ACL Anthology, publications that met the inclusion criteria were searched according to PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for a systematic scoping review. Original, peer-reviewed publications focusing on short-form processing in human clinical narratives were included, covering the period from January 2018 to February 2023. Short-form types were extracted, and multidimensional research methodologies were assigned to each target objective (identification, expansion, and disambiguation). NLP study recommendations and study characteristics were systematically assigned occurrence rates for evaluation. RESULTS Out of a total of 6639 records, only 19 articles were included in the final analysis. Rule-based approaches were predominantly used for identifying short forms, while string similarity and vector representations were applied for expansion. Embeddings and deep learning approaches were used for disambiguation. CONCLUSIONS The scope and types of what constitutes a clinical short form were often not explicitly defined by the authors. This lack of definition poses challenges for reproducibility and for determining whether specific methodologies are suitable for different types of short forms. Analysis of a subset of NLP recommendations for assessing quality and reproducibility revealed only partial adherence to these recommendations. Single-character abbreviations were underrepresented in studies on clinical narrative processing, as were investigations in languages other than English. Future research should focus on these 2 areas, and each paper should include descriptions of the types of content analyzed.
Collapse
Affiliation(s)
- Amila Kugic
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria
| | - Ingrid Martin
- Institute for AI and Informatics in Medicine, School of Medicine and Health, Technical University of Munich, Munich, Germany
| | - Luise Modersohn
- Institute for AI and Informatics in Medicine, School of Medicine and Health, Technical University of Munich, Munich, Germany
| | - Peter Pallaoro
- Institute for AI and Informatics in Medicine, School of Medicine and Health, Technical University of Munich, Munich, Germany
| | - Markus Kreuzthaler
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria
| | - Stefan Schulz
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria
| | - Martin Boeker
- Institute for AI and Informatics in Medicine, School of Medicine and Health, Technical University of Munich, Munich, Germany
| |
Collapse
|
4
|
Jonker RAA, Almeida T, Antunes R, Almeida JR, Matos S. Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes. Database (Oxford) 2024; 2024:baae068. [PMID: 39083461 PMCID: PMC11290360 DOI: 10.1093/database/baae068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 05/15/2024] [Accepted: 07/08/2024] [Indexed: 08/02/2024]
Abstract
The identification of medical concepts from clinical narratives has a large interest in the biomedical scientific community due to its importance in treatment improvements or drug development research. Biomedical named entity recognition (NER) in clinical texts is crucial for automated information extraction, facilitating patient record analysis, drug development, and medical research. Traditional approaches often focus on single-class NER tasks, yet recent advancements emphasize the necessity of addressing multi-class scenarios, particularly in complex biomedical domains. This paper proposes a strategy to integrate a multi-head conditional random field (CRF) classifier for multi-class NER in Spanish clinical documents. Our methodology overcomes overlapping entity instances of different types, a common challenge in traditional NER methodologies, by using a multi-head CRF model. This architecture enhances computational efficiency and ensures scalability for multi-class NER tasks, maintaining high performance. By combining four diverse datasets, SympTEMIST, MedProcNER, DisTEMIST, and PharmaCoNER, we expand the scope of NER to encompass five classes: symptoms, procedures, diseases, chemicals, and proteins. To the best of our knowledge, these datasets combined create the largest Spanish multi-class dataset focusing on biomedical entity recognition and linking for clinical notes, which is important to train a biomedical model in Spanish. We also provide entity linking to the multi-lingual Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) vocabulary, with the eventual goal of performing biomedical relation extraction. Through experimentation and evaluation of Spanish clinical documents, our strategy provides competitive results against single-class NER models. For NER, our system achieves a combined micro-averaged F1-score of 78.73, with clinical mentions normalized to SNOMED CT with an end-to-end F1-score of 54.51. The code to run our system is publicly available at https://github.com/ieeta-pt/Multi-Head-CRF. Database URL: https://github.com/ieeta-pt/Multi-Head-CRF.
Collapse
Affiliation(s)
- Richard A A Jonker
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Tiago Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Rui Antunes
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - João R Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sérgio Matos
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| |
Collapse
|
5
|
Almeida T, Jonker RAA, Antunes R, Almeida JR, Matos S. Towards discovery: an end-to-end system for uncovering novel biomedical relations. Database (Oxford) 2024; 2024:baae057. [PMID: 38994795 PMCID: PMC11240158 DOI: 10.1093/database/baae057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/20/2024] [Accepted: 06/19/2024] [Indexed: 07/13/2024]
Abstract
Biomedical relation extraction is an ongoing challenge within the natural language processing community. Its application is important for understanding scientific biomedical literature, with many use cases, such as drug discovery, precision medicine, disease diagnosis, treatment optimization and biomedical knowledge graph construction. Therefore, the development of a tool capable of effectively addressing this task holds the potential to improve knowledge discovery by automating the extraction of relations from research manuscripts. The first track in the BioCreative VIII competition extended the scope of this challenge by introducing the detection of novel relations within the literature. This paper describes that our participation system initially focused on jointly extracting and classifying novel relations between biomedical entities. We then describe our subsequent advancement to an end-to-end model. Specifically, we enhanced our initial system by incorporating it into a cascading pipeline that includes a tagger and linker module. This integration enables the comprehensive extraction of relations and classification of their novelty directly from raw text. Our experiments yielded promising results, and our tagger module managed to attain state-of-the-art named entity recognition performance, with a micro F1-score of 90.24, while our end-to-end system achieved a competitive novelty F1-score of 24.59. The code to run our system is publicly available at https://github.com/ieeta-pt/BioNExt. Database URL: https://github.com/ieeta-pt/BioNExt.
Collapse
Affiliation(s)
- Tiago Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Richard A A Jonker
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Rui Antunes
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - João R Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sérgio Matos
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| |
Collapse
|
6
|
Vakili T, Henriksson A, Dalianis H. End-to-end pseudonymization of fine-tuned clinical BERT models : Privacy preservation with maintained data utility. BMC Med Inform Decis Mak 2024; 24:162. [PMID: 38915012 PMCID: PMC11197357 DOI: 10.1186/s12911-024-02546-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 05/21/2024] [Indexed: 06/26/2024] Open
Abstract
Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.
Collapse
Affiliation(s)
- Thomas Vakili
- Department of Computer and Systems Sciences, Stockholm University, P.O. Box 7003, 164 07, Kista, Stockholm, Sweden.
| | - Aron Henriksson
- Department of Computer and Systems Sciences, Stockholm University, P.O. Box 7003, 164 07, Kista, Stockholm, Sweden
| | - Hercules Dalianis
- Department of Computer and Systems Sciences, Stockholm University, P.O. Box 7003, 164 07, Kista, Stockholm, Sweden
| |
Collapse
|
7
|
Zuo X, Zhou Y, Duke J, Hripcsak G, Shah N, Banda JM, Reeves R, Miller T, Waitman LR, Natarajan K, Xu H. Standardizing Multi-site Clinical Note Titles to LOINC Document Ontology: A Transformer-based Approach. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2024; 2023:834-843. [PMID: 38222429 PMCID: PMC10785935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
The types of clinical notes in electronic health records (EHRs) are diverse and it would be great to standardize them to ensure unified data retrieval, exchange, and integration. The LOINC Document Ontology (DO) is a subset of LOINC that is created specifically for naming and describing clinical documents. Despite the efforts of promoting and improving this ontology, how to efficiently deploy it in real-world clinical settings has yet to be explored. In this study we evaluated the utility of LOINC DO by mapping clinical note titles collected from five institutions to the LOINC DO and classifying the mapping into three classes based on semantic similarity between note titles and LOINC DO codes. Additionally, we developed a standardization pipeline that automatically maps clinical note titles from multiple sites to suitable LOINC DO codes, without accessing the content of clinical notes. The pipeline can be initialized with different large language models, and we compared the performances between them. The results showed that our automated pipeline achieved an accuracy of 0.90. By comparing the manual and automated mapping results, we analyzed the coverage of LOINC DO in describing multi-site clinical note titles and summarized the potential scope for extension.
Collapse
Affiliation(s)
- Xu Zuo
- University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Yujia Zhou
- University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Jon Duke
- Georgia Institute of Technology, Atlanta, GA, USA
- OHDSI Consortium, Natural Language Processing Working Group
| | - George Hripcsak
- Columbia University, New York City, NY, USA
- OHDSI Consortium, Natural Language Processing Working Group
| | - Nigam Shah
- Stanford University, Stanford, CA, USA
- OHDSI Consortium, Natural Language Processing Working Group
| | - Juan M Banda
- Georgia State University, Atlanta, GA, USA
- OHDSI Consortium, Natural Language Processing Working Group
| | - Ruth Reeves
- Vanderbilt University Medical Center, Nashville, TN, USA
- OHDSI Consortium, Natural Language Processing Working Group
| | - Timothy Miller
- Boston Children's Hospital, Boston, MA, USA
- OHDSI Consortium, Natural Language Processing Working Group
| | | | - Karthik Natarajan
- Columbia University, New York City, NY, USA
- OHDSI Consortium, Natural Language Processing Working Group
| | - Hua Xu
- Yale University, New Haven, CT, USA
- OHDSI Consortium, Natural Language Processing Working Group
| |
Collapse
|
8
|
Abdulnazar A, Roller R, Schulz S, Kreuzthaler M. Unsupervised SapBERT-based bi-encoders for medical concept annotation of clinical narratives with SNOMED CT. Digit Health 2024; 10:20552076241288681. [PMID: 39493636 PMCID: PMC11531008 DOI: 10.1177/20552076241288681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 09/03/2024] [Indexed: 11/05/2024] Open
Abstract
Objective Clinical narratives provide comprehensive patient information. Achieving interoperability involves mapping relevant details to standardized medical vocabularies. Typically, natural language processing divides this task into named entity recognition (NER) and medical concept normalization (MCN). State-of-the-art results require supervised setups with abundant training data. However, the limited availability of annotated data due to sensitivity and time constraints poses challenges. This study addressed the need for unsupervised medical concept annotation (MCA) to overcome these limitations and support the creation of annotated datasets. Method We use an unsupervised SapBERT-based bi-encoder model to analyze n-grams from narrative text and measure their similarity to SNOMED CT concepts. At the end, we apply a syntactical re-ranker. For evaluation, we use the semantic tags of SNOMED CT candidates to assess the NER phase and their concept IDs to assess the MCN phase. The approach is evaluated with both English and German narratives. Result Without training data, our unsupervised approach achieves an F1 score of 0.765 in English and 0.557 in German for MCN. Evaluation at the semantic tag level reveals that "disorder" has the highest F1 scores, 0.871 and 0.648 on English and German datasets. Furthermore, the MCA approach on the semantic tag "disorder" shows F1 scores of 0.839 and 0.696 in English and 0.685 and 0.437 in German for NER and MCN, respectively. Conclusion This unsupervised approach demonstrates potential for initial annotation (pre-labeling) in manual annotation tasks. While promising for certain semantic tags, challenges remain, including false positives, contextual errors, and variability of clinical language, requiring further fine-tuning.
Collapse
Affiliation(s)
- Akhila Abdulnazar
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria
- CBmed GmbH – Center for Biomarker Research in Medicine, Graz, Austria
| | - Roland Roller
- German Research Center for Artificial Intelligence (DFKI), Berlin, Germany
| | - Stefan Schulz
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria
| | - Markus Kreuzthaler
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria
| |
Collapse
|
9
|
Zahra FA, Kate RJ. Obtaining clinical term embeddings from SNOMED CT ontology. J Biomed Inform 2024; 149:104560. [PMID: 38070816 DOI: 10.1016/j.jbi.2023.104560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/29/2023] [Accepted: 12/05/2023] [Indexed: 01/22/2024]
Abstract
Clinical term embeddings are traditionally obtained using corpus-based methods, however, these methods cannot incorporate knowledge about clinical terms which is already present in medical ontologies. On the other hand, graph-based methods can obtain embeddings of clinical concepts from ontologies, but they cannot obtain embeddings for clinical terms and words. In this paper, a novel method is presented to obtain embeddings for clinical terms and words from the SNOMED CT ontology. The method first obtains embeddings of clinical concepts from SNOMED CT using a graph-based method. Next, these concept embeddings are used as targets to train a deep learning model to map clinical terms to concepts embeddings. The learned model then provides embeddings for clinical terms and words as well as maps novel clinical terms to their embeddings. The embeddings obtained using the method out-performed corpus-based embeddings on the task of predicting clinical term similarity on five benchmark datasets. On the clinical term normalization task, using these embeddings simply as a means of computing similarity between clinical terms obtained accuracy which was competitive to methods trained specifically for this task. Both corpus-based and ontology-based embeddings have a limitation that they tend to learn similar embeddings for opposite or analogous terms. To counter this, we also introduce a method to automatically learn patterns that indicate when two clinical terms represent the same concept and when they represent different concepts. Supplementing the normalization process with these patterns showed improvement. Although clinical term embeddings obtained from SNOMED CT incorporate ontological knowledge which is missed by corpus-based embeddings, they do not incorporate linguistic knowledge which is needed for sentence-based tasks. Hence combining ontology-based embeddings with corpus-based embeddings is an avenue for future work.
Collapse
Affiliation(s)
- Fuad Abu Zahra
- Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, USA
| | - Rohit J Kate
- Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, USA.
| |
Collapse
|
10
|
Garda S, Weber-Genzel L, Martin R, Leser U. BELB: a biomedical entity linking benchmark. Bioinformatics 2023; 39:btad698. [PMID: 37975879 PMCID: PMC10681865 DOI: 10.1093/bioinformatics/btad698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 10/30/2023] [Accepted: 11/16/2023] [Indexed: 11/19/2023] Open
Abstract
MOTIVATION Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage KB UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. RESULTS We therefore developed BELB, a biomedical entity linking benchmark, providing access in a unified format to 11 corpora linked to 7 KBs and spanning six entity types: gene, disease, chemical, species, cell line, and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB, we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models. AVAILABILITY AND IMPLEMENTATION The source code of BELB is available at: https://github.com/sg-wbi/belb. The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belb-exp.
Collapse
Affiliation(s)
- Samuele Garda
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Leon Weber-Genzel
- Center for Information and Language Processing, Ludwig-Maximilians-Universität München, München 80539, Germany
| | - Robert Martin
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| |
Collapse
|
11
|
Ramakrishnaiah Y, Macesic N, Webb GI, Peleg AY, Tyagi S. EHR-QC: A streamlined pipeline for automated electronic health records standardisation and preprocessing to predict clinical outcomes. J Biomed Inform 2023; 147:104509. [PMID: 37827477 DOI: 10.1016/j.jbi.2023.104509] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 09/26/2023] [Accepted: 09/28/2023] [Indexed: 10/14/2023]
Abstract
The adoption of electronic health records (EHRs) has created opportunities to analyse historical data for predicting clinical outcomes and improving patient care. However, non-standardised data representations and anomalies pose major challenges to the use of EHRs in digital health research. To address these challenges, we have developed EHR-QC, a tool comprising two modules: the data standardisation module and the preprocessing module. The data standardisation module migrates source EHR data to a standard format using advanced concept mapping techniques, surpassing expert curation in benchmarking analysis. The preprocessing module includes several functions designed specifically to handle healthcare data subtleties. We provide automated detection of data anomalies and solutions to handle those anomalies. We believe that the development and adoption of tools like EHR-QC is critical for advancing digital health. Our ultimate goal is to accelerate clinical research by enabling rapid experimentation with data-driven observational research to generate robust, generalisable biomedical knowledge.
Collapse
Affiliation(s)
- Yashpal Ramakrishnaiah
- Department of Infectious Diseases, The Alfred Hospital and Central Clinical School, Monash University, Melbourne 3000, VIC, Australia
| | - Nenad Macesic
- Department of Infectious Diseases, The Alfred Hospital and Central Clinical School, Monash University, Melbourne 3000, VIC, Australia; Centre to Impact AMR, Monash University, Melbourne 3000, VIC, Australia
| | - Geoffrey I Webb
- Department of Infectious Diseases, The Alfred Hospital and Central Clinical School, Monash University, Melbourne 3000, VIC, Australia; Centre to Impact AMR, Monash University, Melbourne 3000, VIC, Australia
| | - Anton Y Peleg
- Department of Infectious Diseases, The Alfred Hospital and Central Clinical School, Monash University, Melbourne 3000, VIC, Australia; Centre to Impact AMR, Monash University, Melbourne 3000, VIC, Australia.
| | - Sonika Tyagi
- Department of Infectious Diseases, The Alfred Hospital and Central Clinical School, Monash University, Melbourne 3000, VIC, Australia; School of Computing Technologies, RMIT University, Melbourne 3000, VIC, Australia.
| |
Collapse
|
12
|
Kreuzthaler M, Brochhausen M, Zayas C, Blobel B, Schulz S. Linguistic and ontological challenges of multiple domains contributing to transformed health ecosystems. Front Med (Lausanne) 2023; 10:1073313. [PMID: 37007792 PMCID: PMC10050682 DOI: 10.3389/fmed.2023.1073313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Accepted: 02/13/2023] [Indexed: 03/17/2023] Open
Abstract
This paper provides an overview of current linguistic and ontological challenges which have to be met in order to provide full support to the transformation of health ecosystems in order to meet precision medicine (5 PM) standards. It highlights both standardization and interoperability aspects regarding formal, controlled representations of clinical and research data, requirements for smart support to produce and encode content in a way that humans and machines can understand and process it. Starting from the current text-centered communication practices in healthcare and biomedical research, it addresses the state of the art in information extraction using natural language processing (NLP). An important aspect of the language-centered perspective of managing health data is the integration of heterogeneous data sources, employing different natural languages and different terminologies. This is where biomedical ontologies, in the sense of formal, interchangeable representations of types of domain entities come into play. The paper discusses the state of the art of biomedical ontologies, addresses their importance for standardization and interoperability and sheds light to current misconceptions and shortcomings. Finally, the paper points out next steps and possible synergies of both the field of NLP and the area of Applied Ontology and Semantic Web to foster data interoperability for 5 PM.
Collapse
Affiliation(s)
- Markus Kreuzthaler
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria
| | - Mathias Brochhausen
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Cilia Zayas
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Bernd Blobel
- Medical Faculty, University of Regensburg, Regensburg, Germany
- eHealth Competence Center Bavaria, Deggendorf Institute of Technology, Deggendorf, Germany
- First Medical Faculty, Charles University Prague, Prague, Czechia
| | - Stefan Schulz
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria
- Averbis GmbH, Freiburg, Germany
- *Correspondence: Stefan Schulz,
| |
Collapse
|
13
|
Leaman R, Islamaj R, Adams V, Alliheedi MA, Almeida JR, Antunes R, Bevan R, Chang YC, Erdengasileng A, Hodgskiss M, Ida R, Kim H, Li K, Mercer RE, Mertová L, Mobasher G, Shin HC, Sung M, Tsujimura T, Yeh WC, Lu Z. Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII. Database (Oxford) 2023; 2023:7071696. [PMID: 36882099 PMCID: PMC9991492 DOI: 10.1093/database/baad005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 01/06/2023] [Accepted: 02/15/2023] [Indexed: 03/09/2023]
Abstract
The BioCreative National Library of Medicine (NLM)-Chem track calls for a community effort to fine-tune automated recognition of chemical names in the biomedical literature. Chemicals are one of the most searched biomedical entities in PubMed, and-as highlighted during the coronavirus disease 2019 pandemic-their identification may significantly advance research in multiple biomedical subfields. While previous community challenges focused on identifying chemical names mentioned in titles and abstracts, the full text contains valuable additional detail. We, therefore, organized the BioCreative NLM-Chem track as a community effort to address automated chemical entity recognition in full-text articles. The track consisted of two tasks: (i) chemical identification and (ii) chemical indexing. The chemical identification task required predicting all chemicals mentioned in recently published full-text articles, both span [i.e. named entity recognition (NER)] and normalization (i.e. entity linking), using Medical Subject Headings (MeSH). The chemical indexing task required identifying which chemicals reflect topics for each article and should therefore appear in the listing of MeSH terms for the document in the MEDLINE article indexing. This manuscript summarizes the BioCreative NLM-Chem track and post-challenge experiments. We received a total of 85 submissions from 17 teams worldwide. The highest performance achieved for the chemical identification task was 0.8672 F-score (0.8759 precision and 0.8587 recall) for strict NER performance and 0.8136 F-score (0.8621 precision and 0.7702 recall) for strict normalization performance. The highest performance achieved for the chemical indexing task was 0.6073 F-score (0.7417 precision and 0.5141 recall). This community challenge demonstrated that (i) the current substantial achievements in deep learning technologies can be utilized to improve automated prediction accuracy further and (ii) the chemical indexing task is substantially more challenging. We look forward to further developing biomedical text-mining methods to respond to the rapid growth of biomedical literature. The NLM-Chem track dataset and other challenge materials are publicly available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/. Database URL https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/.
Collapse
Affiliation(s)
| | | | - Virginia Adams
- NVIDIA, 2788 San Tomas Expressway, Santa Clara, CA 95051, USA
| | - Mohammed A Alliheedi
- Department of Computer Science, Al Baha University, 4781 King Fahd Rd, Al Aqiq 65779, Saudi Arabia
| | - João Rafael Almeida
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
- Department of Information and Communications Technologies, University of A Coruña, Camiño do Lagar de Castro, A Coruña 15008, Spain
| | - Rui Antunes
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Robert Bevan
- Informatics Department, Medicines Discovery Catapult, Alderley Park, Block 35, Mereside, Macclesfield SK10 4ZF, UK
| | - Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Da’an District, Taipei City , Taipei 106, Taiwan
| | - Arslan Erdengasileng
- Department of Statistics, Florida State University, 117 N. Woodward Ave, Tallahassee, FL 32306, USA
| | - Matthew Hodgskiss
- Informatics Department, Medicines Discovery Catapult, Alderley Park, Block 35, Mereside, Macclesfield SK10 4ZF, UK
| | - Ryuki Ida
- Computational Intelligence Laboratory, Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya, Aichi 468-8511, Japan
| | - Hyunjae Kim
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, South Korea
| | - Keqiao Li
- Department of Statistics, Florida State University, 117 N. Woodward Ave, Tallahassee, FL 32306, USA
| | - Robert E Mercer
- Department of Computer Science, The University of Western Ontario, Room 355, Middlesex College, Ontario , London N6A 5B7, Canada
| | - Lukrécia Mertová
- Scientific Databases and Visualization Group, Heidelberg Institute for Theoretical Studies (HITS gGmbH), Schloss-Wolfsbrunnenweg 35, Heidelberg 69118, Germany
| | - Ghadeer Mobasher
- Scientific Databases and Visualization Group, Heidelberg Institute for Theoretical Studies (HITS gGmbH), Schloss-Wolfsbrunnenweg 35, Heidelberg 69118, Germany
- Institute of Computer Science, Heidelberg University, Im Neuenheimer Feld 205, Heidelberg 69120, Germany
| | - Hoo-Chang Shin
- NVIDIA, 2788 San Tomas Expressway, Santa Clara, CA 95051, USA
| | - Mujeen Sung
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, South Korea
| | - Tomoki Tsujimura
- Computational Intelligence Laboratory, Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya, Aichi 468-8511, Japan
| | - Wen-Chao Yeh
- Institute of Information Systems and Applications, National Tsing Hua University, No. 101, Section 2, Kuang-Fu Road, Hsinchu 30013, Taiwan
| | - Zhiyong Lu
- *Corresponding author: Tel: +1-301-594-7089; Fax: +1-301-480-2290;
| |
Collapse
|
14
|
Lin S, Nateqi J, Weingartner-Ortner R, Gruarin S, Marling H, Pilgram V, Lagler FB, Aigner E, Martin AG. An artificial intelligence-based approach for identifying rare disease patients using retrospective electronic health records applied for Pompe disease. Front Neurol 2023; 14:1108222. [PMID: 37153672 PMCID: PMC10160659 DOI: 10.3389/fneur.2023.1108222] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 04/03/2023] [Indexed: 05/10/2023] Open
Abstract
Objective We retrospectively screened 350,116 electronic health records (EHRs) to identify suspected patients for Pompe disease. Using these suspected patients, we then describe their phenotypical characteristics and estimate the prevalence in the respective population covered by the EHRs. Methods We applied Symptoma's Artificial Intelligence-based approach for identifying rare disease patients to retrospective anonymized EHRs provided by the "University Hospital Salzburg" clinic group. Within 1 month, the AI screened 350,116 EHRs reaching back 15 years from five hospitals, and 104 patients were flagged as probable for Pompe disease. Flagged patients were manually reviewed and assessed by generalist and specialist physicians for their likelihood for Pompe disease, from which the performance of the algorithms was evaluated. Results Of the 104 patients flagged by the algorithms, generalist physicians found five "diagnosed," 10 "suspected," and seven patients with "reduced suspicion." After feedback from Pompe disease specialist physicians, 19 patients remained clinically plausible for Pompe disease, resulting in a specificity of 18.27% for the AI. Estimating from the remaining plausible patients, the prevalence of Pompe disease for the greater Salzburg region [incl. Bavaria (Germany), Styria (Austria), and Upper Austria (Austria)] was one in every 18,427 people. Phenotypes for patient cohorts with an approximated onset of symptoms above or below 1 year of age were established, which correspond to infantile-onset Pompe disease (IOPD) and late-onset Pompe disease (LOPD), respectively. Conclusion Our study shows the feasibility of Symptoma's AI-based approach for identifying rare disease patients using retrospective EHRs. Via the algorithm's screening of an entire EHR population, a physician had only to manually review 5.47 patients on average to find one suspected candidate. This efficiency is crucial as Pompe disease, while rare, is a progressively debilitating but treatable neuromuscular disease. As such, we demonstrated both the efficiency of the approach and the potential of a scalable solution to the systematic identification of rare disease patients. Thus, similar implementation of this methodology should be encouraged to improve care for all rare disease patients.
Collapse
Affiliation(s)
- Simon Lin
- Science Department, Symptoma GmbH, Vienna, Austria
- Department of Internal Medicine, Paracelsus Medical University, Salzburg, Austria
| | - Jama Nateqi
- Science Department, Symptoma GmbH, Vienna, Austria
- Department of Internal Medicine, Paracelsus Medical University, Salzburg, Austria
| | | | | | | | - Vinzenz Pilgram
- Medical and Information Technology - MIT, University Hospital Salzburg (SALK), Salzburg, Austria
| | - Florian B. Lagler
- Medical and Information Technology - MIT, University Hospital Salzburg (SALK), Salzburg, Austria
- Department of Pediatrics and Institute for Inherited Metabolic Diseases, Paracelsus Medical University, Salzburg, Austria
| | - Elmar Aigner
- Department of Internal Medicine, Paracelsus Medical University, Salzburg, Austria
- Medical and Information Technology - MIT, University Hospital Salzburg (SALK), Salzburg, Austria
| | - Alistair G. Martin
- Science Department, Symptoma GmbH, Vienna, Austria
- *Correspondence: Alistair G. Martin
| |
Collapse
|
15
|
Grabar N, Grouin C. Year 2021: COVID-19, Information Extraction and BERTization among the Hottest Topics in Medical Natural Language Processing. Yearb Med Inform 2022; 31:254-260. [PMID: 36463883 PMCID: PMC9719758 DOI: 10.1055/s-0042-1742547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Abstract
OBJECTIVES Analyze the content of publications within the medical natural language processing (NLP) domain in 2021. METHODS Automatic and manual preselection of publications to be reviewed, and selection of the best NLP papers of the year. Analysis of the important issues. RESULTS Four best papers have been selected in 2021. We also propose an analysis of the content of the NLP publications in 2021, all topics included. CONCLUSIONS The main issues addressed in 2021 are related to the investigation of COVID-related questions and to the further adaptation and use of transformer models. Besides, the trends from the past years continue, such as information extraction and use of information from social networks.
Collapse
Affiliation(s)
- Natalia Grabar
- STL, CNRS, Université de Lille, Domaine du Pont-de-bois, Villeneuve-d'Ascq cedex, France
| | - Cyril Grouin
- Université Paris Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay, France
| |
Collapse
|
16
|
Almeida T, Antunes R, F. Silva J, Almeida JR, Matos S. Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics. Database (Oxford) 2022; 2022:6625810. [PMID: 35776534 PMCID: PMC9248917 DOI: 10.1093/database/baac047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 05/13/2022] [Accepted: 06/06/2022] [Indexed: 11/14/2022]
Abstract
Abstract
The identification of chemicals in articles has attracted a large interest in the biomedical scientific community, given its importance in drug development research. Most of previous research have focused on PubMed abstracts, and further investigation using full-text documents is required because these contain additional valuable information that must be explored. The manual expert task of indexing Medical Subject Headings (MeSH) terms to these articles later helps researchers find the most relevant publications for their ongoing work. The BioCreative VII NLM-Chem track fostered the development of systems for chemical identification and indexing in PubMed full-text articles. Chemical identification consisted in identifying the chemical mentions and linking these to unique MeSH identifiers. This manuscript describes our participation system and the post-challenge improvements we made. We propose a three-stage pipeline that individually performs chemical mention detection, entity normalization and indexing. Regarding chemical identification, we adopted a deep-learning solution that utilizes the PubMedBERT contextualized embeddings followed by a multilayer perceptron and a conditional random field tagging layer. For the normalization approach, we use a sieve-based dictionary filtering followed by a deep-learning similarity search strategy. Finally, for the indexing we developed rules for identifying the more relevant MeSH codes for each article. During the challenge, our system obtained the best official results in the normalization and indexing tasks despite the lower performance in the chemical mention recognition task. In a post-contest phase we boosted our results by improving our named entity recognition model with additional techniques. The final system achieved 0.8731, 0.8275 and 0.4849 in the chemical identification, normalization and indexing tasks, respectively. The code to reproduce our experiments and run the pipeline is publicly available.
Database URL
https://github.com/bioinformatics-ua/biocreativeVII_track2
Collapse
Affiliation(s)
- Tiago Almeida
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro , Aveiro, Portugal
| | - Rui Antunes
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro , Aveiro, Portugal
| | - João F. Silva
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro , Aveiro, Portugal
| | - João R Almeida
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro , Aveiro, Portugal
- Department of Information and Communications Technologies, University of A Coruña , A Coruña, Spain
| | - Sérgio Matos
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro , Aveiro, Portugal
| |
Collapse
|
17
|
Fast medical concept normalization for biomedical literature based on stack and index optimized self-attention. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07228-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
18
|
Lybarger K, Damani A, Gunn M, Uzuner OZ, Yetisgen M. Extracting Radiological Findings With Normalized Anatomical Information Using a Span-Based BERT Relation Extraction Model. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2022; 2022:339-348. [PMID: 35854739 PMCID: PMC9285141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 04/27/2023]
Abstract
Medical imaging is critical to the diagnosis and treatment of numerous medical problems, including many forms of cancer. Medical imaging reports distill the findings and observations of radiologists, creating an unstructured textual representation of unstructured medical images. Large-scale use of this text-encoded information requires converting the unstructured text to a structured, semantic representation. We explore the extraction and normalization of anatomical information in radiology reports that is associated with radiological findings. We investigate this extraction and normalization task using a span-based relation extraction model that jointly extracts entities and relations using BERT. This work examines the factors that influence extraction and normalization performance, including the body part/organ system, frequency of occurrence, span length, and span diversity. It discusses approaches for improving performance and creating high-quality semantic representations of radiological phenomena.
Collapse
|
19
|
Vashishth S, Newman-Griffis D, Joshi R, Dutt R, Rosé CP. Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets. J Biomed Inform 2021; 121:103880. [PMID: 34390853 PMCID: PMC8952339 DOI: 10.1016/j.jbi.2021.103880] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2021] [Revised: 07/31/2021] [Accepted: 07/31/2021] [Indexed: 10/28/2022]
Abstract
OBJECTIVES Biomedical natural language processing tools are increasingly being applied for broad-coverage information extraction-extracting medical information of all types in a scientific document or a clinical note. In such broad-coverage settings, linking mentions of medical concepts to standardized vocabularies requires choosing the best candidate concepts from large inventories covering dozens of types. This study presents a novel semantic type prediction module for biomedical NLP pipelines and two automatically-constructed, large-scale datasets with broad coverage of semantic types. METHODS We experiment with five off-the-shelf biomedical NLP toolkits on four benchmark datasets for medical information extraction from scientific literature and clinical notes. All toolkits adopt a staged approach of mention detection followed by two stages of medical entity linking: (1) generating a list of candidate concepts, and (2) picking the best concept among them. We introduce a semantic type prediction module to alleviate the problem of overgeneration of candidate concepts by filtering out irrelevant candidate concepts based on the predicted semantic type of a mention. We present MedType, a fully modular semantic type prediction model which we integrate into the existing NLP toolkits. To address the dearth of broad-coverage training data for medical information extraction, we further present WikiMed and PubMedDS, two large-scale datasets for medical entity linking. RESULTS Semantic type filtering improves medical entity linking performance across all toolkits and datasets, often by several percentage points of F-1. Further, pretraining MedType on our novel datasets achieves state-of-the-art performance for semantic type prediction in biomedical text. CONCLUSIONS Semantic type prediction is a key part of building accurate NLP pipelines for broad-coverage information extraction from biomedical text. We make our source code and novel datasets publicly available to foster reproducible research.
Collapse
Affiliation(s)
| | | | - Rishabh Joshi
- Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA
| | - Ritam Dutt
- Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA
| | - Carolyn P Rosé
- Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA
| |
Collapse
|
20
|
De Silva K, Mathews N, Teede H, Forbes A, Jönsson D, Demmer RT, Enticott J. Clinical notes as prognostic markers of mortality associated with diabetes mellitus following critical care: A retrospective cohort analysis using machine learning and unstructured big data. Comput Biol Med 2021; 132:104305. [PMID: 33705995 DOI: 10.1016/j.compbiomed.2021.104305] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2020] [Revised: 02/23/2021] [Accepted: 02/27/2021] [Indexed: 12/14/2022]
Abstract
BACKGROUND Clinical notes are ubiquitous resources offering potential value in optimizing critical care via data mining technologies. OBJECTIVE To determine the predictive value of clinical notes as prognostic markers of 1-year all-cause mortality among people with diabetes following critical care. MATERIALS AND METHODS Mortality of diabetes patients were predicted using three cohorts of clinical text in a critical care database, written by physicians (n = 45253), nurses (159027), and both (n = 204280). Natural language processing was used to pre-process text documents and LASSO-regularized logistic regression models were trained and tested. Confusion matrix metrics of each model were calculated and AUROC estimates between models were compared. All predictive words and corresponding coefficients were extracted. Outcome probability associated with each text document was estimated. RESULTS Models built on clinical text of physicians, nurses, and the combined cohort predicted mortality with AUROC of 0.996, 0.893, and 0.922, respectively. Predictive performance of the models significantly differed from one another whereas inter-rater reliability ranged from substantial to almost perfect across them. Number of predictive words with non-zero coefficients were 3994, 8159, and 10579, respectively, in the models of physicians, nurses, and the combined cohort. Physicians' and nursing notes, both individually and when combined, strongly predicted 1-year all-cause mortality among people with diabetes following critical care. CONCLUSION Clinical notes of physicians and nurses are strong and novel prognostic markers of diabetes-associated mortality in critical care, offering potentially generalizable and scalable applications. Clinical text-derived personalized risk estimates of prognostic outcomes such as mortality could be used to optimize patient care.
Collapse
Affiliation(s)
- Kushan De Silva
- Monash Centre for Health Research and Implementation, School of Public Health and Preventive Medicine, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Clayton, 3168, Australia.
| | - Noel Mathews
- Monash Centre for Health Research and Implementation, School of Public Health and Preventive Medicine, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Clayton, 3168, Australia
| | - Helena Teede
- Monash Centre for Health Research and Implementation, School of Public Health and Preventive Medicine, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Clayton, 3168, Australia
| | - Andrew Forbes
- Biostatistics Unit, Division of Research Methodology, School of Public Health and Preventive Medicine, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Melbourne, 3004, Australia
| | - Daniel Jönsson
- Department of Periodontology, Faculty of Odontology, Malmö University, Malmö, 21119, Sweden; Swedish Dental Service of Skane, Lund, 22647, Sweden
| | - Ryan T Demmer
- Division of Epidemiology and Community Health, School of Public Health, University of Minnesota, Minneapolis, MN, USA; Mailman School of Public Health, Columbia University, New York, USA
| | - Joanne Enticott
- Monash Centre for Health Research and Implementation, School of Public Health and Preventive Medicine, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Clayton, 3168, Australia
| |
Collapse
|
21
|
Kate RJ. Clinical Term Normalization Using Learned Edit Patterns and Subconcept Matching: System Development and Evaluation. JMIR Med Inform 2021; 9:e23104. [PMID: 33443483 PMCID: PMC7843202 DOI: 10.2196/23104] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 10/31/2020] [Accepted: 11/18/2020] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Clinical terms mentioned in clinical text are often not in their standardized forms as listed in clinical terminologies because of linguistic and stylistic variations. However, many automated downstream applications require clinical terms mapped to their corresponding concepts in clinical terminologies, thus necessitating the task of clinical term normalization. OBJECTIVE In this paper, a system for clinical term normalization is presented that utilizes edit patterns to convert clinical terms into their normalized forms. METHODS The edit patterns are automatically learned from the Unified Medical Language System (UMLS) Metathesaurus as well as from the given training data. The edit patterns are generalized sequences of edits that are derived from edit distance computations. The edit patterns are both character based as well as word based and are learned separately for different semantic types. In addition to these edit patterns, the system also normalizes clinical terms through the subconcepts mentioned within them. RESULTS The system was evaluated as part of the 2019 n2c2 Track 3 shared task of clinical term normalization. It obtained 80.79% accuracy on the standard test data. This paper includes ablation studies to evaluate the contributions of different components of the system. A challenging part of the task was disambiguation when a clinical term could be normalized to multiple concepts. CONCLUSIONS The learned edit patterns led the system to perform well on the normalization task. Given that the system is based on patterns, it is human interpretable and is also capable of giving insights about common variations of clinical terms mentioned in clinical text that are different from their standardized forms.
Collapse
Affiliation(s)
- Rohit J Kate
- Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, United States
| |
Collapse
|
22
|
Humphreys BL, Del Fiol G, Xu H. The UMLS knowledge sources at 30: indispensable to current research and applications in biomedical informatics. J Am Med Inform Assoc 2020; 27:1499-1501. [PMID: 33059366 DOI: 10.1093/jamia/ocaa208] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Indexed: 01/22/2023] Open
Affiliation(s)
| | - Guilherme Del Fiol
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|