1
|
Shen Y, Wang J, Wang Z, Shi Z, Chen H, Wang Z, Jiang Y, Wang X, Cheng C, Wang X, Zhu H, Ye J. CATI: A medical context-enhanced framework for diagnosis code assignment in the UK Biobank study. Artif Intell Med 2025; 166:103136. [PMID: 40344999 DOI: 10.1016/j.artmed.2025.103136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Revised: 03/10/2025] [Accepted: 04/15/2025] [Indexed: 05/11/2025]
Abstract
Diagnosis codes are standard code format of diseases or medical conditions. This study is aimed at assigning diagnosis codes to patients in large-scale biobanks, particularly addressing the issue of missing codes for some patients. This is crucial for downstream disease-related tasks. While recent methods primarily rely on structured biobank data for code assignment, they often overlook the valuable medical context provided by textual information in the biobanks and hierarchical structure of the disease coding system. To address this gap, we have developed CATI, a medical context-enhanced framework for diagnosis Code Assignment by integrating Textual details derived from key features and disease hIerarchy. The study is based on the UK Biobank data and considers Phecodes and ICD-10 codes as standard disease formats. We start by representing ten informative codified features using their formal names and then integrate them into CATI as text embeddings, achieved through prompt tuning on the pre-trained language model BioBERT. Recognizing the hierarchical structure of diagnosis codes, we have developed a novel convolution layer in our method that effectively propagates logits between adjacent diagnosis codes. Evaluation results demonstrate that CATI outperforms existing state-of-the-art methods in terms of both Phecodes and ICD-10 codes, boasting at least a 5.16% improvement in average AUROC for unseen disease codes and an 8.68% rise in average AUPRC for disease codes with training instances ranging in (1000,10000]. This framework contributes to the formation of well-defined cohorts for downstream studies and offers a unique perspective for addressing complex healthcare tasks by incorporating vital medical context.
Collapse
Affiliation(s)
- Yue Shen
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Jie Wang
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui, 230027, China.
| | - Zhe Wang
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Zhihao Shi
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Hanzhu Chen
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Zheng Wang
- Alibaba Cloud Computing, Hangzhou, Zhejiang, 310030, China
| | - Yukang Jiang
- Department of Radiology, University of North Carolina at Chapel Hill, NC 27599, USA
| | - Xiaopu Wang
- School of Management, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Chuandong Cheng
- Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Xueqin Wang
- School of Management, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Hongtu Zhu
- Biomedical Research Imaging Center, Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599, USA.
| | - Jieping Ye
- Alibaba Cloud Computing, Hangzhou, Zhejiang, 310030, China.
| |
Collapse
|
2
|
Gupta S, Sharma S, Sharma R, Chandra J. Healing with hierarchy: Hierarchical attention empowered graph neural networks for predictive analysis in medical data. Artif Intell Med 2025; 165:103134. [PMID: 40286587 DOI: 10.1016/j.artmed.2025.103134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 04/11/2025] [Accepted: 04/12/2025] [Indexed: 04/29/2025]
Abstract
In healthcare, predictive analysis using unstructured medical data is crucial for gaining insights into patient conditions and outcomes. However, unstructured data, which contains valuable patient information such as symptoms and medical histories, often presents challenges, including lengthy text sequences and incomplete data. To address these issues, we introduce a new framework named Hierarchical Attention-based Integrated Learning (HAIL), designed to predict in-hospital mortality and the duration of stay in the intensive care unit. HAIL combines hierarchical attention mechanisms with graph neural networks to effectively manage missing data and enhance outcome predictions. Our model iteratively refines embeddings, resulting in a more thorough analysis of electronic health record data. Experimental findings demonstrate a notable performance improvement of 2%-3% across various metrics when compared to existing benchmarks on standard datasets, highlighting HAIL's effectiveness in time-sensitive clinical decision-making. Additionally, our analysis underscores the significance of patient networks in maintaining the robustness and consistent performance of the HAIL framework.
Collapse
Affiliation(s)
- Shivani Gupta
- Indian Institute of Technology Patna, Department of Computer Science and Engineering, Patna, 801103, Bihar, India.
| | - Saurabh Sharma
- Indian Institute of Technology Patna, Department of Computer Science and Engineering, Patna, 801103, Bihar, India.
| | - Rajesh Sharma
- University of Tartu, Institute of Computer Science, Ülikooli 18-133, Tartu, 50090, Estonia.
| | - Joydeep Chandra
- Indian Institute of Technology Patna, Department of Computer Science and Engineering, Patna, 801103, Bihar, India.
| |
Collapse
|
3
|
Fathy W, Emeriaud G, Cheriet F. A comprehensive review of ICU readmission prediction models: From statistical methods to deep learning approaches. Artif Intell Med 2025; 165:103126. [PMID: 40300338 DOI: 10.1016/j.artmed.2025.103126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 10/04/2024] [Accepted: 03/29/2025] [Indexed: 05/01/2025]
Abstract
The prediction of Intensive Care Unit (ICU) readmission has become a crucial area of research due to the increasing demand for ICU resources and the need to provide timely interventions to critically ill patients. In recent years, several studies have explored the use of statistical, machine learning (ML), and deep learning (DL) models to predict ICU readmission. This review paper presents an extensive overview of these studies and discusses the challenges associated with ICU readmission prediction. We categorize the studies based on the type of model used and evaluate their strengths and limitations. We also discuss the performance metrics used to evaluate the models and their potential clinical applications. In addition, this review explores current methodologies, data usage, and recent advances in interpretability and explainable AI for medical applications, offering insights to guide future research and development in this field. Finally, we identify gaps in the current literature and provide recommendations for future research. Recent advances like ML and DL have moderately improved the prediction of the risk of ICU readmission. However, more progress is needed to reach the precision required to build computerized decision support tools.
Collapse
Affiliation(s)
- Waleed Fathy
- Department of Computer and Software Engineering, Polytechnique Montréal, Montreal, Quebec, Canada; Department of Electronic and Communication Engineering, Zagazig Univeristy, Zagazig, Sharkia, Egypt.
| | - Guillaume Emeriaud
- Department of Pediatrics, CHU Sainte-Justine, Université de Montréal, Montreal, Quebec, Canada.
| | - Farida Cheriet
- Department of Computer and Software Engineering, Polytechnique Montréal, Montreal, Quebec, Canada.
| |
Collapse
|
4
|
Ye X, Shi T, Huang D, Sakurai T. Multi-Omics clustering by integrating clinical features from large language model. Methods 2025; 239:64-71. [PMID: 40180255 DOI: 10.1016/j.ymeth.2025.03.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2025] [Revised: 03/16/2025] [Accepted: 03/26/2025] [Indexed: 04/05/2025] Open
Abstract
Multi-omics clustering has emerged as a powerful approach for understanding complex biological systems and enabling cancer subtyping by integrating diverse omics data. Existing methods primarily focus on the integration of different types of omics data, often overlooking the value of clinical context. In this study, we propose a novel framework that incorporates clinical features extracted from large language model (LLM) to enhance multi-omics clustering. Leveraging clinical data extracted from pathology reports using a BERT-based model, our framework converts unstructured medical text into structured clinical features. These features are integrated with omics data through an autoencoder, enriching the information content of each omics layer to improve feature extraction. The extracted features are then projected into a latent subspace using singular value decomposition (SVD), followed by spectral clustering to obtain the final clustering result. We evaluate the proposed framework on six cancer datasets on three omics levels, comparing it with several state-of-the-art methods. The experimental results demonstrate that the proposed framework outperforms existing methods in multi-omics clustering for cancer subtyping. Moreover, the results highlight the efficacy of integrating clinical features derived from LLM, significantly enhancing clustering performance. This work underscores the importance of clinical context in multi-omics analysis and showcases the transformative potential of LLM in advancing precision medicine.
Collapse
Affiliation(s)
- Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan.
| | - Tianyi Shi
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Dong Huang
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan.
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| |
Collapse
|
5
|
Hu Y, Chen Y, Xu Y. A shape composition method for named entity recognition. Neural Netw 2025; 187:107389. [PMID: 40117979 DOI: 10.1016/j.neunet.2025.107389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2024] [Revised: 12/16/2024] [Accepted: 03/10/2025] [Indexed: 03/23/2025]
Abstract
Large language models (LLMs) roughly encode a sentence into a dense representation (a vector), which mixes up the semantic expression of all named entities within a sentence. So the decoding process is easily overwhelmed by sentence-specific information learned during the pre-training process. It results in seriously performance degeneration in recognizing named entities, especially annotated with nested structures. In contrast to LLMs condensing a sentence into a single vector, our model adopts a discriminative language model to map each sentence into a high-order semantic space. In this space, named entities are decomposed into entity body and entity edge. The decomposition is effective to decode complex semantic structures of named entities. In this paper, a shape composition method is proposed for recognizing named entities. This approach leverages a multi-objective learning neural architecture to simultaneously detect entity bodies and classify entity edges. During training, the dual objectives for body and edge learning guide the deep network to encode more task-relevant semantic information. Our method is evaluated on eight widely used public datasets and demonstrated competitive performance. Analytical experiments show that the strategy of let semantic expressions take its course aligns with the entity recognition task. This approach yields finer-grained semantic representations, which enhance not only NER but also other NLP tasks.
Collapse
Affiliation(s)
- Ying Hu
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Yanping Chen
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Yong Xu
- Bio-Computing Research Center, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, China.
| |
Collapse
|
6
|
Khan N, Mufti MR, Arif M, Ali A, Shah Z. KEM-IoMT: Knowledge graph embedding-enhanced accurate medical service recommendation against diabetes. Comput Biol Med 2025; 194:110463. [PMID: 40516453 DOI: 10.1016/j.compbiomed.2025.110463] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2024] [Revised: 03/16/2025] [Accepted: 05/25/2025] [Indexed: 06/16/2025]
Abstract
The Internet of Medical Things (IoMT)-enhanced Recommender System (RS) acquired swift advancement in configuring diverse medical data into intelligent systems to generate personalized medical services. However, due to the heterogeneous and complex nature of the diabetes data, generating accurate and context-sensitive service recommendations remains challenging. Additionally, existing RSs do not extend their knowledge-bases by incorporating user-reviews and current updates on the given disease alongside the medical data. Thus, this paper introduces Knowledge graph Embedding-enhanced accurate Medical service recommendation (KEM) in the IoMT, aiming to enhance the precision of RS for diabetes care. The KEM mainly collects user reviews and online data about the disease, preprocesses the collected data, and transforms it into the Knowledge Graph (KG). The model embeds the KG and encapsulates the embedding representations into the independent latent factors through the Graph Neural Network. Moreover, the KEM employs Deep Matrix Factorization to compute the latent factors and obtain the required relations for recommendation. Extensive experiments on real-world data demonstrate the effectiveness of the KEM model in enhancing performance compared to baseline methods.
Collapse
Affiliation(s)
- Nasrullah Khan
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar.
| | - Muhammad Rafiq Mufti
- Department of Computer Science, COMSATS University Islamabad, Vehari Campus, Vehari 61100, Pakistan.
| | - Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar.
| | - Amjad Ali
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar.
| | - Zubair Shah
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar.
| |
Collapse
|
7
|
Kohli S, Agarwal P, Ho Wing Chan A, Erekat A, Nadkarni G, Kummer B. Machine learning to predict penumbra core mismatch in acute ischemic stroke using clinical note data. NPJ Digit Med 2025; 8:340. [PMID: 40481318 PMCID: PMC12144192 DOI: 10.1038/s41746-025-01703-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2024] [Accepted: 05/03/2025] [Indexed: 06/11/2025] Open
Abstract
In acute ischemic stroke due to large-vessel occlusion (AIS-LVO), late-window endovascular thrombectomy (EVT) decisions depend on penumbra-to-core (P:C) mismatch from computed tomographic perfusion (CTP). We developed multiple machine learning (ML) models to predict P:C ratios from a retrospectively-identified cohort of AIS-LVO patients who underwent CTP within 30 min of initial neuroimaging, using non-imaging electronic health record (EHR) data available prior to CTP evaluation. We extracted structured data and free-text clinical notes from the EHR, generating document embeddings as sums of BioWordVec vectors weighted by term-frequency-inverse-document-frequency scores. We identified 120 patients; an extreme-gradient-boosting model classified P:C ratios as ≥ or <1.8, achieving an AUROC of 0.80 (95% CI 0.57-0.92) with optimal performance using text limited to 500 characters. Sensitivity was 0.80, specificity 0.66, and F1 score 0.86. Our findings suggest that ML models leveraging real-world non-imaging data can potentially aid LVO-AIS triage, though further validation is needed.
Collapse
Affiliation(s)
- Shaun Kohli
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Parul Agarwal
- Department of Neurology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Institute for Health Care Delivery Science, Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Andy Ho Wing Chan
- Department of Neurology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Asala Erekat
- Department of Neurology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Clinical Neuro-informatics Center, Department of Neurology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Girish Nadkarni
- Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Division of Data and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Benjamin Kummer
- Department of Neurology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
- Clinical Neuro-informatics Center, Department of Neurology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
- Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
8
|
Withers CA, Rufai AM, Venkatesan A, Tirunagari S, Lobentanzer S, Harrison M, Zdrazil B. Natural language processing in drug discovery: bridging the gap between text and therapeutics with artificial intelligence. Expert Opin Drug Discov 2025; 20:765-783. [PMID: 40298230 DOI: 10.1080/17460441.2025.2490835] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2024] [Revised: 03/07/2025] [Accepted: 04/04/2025] [Indexed: 04/30/2025]
Abstract
INTRODUCTION The field of Natural Language Processing (NLP) within the life sciences has exploded in its capacity to aid the extraction and analysis of data from scientific texts in recent years through the advancement of Artificial Intelligence (AI). Drug discovery pipelines have been innovated and accelerated by the uptake of AI/Machine Learning (ML) techniques. AREAS COVERED The authors provide background on Named Entity Recognition (NER) in text - from tagging terms in text using ontologies to entity identification via ML models. They also explore the use of Knowledge Graphs (KGs) in biological data ingestion, manipulation, and extraction, leading into the modern age of Large Language Models (LLMs) and their ability to maneuver complex and abundant data. The authors also cover the main strengths and weaknesses of the many methods available when undertaking NLP tasks in drug discovery. Literature was derived from searches utilizing Europe PMC, ResearchRabbit and SciSpace. EXPERT OPINION The mass of scientific data that is now produced each year is both a huge positive for potential innovation in drug discovery and a new hurdle for researchers to overcome. Notably, methods should be selected to fit a use case and the data available, as each method performs optimally under different conditions.
Collapse
Affiliation(s)
- Christine Ann Withers
- Chemical Biology Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Amina Mardiyyah Rufai
- Literature Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Aravind Venkatesan
- Literature Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Santosh Tirunagari
- Literature Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Sebastian Lobentanzer
- Institute of Computational Biology, Helmholtz Centre, Munich, Germany
- Faculty of Medicine and Heidelberg University Hospital, Heidelberg University, Institute for Computational Biomedicine, Heidelberg, Germany
- Open Targets, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Melissa Harrison
- Literature Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Barbara Zdrazil
- Chemical Biology Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
- Open Targets, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| |
Collapse
|
9
|
Zhou F, Parrish R, Afzal M, Saha A, Haynes RB, Iorio A, Lokker C. Benchmarking domain-specific pretrained language models to identify the best model for methodological rigor in clinical studies. J Biomed Inform 2025; 166:104825. [PMID: 40246186 DOI: 10.1016/j.jbi.2025.104825] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2024] [Revised: 03/02/2025] [Accepted: 04/03/2025] [Indexed: 04/19/2025]
Abstract
OBJECTIVE Encoder-only transformer-based language models have shown promise in automating critical appraisal of clinical literature. However, a comprehensive evaluation of the models for classifying the methodological rigor of randomized controlled trials is necessary to identify the more robust ones. This study benchmarks several state-of-the-art transformer-based language models using a diverse set of performance metrics. METHODS Seven transformer-based language models were fine-tuned on the title and abstract of 42,575 articles from 2003 to 2023 in McMaster University's Premium LiteratUre Service database under different configurations. The studies reported in the articles addressed questions related to treatment, prevention, or quality improvement for which randomized controlled trials are the gold standard with defined criteria for rigorous methods. Models were evaluated on the validation set using 12 schemes and metrics, including optimization for cross-entropy loss, Brier score, AUROC, average precision, sensitivity, specificity, and accuracy, among others. Threshold tuning was performed to optimize threshold-dependent metrics. Models that achieved the best performance in one or more schemes on the validation set were further tested in hold-out and external datasets. RESULTS A total of 210 models were fine-tuned. Six models achieved top performance in one or more evaluation schemes. Three BioLinkBERT models outperformed others on 8 of the 12 schemes. BioBERT, BiomedBERT, and SciBERT were best on 1, 1 and 2 schemes, respectively. While model performance remained robust on the hold-out test set, it declined in external datasets. Class weight adjustments improved performance in most instances. CONCLUSION BioLinkBERT generally outperformed the other models. Using comprehensive evaluation metrics and threshold tuning optimizes model selection for real-world applications. Future work should assess generalizability to other datasets, explore alternate imbalance strategies, and examine training on full-text articles.
Collapse
Affiliation(s)
- Fangwen Zhou
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Rick Parrish
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Muhammad Afzal
- Department of Computing, Faculty of Computing, Engineering and the Built Environment, Birmingham City University, Birmingham, United Kingdom
| | - Ashirbani Saha
- Department of Oncology, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - R Brian Haynes
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Alfonso Iorio
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada; Department of Medicine, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Cynthia Lokker
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada.
| |
Collapse
|
10
|
Dias AC, Moreira VP, Comba JLD. RoBIn: A Transformer-based model for risk of bias inference with machine reading comprehension. J Biomed Inform 2025; 166:104819. [PMID: 40250743 DOI: 10.1016/j.jbi.2025.104819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2024] [Revised: 03/12/2025] [Accepted: 03/25/2025] [Indexed: 04/20/2025]
Abstract
OBJECTIVE Scientific publications are essential for uncovering insights, testing new drugs, and informing healthcare policies. Evaluating the quality of these publications often involves assessing their Risk of Bias (RoB), a task traditionally performed by human reviewers. The goal of this work is to create a dataset and develop models that allow automated RoB assessment in clinical trials. METHODS We use data from the Cochrane Database of Systematic Reviews (CDSR) as ground truth to label open-access clinical trial publications from PubMed. This process enabled us to develop training and test datasets specifically for machine reading comprehension and RoB inference. Additionally, we created extractive (RoBInExt) and generative (RoBInGen) Transformer-based approaches to extract relevant evidence and classify the RoB effectively. RESULTS RoBIn was evaluated across various settings and benchmarked against state-of-the-art methods, including large language models (LLMs). In most cases, the best-performing RoBIn variant surpasses traditional machine learning and LLM-based approaches, achieving a AUROC of 0.83. CONCLUSION This work addresses RoB assessment in clinical trials by introducing RoBIn, two Transformer-based models for RoB inference and evidence retrieval, which outperform traditional models and LLMs, demonstrating its potential to improve efficiency and scalability in clinical research evaluation. We also introduce a public dataset that is automatically annotated and can be used to enable future research to enhance automated RoB assessment.
Collapse
Affiliation(s)
- Abel Corrêa Dias
- Instituto de Informatica, Av. Bento Goncalves 9500 - Caixa Postal 15064, Porto Alegre, 91501-970, Rio Grande do Sul, Brazil
| | - Viviane Pereira Moreira
- Instituto de Informatica, Av. Bento Goncalves 9500 - Caixa Postal 15064, Porto Alegre, 91501-970, Rio Grande do Sul, Brazil
| | - João Luiz Dihl Comba
- Instituto de Informatica, Av. Bento Goncalves 9500 - Caixa Postal 15064, Porto Alegre, 91501-970, Rio Grande do Sul, Brazil.
| |
Collapse
|
11
|
Shen Y, Xu Y, Ma J, Rui W, Zhao C, Heacock L, Huang C. Multi-modal large language models in radiology: principles, applications, and potential. Abdom Radiol (NY) 2025; 50:2745-2757. [PMID: 39621074 DOI: 10.1007/s00261-024-04708-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2024] [Revised: 11/13/2024] [Accepted: 11/15/2024] [Indexed: 05/13/2025]
Abstract
Large language models (LLMs) and multi-modal large language models (MLLMs) represent the cutting-edge in artificial intelligence. This review provides a comprehensive overview of their capabilities and potential impact on radiology. Unlike most existing literature reviews focusing solely on LLMs, this work examines both LLMs and MLLMs, highlighting their potential to support radiology workflows such as report generation, image interpretation, EHR summarization, differential diagnosis generation, and patient education. By streamlining these tasks, LLMs and MLLMs could reduce radiologist workload, improve diagnostic accuracy, support interdisciplinary collaboration, and ultimately enhance patient care. We also discuss key limitations, such as the limited capacity of current MLLMs to interpret 3D medical images and to integrate information from both image and text data, as well as the lack of effective evaluation methods. Ongoing efforts to address these challenges are introduced.
Collapse
Affiliation(s)
- Yiqiu Shen
- New York University Langone Medical Center, New York, USA.
| | - Yanqi Xu
- New York University, New York, USA
| | | | | | - Chen Zhao
- New York University Shanghai, Shanghai, China
| | - Laura Heacock
- New York University Langone Medical Center, New York, USA
| | - Chenchan Huang
- New York University Langone Medical Center, New York, USA
| |
Collapse
|
12
|
Dorfner FJ, Dada A, Busch F, Makowski MR, Han T, Truhn D, Kleesiek J, Sushil M, Adams LC, Bressem KK. Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks. J Am Med Inform Assoc 2025; 32:1015-1024. [PMID: 40190132 DOI: 10.1093/jamia/ocaf045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Accepted: 03/02/2025] [Indexed: 05/21/2025] Open
Abstract
OBJECTIVES Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study aims to critically evaluate the performance of biomedically fine-tuned LLMs against their general-purpose counterparts across a range of clinical tasks. MATERIALS AND METHODS We evaluated the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on clinical case challenges from NEJM and JAMA, and on multiple clinical tasks, such as information extraction, document summarization and clinical coding. We used a diverse set of benchmarks specifically chosen to be outside the likely fine-tuning datasets of biomedical models, ensuring a fair assessment of generalization capabilities. RESULTS Biomedical LLMs generally underperformed compared to general-purpose models, especially on tasks not focused on probing medical knowledge. While on the case challenges, larger biomedical and general-purpose models showed similar performance (eg, OpenBioLLM-70B: 66.4% vs Llama-3-70B-Instruct: 65% on JAMA), smaller biomedical models showed more pronounced underperformance (OpenBioLLM-8B: 30% vs Llama-3-8B-Instruct: 64.3% on NEJM). Similar trends appeared across CLUE benchmarks, with general-purpose models often achieving higher scores in text generation, question answering, and coding. Notably, biomedical LLMs also showed a higher tendency to hallucinate. DISCUSSION Our findings challenge the assumption that biomedical fine-tuning inherently improves LLM performance, as general-purpose models consistently performed better on unseen medical tasks. Retrieval-augmented generation may offer a more effective strategy for clinical adaptation. CONCLUSION Fine-tuning LLMs on biomedical data may not yield the anticipated benefits. Alternative approaches, such as retrieval augmentation, should be further explored for effective and reliable clinical integration of LLMs.
Collapse
Affiliation(s)
- Felix J Dorfner
- Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin 10117, Germany
- Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School, Charlestown, MA 02129, United States
| | - Amin Dada
- Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen 45131, Germany
| | - Felix Busch
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
| | - Marcus R Makowski
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
| | - Tianyu Han
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen 52074, Germany
| | - Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen 52074, Germany
| | - Jens Kleesiek
- Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen 45131, Germany
- Cancer Research Center Cologne Essen (CCCE), West German Cancer Center Essen, University Hospital Essen (AöR), Essen 45147, Germany
- German Cancer Consortium (DKTK, Partner Site Essen), Heidelberg, Germany
- Department of Physics, TU Dortmund, Dortmund 44227, Germany
| | - Madhumita Sushil
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, United States
| | - Lisa C Adams
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
| | - Keno K Bressem
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
- German Heart Center Munich, Technical University Munich, Munich 80636, Germany
| |
Collapse
|
13
|
Raj S, Namdeo V, Singh P, Srivastava A. Identification and prioritization of disease candidate genes using biomedical named entity recognition and random forest classification. Comput Biol Med 2025; 192:110320. [PMID: 40349579 DOI: 10.1016/j.compbiomed.2025.110320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 04/13/2025] [Accepted: 04/30/2025] [Indexed: 05/14/2025]
Abstract
BACKGROUND AND OBJECTIVE The elucidation of candidate genes is fundamental to comprehending intricate diseases, vital for early diagnosis, personalized treatment, and drug discovery. Traditional Disease Gene Identification methods encounter limitations, necessitating substantial sample sizes and statistical power, particularly challenging for complex diseases. Conversely, Disease Gene Prioritization methods leverage biological knowledge but rely on computational predictions, often lacking experimental validation. Addressing existing tool challenges, this study introduces an innovative two-tier machine-learning protocol that distils Disease Gene Association details from disease-specific abstracts, incorporating diverse findings. Employing advanced text mining, the model classifies disease-gene associations from the abstracts into Positive, Negative, and Ambiguous classes. METHODS Leveraging Random Forest as a robust text classification tool, this study demonstrates its efficacy in navigating complexities within biomedical texts. In the developed 2-tiered protocol, the level 1 classifier categorizes information into two classes, distinguished by the presence or absence of disease-gene associations, whereas the level 2 classifier further classifies into three classes: Positive, Negative, and Ambiguous associations. The developed classifier underwent rigorous training and cross-validation on different gold standard datasets - Alzheimer's, Breast Cancer and Type 2 Diabetes. Its performance across these varied disease contexts underscores its versatility and robustness without succumbing to overfitting. RESULTS Achieving an average accuracy of 97.29 % and 98.14 % for level 1 and level 2 classification, the protocol successfully extracted 2769, 3220 and 740 genes associated positively with Alzheimer's, Breast Cancer and Type 2 Diabetes. From the identified positive genes, a substantial number-1008, 670, and 165 genes, respectively-were not reported in established databases, thus expanding the genetic exploration of these diseases. These identified genes offer promising opportunities for targeted interventions, while ambiguous genes warrant further investigation to unravel deeper disease associations. CONCLUSIONS This research significantly contributes to the understanding of genetic diseases by offering a comprehensive roadmap for their intricate exploration. Beyond the study's focus on Alzheimer's, Breast Cancer, and Type 2 Diabetes, the protocol's applicability extends to diverse biomedical landscapes, demonstrating its versatility and impactful potential for comprehensive disease exploration.
Collapse
Affiliation(s)
- Sushrutha Raj
- Amity Institute of Integrative Sciences and Health, Amity University Haryana, Amity Education Valley, Gurgaon, 122413, India
| | - Vindhya Namdeo
- Sri Innovation and Research Foundation, Ghaziabad, Uttar Pradesh, 201009, India
| | - Payal Singh
- Sri Innovation and Research Foundation, Ghaziabad, Uttar Pradesh, 201009, India
| | - Alok Srivastava
- Sri Innovation and Research Foundation, Ghaziabad, Uttar Pradesh, 201009, India; L V Prasad Eye Institute, Hyderabad, Telangana, 500034, India.
| |
Collapse
|
14
|
Wang Y, Cao P, Fang H, Ye Y. Span-aware pre-trained network with deep information bottleneck for scientific entity relation extraction. Neural Netw 2025; 186:107250. [PMID: 39955959 DOI: 10.1016/j.neunet.2025.107250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Revised: 12/19/2024] [Accepted: 02/02/2025] [Indexed: 02/18/2025]
Abstract
Scientific entity relation extraction intends to promote the performance of each subtask through exploring the contextual representations with rich scientific semantics. However, most of existing models encounter the dilemma of scientific semantic dilution, where task-irrelevant information entangles with task-relevant information making science-friendly representation learning challenging. In addition, existing models isolate task-relevant information among subtasks, undermining the coherence of scientific semantics and consequently impairing the performance of each subtask. To deal with these challenges, a novel and effective Span-aware Pre-trained network with deep Information Bottleneck (SpIB) is proposed, which aims to conduct the scientific entity and relation extraction by minimizing task-irrelevant information and meanwhile maximizing the relatedness of task-relevant information. Specifically, SpIB model includes a minimum span-based representation learning (SRL) module and a relatedness-oriented task-relevant representation learning (TRL) module to disentangle the task-irrelevant information and discover the relatedness hidden in task-relevant information across subtasks. Then, an information minimum-maximum strategy is designed to minimize the mutual information of span-based representations and maximize the multivariate information of task-relevant representations. Finally, we design a unified loss function to simultaneously optimize the learned span-based and task-relevant representations. Experimental results on several scientific datasets, SciERC, ADE, BioRelEx, show the superiority of the proposed SpIB model over various the state-of-the-art models. The source code is publicly available at https://github.com/SWT-AITeam/SpIB.
Collapse
Affiliation(s)
- Youwei Wang
- School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou, 450001, China.
| | - Peisong Cao
- School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou, 450001, China.
| | - Haichuan Fang
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China.
| | - Yangdong Ye
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China.
| |
Collapse
|
15
|
Guan H, Novoa-Laurentiev J, Zhou L. CD-Tron: Leveraging large clinical language model for early detection of cognitive decline from electronic health records. J Biomed Inform 2025; 166:104830. [PMID: 40320101 DOI: 10.1016/j.jbi.2025.104830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Revised: 03/28/2025] [Accepted: 04/13/2025] [Indexed: 05/08/2025]
Abstract
BACKGROUND Early detection of cognitive decline during the preclinical stage of Alzheimer's disease and related dementias (AD/ADRD) is crucial for timely intervention and treatment. Clinical notes in the electronic health record contain valuable information that can aid in the early identification of cognitive decline. In this study, we utilize advanced large clinical language models, fine-tuned on clinical notes, to improve the early detection of cognitive decline. METHODS We collected clinical notes from 2,166 patients spanning the 4 years preceding their initial mild cognitive impairment (MCI) diagnosis from the Enterprise Data Warehouse of Mass General Brigham. To train the model, we developed CD-Tron, built upon a large clinical language model that was finetuned using 4,949 expert-labeled note sections. For evaluation, the trained model was applied to 1,996 independent note sections to assess its performance on real-world unstructured clinical data. Additionally, we used explainable AI techniques, specifically SHAP values (SHapley Additive exPlanations), to interpret the model's predictions and provide insight into the most influential features. Error analysis was also facilitated to further analyze the model's prediction. RESULTS CD-Tron significantly outperforms baseline models, achieving notable improvements in precision, recall, and AUC metrics for detecting cognitive decline (CD). Tested on many real-world clinical notes, CD-Tron demonstrated high sensitivity with only one false negative, crucial for clinical applications prioritizing early and accurate CD detection. SHAP-based interpretability analysis highlighted key textual features contributing to model predictions, supporting transparency and clinician understanding. CONCLUSION CD-Tron offers a novel approach to early cognitive decline detection by applying large clinical language models to free-text EHR data. Pretrained on real-world clinical notes, it accurately identifies early cognitive decline and integrates SHAP for interpretability, enhancing transparency in predictions.
Collapse
Affiliation(s)
- Hao Guan
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA 02115, USA.
| | - John Novoa-Laurentiev
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA 02115, USA
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
16
|
Asim MN, Asif T, Hassan F, Dengel A. Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models. Database (Oxford) 2025; 2025:baaf027. [PMID: 40448683 DOI: 10.1093/database/baaf027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 02/06/2025] [Accepted: 03/26/2025] [Indexed: 06/02/2025]
Abstract
Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Tayyaba Asif
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
| | - Faiza Hassan
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| |
Collapse
|
17
|
Potu ST, Niranjan Murthy R, Thomas A, Mishra L, Prange N, Durmaz AR. Ontology-conformal recognition of materials entities using language models. Sci Rep 2025; 15:18597. [PMID: 40425727 PMCID: PMC12116928 DOI: 10.1038/s41598-025-03619-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2024] [Accepted: 05/21/2025] [Indexed: 05/29/2025] Open
Abstract
Extracting structured and semantically annotated materials information from unstructured scientific literature is a crucial step toward constructing machine-interpretable knowledge graphs and accelerating data-driven materials research. This is especially important in materials science, which is adversely affected by data scarcity. Data scarcity further motivates employing solutions such as foundation language models for extracting information which can in principle address several subtasks of the information extraction problem in a range of domains without the need of generating costly large-scale annotated datasets for each downstream task. However, foundation language models struggle with tasks like Named Entity Recognition (NER) due to domain-specific terminologies, fine-grained entities, and semantic ambiguity. The issue is even more pronounced when entities must map directly to pre-existing domain ontologies. This work aims to assess whether foundation large language models (LLMs) can successfully perform ontology-conformal NER in the materials mechanics and fatigue domain. Specifically, we present a comparative evaluation of in-context learning (ICL) with foundation models such as GPT-4 against fine-tuned task-specific language models, including MatSciBERT and DeBERTa. The study is performed on two materials fatigue datasets, which contain annotations at a comparatively fine-grained level adhering to the class definitions of a formal ontology to ensure semantic alignment and cross-dataset interoperability. Both datasets cover adjacent domains to assess how well both NER methodologies generalize when presented with typical domain shifts. Task-specific models are shown to significantly outperform general foundation models on an ontology-constrained NER. Our findings reveal a strong dependence on the quality of few-shot demonstrations in ICL to handle domain-shift. The study also highlights the significance of domain-specific pre-training by comparing task-specific models that differ primarily in their pre-training corpus.
Collapse
Affiliation(s)
- Sai Teja Potu
- Group of Meso and Micromechanics, Fraunhofer Institute for Mechanics of Materials IWM, 79108, Freiburg, Germany
- University of Freiburg, 79098, Freiburg, Germany
| | - Rachana Niranjan Murthy
- Group of Meso and Micromechanics, Fraunhofer Institute for Mechanics of Materials IWM, 79108, Freiburg, Germany
- University of Freiburg, 79098, Freiburg, Germany
| | - Akhil Thomas
- Group of Meso and Micromechanics, Fraunhofer Institute for Mechanics of Materials IWM, 79108, Freiburg, Germany.
- University of Freiburg, 79098, Freiburg, Germany.
| | - Lokesh Mishra
- IBM Research Europe Zurich, Rüschlikon, 8803, Switzerland
| | | | - Ali Riza Durmaz
- Group of Meso and Micromechanics, Fraunhofer Institute for Mechanics of Materials IWM, 79108, Freiburg, Germany
| |
Collapse
|
18
|
Tran M, Schmidle P, Guo RR, Wagner SJ, Koch V, Lupperger V, Novotny B, Murphree DH, Hardway HD, D'Amato M, Lefkes J, Geijs DJ, Feuchtinger A, Böhner A, Kaczmarczyk R, Biedermann T, Amir AL, Mooyaart AL, Ciompi F, Litjens G, Wang C, Comfere NI, Eyerich K, Braun SA, Marr C, Peng T. Generating dermatopathology reports from gigapixel whole slide images with HistoGPT. Nat Commun 2025; 16:4886. [PMID: 40419470 DOI: 10.1038/s41467-025-60014-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2024] [Accepted: 05/12/2025] [Indexed: 05/28/2025] Open
Abstract
Histopathology is the reference standard for diagnosing the presence and nature of many diseases, including cancer. However, analyzing tissue samples under a microscope and summarizing the findings in a comprehensive pathology report is time-consuming, labor-intensive, and non-standardized. To address this problem, we present HistoGPT, a vision language model that generates pathology reports from a patient's multiple full-resolution histology images. It is trained on 15,129 whole slide images from 6705 dermatology patients with corresponding pathology reports. The generated reports match the quality of human-written reports for common and homogeneous malignancies, as confirmed by natural language processing metrics and domain expert analysis. We evaluate HistoGPT in an international, multi-center clinical study and show that it can accurately predict tumor subtypes, tumor thickness, and tumor margins in a zero-shot fashion. Our model demonstrates the potential of artificial intelligence to assist pathologists in evaluating, reporting, and understanding routine dermatopathology cases.
Collapse
Affiliation(s)
- Manuel Tran
- Helmholtz AI, Helmholtz Munich, Neuherberg, Germany
- School of Computation, Information and Technology, Technical University of Munich, Munich, Germany
| | - Paul Schmidle
- Department of Dermatology, Medical Center, University of Freiburg, Freiburg, Germany
| | - Ruifeng Ray Guo
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Jacksonville, FL, USA
| | - Sophia J Wagner
- Helmholtz AI, Helmholtz Munich, Neuherberg, Germany
- School of Computation, Information and Technology, Technical University of Munich, Munich, Germany
| | - Valentin Koch
- School of Computation, Information and Technology, Technical University of Munich, Munich, Germany
- Institute of AI for Health, Helmholtz Munich, Neuherberg, Germany
| | | | - Brenna Novotny
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA
| | - Dennis H Murphree
- Digital Health, Artificial Intelligence and Innovations Program, Mayo Clinic, Rochester, MN, USA
| | - Heather D Hardway
- Digital Health, Artificial Intelligence and Innovations Program, Mayo Clinic, Rochester, MN, USA
| | - Marina D'Amato
- Computational Pathology Group, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Judith Lefkes
- Computational Pathology Group, Radboud University Medical Center, Nijmegen, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
| | - Daan J Geijs
- Computational Pathology Group, Radboud University Medical Center, Nijmegen, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
| | - Annette Feuchtinger
- Core Facility Pathology and Tissue Analytics, Helmholtz Munich, Neuherberg, Germany
| | - Alexander Böhner
- Department of Dermatology and Allergy, Technical University of Munich, Munich, Germany
| | - Robert Kaczmarczyk
- Department of Dermatology and Allergy, Technical University of Munich, Munich, Germany
| | - Tilo Biedermann
- Department of Dermatology and Allergy, Technical University of Munich, Munich, Germany
| | - Avital L Amir
- Department of Pathology, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Antien L Mooyaart
- Department of Pathology, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Francesco Ciompi
- Computational Pathology Group, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Geert Litjens
- Computational Pathology Group, Radboud University Medical Center, Nijmegen, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
| | - Chen Wang
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA
| | - Nneka I Comfere
- Digital Health, Artificial Intelligence and Innovations Program, Mayo Clinic, Rochester, MN, USA
- Department of Dermatology and Laboratory Medicine & Pathology, Mayo Clinic, Rochester, MN, USA
| | - Kilian Eyerich
- Department of Dermatology, Medical Center, University of Freiburg, Freiburg, Germany.
| | - Stephan A Braun
- Dermatology Department, University Hospital Münster, Münster, Germany.
- Department of Dermatology, Medical Faculty, Heinrich-Heine University, Düsseldorf, Germany.
| | - Carsten Marr
- Helmholtz AI, Helmholtz Munich, Neuherberg, Germany.
- Institute of AI for Health, Helmholtz Munich, Neuherberg, Germany.
| | - Tingying Peng
- Helmholtz AI, Helmholtz Munich, Neuherberg, Germany.
- School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.
| |
Collapse
|
19
|
Dastani M, Mardaneh J, Rostamian M. Large language models' capabilities in responding to tuberculosis medical questions: testing ChatGPT, Gemini, and Copilot. Sci Rep 2025; 15:18004. [PMID: 40410343 PMCID: PMC12102205 DOI: 10.1038/s41598-025-03074-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2025] [Accepted: 05/19/2025] [Indexed: 05/25/2025] Open
Abstract
This study aims to evaluate the capability of Large Language Models (LLMs) in responding to questions related to tuberculosis. Three large language models (ChatGPT, Gemini, and Copilot) were selected based on public accessibility criteria and their ability to respond to medical questions. Questions were designed across four main domains (diagnosis, treatment, prevention and control, and disease management). The responses were subsequently evaluated using DISCERN-AI and NLAT-AI assessment tools. ChatGPT achieved higher scores (4 out of 5) across all domains, while Gemini demonstrated superior performance in specific areas such as prevention and control with a score of 4.4. Copilot showed the weakest performance in disease management with a score of 3.6. In the diagnosis domain, all three models demonstrated equivalent performance (4 out of 5). According to the DISCERN-AI criteria, ChatGPT excelled in information relevance but showed deficiencies in providing sources and information production dates. All three models exhibited similar performance in balance and objectivity indicators. While all three models demonstrate acceptable capabilities in responding to medical questions related to tuberculosis, they share common limitations such as insufficient source citation and failure to acknowledge response uncertainties. Enhancement of these models could strengthen their role in providing medical information.
Collapse
Affiliation(s)
- Meisam Dastani
- Infectious Diseases Research Center, Gonabad University of Medical Sciences, Gonabad, Iran
| | - Jalal Mardaneh
- Department of Microbiology, Infectious Diseases Research Center, School of Medicine, Gonabad University of Medical Sciences, Gonabad, Iran
| | - Morteza Rostamian
- English Department, School of Medicine, Gonabad University of Medical Sciences, Gonabad, Iran.
| |
Collapse
|
20
|
Hein D, Christie A, Holcomb M, Xie B, Jain AJ, Vento J, Rakheja N, Shakur AH, Christley S, Cowell LG, Brugarolas J, Jamieson AR, Kapur P. Iterative refinement and goal articulation to optimize large language models for clinical information extraction. NPJ Digit Med 2025; 8:301. [PMID: 40410408 PMCID: PMC12102345 DOI: 10.1038/s41746-025-01686-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2025] [Accepted: 04/28/2025] [Indexed: 05/25/2025] Open
Abstract
Extracting structured data from free-text medical records at scale is laborious, and traditional approaches struggle in complex clinical domains. We present a novel, end-to-end pipeline leveraging large language models (LLMs) for highly accurate information extraction and normalization from unstructured pathology reports, focusing initially on kidney tumors. Our innovation combines flexible prompt templates, the direct production of analysis-ready tabular data, and a rigorous, human-in-the-loop iterative refinement process guided by a comprehensive error ontology. Applying the finalized pipeline to 2297 kidney tumor reports with pre-existing templated data available for validation yielded a macro-averaged F1 of 0.99 for six kidney tumor subtypes and 0.97 for detecting kidney metastasis. We further demonstrate flexibility with multiple LLM backbones and adaptability to new domains, utilizing publicly available breast and prostate cancer reports. Beyond performance metrics or pipeline specifics, we emphasize the critical importance of task definition, interdisciplinary collaboration, and complexity management in LLM-based clinical workflows.
Collapse
Affiliation(s)
- David Hein
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA.
| | - Alana Christie
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Michael Holcomb
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Bingqing Xie
- Department of Internal Medicine, Division of Hematology & Oncology, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - A J Jain
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Joseph Vento
- Department of Internal Medicine, Division of Hematology & Oncology, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Neil Rakheja
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Ameer Hamza Shakur
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Scott Christley
- Department of Health Data Science and Biostatistics, Peter O'Donnell Jr. School of Public Health, Univerisity of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Lindsay G Cowell
- Department of Health Data Science and Biostatistics, Peter O'Donnell Jr. School of Public Health, Univerisity of Texas Southwestern Medical Center, Dallas, TX, USA
| | - James Brugarolas
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Andrew R Jamieson
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Payal Kapur
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
- Department of Pathology, University of Texas Southwestern Medical Center, Dallas, TX, USA
| |
Collapse
|
21
|
Kell G, Roberts A, Umansky S, Khare Y, Ahmed N, Patel N, Simela C, Coumbe J, Rozario J, Griffiths RR, Marshall IJ. RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:590-599. [PMID: 40417548 PMCID: PMC12099375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
Clinical question answering systems have the potential to provide clinicians with relevant and timely answers to their questions. Nonetheless, despite the advances that have been made, adoption of these systems in clinical settings has been slow. One issue is a lack of question-answering datasets which reflect the real-world needs of health professionals. In this work, we present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM. We describe the process for generating and verifying the QA pairs and assess several QA models on BioASQ and RealMedQA to assess the relative difficulty of matching answers to questions. We show that the LLM is more cost-efficient for generating "ideal" QA pairs. Additionally, we achieve a lower lexical similarity between questions and answers than BioASQ which provides an additional challenge to the top two QA models, as per the results. We release our code and our dataset publicly to encourage further research.
Collapse
Affiliation(s)
- Gregory Kell
- King's College London, London, Greater London, United Kingdom
| | - Angus Roberts
- King's College London, London, Greater London, United Kingdom
| | - Serge Umansky
- Metadvice Ltd., London, Greater London, United Kingdom
| | - Yuti Khare
- Maidstone and Tunbridge Wells NHS Trust, Maidstone, Kent, United Kingdom
| | - Najma Ahmed
- King's College London, London, Greater London, United Kingdom
| | - Nikhil Patel
- King's College London, London, Greater London, United Kingdom
| | - Chloe Simela
- King's College London, London, Greater London, United Kingdom
| | - Jack Coumbe
- King's College London, London, Greater London, United Kingdom
| | - Julian Rozario
- King's College London, London, Greater London, United Kingdom
| | | | - Iain J Marshall
- King's College London, London, Greater London, United Kingdom
| |
Collapse
|
22
|
Munzir SI, Hier DB, Oommen C, Carrithers MD. A Large Language Model Outperforms Other Computational Approaches to the High-Throughput Phenotyping of Physician Notes. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:838-846. [PMID: 40417529 PMCID: PMC12099424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
High-throughput phenotyping, the automated mapping of patient signs and symptoms to standardized ontology concepts, is essential for realizing value from electronic health records (EHR) in support of precision medicine. Despite technological advances, high-throughput phenotyping remains a challenge. This study compares three computational approaches to high-throughput phenotyping: a large language model (LLM) incorporating generative AI, a deep learning (DL) approach utilizing span categorization, and a machine learning (ML) approach with word embeddings. The LLM approach that implemented GPT-4 demonstrated superior performance, suggesting that large language models are poised to become the preferred method for high-throughput phenotyping ofphysician notes.
Collapse
Affiliation(s)
- Syed I Munzir
- Department of Neurology and Rehabilitation, University of Illinois at Chicago, Chicago, USA
| | - Daniel B Hier
- Department of Neurology and Rehabilitation, University of Illinois at Chicago, Chicago, USA
- Kummer Institute, Missouri University of Science and Technology, Rolla, MO, USA
| | - Chelsea Oommen
- Department of Neurology and Rehabilitation, University of Illinois at Chicago, Chicago, USA
| | - Michael D Carrithers
- Department of Neurology and Rehabilitation, University of Illinois at Chicago, Chicago, USA
| |
Collapse
|
23
|
Shi Y, Xu S, Yang T, Liu Z, Liu T, Li X, Liu N. MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:1011-1020. [PMID: 40417500 PMCID: PMC12099378] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
Large Language Models (LLMs), although powerful in general domains, often perform poorly on domain-specific tasks such as medical question answering (QA). In addition, LLMs tend to function as "black-boxes", making it challenging to modify their behavior. To address the problem, our work employs a transparent process of retrieval augmented generation (RAG), aiming to improve LLM responses without the need for fine-tuning or retraining. Specifically, we propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then inject them into the LLM's query prompt. Focusing on medical QA, we evaluate the impact of different retrieval models and the number of facts on LLM performance using the MedQA-SMILE dataset. Notably, our retrieval-augmented Vicuna-7B model exhibited an accuracy improvement from 44.46% to 48.54%. This work underscores the potential of RAG to enhance LLM performance, offering a practical approach to mitigate the challenges posed by black-box LLMs.
Collapse
Affiliation(s)
- Yucheng Shi
- School of Computing, University of Georgia, Athens, GA 30602 USA
- Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA
| | - Shaochen Xu
- School of Computing, University of Georgia, Athens, GA 30602 USA
| | - Tianze Yang
- School of Computing, University of Georgia, Athens, GA 30602 USA
| | - Zhengliang Liu
- School of Computing, University of Georgia, Athens, GA 30602 USA
| | - Tianming Liu
- School of Computing, University of Georgia, Athens, GA 30602 USA
| | - Xiang Li
- Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA
| | - Ninghao Liu
- School of Computing, University of Georgia, Athens, GA 30602 USA
| |
Collapse
|
24
|
Das T, Shafquat A, Beigi M, Aptekar J, Mezey J, Sun J. SeqTrial: Utility Preserving Sequential Clinical Trial Data Generator. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:329-338. [PMID: 40417577 PMCID: PMC12099387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
Clinical trial data used to evaluate new treatments have value beyond the original studies, but limitations in data access due to privacy concerns make further use of these data challenging. Digital twins offer a solution by simulating patient outcomes, providing less restricted data access, reducing costs and increasing sample sizes. However, existing research focuses on synthetic Electronic Healthcare Records (EHRs) and lacks personalized patient record generation. This paper introduces SeqTrial, a framework for generating personalized digital twins for sequential clinical trial event data. The method uses BioBERT word embeddings to capture biomedical term semantics, an attention mechanism to understand visit relationships, and synthesizes digital twins for each patient. SeqTrial generates utility-preserving digital twins capable of estimating clinical outcomes, while addressing data scarcity through self-supervised pretraining. The method demonstrates high fidelity and utility in generating synthetic sequential clinical trial data for patient outcome prediction while ensuring privacy protection. The code is available at.
Collapse
Affiliation(s)
- Trisha Das
- University of Illinois Urbana-Champaign, Urbana, IL
| | | | | | | | | | - Jimeng Sun
- University of Illinois Urbana-Champaign, Urbana, IL
| |
Collapse
|
25
|
Chekuri A, Johal AS, Allen MR, Ayers JW, Hogarth M, Farcas E. Towards Optimizing LLM Use in Healthcare: Identifying Patient Questions in MyChart Messages. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:232-241. [PMID: 40417557 PMCID: PMC12099336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
The volume of patient-provider messages is on the rise, and Large Language Models (LLMs) can potentially streamline the clinical messaging process, but their success hinges on triaging messages they can optimally address. In this study, we analyzed Electronic Health Records with over 4 million messages exchanged between patients and providers to characterize the utility of using LLMs for messages containing knowledge questions. We implemented a rule-based Syntactic Question Detector as a triage tool, and we evaluated it on 500 messages. The interrater reliability metrics and comparison with LLMs show the difficulty of detecting questions due to the informal text and implicit requests. Our results show that 25% of MyChart messages with questions do not have a response from the clinical team. This paper provides insights into the challenges of real-world data, highlights the importance and non-triviality of detecting questions, and suggests a pipeline for LLM use in healthcare.
Collapse
Affiliation(s)
- Akhila Chekuri
- University of California San Diego, La Jolla, CA Computer Science and Engineering
| | - Armaan S Johal
- University of California San Diego, La Jolla, CA Computer Science and Engineering
- University of California San Diego, La Jolla, CA Cognitive Science
| | - Matthew R Allen
- University of California San Diego, La Jolla, CA Division of Biomedical Informatics
| | - John W Ayers
- University of California San Diego, La Jolla, CA Division of Biomedical Informatics
| | - Michael Hogarth
- University of California San Diego, La Jolla, CA Division of Biomedical Informatics
| | - Emilia Farcas
- University of California San Diego, La Jolla, CA Qualcomm Institute
| |
Collapse
|
26
|
Chen Z, Zhang M, Ahmed MM, Guo Y, George TJ, Bian J, Wu Y. Narrative Feature or Structured Feature? A Study of Large Language Models to Identify Cancer Patients at Risk of Heart Failure. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:242-251. [PMID: 40417538 PMCID: PMC12099403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
Cancer treatments are known to introduce cardiotoxicity, negatively impacting outcomes and survivorship. Identifying cancer patients at risk of heart failure (HF) is critical to improving cancer treatment outcomes and safety. This study examined machine learning (ML) models to identify cancer patients at risk of HF using electronic health records (EHRs), including traditional ML, Time-Aware long short-term memory (T-LSTM), and large language models (LLMs) using novel narrative features derived from the structured medical codes. We identified a cancer cohort of 12,806 patients from the University of Florida Health, diagnosed with lung, breast, and colorectal cancers, among which 1,602 individuals developed HF after cancer. The LLM, GatorTron-3.9B, achieved the best F1 scores, outperforming the traditional support vector machines by 39%, the T-LSTM deep learning model by 7%, and a widely used transformer model, BERT, by 5.6%. The analysis shows that the proposed narrative features remarkably increased feature density and improved performance.
Collapse
Affiliation(s)
- Ziyi Chen
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Mengyuan Zhang
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Mustafa Mohammed Ahmed
- Division of Cardiovascular Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Yi Guo
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Thomas J George
- Division of Hematology & Oncology, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Yonghui Wu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| |
Collapse
|
27
|
Lyu W, Bi Z, Wang F, Chen C. BadCLM: Backdoor Attack in Clinical Language Models for Electronic Health Records. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:768-777. [PMID: 40417555 PMCID: PMC12099347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
The advent of clinical language models integrated into electronic health records (EHR) for clinical decision support has marked a significant advancement, leveraging the depth of clinical notes for improved decision-making. Despite their success, the potential vulnerabilities of these models remain largely unexplored. This paper delves into the realm of backdoor attacks on clinical language models, introducing an innovative attention-based backdoor attack method, BadCLM (Bad Clinical Language Models). This technique clandestinely embeds a backdoor within the models, causing them to produce incorrect predictions when a pre-defined trigger is present in inputs, while functioning accurately otherwise. We demonstrate the efficacy of BadCLM through an in-hospital mortality prediction task with MIMIC III dataset, showcasing its potential to compromise model integrity. Our findings illuminate a significant security risk in clinical decision support systems and pave the way for future endeavors in fortifying clinical language models against such vulnerabilities.
Collapse
Affiliation(s)
- Weimin Lyu
- The Stony Brook University, New York, NY
| | - Zexin Bi
- The Webb Schools, Claremont, CA, Country
| | | | - Chao Chen
- The Stony Brook University, New York, NY
| |
Collapse
|
28
|
Majid I, Mishra V, Ravindranath R, Wang SY. Evaluating the Performance of Large Language Models for Named Entity Recognition in Ophthalmology Clinical Free-Text Notes. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:778-787. [PMID: 40417582 PMCID: PMC12099357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
This study compared large language models (LLMs) and Bidirectional Encoder Representations from Transformers (BERT) models in identifying medication names, routes, and frequencies from publicly available free-text ophthalmology progress notes of 480 patients. 5,520 lines of annotated text were divided into train (N=3,864), validation (N=1,104), and test sets (N=552). We evaluated ChatGPT-3.5, ChatGPT-4, PaLM 2, and Gemini to identify these medication entities. We fine-tuned BERT, BioBERT, ClinicalBERT, DistilBERT, and RoBERTa for the same task using the training set. On the test set, GPT-4 achieved the best performance (macro-averaged F1 0.962). Among the BERT models, BioBERT achieved the best performance (macro-averaged F1 0.875). Modern LLMs outperformed BERT models even in the highly domain-specific task of identifying ophthalmic medication information from progress notes, showcasing the potential of LLMs for medical named entity recognition to enhance patient care.
Collapse
Affiliation(s)
- Iyad Majid
- Stanford University School of Medicine, Palo Alto, CA, United States
| | - Vaibhav Mishra
- Stanford University School of Medicine, Palo Alto, CA, United States
| | | | - Sophia Y Wang
- Stanford University School of Medicine, Palo Alto, CA, United States
| |
Collapse
|
29
|
Wang J, Li H, Liu H. A Comprehensive System for Searching and Evaluating Genomic Variant Evidence Using AI and Knowledge Bases to Support Personalized Medicine. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:1206-1214. [PMID: 40417484 PMCID: PMC12099401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
We introduce an innovative automated system for the search and assessment of genetic variant evidence, meticulously aligned with ACMG guidelines. Leveraging the synergistic power of artificial intelligence (AI), elastic search, and an extensive knowledge base, our system advances the efficiency and accuracy of genetic variant interpretation. Distinct from existing methodologies, it features a pioneering literature filtering mechanism that automates the identification and relevance ranking of scientific articles, significantly reducing the time spending on literature evidence search and optimizing the evidence assessment process. Implemented and rigorously tested by a commercial company hereditary cancer variant curation team, the system demonstrated its effectiveness and scalability by processing over 3 million PMIDs and 1.8 million full-text articles. Throughout the period of active utilization, significant insights were gleaned into the real-world impact and user experience of the system, conclusively affirming its robustness. Our comparative analysis with Mastermind 2.0 highlights the system's enhanced performance in minimizing false positives for various mutation types. The core AI model exhibits exceptional precision, recall, and F1 scores above 0.8, signifying its adeptness in selecting pertinent literature for variant classification. The experience and knowledge acquired from deploying the system in a commercial setting provide a distinctive outlook on its practicality and prospects for future development. The novel integration of AI with traditional genetic variant curation processes heralds a new era in the field, promising significant advancements and broader application prospects.
Collapse
Affiliation(s)
- Jinlian Wang
- The McWilliams School of Biomedical Informatics, Houston, TX, USA
| | - Hui Li
- The McWilliams School of Biomedical Informatics, Houston, TX, USA
| | - Hongfang Liu
- The McWilliams School of Biomedical Informatics, Houston, TX, USA
| |
Collapse
|
30
|
Das A, Tariq A, Batalini F, Dhara B, Banerjee I. Exposing Vulnerabilities in Clinical LLMs Through Data Poisoning Attacks: Case Study in Breast Cancer. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:339-348. [PMID: 40417494 PMCID: PMC12099371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the current standard practice. Despite their transformative impact on natural language processing (NLP), public LLMs present notable vulnerabilities given the source of training data is often web-based or crowdsourced, and hence can be manipulated by perpetrators. We delve into the vulnerabilities of clinical LLMs, particularly BioGPT which is trained on publicly available biomedical literature and clinical notes from MIMIC-III, in the realm of data poisoning attacks. Exploring susceptibility to data poisoning-based attacks on de-identified breast cancer clinical notes, our approach is the first one to assess the extent of such attacks and our findings reveal successful manipulation of LLM outputs. Through this work, we emphasize on the urgency of comprehending these vulnerabilities in LLMs, and encourage the mindful and responsible usage of LLMs in the clinical domain.
Collapse
Affiliation(s)
- Avisha Das
- Arizona Advanced AI & Innovation (A3I) Hub, Mayo Clinic Arizona
| | - Amara Tariq
- Arizona Advanced AI & Innovation (A3I) Hub, Mayo Clinic Arizona
| | | | | | - Imon Banerjee
- Arizona Advanced AI & Innovation (A3I) Hub, Mayo Clinic Arizona
- Department of Radiology, Mayo Clinic Arizona
- School of Computing and Augmented Intelligence, Arizona State University
| |
Collapse
|
31
|
Dobreva J, Simjanoska Misheva M, Mishev K, Trajanov D, Mishkovski I. A Unified Framework for Alzheimer's Disease Knowledge Graphs: Architectures, Principles, and Clinical Translation. Brain Sci 2025; 15:523. [PMID: 40426694 PMCID: PMC12110335 DOI: 10.3390/brainsci15050523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2025] [Revised: 05/07/2025] [Accepted: 05/12/2025] [Indexed: 05/29/2025] Open
Abstract
This review paper synthesizes the application of knowledge graphs (KGs) in Alzheimer's disease (AD) research, based on two basic questions, as follows: what types of input data are available to construct these knowledge graphs, and what purpose the knowledge graph is intended to fulfill. We synthesize results from existing works to illustrate how diverse knowledge graph structures behave in different data availability settings with distinct application targets in AD research. By comparative analysis, we define the best methodology practices by data type (literature, structured databases, neuroimaging, and clinical records) and application of interest (drug repurposing, disease classification, mechanism discovery, and clinical decision support). From this analysis, we recommend AD-KG 2.0, which is a new framework that coalesces best practices into a unifying architecture with well-defined decision pathways for implementation. Our key contributions are as follows: (1) a dynamic adaptation mechanism that adapts methodological elements automatically according to both data availability and application objectives, (2) a specialized semantic alignment layer that harmonizes terminologies across biological scales, and (3) a multi-constraint optimization approach for knowledge graph building. The framework accommodates a variety of applications, including drug repurposing, patient stratification for precision medicine, disease progression modeling, and clinical decision support. Our system, with a decision tree structured and pipeline layered architecture, offers research precise directions on how to use knowledge graphs in AD research by aligning methodological choice decisions with respective data availability and application goals. We provide precise component designs and adaptation processes that deliver optimal performance across varying research and clinical settings. We conclude by addressing implementation challenges and future directions for translating knowledge graph technologies from research tool to clinical use, with a specific focus on interpretability, workflow integration, and regulatory matters.
Collapse
Affiliation(s)
- Jovana Dobreva
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, 1000 Skopje, North Macedonia; (M.S.M.); (K.M.); (D.T.); (I.M.)
| | | | | | | | | |
Collapse
|
32
|
Karabuğa B, Karaçin C, Büyükkör M, Bayram D, Aydemir E, Kaya OB, Yılmaz ME, Çamöz ES, Ergün Y. The Role of Artificial Intelligence (ChatGPT-4o) in Supporting Tumor Board Decisions. J Clin Med 2025; 14:3535. [PMID: 40429531 PMCID: PMC12112035 DOI: 10.3390/jcm14103535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2025] [Revised: 05/09/2025] [Accepted: 05/16/2025] [Indexed: 05/29/2025] Open
Abstract
Background/Objectives: Artificial intelligence (AI) has emerged as a promising field in the era of personalized oncology due to its potential to save time and workforce while serving as a supportive tool in patient management decisions. Although several studies in the literature have explored the integration of AI into oncology practice across different tumor types, available data remain limited. In our study, we aimed to evaluate the role of AI in the management of complex cancer cases by comparing the decisions of an in-house tumor board and ChatGPT-4o for patients with various tumor types. Methods: A total of 102 patients with diverse cancer types were included. Treatment and follow-up decisions proposed by both the tumor board and ChatGPT-4o were independently evaluated by two medical oncologists using a 5-point Likert scale. Results: Analysis of agreement levels showed high inter-rater reliability (κ = 0.722, p < 0.001 for tumor board decisions; κ = 0.794, p < 0.001 for ChatGPT decisions). However, concordance between the tumor board and ChatGPT was low, as reflected in the assessments of both raters (Rater 1: κ = 0.211, p = 0.003; Rater 2: κ = 0.376, p < 0.001). Both raters more frequently agreed with the tumor board decisions, and a statistically significant difference between tumor board and AI decisions was observed for both (Rater 1: Z = +4.548, p < 0.001; Rater 2: Z = +3.990, p < 0.001). Conclusions: These findings suggest that AI, in its current form, is not yet capable of functioning as a standalone decision-maker in the management of challenging oncology cases. Clinical experience and expert judgment remain the most critical factors in guiding patient care.
Collapse
Affiliation(s)
- Berkan Karabuğa
- Department of Medical Oncology, Dr. Abdurrahman Yurtaslan Ankara Oncology Research and Training Hospital, 06200 Ankara, Turkey; (C.K.); (M.B.); (E.A.); (O.B.K.); (M.E.Y.); (E.S.Ç.)
| | - Cengiz Karaçin
- Department of Medical Oncology, Dr. Abdurrahman Yurtaslan Ankara Oncology Research and Training Hospital, 06200 Ankara, Turkey; (C.K.); (M.B.); (E.A.); (O.B.K.); (M.E.Y.); (E.S.Ç.)
| | - Mustafa Büyükkör
- Department of Medical Oncology, Dr. Abdurrahman Yurtaslan Ankara Oncology Research and Training Hospital, 06200 Ankara, Turkey; (C.K.); (M.B.); (E.A.); (O.B.K.); (M.E.Y.); (E.S.Ç.)
| | - Doğan Bayram
- Department of Medical Oncology, Gülhane Research and Training Hospital, 06010 Ankara, Turkey;
| | - Ergin Aydemir
- Department of Medical Oncology, Dr. Abdurrahman Yurtaslan Ankara Oncology Research and Training Hospital, 06200 Ankara, Turkey; (C.K.); (M.B.); (E.A.); (O.B.K.); (M.E.Y.); (E.S.Ç.)
| | - Osman Bilge Kaya
- Department of Medical Oncology, Dr. Abdurrahman Yurtaslan Ankara Oncology Research and Training Hospital, 06200 Ankara, Turkey; (C.K.); (M.B.); (E.A.); (O.B.K.); (M.E.Y.); (E.S.Ç.)
| | - Mehmet Emin Yılmaz
- Department of Medical Oncology, Dr. Abdurrahman Yurtaslan Ankara Oncology Research and Training Hospital, 06200 Ankara, Turkey; (C.K.); (M.B.); (E.A.); (O.B.K.); (M.E.Y.); (E.S.Ç.)
| | - Elif Sertesen Çamöz
- Department of Medical Oncology, Dr. Abdurrahman Yurtaslan Ankara Oncology Research and Training Hospital, 06200 Ankara, Turkey; (C.K.); (M.B.); (E.A.); (O.B.K.); (M.E.Y.); (E.S.Ç.)
| | - Yakup Ergün
- Department of Medical Oncology, Bower Hospital, 21100 Diyarbakır, Turkey;
| |
Collapse
|
33
|
Yamagiwa H, Hashimoto R, Arakane K, Murakami K, Soeda S, Oyama M, Zhu Y, Okada M, Shimodaira H. Predicting drug-gene relations via analogy tasks with word embeddings. Sci Rep 2025; 15:17240. [PMID: 40383732 PMCID: PMC12086191 DOI: 10.1038/s41598-025-01418-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2024] [Accepted: 05/06/2025] [Indexed: 05/20/2025] Open
Abstract
Natural language processing is utilized in a wide range of fields, where words in text are typically transformed into feature vectors called embeddings. BioConceptVec is a specific example of embeddings tailored for biology, trained on approximately 30 million PubMed abstracts using models such as skip-gram. Generally, word embeddings are known to solve analogy tasks through simple vector arithmetic. For example, subtracting the vector for man from that of king and then adding the vector for woman yields a point that lies closer to queen in the embedding space. In this study, we demonstrate that BioConceptVec embeddings, along with our own embeddings trained on PubMed abstracts, contain information about drug-gene relations and can predict target genes from a given drug through analogy computations. We also show that categorizing drugs and genes using biological pathways improves performance. Furthermore, we illustrate that vectors derived from known relations in the past can predict unknown future relations in datasets divided by year. Despite the simplicity of implementing analogy tasks as vector additions, our approach demonstrated performance comparable to that of large language models such as GPT-4 in predicting drug-gene relations.
Collapse
Affiliation(s)
| | | | - Kiwamu Arakane
- Institute for Protein Research, Osaka University, Osaka, Japan
| | - Ken Murakami
- Research Institute of Molecular Pathology, Vienna BioCenter, Vienna, Austria
| | - Shou Soeda
- Institute for Protein Research, Osaka University, Osaka, Japan
| | - Momose Oyama
- Kyoto University, Kyoto, Japan
- RIKEN, Tokyo, Japan
| | | | - Mariko Okada
- Institute for Protein Research, Osaka University, Osaka, Japan
| | | |
Collapse
|
34
|
Nun A, Birot O, Guibon G, Lapostolle F, Lerner I. SIMSAMU - A French medical dispatch dialog open dataset. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2025; 268:108857. [PMID: 40408830 DOI: 10.1016/j.cmpb.2025.108857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 04/27/2025] [Accepted: 05/12/2025] [Indexed: 05/25/2025]
Abstract
BACKGROUND Dispatch Services (DS) are essential to Emergency Medical Services (EMS). Dispatchers enable patients to access medical assistance in emergencies, anytime and anywhere, within limited time and resources. AI-based decision-support tools hold great promise for dispatchers. Developing these tools requires medical field-specific data. Medical dispatch dialogue is unique: it is a brief phone exchange in an emergency, within a limited time frame, without a physical examination. OBJECTIVE Our main objective was to (i) create an open French dataset of medical dispatch dialogues. Our secondary objectives were to (ii) develop a detailed medical dispatch scheme from this dataset using an unsupervised method, and (iii) provide a baseline evaluation of diarization and speech recognition models for this domain in French. METHODS From 2022 to 2023, emergency medicine junior doctors simulated real-life medical dispatch calls. These calls were recorded and transcribed to form the SIMSAMU corpus. We developed a dispatch scheme based on (i) recording analysis, (ii) data-driven utterance typology, and (iii) domain expertise. Utterance typology was derived via hierarchical clustering of representations learned by finetuning BERT embeddings on SIMSAMU. Clusters were mapped to the Roter Interaction Analysis System (RIAS) and included in our dispatch scheme. SIMSAMU was used to train and evaluate state-of-the-art neural network models for diarization and speech recognition. Diarization used the PyaNet model, fine-tuned on the ESLO2 dataset. Speech recognition used a CTC model with pre-trained wav2vec 2.0 embedding, compared to the multilingual Whisper model. The CTC-wav2vec model was further fine-tuned on SIMSAMU and evaluated by leave-one-speaker-out cross-validation. RESULTS The dataset consists of 61 audio recordings totaling 3 h 14 min. Four clusters were identified for callers and 3 for dispatchers. Two main dialogue phases were identified: interrogation and contractualization. The diarization model achieved a 10.4 % error rate. Speech recognition word error rates were 35.8 % for Whisper, 24.8 % for the CTC-wav2vec model fine-tuned on ESLO2, and 16.1 % after in-domain fine-tuning. CONCLUSION We propose a French open medical dispatch dialogue dataset and an expert-validated schema of the medical dispatch dialogue based on unsupervised analysis. Notable gaps in how well speech recognition models generalize underscore the need for targeted, in-domain fine-tuning in this specialized application. SIMSAMU is designed to support this effort by serving as a benchmark for evaluating domain-adapted speech recognition and dialogue modeling strategies.
Collapse
Affiliation(s)
- Aimé Nun
- Université Paris Cité, Inserm, Centre de Recherche des Cordeliers, Sorbonne Université, Paris F-75006, France; HeKA, Inria Paris, Paris F-75012, France; Assistance Publique Hôpitaux de Paris (AP-HP), Department of Medical Informatics, Georges Pompidou European Hospital, Paris, France.
| | - Olivier Birot
- Université Paris Cité, Inserm, Centre de Recherche des Cordeliers, Sorbonne Université, Paris F-75006, France; HeKA, Inria Paris, Paris F-75012, France
| | - Gaël Guibon
- LORIA, Université de Lorraine, CNRS, 54600, France; Université Sorbonne Paris Nord, CNRS, Laboratoire d'Informatique de Paris Nord, LIPN, Villetaneuse F-93430, France
| | - Frédéric Lapostolle
- SAMU 93, UF Recherche-Enseignement-Qualité, Avicenne Hospital-APHP, Bobigny, France; Université Paris 13, INSERM Unit 942, Sorbonne Paris Cité, Bobigny, France
| | - Ivan Lerner
- Université Paris Cité, Inserm, Centre de Recherche des Cordeliers, Sorbonne Université, Paris F-75006, France; HeKA, Inria Paris, Paris F-75012, France; Assistance Publique Hôpitaux de Paris (AP-HP), Department of Medical Informatics, Georges Pompidou European Hospital, Paris, France
| |
Collapse
|
35
|
Liu Z, Zhang G, Shen Y. Psychomedical named entity recognition method based on multi-level feature extraction and multi-granularity embedding fusion. Sci Rep 2025; 15:16927. [PMID: 40374721 PMCID: PMC12081933 DOI: 10.1038/s41598-025-90939-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2024] [Accepted: 02/17/2025] [Indexed: 05/17/2025] Open
Abstract
Named Entity Recognition (NER) in psychomedicine is one of the key tasks in natural language processing in psychomedicine. It aims to identify and classify specialized terms in psychomedical texts and provide powerful support for downstream tasks. Psychological medicine texts are characterized by long paragraphs, complex sentences, and scattered knowledge. The current character-based psychomedicine NER model has single embedded information. It lacks structural and phonetic characterization information. Migrating NER models from the general purpose domain to the psychomedical domain are not effective in improving entity recognition accuracy. To solve this problem, we propose a NER method based on multi-level feature extraction and multi-granularity embedding fusion (MFME-NER), which aims to provide an innovative solution. First, three different granularities of embedding information, character embedding, radical embedding and pinyin embedding, are introduced to enrich the semantic representation of the input text. Second, the BERT model is improved. Merging the features of all Encoder layers inside the output. So that the BERT model has multi-layer feature extraction capability (MFE-BERT). The character embedding is pre-trained by MFE-BERT. And the BiLSTM model is utilized for the extraction of features at the character granularity. The features of radical embedding and pinyin embedding are extracted separately by the CNN model, and then feature fusion is performed. Finally, feature vectors at three granularities are integrated using a gated feed-forward neural network attention mechanism (GA-FNNAtention). The experimental results show that MFME-NER achieved 94.26% and 89.63% F1 Score in the self-constructed psychomedical dataset PsyDatase and CBLUE dataset, respectively. The proposed method surpasses the currently used evaluation metrics, thus substantiating its rationality and efficacy.This study can better contribute to the analysis of psychomedical data.
Collapse
Affiliation(s)
- Zixuan Liu
- School of Cyber Security and Computer, Hebei University, Baoding, 071000, China
| | - Guofang Zhang
- School of Cyber Security and Computer, Hebei University, Baoding, 071000, China.
| | - Yanguang Shen
- School of Information Science and Electrical Engineering, Hebei University of Engineering, Handan, 056038, China
| |
Collapse
|
36
|
Harel-Canada F, Salimian A, Moghanian B, Clingan S, Nguyen A, Avra T, Poimboeuf M, Romero R, Funnell A, Petousis P, Shin M, Peng N, Shover CL, Goodman-Meza D. Enhancing Substance Use Detection in Clinical Notes with Large Language Models. RESEARCH SQUARE 2025:rs.3.rs-6615981. [PMID: 40470194 PMCID: PMC12136207 DOI: 10.21203/rs.3.rs-6615981/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/16/2025]
Abstract
Identifying substance use behaviors in electronic health records (EHRs) is challenging because critical details are often buried in unstructured notes that use varied terminology and negation, requiring careful contextual interpretation to distinguish relevant use from historical mentions or denials. Using MIMIC-III/IV discharge summaries, we created a large, annotated drug detection dataset to tackle this problem and support future systemic substance use surveillance. We then investigated the performance of multiple large language models (LLMs) for detecting eight substance use categories within this data. Evaluating models in zero-shot, few-shot, and fine-tuning configurations, we found that a fine-tuned model, Llama-DrugDetector-70B, outperformed others. It achieved near-perfect F1-scores (>=0.95) for most individual substances and strong scores for more complex tasks like prescription opioid misuse (F1=0.815) and polysubstance use (F1=0.917). These findings demonstrate that LLMs significantly enhance detection, showing promise for clinical decision support and research, although further work on scalability is warranted.
Collapse
Affiliation(s)
- Fabrice Harel-Canada
- Computer Science Department, University of California, Los Angeles, 404 Westwood Plaza Suite 277, Los Angeles, 90095, CA, USA
| | - Anabel Salimian
- Semel Institute for Neuroscience and Human Behavior at University of California, Los Angeles, 760 Westwood Plaza, Los Angeles, 90024, CA, USA
| | - Brandon Moghanian
- University of California, Los Angeles, 200 Medical Plaza Suite 365C, Los Angeles, 90024, CA, USA
| | - Sarah Clingan
- Integrated Substance Abuse Programs at University of California, Los Angeles, 10911 Weyburn Ave, Ste. 200, Los Angeles, 90024, CA, USA
| | - Allan Nguyen
- University of California, Los Angeles, 200 Medical Plaza Suite 365C, Los Angeles, 90024, CA, USA
| | - Tucker Avra
- David Geffen School of Medicine at University of California, Los Angeles, 10833 Le Conte Ave, Los Angeles, 90095, CA, USA
| | - Michelle Poimboeuf
- Division of General Internal Medicine and Health Services Research, University of California, Los Angeles, 1100 Glendon Ave STE 850, Los Angeles, 90024, CA, USA
| | - Ruby Romero
- Division of General Internal Medicine and Health Services Research, University of California, Los Angeles, 1100 Glendon Ave STE 850, Los Angeles, 90024, CA, USA
| | - Arthur Funnell
- Clinical and Translational Science Institute, University of California, Los Angeles, 924 Westwood Blvd Suite 420, Los Angeles, 90024, CA, USA
| | - Panayiotis Petousis
- Clinical and Translational Science Institute, University of California, Los Angeles, 924 Westwood Blvd Suite 420, Los Angeles, 90024, CA, USA
| | - Michael Shin
- Department of Geography, University of California, Los Angeles, 1255 Bunche Hall, Los Angeles, 90095, CA, USA
| | - Nanyun Peng
- Computer Science Department, University of California, Los Angeles, 404 Westwood Plaza Suite 277, Los Angeles, 90095, CA, USA
| | - Chelsea L. Shover
- Division of General Internal Medicine and Health Services Research, University of California, Los Angeles, 1100 Glendon Ave STE 850, Los Angeles, 90024, CA, USA
| | - David Goodman-Meza
- Kirby Institute, University of New South Wales, Wallace Wurth Building (C27), Cnr High St & Botany St, UNSW, Sydney, 2052, NSW, Australia
| |
Collapse
|
37
|
Wang X, Figueredo G, Li R, Zhang WE, Chen W, Chen X. A survey of deep-learning-based radiology report generation using multimodal inputs. Med Image Anal 2025; 103:103627. [PMID: 40382855 DOI: 10.1016/j.media.2025.103627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2024] [Revised: 04/09/2025] [Accepted: 04/24/2025] [Indexed: 05/20/2025]
Abstract
Automatic radiology report generation can alleviate the workload for physicians and minimize regional disparities in medical resources, therefore becoming an important topic in the medical image analysis field. It is a challenging task, as the computational model needs to mimic physicians to obtain information from multi-modal input data (i.e., medical images, clinical information, medical knowledge, etc.), and produce comprehensive and accurate reports. Recently, numerous works have emerged to address this issue using deep-learning-based methods, such as transformers, contrastive learning, and knowledge-base construction. This survey summarizes the key techniques developed in the most recent works and proposes a general workflow for deep-learning-based report generation with five main components, including multi-modality data acquisition, data preparation, feature learning, feature fusion and interaction, and report generation. The state-of-the-art methods for each of these components are highlighted. Additionally, we summarize the latest developments in large model-based methods and model explainability, along with public datasets, evaluation methods, current challenges, and future directions in this field. We have also conducted a quantitative comparison between different methods in the same experimental setting. This is the most up-to-date survey that focuses on multi-modality inputs and data fusion for radiology report generation. The aim is to provide comprehensive and rich information for researchers interested in automatic clinical report generation and medical image analysis, especially when using multimodal inputs, and to assist them in developing new algorithms to advance the field.
Collapse
Affiliation(s)
- Xinyi Wang
- School of Computer Science, The University of Nottingham, Nottingham NG7 2RD, United Kingdom
| | - Grazziela Figueredo
- School of Medicine, The University of Nottingham, Nottingham NG7 2RD, United Kingdom
| | - Ruizhe Li
- School of Computer Science, The University of Nottingham, Nottingham NG7 2RD, United Kingdom
| | - Wei Emma Zhang
- School of Computer and Mathematical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
| | - Weitong Chen
- School of Computer and Mathematical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
| | - Xin Chen
- School of Computer Science, The University of Nottingham, Nottingham NG7 2RD, United Kingdom.
| |
Collapse
|
38
|
Shi B, Chen L, Pang S, Wang Y, Wang S, Li F, Zhao W, Guo P, Zhang L, Fan C, Zou Y, Wu X. Large Language Models and Artificial Neural Networks for Assessing 1-Year Mortality in Patients With Myocardial Infarction: Analysis From the Medical Information Mart for Intensive Care IV (MIMIC-IV) Database. J Med Internet Res 2025; 27:e67253. [PMID: 40354652 PMCID: PMC12107198 DOI: 10.2196/67253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2024] [Revised: 04/01/2025] [Accepted: 04/17/2025] [Indexed: 05/14/2025] Open
Abstract
BACKGROUND Accurate mortality risk prediction is crucial for effective cardiovascular risk management. Recent advancements in artificial intelligence (AI) have demonstrated potential in this specific medical field. Qwen-2 and Llama-3 are high-performance, open-source large language models (LLMs) available online. An artificial neural network (ANN) algorithm derived from the SWEDEHEART (Swedish Web System for Enhancement and Development of Evidence-Based Care in Heart Disease Evaluated According to Recommended Therapies) registry, termed SWEDEHEART-AI, can predict patient prognosis following acute myocardial infarction (AMI). OBJECTIVE This study aims to evaluate the 3 models mentioned above in predicting 1-year all-cause mortality in critically ill patients with AMI. METHODS The Medical Information Mart for Intensive Care IV (MIMIC-IV) database is a publicly available data set in critical care medicine. We included 2758 patients who were first admitted for AMI and discharged alive. SWEDEHEART-AI calculated the mortality rate based on each patient's 21 clinical variables. Qwen-2 and Llama-3 analyzed the content of patients' discharge records and directly provided a 1-decimal value between 0 and 1 to represent 1-year death risk probabilities. The patients' actual mortality was verified using follow-up data. The predictive performance of the 3 models was assessed and compared using the Harrell C-statistic (C-index), the area under the receiver operating characteristic curve (AUROC), calibration plots, Kaplan-Meier curves, and decision curve analysis. RESULTS SWEDEHEART-AI demonstrated strong discrimination in predicting 1-year all-cause mortality in patients with AMI, with a higher C-index than Qwen-2 and Llama-3 (C-index 0.72, 95% CI 0.69-0.74 vs C-index 0.65, 0.62-0.67 vs C-index 0.56, 95% CI 0.53-0.58, respectively; all P<.001 for both comparisons). SWEDEHEART-AI also showed high and consistent AUROC in the time-dependent ROC curve. The death rates calculated by SWEDEHEART-AI were positively correlated with actual mortality, and the 3 risk classes derived from this model showed clear differentiation in the Kaplan-Meier curve (P<.001). Calibration plots indicated that SWEDEHEART-AI tended to overestimate mortality risk, with an observed-to-expected ratio of 0.478. Compared with the LLMs, SWEDEHEART-AI demonstrated positive and greater net benefits at risk thresholds below 19%. CONCLUSIONS SWEDEHEART-AI, a trained ANN model, demonstrated the best performance, with strong discrimination and clinical utility in predicting 1-year all-cause mortality in patients with AMI from an intensive care cohort. Among the LLMs, Qwen-2 outperformed Llama-3 and showed moderate predictive value. Qwen-2 and SWEDEHEART-AI exhibited comparable classification effectiveness. The future integration of LLMs into clinical decision support systems holds promise for accurate risk stratification in patients with AMI; however, further research is needed to optimize LLM performance and address calibration issues across diverse patient populations.
Collapse
Affiliation(s)
- Boqun Shi
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Liangguo Chen
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Shuo Pang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Yue Wang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Shen Wang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Fadong Li
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Wenxin Zhao
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Pengrong Guo
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Leli Zhang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Chu Fan
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Yi Zou
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Xiaofan Wu
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| |
Collapse
|
39
|
Krzyzanowski A, Pickett SD, Pogány P. Exploring BERT for Reaction Yield Prediction: Evaluating the Impact of Tokenization, Molecular Representation, and Pretraining Data Augmentation. J Chem Inf Model 2025; 65:4381-4402. [PMID: 40311104 DOI: 10.1021/acs.jcim.5c00359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2025]
Abstract
Predicting reaction yields in synthetic chemistry remains a significant challenge. This study systematically evaluates the impact of tokenization, molecular representation, pretraining data, and adversarial training on a BERT-based model for yield prediction of Buchwald-Hartwig and Suzuki-Miyaura coupling reactions using publicly available HTE data sets. We demonstrate that molecular representation choice (SMILES, DeepSMILES, SELFIES, Morgan fingerprint-based notation, IUPAC names) has minimal impact on model performance, while typically BPE and SentencePiece tokenization outperform other methods. WordPiece is strongly discouraged for SELFIES and fingerprint-based notation. Furthermore, pretraining with relatively small data sets (<100 K reactions) achieves comparable performance to larger data sets containing millions of examples. The use of artificially generated domain-specific pretraining data is proposed. The artificially generated sets prove to be a good surrogate for the reaction schemes extracted from reaction data sets such as Pistachio or Reaxys. The best performance was observed for hybrid pretraining sets combining the real and the domain-specific, artificial data. Finally, we show that a novel adversarial training approach, perturbing input embeddings dynamically, improves model robustness and generalizability for yield and reaction success prediction. These findings provide valuable insights for developing robust and practical machine learning models for yield prediction in synthetic chemistry. GSK's BERT training code base is made available to the community with this work.
Collapse
Affiliation(s)
| | - Stephen D Pickett
- GSK Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, U.K
| | - Peter Pogány
- GSK Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, U.K
| |
Collapse
|
40
|
Zhang Y, Vlachos DG, Liu D, Fang H. Rapid Adaptation of Chemical Named Entity Recognition Using Few-Shot Learning and LLM Distillation. J Chem Inf Model 2025; 65:4334-4345. [PMID: 40310732 DOI: 10.1021/acs.jcim.5c00248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2025]
Abstract
Named entity recognition (NER) has been widely used in chemical text mining for the automatic identification and extraction of chemical entities. However, existing chemical NER systems primarily focus on scenarios with abundant training data, requiring significant human effort on annotations. This poses challenges for applications in the chemical field, such as catalysis, where many advancements have traditionally relied on trial-and-error investigations and incremental adjustment of variables. This hinders catalysis science and technology progress in addressing emerging energy and environmental crises. In this work, we propose a few-shot NER model that can quickly adapt to extract new types of chemical entities by using only a limited number of annotated examples. Our model employs a metric-learning approach to transfer entity similarity knowledge from high-resource chemical domains (with abundant annotations) to enable effective entity recognition in low-resource specialized domains (limited annotation). We validate the effectiveness of our model on a few-shot chemical NER benchmark built based on six existing chemical NER data sets. Experiments show that the proposed few-shot NER model can achieve reasonable performance with only 5 examples per entity type and shows consistent improvement as the number of examples increases. Furthermore, we demonstrate how the proposed model can be trained with large language model (LLM) annotated data, opening a new pathway for rapid adaptation of NER systems. Our approach leverages the knowledge broadness of large language models for chemistry while distilling this knowledge into a lightweight model suitable for efficient and in-house use.
Collapse
Affiliation(s)
- Yue Zhang
- Center for Plastics Innovation, University of Delaware, Newark, Delaware 19716, United States
- Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19716, United States
| | - Dionisios G Vlachos
- Center for Plastics Innovation, University of Delaware, Newark, Delaware 19716, United States
- Department of Chemical and Biomolecular Engineering, University of Delaware, Newark, Delaware 19711, United States
| | - Dongxia Liu
- Center for Plastics Innovation, University of Delaware, Newark, Delaware 19716, United States
- Department of Chemical and Biomolecular Engineering, University of Delaware, Newark, Delaware 19711, United States
| | - Hui Fang
- Center for Plastics Innovation, University of Delaware, Newark, Delaware 19716, United States
- Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19716, United States
| |
Collapse
|
41
|
Li TZ, Still JM, Zuo L, Liu Y, Krishnan AR, Sandler KL, Maldonado F, Lasko TA, Landman BA. Longitudinal Masked Representation Learning for Pulmonary Nodule Diagnosis from Language Embedded EHRs. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2025.05.09.25327341. [PMID: 40385386 PMCID: PMC12083608 DOI: 10.1101/2025.05.09.25327341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/20/2025]
Abstract
Electronic health records (EHRs) are a rich source of clinical data, yet exploiting longitudinal signals for pulmonary nodule diagnosis remains challenging due to the administrative noise and high level of clinical abstraction present in these records. Because of this complexity, classification models are prone to overfitting when labeled data is scarce. This study explores masked representation learning (MRL) as a strategy to improve pulmonary nodule diagnosis by modeling longitudinal EHRs across multiple modalities: clinical conditions, procedures, and medications. We leverage a web-scale text embedding model to encode EHR event streams into semantically embedded sequences. We then pretrain a bidirectional transformer using MRL conditioned on time encodings on a large cohort of general pulmonary conditions from our home institution. Evaluation on a cohort of diagnosed pulmonary nodules demonstrates significant improvement in diagnosis accuracy with a model finetuned from MRL (0.781 AUC, 95% CI: [0.780, 0.782]) compared to a supervised model with the same architecture (0.768 AUC, 95% CI: [0.766, 0.770]) when integrating all three modalities. These findings suggest that language-embedded MRL can facilitate downstream clinical classification, offering potential advancements in the comprehensive analysis of longitudinal EHR modalities.
Collapse
Affiliation(s)
- Thomas Z Li
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN
- Medical Scientist Training Program, Vanderbilt University, Nashville, TN
| | - John M Still
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN
| | - Lianrui Zuo
- Department of Electrical and Computer Engineering, Vanderbilt University, Nashville, TN
| | - Yihao Liu
- Department of Electrical and Computer Engineering, Vanderbilt University, Nashville, TN
| | - Aravind R Krishnan
- Department of Electrical and Computer Engineering, Vanderbilt University, Nashville, TN
| | - Kim L Sandler
- Department of Radiology and Radiological Sciences, Vanderbilt University Medical Center, Nashville, TN
| | - Fabien Maldonado
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Thomas A Lasko
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN
| | - Bennett A Landman
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN
- Department of Electrical and Computer Engineering, Vanderbilt University, Nashville, TN
- Department of Radiology and Radiological Sciences, Vanderbilt University Medical Center, Nashville, TN
| |
Collapse
|
42
|
Li W, Wang H, Li W, Zhao J, Sun Y. Generation-Based Few-Shot BioNER via Local Knowledge Index and Dual Prompts. Interdiscip Sci 2025:10.1007/s12539-025-00709-3. [PMID: 40347393 DOI: 10.1007/s12539-025-00709-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 04/01/2025] [Accepted: 04/05/2025] [Indexed: 05/12/2025]
Abstract
Few-shot Biomedical Named Entity Recognition (BioNER) presents significant challenges due to limited training data and the presence of nested and discontinuous entities. To tackle these issues, a novel approach GKP-BioNER, Generation-based Few-Shot BioNER via Local Knowledge Index and Dual Prompts, is proposed. It redefines BioNER as a generation task by integrating hard and soft prompts. Specifically, GKP-BioNER constructs a localized knowledge index using a Wikipedia dump, facilitating the retrieval of semantically relevant texts to the original sentence. These texts are then reordered to prioritize the most semantically relevant content to the input data, serving as hard prompts. This helps the model to address challenges demanding domain-specific insights. Simultaneously, GKP-BioNER preserves the integrity of the pre-trained models while introducing learnable parameters as soft prompts to guide the self-attention layer, allowing the model to adapt to the context. Moreover, a soft prompt mechanism is designed to support knowledge transfer across domains. Extensive experiments on five datasets demonstrate that GKP-BioNER significantly outperforms eight state-of-the-art methods. It shows robust performance in low-resource and complex scenarios across various domains, highlighting its strength in knowledge transfer and broad applicability.
Collapse
Affiliation(s)
- Weixin Li
- School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China
| | - Hong Wang
- School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China.
| | - Wei Li
- School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China
| | - Jun Zhao
- School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China
| | - Yanshen Sun
- Department of Computer Science, Virginia Tech, Blacksburg, 24061, USA
| |
Collapse
|
43
|
Aghaarabi E, Murray D. Transformer-Based Language Models for Group Randomized Trial Classification in Biomedical Literature: Model Development and Validation. JMIR Med Inform 2025; 13:e63267. [PMID: 40344669 DOI: 10.2196/63267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 02/02/2025] [Accepted: 02/06/2025] [Indexed: 05/11/2025] Open
Abstract
Background For the public health community, monitoring recently published articles is crucial for staying informed about the latest research developments. However, identifying publications about studies with specific research designs from the extensive body of public health publications is a challenge with the currently available methods. Objective Our objective is to develop a fine-tuned pretrained language model that can accurately identify publications from clinical trials that use a group- or cluster-randomized trial (GRT), individually randomized group-treatment trial (IRGT), or stepped wedge group- or cluster-randomized trial (SWGRT) design within the biomedical literature. Methods We fine-tuned the BioMedBERT language model using a dataset of biomedical literature from the Office of Disease Prevention at the National Institute of Health. The model was trained to classify publications into three categories of clinical trials that use nested designs. The model performance was evaluated on unseen data and demonstrated high sensitivity and specificity for each class. Results When our proposed model was tested for generalizability with unseen data, it delivered high sensitivity and specificity for each class as follows: negatives (0.95 and 0.93), GRTs (0.94 and 0.90), IRGTs (0.81 and 0.97), and SWGRTs (0.96 and 0.99), respectively. Conclusions Our work demonstrates the potential of fine-tuned, domain-specific language models to accurately identify publications reporting on complex and specialized study designs, addressing a critical need in the public health research community. This model offers a valuable tool for the public health community to directly identify publications from clinical trials that use one of the three classes of nested designs.
Collapse
Affiliation(s)
- Elaheh Aghaarabi
- Office of Disease Prevention, National Institutes of Health, 6705 Rockledge Dr, Bethesda, MD, 20892, United States, 1 3014964000
| | - David Murray
- Office of Disease Prevention, National Institutes of Health, 6705 Rockledge Dr, Bethesda, MD, 20892, United States, 1 3014964000
| |
Collapse
|
44
|
Han P, Wang J, Liu D, Liu L, Song T. Robust temporal knowledge inference via pathway snapshots with liquid neural network. Methods 2025; 241:24-32. [PMID: 40349883 DOI: 10.1016/j.ymeth.2025.05.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2025] [Revised: 04/30/2025] [Accepted: 05/08/2025] [Indexed: 05/14/2025] Open
Abstract
Static graphs play a pivotal role in modeling and analyzing biological and biomedical data. However, many real-world scenarios-such as disease progression and drug pharmacokinetic processes-exhibit dynamic behaviors. Consequently, static graph methods often struggle to robustly address new environments characterized by complex and previously unseen relationship changes. Here, we propose a method for constructing temporal knowledge inference agents tailored to disease pathways, enabling effective relation reasoning beyond their training environment under complex shifts. To achieve this, we developed an imitation learning framework using liquid neural networks, a class of continuous-time neural models inspired by the brain function that are causal and adaptable to changing conditions. Our findings indicate that liquid agents can distill the essential tasks from knowledge graph inputs while accounting temporal evolution, thereby enabling the transfer of temporal skills to novel time nodes. Compared to state-of-the-art deep reinforcement learning agents, experiments demonstrate that temporal robustness in decision-making emerges uniquely in liquid networks.
Collapse
Affiliation(s)
- Peifu Han
- College of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
| | - Jianmin Wang
- The Interdisciplinary Graduate Program in Integrative Biotechnology, Yonsei University, Incheon 21983, Republic of Korea
| | - Dayan Liu
- College of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
| | - Lin Liu
- Department of Stomatology, The First Medical Center, Chinese PLA General Hospital, Beijing 100853, China.
| | - Tao Song
- College of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China.
| |
Collapse
|
45
|
Tait K, Cronin J, Wiper O, Wallis J, Davies J, Dürichen R. ArcTEX-a novel clinical data enrichment pipeline to support real-world evidence oncology studies. Front Digit Health 2025; 7:1561358. [PMID: 40416094 PMCID: PMC12098606 DOI: 10.3389/fdgth.2025.1561358] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2025] [Accepted: 04/23/2025] [Indexed: 05/27/2025] Open
Abstract
Data stored within electronic health records (EHRs) offer a valuable source of information for real-world evidence (RWE) studies in oncology. However, many key clinical features are only available within unstructured notes. We present ArcTEX, a novel data enrichment pipeline developed to extract oncological features from NHS unstructured clinical notes with high accuracy, even in resource-constrained environments where availability of GPUs might be limited. By design, the predicted outcomes of ArcTEX are free of patient-identifiable information, making this pipeline ideally suited for use in Trust environments. We compare our pipeline to existing discriminative and generative models, demonstrating its superiority over approaches such as Llama3/3.1/3.2 and other BERT based models, with a mean accuracy of 98.67% for several essential clinical features in endometrial and breast cancer. Additionally, we show that as few as 50 annotated training examples are needed to adapt the model to a different oncology area, such as lung cancer, with a different set of priority clinical features, achieving a comparable mean accuracy of 95% on average.
Collapse
Affiliation(s)
| | | | | | | | - Jim Davies
- Department of Computer Science, University of Oxford, Oxford, United Kingdom
| | | |
Collapse
|
46
|
Moassefi M, Houshmand S, Faghani S, Chang PD, Sun SH, Khosravi B, Triphati AG, Rasool G, Bhatia NK, Folio L, Andriole KP, Gichoya JW, Erickson BJ. Cross-Institutional Evaluation of Large Language Models for Radiology Diagnosis Extraction: A Prompt-Engineering Perspective. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2025:10.1007/s10278-025-01523-5. [PMID: 40341981 DOI: 10.1007/s10278-025-01523-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/15/2025] [Revised: 04/14/2025] [Accepted: 04/23/2025] [Indexed: 05/11/2025]
Abstract
The rapid evolution of large language models (LLMs) offers promising opportunities for radiology report annotation, aiding in determining the presence of specific findings. This study evaluates the effectiveness of a human-optimized prompt in labeling radiology reports across multiple institutions using LLMs. Six distinct institutions collected 500 radiology reports: 100 in each of 5 categories. A standardized Python script was distributed to participating sites, allowing the use of one common locally executed LLM with a standard human-optimized prompt. The script executed the LLM's analysis for each report and compared predictions to reference labels provided by local investigators. Models' performance using accuracy was calculated, and results were aggregated centrally. The human-optimized prompt demonstrated high consistency across sites and pathologies. Preliminary analysis indicates significant agreement between the LLM's outputs and investigator-provided reference across multiple institutions. At one site, eight LLMs were systematically compared, with Llama 3.1 70b achieving the highest performance in accurately identifying the specified findings. Comparable performance with Llama 3.1 70b was observed at two additional centers, demonstrating the model's robust adaptability to variations in report structures and institutional practices. Our findings illustrate the potential of optimized prompt engineering in leveraging LLMs for cross-institutional radiology report labeling. This approach is straightforward while maintaining high accuracy and adaptability. Future work will explore model robustness to diverse report structures and further refine prompts to improve generalizability.
Collapse
Affiliation(s)
- Mana Moassefi
- Mayo Clinic Artificial Intelligence Lab, Department of Radiology, Mayo Clinic, 200 1st Street, S.W., Rochester, MN, 55905, USA
| | - Sina Houshmand
- Department of Radiology, University of California San Francisco, San Francisco, CA, USA
| | - Shahriar Faghani
- Mayo Clinic Artificial Intelligence Lab, Department of Radiology, Mayo Clinic, 200 1st Street, S.W., Rochester, MN, 55905, USA
| | - Peter D Chang
- Departments of Radiological Sciences and Computer Science, University of California, Irvine, CA, USA
- The Center for Artificial Intelligence in Diagnostic Medicine (CAIDM), University of California, Irvine, CA, USA
| | - Shawn H Sun
- Departments of Radiological Sciences and Computer Science, University of California, Irvine, CA, USA
| | - Bardia Khosravi
- Mayo Clinic Artificial Intelligence Lab, Department of Radiology, Mayo Clinic, 200 1st Street, S.W., Rochester, MN, 55905, USA
| | | | | | - Neil K Bhatia
- Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Les Folio
- Moffitt Cancer Center, Tampa, FL, USA
| | - Katherine P Andriole
- Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Judy W Gichoya
- Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, GA, USA
- Healthcare AI Innovation and Translational Informatics (HITI) Lab, Emory University School of Medicine, Atlanta, GA, USA
| | - Bradley J Erickson
- Mayo Clinic Artificial Intelligence Lab, Department of Radiology, Mayo Clinic, 200 1st Street, S.W., Rochester, MN, 55905, USA.
| |
Collapse
|
47
|
Tomita K, Nishida T, Kitaguchi Y, Kitazawa K, Miyake M. Image Recognition Performance of GPT-4V(ision) and GPT-4o in Ophthalmology: Use of Images in Clinical Questions. Clin Ophthalmol 2025; 19:1557-1564. [PMID: 40357454 PMCID: PMC12068282 DOI: 10.2147/opth.s494480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Accepted: 04/09/2025] [Indexed: 05/15/2025] Open
Abstract
Purpose To compare the diagnostic accuracy of Generative Pre-trained Transformer with Vision (GPT)-4, GPT-4 with Vision (GPT-4V), and GPT-4o for clinical questions in ophthalmology. Patients and Methods The questions were collected from the "Diagnosis This" section on the American Academy of Ophthalmology website. We tested 580 questions and presented ChatGPT with the same questions under two conditions: 1) multimodal model, incorporating both the question text and associated images, and 2) text-only model. We then compared the difference in accuracy using McNemar tests among multimodal (GPT-4o and GPT-4V) and text-only (GPT-4V) models. The percentage of general correct answers was also collected from the website. Results Multimodal GPT-4o performed the best accuracy (77.1%), followed by multimodal GPT-4V (71.0%), and then text-only GPT-4V (68.7%); (P values < 0.001, 0.012, and 0.001, respectively). All GPT-4 models showed higher accuracy than the general correct answers on the website (64.6%). Conclusion The addition of information from images enhances the performance of GPT-4V in diagnosing clinical questions in ophthalmology. This suggests that integrating multimodal data could be crucial in developing more effective and reliable diagnostic tools in medical fields.
Collapse
Affiliation(s)
- Kosei Tomita
- Department of Ophthalmology, Kawasaki Medical School, Okayama, Japan
| | - Takashi Nishida
- Hamilton Glaucoma Center, Shiley Eye Institute, Viterbi Family Department of Ophthalmology, University of California, San Diego, La Jolla, CA, USA
| | - Yoshiyuki Kitaguchi
- Department of Ophthalmology, Osaka University Graduate School of Medicine, Osaka, Japan
| | - Koji Kitazawa
- Department of Ophthalmology, Kyoto Prefectural University of Medicine, Kyoto, Japan
| | - Masahiro Miyake
- Department of Ophthalmology and Visual Sciences, Kyoto University Graduate School of Medicine, Kyoto, Japan
| |
Collapse
|
48
|
Alkhoury N, Shaik M, Wurmus R, Akalin A. Enhancing biomarker based oncology trial matching using large language models. NPJ Digit Med 2025; 8:250. [PMID: 40325165 PMCID: PMC12053753 DOI: 10.1038/s41746-025-01673-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Accepted: 04/24/2025] [Indexed: 05/07/2025] Open
Abstract
Clinical trials are an essential component of drug development for new cancer treatments, yet the information required to determine a patient's eligibility for enrollment is scattered in large amounts of unstructured text. Genomic biomarkers are especially important in precision medicine and targeted therapies, making them essential for matching patients to appropriate trials. Large language models (LLMs) offer a promising solution for extracting this information from clinical trial study descriptions (e.g., brief summary, eligibility criteria), aiding in identifying suitable patient matches in downstream applications. In this study, we explore various strategies for extracting genetic biomarkers from oncology trials. Therefore, our focus is on structuring unstructured clinical trial data, not processing individual patient records. Our results show that open-source language models, when applied out-of-the-box, effectively capture complex logical expressions and structure genomic biomarkers, outperforming closed-source models such as GPT-4. Furthermore, fine-tuning these open-source models with additional data significantly enhances their performance.
Collapse
Affiliation(s)
- Nour Alkhoury
- Berlin Institute for Medical Systems Biology (BIMSB), Max Delbrück Center for Molecular Medicine, Berlin, Germany
| | - Maqsood Shaik
- Berlin Institute for Medical Systems Biology (BIMSB), Max Delbrück Center for Molecular Medicine, Berlin, Germany
| | - Ricardo Wurmus
- Berlin Institute for Medical Systems Biology (BIMSB), Max Delbrück Center for Molecular Medicine, Berlin, Germany
| | - Altuna Akalin
- Berlin Institute for Medical Systems Biology (BIMSB), Max Delbrück Center for Molecular Medicine, Berlin, Germany.
| |
Collapse
|
49
|
Naufal T, Mahendra R, Wicaksono AF. Sentences, entities, and keyphrases extraction from consumer health forums using multi-task learning. J Biomed Semantics 2025; 16:8. [PMID: 40329333 PMCID: PMC12057135 DOI: 10.1186/s13326-025-00329-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Accepted: 04/08/2025] [Indexed: 05/08/2025] Open
Abstract
PURPOSE Online consumer health forums offer an alternative source of health-related information for internet users seeking specific details that may not be readily available through articles or other one-way communication channels. However, the effectiveness of these forums can be constrained by the limited number of healthcare professionals actively participating, which can impact response times to user inquiries. One potential solution to this issue is the integration of a semi-automatic system. A critical component of such a system is question processing, which often involves sentence recognition (SR), medical entity recognition (MER), and keyphrase extraction (KE) modules. We posit that the development of these three modules would enable the system to identify critical components of the question, thereby facilitating a deeper understanding of the question, and allowing for the re-formulation of more effective questions with extracted key information. METHODS This work contributes to two key aspects related to these three tasks. First, we expand and publicly release an Indonesian dataset for each task. Second, we establish a baseline for all three tasks within the Indonesian language domain by employing transformer-based models with nine distinct encoder variations. Our feature studies revealed an interdependence among these three tasks. Consequently, we propose several multi-task learning (MTL) models, both in pairwise and three-way configurations, incorporating parallel and hierarchical architectures. RESULTS Using F1-score at the chunk level, the inter-annotator agreements for SR, MER, and KE tasks were 88.61 % , 64.83 % , and 35.01 % respectively. In single-task learning (STL) settings, the best performance for each task was achieved by different model, with IndoNLU LARGE obtained the highest average score. These results suggested that a larger model did not always perform better. We also found no indication of which ones between Indonesian and multilingual language models that generally performed better for our tasks. In pairwise MTL settings, we found that pairing tasks could outperform the STL baseline for all three tasks. Despite varying loss weights across our three-way MTL models, we did not identify a consistent pattern. While some configurations improved MER and KE performance, none surpassed the best pairwise MTL model for the SR task. CONCLUSION We extended an Indonesian dataset for SR, MER, and KE tasks, resulted in 1, 173 labeled data points which splitted into 773 training instances, 200 validation instances, and 200 testing instances. We then used transformer-based models to set a baseline for all three tasks. Our MTL experiments suggested that additional information regarding the other two tasks could help the learning process for MER and KE tasks, while had only a small effect for SR task.
Collapse
Affiliation(s)
- Tsaqif Naufal
- Faculty of Computer Science, Universitas Indonesia, Kampus UI, 16424, Depok, West Java, Indonesia
| | - Rahmad Mahendra
- Faculty of Computer Science, Universitas Indonesia, Kampus UI, 16424, Depok, West Java, Indonesia
| | - Alfan Farizki Wicaksono
- Faculty of Computer Science, Universitas Indonesia, Kampus UI, 16424, Depok, West Java, Indonesia.
| |
Collapse
|
50
|
Chang YC, Huang MS, Huang YH, Lin YH. The influence of prompt engineering on large language models for protein-protein interaction identification in biomedical literature. Sci Rep 2025; 15:15493. [PMID: 40319086 PMCID: PMC12049485 DOI: 10.1038/s41598-025-99290-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2024] [Accepted: 04/18/2025] [Indexed: 05/07/2025] Open
Abstract
Identifying protein-protein interactions (PPIs) is a foundational task in biomedical natural language processing. While specialized models have been developed, the potential of general-domain large language models (LLMs) in PPI extraction, particularly for researchers without computational expertise, remains unexplored. This study evaluates the effectiveness of proprietary LLMs (GPT-3.5, GPT-4, and Google Gemini) in PPI prediction through systematic prompt engineering. We designed six prompting scenarios of increasing complexity, from basic interaction queries to sophisticated entity-tagged formats, and assessed model performance across multiple benchmark datasets (LLL, IEPA, HPRD50, AIMed, BioInfer, and PEDD). Carefully designed prompts effectively guided LLMs in PPI prediction. Gemini 1.5 Pro achieved the highest performance across most datasets, with notable F1-scores in LLL (90.3%), IEPA (68.2%), HPRD50 (67.5%), and PEDD (70.2%). GPT-4 showed competitive performance, particularly in the LLL dataset (87.3%). We identified and addressed a positive prediction bias, demonstrating improved performance after evaluation refinement. While not surpassing specialized models, general-purpose LLMs with appropriate prompting strategies can effectively perform PPI prediction tasks, offering valuable tools for biomedical researchers without extensive computational expertise.
Collapse
Affiliation(s)
- Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan.
- Clinical Big Data Research Center, Taipei Medical University Hospital, Taipei, Taiwan.
| | - Ming-Siang Huang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Yi-Hsuan Huang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Yi-Hsuan Lin
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| |
Collapse
|