1
|
Munzir SI, Hier DB, Oommen C, Carrithers MD. A Large Language Model Outperforms Other Computational Approaches to the High-Throughput Phenotyping of Physician Notes. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:838-846. [PMID: 40417529 PMCID: PMC12099424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
High-throughput phenotyping, the automated mapping of patient signs and symptoms to standardized ontology concepts, is essential for realizing value from electronic health records (EHR) in support of precision medicine. Despite technological advances, high-throughput phenotyping remains a challenge. This study compares three computational approaches to high-throughput phenotyping: a large language model (LLM) incorporating generative AI, a deep learning (DL) approach utilizing span categorization, and a machine learning (ML) approach with word embeddings. The LLM approach that implemented GPT-4 demonstrated superior performance, suggesting that large language models are poised to become the preferred method for high-throughput phenotyping ofphysician notes.
Collapse
Affiliation(s)
- Syed I Munzir
- Department of Neurology and Rehabilitation, University of Illinois at Chicago, Chicago, USA
| | - Daniel B Hier
- Department of Neurology and Rehabilitation, University of Illinois at Chicago, Chicago, USA
- Kummer Institute, Missouri University of Science and Technology, Rolla, MO, USA
| | - Chelsea Oommen
- Department of Neurology and Rehabilitation, University of Illinois at Chicago, Chicago, USA
| | - Michael D Carrithers
- Department of Neurology and Rehabilitation, University of Illinois at Chicago, Chicago, USA
| |
Collapse
|
2
|
Lu Q, Li R, Wen A, Wang J, Wang L, Liu H. Large Language Models Struggle in Token-Level Clinical Named Entity Recognition. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:748-757. [PMID: 40417588 PMCID: PMC12099373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
Large Language Models (LLMs) have revolutionized various sectors, including healthcare where they are employed in diverse applications. Their utility is particularly significant in the context of rare diseases, where data scarcity, complexity, and specificity pose considerable challenges. In the clinical domain, Named Entity Recognition (NER) stands out as an essential task and it plays a crucial role in extracting relevant information from clinical texts. Despite the promise of LLMs, current research mostly concentrates on document-level NER, identifying entities in a more general context across entire documents, without extracting their precise location. Additionally, efforts have been directed towards adapting ChatGPTfor token-level NER. However, there is a significant research gap when it comes to employing token-level NER for clinical texts, especially with the use of local open-source LLMs. This study aims to bridge this gap by investigating the effectiveness of both proprietary and local LLMs in token-level clinical NER. Essentially, we delve into the capabilities of these models through a series of experiments involving zero-shot prompting, few-shot prompting, retrieval-augmented generation (RAG), and instruction-fine-tuning. Our exploration reveals the inherent challenges LLMs face in token-level NER, particularly in the context of rare diseases, and suggests possible improvements for their application in healthcare. This research contributes to narrowing a significant gap in healthcare informatics and offers insights that could lead to a more refined application of LLMs in the healthcare sector.
Collapse
Affiliation(s)
- Qiuhao Lu
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| | - Rui Li
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| | - Andrew Wen
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| | - Jinlian Wang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| | - Liwei Wang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| | - Hongfang Liu
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| |
Collapse
|
3
|
Merdler-Rabinowicz R, Omar M, Ganesh J, Morava E, Nadkarni GN, Klang E. The role of large language models in medical genetics. Mol Genet Metab 2025; 145:109098. [PMID: 40154187 DOI: 10.1016/j.ymgme.2025.109098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/01/2025]
Affiliation(s)
| | - Mahmud Omar
- Tel-Aviv University, Faculty of Medicine, Tel-Aviv, Israel
| | - Jaya Ganesh
- Department of Genomics and Genetic Sciences, Icahn School of Medicine, New York, NY, USA
| | - Eva Morava
- Department of Genomics and Genetic Sciences, Icahn School of Medicine, New York, NY, USA
| | - Girish N Nadkarni
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
4
|
Germain DP, Gruson D, Malcles M, Garcelon N. Applying artificial intelligence to rare diseases: a literature review highlighting lessons from Fabry disease. Orphanet J Rare Dis 2025; 20:186. [PMID: 40247315 PMCID: PMC12007257 DOI: 10.1186/s13023-025-03655-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Accepted: 03/06/2025] [Indexed: 04/19/2025] Open
Abstract
BACKGROUND Use of artificial intelligence (AI) in rare diseases has grown rapidly in recent years. In this review we have outlined the most common machine-learning and deep-learning methods currently being used to classify and analyse large amounts of data, such as standardized images or specific text in electronic health records. To illustrate how these methods have been adapted or developed for use with rare diseases, we have focused on Fabry disease, an X-linked genetic disorder caused by lysosomal α-galactosidase. A deficiency that can result in multiple organ damage. METHODS We searched PubMed for articles focusing on AI, rare diseases, and Fabry disease published anytime up to 08 January 2025. Further searches, limited to articles published between 01 January 2021 and 31 December 2023, were also performed using double combinations of keywords related to AI and each organ affected in Fabry disease, and AI and rare diseases. RESULTS In total, 20 articles on AI and Fabry disease were included. In the rare disease field, AI methods may be applied prospectively to large populations to identify specific patients, or retrospectively to large data sets to diagnose a previously overlooked rare disease. Different AI methods may facilitate Fabry disease diagnosis, help monitor progression in affected organs, and potentially contribute to personalized therapy development. The implementation of AI methods in general healthcare and medical imaging centres may help raise awareness of rare diseases and prompt general practitioners to consider these conditions earlier in the diagnostic pathway, while chatbots and telemedicine may accelerate patient referral to rare disease experts. The use of AI technologies in healthcare may generate specific ethical risks, prompting new AI regulatory frameworks aimed at addressing these issues to be established in Europe and the United States. CONCLUSION AI-based methods will lead to substantial improvements in the diagnosis and management of rare diseases. The need for a human guarantee of AI is a key issue in pursuing innovation while ensuring that human involvement remains at the centre of patient care during this technological revolution.
Collapse
Affiliation(s)
- Dominique P Germain
- Division of Medical Genetics, University of Versailles-St Quentin en Yvelines (UVSQ), Paris-Saclay University, 2 avenue de la Source de la Bièvre, 78180, Montigny, France.
- First Faculty of Medicine, Charles University, Prague, Czech Republic.
| | - David Gruson
- Ethik-IA, PariSanté Campus, 10 Rue Oradour-Sur-Glane, 75015, Paris, France
| | | | - Nicolas Garcelon
- Imagine Institute, Data Science Platform, INSERM UMR 1163, Université de Paris, 75015, Paris, France
| |
Collapse
|
5
|
Garcia BT, Westerfield L, Yelemali P, Gogate N, Andres Rivera-Munoz E, Du H, Dawood M, Jolly A, Lupski JR, Posey JE. Improving Automated Deep Phenotyping Through Large Language Models Using Retrieval Augmented Generation. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.12.01.24318253. [PMID: 39677442 PMCID: PMC11643181 DOI: 10.1101/2024.12.01.24318253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Background Diagnosing rare genetic disorders relies on precise phenotypic and genotypic analysis, with the Human Phenotype Ontology (HPO) providing a standardized language for capturing clinical phenotypes. Traditional HPO tools, such as Doc2HPO and ClinPhen, employ concept recognition to automate phenotype extraction but struggle with incomplete phenotype assignment, often requiring intensive manual review. While large language models (LLMs) hold promise for more context-driven phenotype extraction, they are prone to errors and "hallucinations," making them less reliable without further refinement. We present RAG-HPO, a Python-based tool that leverages Retrieval-Augmented Generation (RAG) to elevate LLM accuracy in HPO term assignment, bypassing the limitations of baseline models while avoiding the time and resource intensive process of fine-tuning. RAG-HPO integrates a dynamic vector database, allowing real-time retrieval and contextual matching. Methods The high-dimensional vector database utilized by RAG-HPO includes >54,000 phenotypic phrases mapped to HPO IDs, derived from the HPO database and supplemented with additional validated phrases. The RAG-HPO workflow uses an LLM to first extract phenotypic phrases that are then matched via semantic similarity to entries within a vector database before providing best term matches back to the LLM as context for final HPO term assignment. A benchmarking dataset of 120 published case reports with 1,792 manually-assigned HPO terms was developed, and the performance of RAG-HPO measured against existing published tools Doc2HPO, ClinPhen, and FastHPOCR. Results In evaluations, RAG-HPO, powered by Llama-3 70B and applied to a set of 120 case reports, achieved a mean precision of 0.84, recall of 0.78, and an F1 score of 0.80-significantly surpassing conventional tools (p<0.00001). False positive HPO term identification occurred for 15.8% (256/1,624) of terms, of which only 2.7% (7/256) represented hallucinations, and 33.6% (86/256) unrelated terms; the remainder of false positives (63.7%, 163/256) were relative terms of the target term. Conclusions RAG-HPO is a user-friendly, adaptable tool designed for secure evaluation of clinical text and outperforms standard HPO-matching tools in precision, recall, and F1. Its enhanced precision and recall represent a substantial advancement in phenotypic analysis, accelerating the identification of genetic mechanisms underlying rare diseases and driving progress in genetic research and clinical genomics.
Collapse
Affiliation(s)
- Brandon T. Garcia
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Medical Scientist Training Program, Baylor College of Medicine, Houston, TX, 77030, USA
- Genetics and Genomics Graduate Program, Baylor College of Medicine. Houston, TX, 77030, USA
| | - Lauren Westerfield
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Texas Children’s Hospital, Houston, TX, 77303, USA
| | - Priya Yelemali
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Nikhita Gogate
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Genetics and Genomics Graduate Program, Baylor College of Medicine. Houston, TX, 77030, USA
| | - E. Andres Rivera-Munoz
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Genetics and Genomics Graduate Program, Baylor College of Medicine. Houston, TX, 77030, USA
| | - Haowei Du
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Moez Dawood
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Medical Scientist Training Program, Baylor College of Medicine, Houston, TX, 77030, USA
- Genetics and Genomics Graduate Program, Baylor College of Medicine. Houston, TX, 77030, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Angad Jolly
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - James R. Lupski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Texas Children’s Hospital, Houston, TX, 77303, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
- Department of Pediatrics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Jennifer E. Posey
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| |
Collapse
|
6
|
Chen F, Ahimaz P, Nguyen QM, Lewis R, Chung WK, Ta CN, Szigety KM, Sheppard SE, Campbell IM, Wang K, Weng C, Liu C. Phenotype driven molecular genetic test recommendation for diagnosing pediatric rare disorders. NPJ Digit Med 2024; 7:333. [PMID: 39572625 PMCID: PMC11582592 DOI: 10.1038/s41746-024-01331-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Accepted: 11/07/2024] [Indexed: 11/24/2024] Open
Abstract
Patients with rare diseases often experience prolonged diagnostic delays. Ordering appropriate genetic tests is crucial yet challenging, especially for general pediatricians without genetic expertise. Recent American College of Medical Genetics (ACMG) guidelines embrace early use of exome sequencing (ES) or genome sequencing (GS) for conditions like congenital anomalies or developmental delays while still recommend gene panels for patients exhibiting strong manifestations of a specific disease. Recognizing the difficulty in navigating these options, we developed a machine learning model trained on 1005 patient records from Columbia University Irving Medical Center to recommend appropriate genetic tests based on the phenotype information. The model achieved a remarkable performance with an AUROC of 0.823 and AUPRC of 0.918, aligning closely with decisions made by genetic specialists, and demonstrated strong generalizability (AUROC:0.77, AUPRC: 0.816) in an external cohort, indicating its potential value for general pediatricians to expedite rare disease diagnosis by enhancing genetic test ordering.
Collapse
Affiliation(s)
- Fangyi Chen
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Priyanka Ahimaz
- Department of Pediatrics, Columbia University, New York, NY, USA
- Institute of Genomic Medicine, Columbia University, New York, NY, USA
| | - Quan M Nguyen
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA
| | - Rachel Lewis
- Department of Pediatrics, Columbia University, New York, NY, USA
| | - Wendy K Chung
- Division of Genetics and Genomics, Department of Pediatrics, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
| | - Casey N Ta
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Katherine M Szigety
- Division of Human Genetics, Department of Pediatrics, Children's Hospital of Philadelphia, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Sarah E Sheppard
- Division of Human Genetics, Department of Pediatrics, Children's Hospital of Philadelphia, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Ian M Campbell
- Division of Human Genetics, Department of Pediatrics, Children's Hospital of Philadelphia, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, USA.
| | - Cong Liu
- Division of Genetics and Genomics, Department of Pediatrics, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
7
|
Tekumalla R, Banda JM. Towards automated phenotype definition extraction using large language models. Genomics Inform 2024; 22:21. [PMID: 39482749 PMCID: PMC11529293 DOI: 10.1186/s44342-024-00023-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Accepted: 09/29/2024] [Indexed: 11/03/2024] Open
Abstract
Electronic phenotyping involves a detailed analysis of both structured and unstructured data, employing rule-based methods, machine learning, natural language processing, and hybrid approaches. Currently, the development of accurate phenotype definitions demands extensive literature reviews and clinical experts, rendering the process time-consuming and inherently unscalable. Large language models offer a promising avenue for automating phenotype definition extraction but come with significant drawbacks, including reliability issues, the tendency to generate non-factual data ("hallucinations"), misleading results, and potential harm. To address these challenges, our study embarked on two key objectives: (1) defining a standard evaluation set to ensure large language models outputs are both useful and reliable and (2) evaluating various prompting approaches to extract phenotype definitions from large language models, assessing them with our established evaluation task. Our findings reveal promising results that still require human evaluation and validation for this task. However, enhanced phenotype extraction is possible, reducing the amount of time spent in literature review and evaluation.
Collapse
Affiliation(s)
| | - Juan M Banda
- Stanford Health Care, Stanford, CA, USA.
- Observational Health Data Sciences and Informatics, New York, NY, USA.
| |
Collapse
|
8
|
Walsh CG, Wilimitis D, Chen Q, Wright A, Kolli J, Robinson K, Ripperger MA, Johnson KB, Carrell D, Desai RJ, Mosholder A, Dharmarajan S, Adimadhyam S, Fabbri D, Stojanovic D, Matheny ME, Bejan CA. Scalable incident detection via natural language processing and probabilistic language models. Sci Rep 2024; 14:23429. [PMID: 39379449 PMCID: PMC11461638 DOI: 10.1038/s41598-024-72756-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 09/10/2024] [Indexed: 10/10/2024] Open
Abstract
Post marketing safety surveillance depends in part on the ability to detect concerning clinical events at scale. Spontaneous reporting might be an effective component of safety surveillance, but it requires awareness and understanding among healthcare professionals to achieve its potential. Reliance on readily available structured data such as diagnostic codes risks under-coding and imprecision. Clinical textual data might bridge these gaps, and natural language processing (NLP) has been shown to aid in scalable phenotyping across healthcare records in multiple clinical domains. In this study, we developed and validated a novel incident phenotyping approach using unstructured clinical textual data agnostic to Electronic Health Record (EHR) and note type. It's based on a published, validated approach (PheRe) used to ascertain social determinants of health and suicidality across entire healthcare records. To demonstrate generalizability, we validated this approach on two separate phenotypes that share common challenges with respect to accurate ascertainment: (1) suicide attempt; (2) sleep-related behaviors. With samples of 89,428 records and 35,863 records for suicide attempt and sleep-related behaviors, respectively, we conducted silver standard (diagnostic coding) and gold standard (manual chart review) validation. We showed Area Under the Precision-Recall Curve of ~ 0.77 (95% CI 0.75-0.78) for suicide attempt and AUPR ~ 0.31 (95% CI 0.28-0.34) for sleep-related behaviors. We also evaluated performance by coded race and demonstrated differences in performance by race differed across phenotypes. Scalable phenotyping models, like most healthcare AI, require algorithmovigilance and debiasing prior to implementation.
Collapse
Affiliation(s)
- Colin G Walsh
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA.
- Department of Psychiatry and Behavioral Sciences, Vanderbilt University Medical Center, Nashville, TN, USA.
- Vanderbilt University Medical Center, Nashville, USA.
| | - Drew Wilimitis
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Qingxia Chen
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Aileen Wright
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Jhansi Kolli
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Katelyn Robinson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Michael A Ripperger
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Kevin B Johnson
- Department of Biostatistics, Epidemiology and Informatics, and Pediatrics, University of Pennsylvania, Pennsylvania, USA
- Department of Computer and Information Science, Bioengineering, University of Pennsylvania, Pennsylvania, USA
- Department of Science Communication, University of Pennsylvania, Pennsylvania, USA
| | - David Carrell
- Washington Health Research Institute, , Kaiser Permanente Washington, Washington, USA
| | - Rishi J Desai
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, USA
| | - Andrew Mosholder
- Center for Drug Evaluation and Research, United States Food and Drug Administration, Maryland, USA
- Office of Surveillance and Epidemiology, United States Food and Drug Administration, Maryland, USA
| | - Sai Dharmarajan
- Center for Drug Evaluation and Research, United States Food and Drug Administration, Maryland, USA
- Office of Translational Science, United States Food and Drug Administration, Maryland, USA
| | - Sruthi Adimadhyam
- Department of Population Medicine, Harvard Medical School, Harvard Pilgrim Health Care Institute, Boston, USA
| | - Daniel Fabbri
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Danijela Stojanovic
- Center for Drug Evaluation and Research, United States Food and Drug Administration, Maryland, USA
- Office of Surveillance and Epidemiology, United States Food and Drug Administration, Maryland, USA
| | - Michael E Matheny
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Cosmin A Bejan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
9
|
Wu J, Dong H, Li Z, Wang H, Li R, Patra A, Dai C, Ali W, Scordis P, Wu H. A hybrid framework with large language models for rare disease phenotyping. BMC Med Inform Decis Mak 2024; 24:289. [PMID: 39375687 PMCID: PMC11460004 DOI: 10.1186/s12911-024-02698-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Accepted: 09/26/2024] [Indexed: 10/09/2024] Open
Abstract
PURPOSE Rare diseases pose significant challenges in diagnosis and treatment due to their low prevalence and heterogeneous clinical presentations. Unstructured clinical notes contain valuable information for identifying rare diseases, but manual curation is time-consuming and prone to subjectivity. This study aims to develop a hybrid approach combining dictionary-based natural language processing (NLP) tools with large language models (LLMs) to improve rare disease identification from unstructured clinical reports. METHODS We propose a novel hybrid framework that integrates the Orphanet Rare Disease Ontology (ORDO) and the Unified Medical Language System (UMLS) to create a comprehensive rare disease vocabulary. SemEHR, a dictionary-based NLP tool, is employed to extract rare disease mentions from clinical notes. To refine the results and improve accuracy, we leverage various LLMs, including LLaMA3, Phi3-mini, and domain-specific models like OpenBioLLM and BioMistral. Different prompting strategies, such as zero-shot, few-shot, and knowledge-augmented generation, are explored to optimize the LLMs' performance. RESULTS The proposed hybrid approach demonstrates superior performance compared to traditional NLP systems and standalone LLMs. LLaMA3 and Phi3-mini achieve the highest F1 scores in rare disease identification. Few-shot prompting with 1-3 examples yields the best results, while knowledge-augmented generation shows limited improvement. Notably, the approach uncovers a significant number of potential rare disease cases not documented in structured diagnostic records, highlighting its ability to identify previously unrecognized patients. CONCLUSION The hybrid approach combining dictionary-based NLP tools with LLMs shows great promise for improving rare disease identification from unstructured clinical reports. By leveraging the strengths of both techniques, the method demonstrates superior performance and the potential to uncover hidden rare disease cases. Further research is needed to address limitations related to ontology mapping and overlapping case identification, and to integrate the approach into clinical practice for early diagnosis and improved patient outcomes.
Collapse
Affiliation(s)
- Jinge Wu
- Institute of Health Informatics, University College London, London, UK.
- UCB Pharma UK, Slough, UK.
| | - Hang Dong
- Department of Computer Science, University of Exeter, Exeter, UK
| | - Zexi Li
- The Nuffield Department of Surgical Sciences, University of Oxford, Oxford, UK
| | - Haowei Wang
- Division of Medicine, University College London, London, UK
| | - Runci Li
- EGA- Institute for Women's Health, University College London, London, UK
| | | | | | | | | | - Honghan Wu
- Institute of Health Informatics, University College London, London, UK.
- School of Health and Wellbeing, University of Glasgow, Glasgow, UK.
| |
Collapse
|
10
|
Bhattarai K, Oh IY, Sierra JM, Tang J, Payne PRO, Abrams Z, Lai AM. Leveraging GPT-4 for identifying cancer phenotypes in electronic health records: a performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5, Llama-3-8B, and spaCy's rule-based and machine learning-based methods. JAMIA Open 2024; 7:ooae060. [PMID: 38962662 PMCID: PMC11221943 DOI: 10.1093/jamiaopen/ooae060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 06/12/2024] [Accepted: 06/18/2024] [Indexed: 07/05/2024] Open
Abstract
Objective Accurately identifying clinical phenotypes from Electronic Health Records (EHRs) provides additional insights into patients' health, especially when such information is unavailable in structured data. This study evaluates the application of OpenAI's Generative Pre-trained Transformer (GPT)-4 model to identify clinical phenotypes from EHR text in non-small cell lung cancer (NSCLC) patients. The goal was to identify disease stages, treatments and progression utilizing GPT-4, and compare its performance against GPT-3.5-turbo, Flan-T5-xl, Flan-T5-xxl, Llama-3-8B, and 2 rule-based and machine learning-based methods, namely, scispaCy and medspaCy. Materials and Methods Phenotypes such as initial cancer stage, initial treatment, evidence of cancer recurrence, and affected organs during recurrence were identified from 13 646 clinical notes for 63 NSCLC patients from Washington University in St. Louis, Missouri. The performance of the GPT-4 model is evaluated against GPT-3.5-turbo, Flan-T5-xxl, Flan-T5-xl, Llama-3-8B, medspaCy, and scispaCy by comparing precision, recall, and micro-F1 scores. Results GPT-4 achieved higher F1 score, precision, and recall compared to Flan-T5-xl, Flan-T5-xxl, Llama-3-8B, medspaCy, and scispaCy's models. GPT-3.5-turbo performed similarly to that of GPT-4. GPT, Flan-T5, and Llama models were not constrained by explicit rule requirements for contextual pattern recognition. spaCy models relied on predefined patterns, leading to their suboptimal performance. Discussion and Conclusion GPT-4 improves clinical phenotype identification due to its robust pre-training and remarkable pattern recognition capability on the embedded tokens. It demonstrates data-driven effectiveness even with limited context in the input. While rule-based models remain useful for some tasks, GPT models offer improved contextual understanding of the text, and robust clinical phenotype extraction.
Collapse
Affiliation(s)
- Kriti Bhattarai
- Institute for Informatics, Data Science & Biostatistics, Washington University School of Medicine, St. Louis, MO 63110, United States
- Department of Computer Science, Washington University in St Louis, St. Louis, MO 63110, United States
| | - Inez Y Oh
- Institute for Informatics, Data Science & Biostatistics, Washington University School of Medicine, St. Louis, MO 63110, United States
| | - Jonathan Moran Sierra
- Medical Scientist Training Program, Washington University School of Medicine, St. Louis, MO 63110, United States
| | - Jonathan Tang
- Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO 63110, United States
| | - Philip R O Payne
- Institute for Informatics, Data Science & Biostatistics, Washington University School of Medicine, St. Louis, MO 63110, United States
- Department of Computer Science, Washington University in St Louis, St. Louis, MO 63110, United States
| | - Zach Abrams
- Institute for Informatics, Data Science & Biostatistics, Washington University School of Medicine, St. Louis, MO 63110, United States
| | - Albert M Lai
- Institute for Informatics, Data Science & Biostatistics, Washington University School of Medicine, St. Louis, MO 63110, United States
- Department of Computer Science, Washington University in St Louis, St. Louis, MO 63110, United States
| |
Collapse
|
11
|
Wu D, Yang J, Wang K. Exploring the reversal curse and other deductive logical reasoning in BERT and GPT-based large language models. PATTERNS (NEW YORK, N.Y.) 2024; 5:101030. [PMID: 39568650 PMCID: PMC11573886 DOI: 10.1016/j.patter.2024.101030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 04/11/2024] [Accepted: 07/01/2024] [Indexed: 11/22/2024]
Abstract
The "Reversal Curse" describes the inability of autoregressive decoder large language models (LLMs) to deduce "B is A" from "A is B," assuming that B and A are distinct and can be uniquely identified from each other. This logical failure suggests limitations in using generative pretrained transformer (GPT) models for tasks like constructing knowledge graphs. Our study revealed that a bidirectional LLM, bidirectional encoder representations from transformers (BERT), does not suffer from this issue. To investigate further, we focused on more complex deductive reasoning by training encoder and decoder LLMs to perform union and intersection operations on sets. While both types of models managed tasks involving two sets, they struggled with operations involving three sets. Our findings underscore the differences between encoder and decoder models in handling logical reasoning. Thus, selecting BERT or GPT should depend on the task's specific needs, utilizing BERT's bidirectional context comprehension or GPT's sequence prediction strengths.
Collapse
Affiliation(s)
- Da Wu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Jingye Yang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
12
|
Zaghir J, Naguib M, Bjelogrlic M, Névéol A, Tannier X, Lovis C. Prompt Engineering Paradigms for Medical Applications: Scoping Review. J Med Internet Res 2024; 26:e60501. [PMID: 39255030 PMCID: PMC11422740 DOI: 10.2196/60501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 07/09/2024] [Accepted: 07/22/2024] [Indexed: 09/11/2024] Open
Abstract
BACKGROUND Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored. OBJECTIVE The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice. METHODS Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering-based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD). RESULTS We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering-specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research. CONCLUSIONS In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field.
Collapse
Affiliation(s)
- Jamil Zaghir
- Division of Medical Information Sciences, Geneva University Hospitals, Geneva, Switzerland
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Marco Naguib
- Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay, France
| | - Mina Bjelogrlic
- Division of Medical Information Sciences, Geneva University Hospitals, Geneva, Switzerland
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Aurélie Névéol
- Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay, France
| | - Xavier Tannier
- Sorbonne Université, INSERM, Université Sorbonne Paris-Nord, Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en eSanté, LIMICS, Paris, France
| | - Christian Lovis
- Division of Medical Information Sciences, Geneva University Hospitals, Geneva, Switzerland
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| |
Collapse
|
13
|
Wang A, Liu C, Yang J, Weng C. Fine-tuning large language models for rare disease concept normalization. J Am Med Inform Assoc 2024; 31:2076-2083. [PMID: 38829731 PMCID: PMC11339522 DOI: 10.1093/jamia/ocae133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 05/20/2024] [Accepted: 05/22/2024] [Indexed: 06/05/2024] Open
Abstract
OBJECTIVE We aim to develop a novel method for rare disease concept normalization by fine-tuning Llama 2, an open-source large language model (LLM), using a domain-specific corpus sourced from the Human Phenotype Ontology (HPO). METHODS We developed an in-house template-based script to generate two corpora for fine-tuning. The first (NAME) contains standardized HPO names, sourced from the HPO vocabularies, along with their corresponding identifiers. The second (NAME+SYN) includes HPO names and half of the concept's synonyms as well as identifiers. Subsequently, we fine-tuned Llama 2 (Llama2-7B) for each sentence set and conducted an evaluation using a range of sentence prompts and various phenotype terms. RESULTS When the phenotype terms for normalization were included in the fine-tuning corpora, both models demonstrated nearly perfect performance, averaging over 99% accuracy. In comparison, ChatGPT-3.5 has only ∼20% accuracy in identifying HPO IDs for phenotype terms. When single-character typos were introduced in the phenotype terms, the accuracy of NAME and NAME+SYN is 10.2% and 36.1%, respectively, but increases to 61.8% (NAME+SYN) with additional typo-specific fine-tuning. For terms sourced from HPO vocabularies as unseen synonyms, the NAME model achieved 11.2% accuracy, while the NAME+SYN model achieved 92.7% accuracy. CONCLUSION Our fine-tuned models demonstrate ability to normalize phenotype terms unseen in the fine-tuning corpus, including misspellings, synonyms, terms from other ontologies, and laymen's terms. Our approach provides a solution for the use of LLMs to identify named medical entities from clinical narratives, while successfully normalizing them to standard concepts in a controlled vocabulary.
Collapse
Affiliation(s)
- Andy Wang
- Peddie School, Hightstown, NJ 08520, United States
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Jingye Yang
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| |
Collapse
|
14
|
Yan C, Ong HH, Grabowska ME, Krantz MS, Su WC, Dickson AL, Peterson JF, Feng Q, Roden DM, Stein CM, Kerchberger VE, Malin BA, Wei WQ. Large language models facilitate the generation of electronic health record phenotyping algorithms. J Am Med Inform Assoc 2024; 31:1994-2001. [PMID: 38613820 PMCID: PMC11339509 DOI: 10.1093/jamia/ocae072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 02/21/2024] [Accepted: 03/22/2024] [Indexed: 04/15/2024] Open
Abstract
OBJECTIVES Phenotyping is a core task in observational health research utilizing electronic health records (EHRs). Developing an accurate algorithm demands substantial input from domain experts, involving extensive literature review and evidence synthesis. This burdensome process limits scalability and delays knowledge discovery. We investigate the potential for leveraging large language models (LLMs) to enhance the efficiency of EHR phenotyping by generating high-quality algorithm drafts. MATERIALS AND METHODS We prompted four LLMs-GPT-4 and GPT-3.5 of ChatGPT, Claude 2, and Bard-in October 2023, asking them to generate executable phenotyping algorithms in the form of SQL queries adhering to a common data model (CDM) for three phenotypes (ie, type 2 diabetes mellitus, dementia, and hypothyroidism). Three phenotyping experts evaluated the returned algorithms across several critical metrics. We further implemented the top-rated algorithms and compared them against clinician-validated phenotyping algorithms from the Electronic Medical Records and Genomics (eMERGE) network. RESULTS GPT-4 and GPT-3.5 exhibited significantly higher overall expert evaluation scores in instruction following, algorithmic logic, and SQL executability, when compared to Claude 2 and Bard. Although GPT-4 and GPT-3.5 effectively identified relevant clinical concepts, they exhibited immature capability in organizing phenotyping criteria with the proper logic, leading to phenotyping algorithms that were either excessively restrictive (with low recall) or overly broad (with low positive predictive values). CONCLUSION GPT versions 3.5 and 4 are capable of drafting phenotyping algorithms by identifying relevant clinical criteria aligned with a CDM. However, expertise in informatics and clinical experience is still required to assess and further refine generated algorithms.
Collapse
Affiliation(s)
- Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Henry H Ong
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Monika E Grabowska
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Matthew S Krantz
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Wu-Chen Su
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Alyson L Dickson
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Josh F Peterson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - QiPing Feng
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Dan M Roden
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - C Michael Stein
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - V Eric Kerchberger
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Bradley A Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Computer Science, Vanderbilt University, Nashville, TN 37203, United States
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Computer Science, Vanderbilt University, Nashville, TN 37203, United States
| |
Collapse
|
15
|
Pool J, Indulska M, Sadiq S. Large language models and generative AI in telehealth: a responsible use lens. J Am Med Inform Assoc 2024; 31:2125-2136. [PMID: 38441296 PMCID: PMC11339524 DOI: 10.1093/jamia/ocae035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 02/05/2024] [Accepted: 02/14/2024] [Indexed: 08/23/2024] Open
Abstract
OBJECTIVE This scoping review aims to assess the current research landscape of the application and use of large language models (LLMs) and generative Artificial Intelligence (AI), through tools such as ChatGPT in telehealth. Additionally, the review seeks to identify key areas for future research, with a particular focus on AI ethics considerations for responsible use and ensuring trustworthy AI. MATERIALS AND METHODS Following the scoping review methodological framework, a search strategy was conducted across 6 databases. To structure our review, we employed AI ethics guidelines and principles, constructing a concept matrix for investigating the responsible use of AI in telehealth. Using the concept matrix in our review enabled the identification of gaps in the literature and informed future research directions. RESULTS Twenty studies were included in the review. Among the included studies, 5 were empirical, and 15 were reviews and perspectives focusing on different telehealth applications and healthcare contexts. Benefit and reliability concepts were frequently discussed in these studies. Privacy, security, and accountability were peripheral themes, with transparency, explainability, human agency, and contestability lacking conceptual or empirical exploration. CONCLUSION The findings emphasized the potential of LLMs, especially ChatGPT, in telehealth. They provide insights into understanding the use of LLMs, enhancing telehealth services, and taking ethical considerations into account. By proposing three future research directions with a focus on responsible use, this review further contributes to the advancement of this emerging phenomenon of healthcare AI.
Collapse
Affiliation(s)
- Javad Pool
- ARC Industrial Transformation Training Centre for Information Resilience (CIRES), The University of Queensland, Brisbane 4072, Australia
- School of Electrical Engineering and Computer Science, The University of Queensland, Brisbane 4072, Australia
| | - Marta Indulska
- ARC Industrial Transformation Training Centre for Information Resilience (CIRES), The University of Queensland, Brisbane 4072, Australia
- Business School, The University of Queensland, Brisbane 4072, Australia
| | - Shazia Sadiq
- ARC Industrial Transformation Training Centre for Information Resilience (CIRES), The University of Queensland, Brisbane 4072, Australia
- School of Electrical Engineering and Computer Science, The University of Queensland, Brisbane 4072, Australia
| |
Collapse
|
16
|
Du X, Zhou Z, Wang Y, Chuang YW, Yang R, Zhang W, Wang X, Zhang R, Hong P, Bates DW, Zhou L. Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.08.11.24311828. [PMID: 39228726 PMCID: PMC11370524 DOI: 10.1101/2024.08.11.24311828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Background Generative Large language models (LLMs) represent a significant advancement in natural language processing, achieving state-of-the-art performance across various tasks. However, their application in clinical settings using real electronic health records (EHRs) is still rare and presents numerous challenges. Objective This study aims to systematically review the use of generative LLMs, and the effectiveness of relevant techniques in patient care-related topics involving EHRs, summarize the challenges faced, and suggest future directions. Methods A Boolean search for peer-reviewed articles was conducted on May 19th, 2024 using PubMed and Web of Science to include research articles published since 2023, which was one month after the release of ChatGPT. The search results were deduplicated. Multiple reviewers, including biomedical informaticians, computer scientists, and a physician, screened the publications for eligibility and conducted data extraction. Only studies utilizing generative LLMs to analyze real EHR data were included. We summarized the use of prompt engineering, fine-tuning, multimodal EHR data, and evaluation matrices. Additionally, we identified current challenges in applying LLMs in clinical settings as reported by the included studies and proposed future directions. Results The initial search identified 6,328 unique studies, with 76 studies included after eligibility screening. Of these, 67 studies (88.2%) employed zero-shot prompting, five of them reported 100% accuracy on five specific clinical tasks. Nine studies used advanced prompting strategies; four tested these strategies experimentally, finding that prompt engineering improved performance, with one study noting a non-linear relationship between the number of examples in a prompt and performance improvement. Eight studies explored fine-tuning generative LLMs, all reported performance improvements on specific tasks, but three of them noted potential performance degradation after fine-tuning on certain tasks. Only two studies utilized multimodal data, which improved LLM-based decision-making and enabled accurate rare disease diagnosis and prognosis. The studies employed 55 different evaluation metrics for 22 purposes, such as correctness, completeness, and conciseness. Two studies investigated LLM bias, with one detecting no bias and the other finding that male patients received more appropriate clinical decision-making suggestions. Six studies identified hallucinations, such as fabricating patient names in structured thyroid ultrasound reports. Additional challenges included but were not limited to the impersonal tone of LLM consultations, which made patients uncomfortable, and the difficulty patients had in understanding LLM responses. Conclusion Our review indicates that few studies have employed advanced computational techniques to enhance LLM performance. The diverse evaluation metrics used highlight the need for standardization. LLMs currently cannot replace physicians due to challenges such as bias, hallucinations, and impersonal responses.
Collapse
Affiliation(s)
- Xinsong Du
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Zhengyang Zhou
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - Yifei Wang
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - Ya-Wen Chuang
- Division of Nephrology, Department of Internal Medicine, Taichung Veterans General Hospital, Taichung, Taiwan, 407219
- Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taichung, Taiwan, 402202
- School of Medicine, College of Medicine, China Medical University, Taichung, Taiwan, 404328
| | - Richard Yang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
| | - Wenyu Zhang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Xinyi Wang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Rui Zhang
- Division of Computational Health Sciences, University of Minnesota, Minneapolis, MN 55455
| | - Pengyu Hong
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - David W. Bates
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Health Policy and Management, Harvard T.H. Chan School of Public Health, Boston, MA 02115
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| |
Collapse
|
17
|
Matheny ME, Yang J, Smith JC, Walsh CG, Al-Garadi MA, Davis SE, Marsolo KA, Fabbri D, Reeves RR, Johnson KB, Dal Pan GJ, Ball R, Desai RJ. Enhancing Postmarketing Surveillance of Medical Products With Large Language Models. JAMA Netw Open 2024; 7:e2428276. [PMID: 39150707 DOI: 10.1001/jamanetworkopen.2024.28276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 08/17/2024] Open
Abstract
Importance The Sentinel System is a key component of the US Food and Drug Administration (FDA) postmarketing safety surveillance commitment and uses clinical health care data to conduct analyses to inform drug labeling and safety communications, FDA advisory committee meetings, and other regulatory decisions. However, observational data are frequently deemed insufficient for reliable evaluation of safety concerns owing to limitations in underlying data or methodology. Advances in large language models (LLMs) provide new opportunities to address some of these limitations. However, careful consideration is necessary for how and where LLMs can be effectively deployed for these purposes. Observations LLMs may provide new avenues to support signal-identification activities to identify novel adverse event signals from narrative text of electronic health records. These algorithms may be used to support epidemiologic investigations examining the causal relationship between exposure to a medical product and an adverse event through development of probabilistic phenotyping of health outcomes of interest and extraction of information related to important confounding factors. LLMs may perform like traditional natural language processing tools by annotating text with controlled vocabularies with additional tailored training activities. LLMs offer opportunities for enhancing information extraction from adverse event reports, medical literature, and other biomedical knowledge sources. There are several challenges that must be considered when leveraging LLMs for postmarket surveillance. Prompt engineering is needed to ensure that LLM-extracted associations are accurate and specific. LLMs require extensive infrastructure to use, which many health care systems lack, and this can impact diversity, equity, and inclusion, and result in obscuring significant adverse event patterns in some populations. LLMs are known to generate nonfactual statements, which could lead to false positive signals and downstream evaluation activities by the FDA and other entities, incurring substantial cost. Conclusions and Relevance LLMs represent a novel paradigm that may facilitate generation of information to support medical product postmarket surveillance activities that have not been possible. However, additional work is required to ensure LLMs can be used in a fair and equitable manner, minimize false positive findings, and support the necessary rigor of signal detection needed for regulatory activities.
Collapse
Affiliation(s)
- Michael E Matheny
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
- Geriatric Research Education and Clinical Care Service, Tennessee Valley Healthcare System VA, Nashville
| | - Jie Yang
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts
| | - Joshua C Smith
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Colin G Walsh
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee
- Department of Psychiatry, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Mohammed A Al-Garadi
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Sharon E Davis
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Keith A Marsolo
- Department of Population Health Sciences, Duke University School of Medicine, Durham, North Carolina
| | - Daniel Fabbri
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
- Department of Computer Science, Vanderbilt University, Nashville, Tennessee
| | - Ruth R Reeves
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
- Geriatric Research Education and Clinical Care Service, Tennessee Valley Healthcare System VA, Nashville
| | - Kevin B Johnson
- Department of Epidemiology and Informatics, University of Pennsylvania, Philadelphia
- Department of Pediatrics, University of Pennsylvania, Philadelphia
| | - Gerald J Dal Pan
- Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland
| | - Robert Ball
- Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland
| | - Rishi J Desai
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts
| |
Collapse
|
18
|
Munzir SI, Hier DB, Carrithers MD. High Throughput Phenotyping of Physician Notes with Large Language and Hybrid NLP Models. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2024; 2024:1-5. [PMID: 40039752 DOI: 10.1109/embc53108.2024.10782119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
Deep phenotyping is the detailed description of patient signs and symptoms using concepts from an ontology. The deep phenotyping of the numerous physician notes in electronic health records requires high throughput methods. Over the past 30 years, progress toward making high-throughput phenotyping feasible. In this study, we demonstrate that a large language model and a hybrid NLP model (combining word vectors with a machine learning classifier) can perform high throughput phenotyping on physician notes with high accuracy. Large language models will likely emerge as the preferred method for high throughput deep phenotyping physician notes.Clinical relevance: Large language models will likely emerge as the dominant method for the high throughput phenotyping of signs and symptoms in physician notes.
Collapse
|
19
|
Groza T, Gration D, Baynam G, Robinson PN. FastHPOCR: pragmatic, fast, and accurate concept recognition using the human phenotype ontology. Bioinformatics 2024; 40:btae406. [PMID: 38913850 PMCID: PMC11227366 DOI: 10.1093/bioinformatics/btae406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 05/18/2024] [Accepted: 06/19/2024] [Indexed: 06/26/2024] Open
Abstract
MOTIVATION Human Phenotype Ontology (HPO)-based phenotype concept recognition (CR) underpins a faster and more effective mechanism to create patient phenotype profiles or to document novel phenotype-centred knowledge statements. While the increasing adoption of large language models (LLMs) for natural language understanding has led to several LLM-based solutions, we argue that their intrinsic resource-intensive nature is not suitable for realistic management of the phenotype CR lifecycle. Consequently, we propose to go back to the basics and adopt a dictionary-based approach that enables both an immediate refresh of the ontological concepts as well as efficient re-analysis of past data. RESULTS We developed a dictionary-based approach using a pre-built large collection of clusters of morphologically equivalent tokens-to address lexical variability and a more effective CR step by reducing the entity boundary detection strictly to candidates consisting of tokens belonging to ontology concepts. Our method achieves state-of-the-art results (0.76 F1 on the GSC+ corpus) and a processing efficiency of 10 000 publication abstracts in 5 s. AVAILABILITY AND IMPLEMENTATION FastHPOCR is available as a Python package installable via pip. The source code is available at https://github.com/tudorgroza/fast_hpo_cr. A Java implementation of FastHPOCR will be made available as part of the Fenominal Java library available at https://github.com/monarch-initiative/fenominal. The up-to-date GCS-2024 corpus is available at https://github.com/tudorgroza/code-for-papers/tree/main/gsc-2024.
Collapse
Affiliation(s)
- Tudor Groza
- Rare Care Centre, Perth Children’s Hospital, Nedlands, WA 6009, Australia
- Telethon Kids Institute, Nedlands, WA 6009, Australia
- School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University, Bentley, WA 6102, Australia
- SingHealth Duke-NUS Institute of Precision Medicine, Singapore 169609, Singapore
| | - Dylan Gration
- Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, Subiaco, WA 6008, Australia
| | - Gareth Baynam
- Rare Care Centre, Perth Children’s Hospital, Nedlands, WA 6009, Australia
- Telethon Kids Institute, Nedlands, WA 6009, Australia
- Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, Subiaco, WA 6008, Australia
- Faculty of Health and Medical Sciences, University of Western Australia, Crawley, WA 6009, Australia
| | - Peter N Robinson
- Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, United States
| |
Collapse
|
20
|
Wang A, Liu C, Yang J, Weng C. Fine-tuning Large Language Models for Rare Disease Concept Normalization. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.28.573586. [PMID: 38234802 PMCID: PMC10793431 DOI: 10.1101/2023.12.28.573586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
Objective We aim to develop a novel method for rare disease concept normalization by fine-tuning Llama 2, an open-source large language model (LLM), using a domain-specific corpus sourced from the Human Phenotype Ontology (HPO). Methods We developed an in-house template-based script to generate two corpora for fine-tuning. The first (NAME) contains standardized HPO names, sourced from the HPO vocabularies, along with their corresponding identifiers. The second (NAME+SYN) includes HPO names and half of the concept's synonyms as well as identifiers. Subsequently, we fine-tuned Llama2 (Llama2-7B) for each sentence set and conducted an evaluation using a range of sentence prompts and various phenotype terms. Results When the phenotype terms for normalization were included in the fine-tuning corpora, both models demonstrated nearly perfect performance, averaging over 99% accuracy. In comparison, ChatGPT-3.5 has only ~20% accuracy in identifying HPO IDs for phenotype terms. When single-character typos were introduced in the phenotype terms, the accuracy of NAME and NAME+SYN is 10.2% and 36.1%, respectively, but increases to 61.8% (NAME+SYN) with additional typo-specific fine-tuning. For terms sourced from HPO vocabularies as unseen synonyms, the NAME model achieved 11.2% accuracy, while the NAME+SYN model achieved 92.7% accuracy. Conclusion Our fine-tuned models demonstrate ability to normalize phenotype terms unseen in the fine-tuning corpus, including misspellings, synonyms, terms from other ontologies, and laymen's terms. Our approach provides a solution for the use of LLM to identify named medical entities from the clinical narratives, while successfully normalizing them to standard concepts in a controlled vocabulary.
Collapse
Affiliation(s)
- Andy Wang
- Peddie School, Hightstown, NJ, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Jingye Yang
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| |
Collapse
|
21
|
Mao D, Liu C, Wang L, Ai-Ouran R, Deisseroth C, Pasupuleti S, Kim SY, Li L, Rosenfeld JA, Meng L, Burrage LC, Wangler MF, Yamamoto S, Santana M, Perez V, Shukla P, Eng CM, Lee B, Yuan B, Xia F, Bellen HJ, Liu P, Liu Z. AI-MARRVEL - A Knowledge-Driven AI System for Diagnosing Mendelian Disorders. NEJM AI 2024; 1:10.1056/aioa2300009. [PMID: 38962029 PMCID: PMC11221788 DOI: 10.1056/aioa2300009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/05/2024]
Abstract
BACKGROUND Diagnosing genetic disorders requires extensive manual curation and interpretation of candidate variants, a labor-intensive task even for trained geneticists. Although artificial intelligence (AI) shows promise in aiding these diagnoses, existing AI tools have only achieved moderate success for primary diagnosis. METHODS AI-MARRVEL (AIM) uses a random-forest machine-learning classifier trained on over 3.5 million variants from thousands of diagnosed cases. AIM additionally incorporates expert-engineered features into training to recapitulate the intricate decision-making processes in molecular diagnosis. The online version of AIM is available at https://ai.marrvel.org. To evaluate AIM, we benchmarked it with diagnosed patients from three independent cohorts. RESULTS AIM improved the rate of accurate genetic diagnosis, doubling the number of solved cases as compared with benchmarked methods, across three distinct real-world cohorts. To better identify diagnosable cases from the unsolved pools accumulated over time, we designed a confidence metric on which AIM achieved a precision rate of 98% and identified 57% of diagnosable cases out of a collection of 871 cases. Furthermore, AIM's performance improved after being fine-tuned for targeted settings including recessive disorders and trio analysis. Finally, AIM demonstrated potential for novel disease gene discovery by correctly predicting two newly reported disease genes from the Undiagnosed Diseases Network. CONCLUSIONS AIM achieved superior accuracy compared with existing methods for genetic diagnosis. We anticipate that this tool may aid in primary diagnosis, reanalysis of unsolved cases, and the discovery of novel disease genes. (Funded by the NIH Common Fund and others.).
Collapse
Affiliation(s)
- Dongxue Mao
- Department of Pediatrics, Baylor College of Medicine, Houston
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston
- Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston
| | - Chaozhong Liu
- Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston
- Graduate School of Biomedical Sciences, Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston
| | - Linhua Wang
- Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston
- Graduate School of Biomedical Sciences, Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston
| | - Rami Ai-Ouran
- Department of Pediatrics, Baylor College of Medicine, Houston
- Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston
- Department of Data Science and AI, Al Hussein Technical University, Amman, Jordan
| | - Cole Deisseroth
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston
- Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston
| | - Sasidhar Pasupuleti
- Department of Pediatrics, Baylor College of Medicine, Houston
- Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston
| | - Seon Young Kim
- Department of Pediatrics, Baylor College of Medicine, Houston
- Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston
| | - Lucian Li
- Department of Pediatrics, Baylor College of Medicine, Houston
- Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston
| | - Jill A Rosenfeld
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston
| | - Linyan Meng
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston
- Baylor Genetics, Houston7
| | - Lindsay C Burrage
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston
| | - Michael F Wangler
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston
- Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston
| | - Shinya Yamamoto
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston
- Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston
| | | | | | | | - Christine M Eng
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston
- Baylor Genetics, Houston7
| | - Brendan Lee
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston
| | - Bo Yuan
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston
- Human Genome Sequencing Center, Baylor College of Medicine, Houston
| | - Fan Xia
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston
- Baylor Genetics, Houston7
| | - Hugo J Bellen
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston
- Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston
- Department of Neuroscience, Baylor College of Medicine, Houston
| | - Pengfei Liu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston
- Baylor Genetics, Houston7
| | - Zhandong Liu
- Department of Pediatrics, Baylor College of Medicine, Houston
- Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston
| |
Collapse
|
22
|
Wu D, Yang J, Liu C, Hsieh TC, Marchi E, Blair J, Krawitz P, Weng C, Chung W, Lyon GJ, Krantz ID, Kalish JM, Wang K. GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal Machine Learning Combining Facial Images and Clinical Texts. ARXIV 2024:arXiv:2312.15320v2. [PMID: 38711434 PMCID: PMC11071539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Individuals with suspected rare genetic disorders often undergo multiple clinical evaluations, imaging studies, laboratory tests and genetic tests, to find a possible answer over a prolonged period of time. Addressing this "diagnostic odyssey" thus has substantial clinical, psychosocial, and economic benefits. Many rare genetic diseases have distinctive facial features, which can be used by artificial intelligence algorithms to facilitate clinical diagnosis, in prioritizing candidate diseases to be further examined by lab tests or genetic assays, or in helping the phenotype-driven reinterpretation of genome/exome sequencing data. Existing methods using frontal facial photos were built on conventional Convolutional Neural Networks (CNNs), rely exclusively on facial images, and cannot capture non-facial phenotypic traits and demographic information essential for guiding accurate diagnoses. Here we introduce GestaltMML, a multimodal machine learning (MML) approach solely based on the Transformer architecture. It integrates facial images, demographic information (age, sex, ethnicity), and clinical notes (optionally, a list of Human Phenotype Ontology terms) to improve prediction accuracy. Furthermore, we also evaluated GestaltMML on a diverse range of datasets, including 528 diseases from the GestaltMatcher Database, several in-house datasets of Beckwith-Wiedemann syndrome (BWS, over-growth syndrome with distinct facial features), Sotos syndrome (overgrowth syndrome with overlapping features with BWS), NAA10-related neurodevelopmental syndrome, Cornelia de Lange syndrome (multiple malformation syndrome), and KBG syndrome (multiple malformation syndrome). Our results suggest that GestaltMML effectively incorporates multiple modalities of data, greatly narrowing candidate genetic diagnoses of rare diseases and may facilitate the reinterpretation of genome/exome sequencing data.
Collapse
Affiliation(s)
- Da Wu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Jingye Yang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Tzung-Chien Hsieh
- Institute for Genomic Statistics and Bioinformatics, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Elaine Marchi
- Department of Human Genetics, New York State Institute for Basic Research in Developmental Disabilities, Staten Island, NY, USA
| | - Justin Blair
- Division of Human Genetics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Peter Krawitz
- Institute for Genomic Statistics and Bioinformatics, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Wendy Chung
- Department of Pediatrics, Boston Children’s Hospital, Harvard Medical School, Boston, MA, USA
| | - Gholson J. Lyon
- Department of Human Genetics, New York State Institute for Basic Research in Developmental Disabilities, Staten Island, NY, USA
- Biology PhD Program, The Graduate Center, The City University of New York, New York, United States of America
| | - Ian D. Krantz
- Division of Human Genetics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Jennifer M. Kalish
- Division of Human Genetics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
- Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
23
|
Bhasin MA, Knaus A, Incardona P, Schmid A, Holtgrewe M, Elbracht M, Krawitz PM, Hsieh TC. Enhancing Variant Prioritization in VarFish through On-Premise Computational Facial Analysis. Genes (Basel) 2024; 15:370. [PMID: 38540429 PMCID: PMC10969976 DOI: 10.3390/genes15030370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 03/03/2024] [Accepted: 03/13/2024] [Indexed: 06/14/2024] Open
Abstract
Genomic variant prioritization is crucial for identifying disease-associated genetic variations. Integrating facial and clinical feature analyses into this process enhances performance. This study demonstrates the integration of facial analysis (GestaltMatcher) and Human Phenotype Ontology analysis (CADA) within VarFish, an open-source variant analysis framework. Challenges related to non-open-source components were addressed by providing an open-source version of GestaltMatcher, facilitating on-premise facial analysis to address data privacy concerns. Performance evaluation on 163 patients recruited from a German multi-center study of rare diseases showed PEDIA's superior accuracy in variant prioritization compared to individual scores. This study highlights the importance of further benchmarking and future integration of advanced facial analysis approaches aligned with ACMG guidelines to enhance variant classification.
Collapse
Affiliation(s)
- Meghna Ahuja Bhasin
- Institute for Genomic Statistics and Bioinformatics, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, 53127 Bonn, Germany; (M.A.B.); (A.K.); (P.I.); (A.S.); (P.M.K.)
| | - Alexej Knaus
- Institute for Genomic Statistics and Bioinformatics, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, 53127 Bonn, Germany; (M.A.B.); (A.K.); (P.I.); (A.S.); (P.M.K.)
| | - Pietro Incardona
- Institute for Genomic Statistics and Bioinformatics, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, 53127 Bonn, Germany; (M.A.B.); (A.K.); (P.I.); (A.S.); (P.M.K.)
- Core Unit for Bioinformatics Data Analysis, Medical Faculty, University of Bonn, 53127 Bonn, Germany
| | - Alexander Schmid
- Institute for Genomic Statistics and Bioinformatics, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, 53127 Bonn, Germany; (M.A.B.); (A.K.); (P.I.); (A.S.); (P.M.K.)
| | - Manuel Holtgrewe
- CUBI—Core Unit Bioinformatics, Berlin Institute of Health, 10117 Berlin, Germany;
| | - Miriam Elbracht
- Institute for Human Genetics and Genomic Medicine, Medical Faculty, RWTH Aachen University, 52062 Aachen, Germany;
| | - Peter M. Krawitz
- Institute for Genomic Statistics and Bioinformatics, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, 53127 Bonn, Germany; (M.A.B.); (A.K.); (P.I.); (A.S.); (P.M.K.)
| | - Tzung-Chien Hsieh
- Institute for Genomic Statistics and Bioinformatics, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, 53127 Bonn, Germany; (M.A.B.); (A.K.); (P.I.); (A.S.); (P.M.K.)
| |
Collapse
|