1
|
Groza T, Rayabsri W, Gration D, Hariram H, Jamuar SS, Baynam G. First steps toward building natural history of diseases computationally: Lessons learned from the Noonan syndrome use case. Am J Hum Genet 2025; 112:1158-1172. [PMID: 40245863 PMCID: PMC12120186 DOI: 10.1016/j.ajhg.2025.03.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2024] [Revised: 03/20/2025] [Accepted: 03/21/2025] [Indexed: 04/19/2025] Open
Abstract
Rare diseases (RDs) are conditions affecting fewer than 1 in 2,000 people, with over 7,000 identified, primarily genetic in nature, and more than half impacting children. Although each RD affects a small population, collectively, between 3.5% and 5.9% of the global population, or 262.9-446.2 million people, live with an RD. Most RDs lack established treatment protocols, highlighting the need for proper care pathways addressing prognosis, diagnosis, and management. Advances in generative AI and large language models (LLMs) offer new opportunities to document the temporal progression of phenotypic features, addressing gaps in current knowledge bases. This study proposes an LLM-based framework to capture the natural history of diseases, specifically focusing on Noonan syndrome. The framework aims to document phenotypic trajectories, validate against RD knowledge bases, and integrate insights into care coordination using electronic health record (EHR) data from the Undiagnosed Diseases Program Singapore.
Collapse
Affiliation(s)
- Tudor Groza
- Rare Care Centre, Perth Children's Hospital, Nedlands, WA 6009, Australia; Bioinformatics Institute, Agency for Science, Technology and Research (A(∗)STAR), 30 Biopolis Street #07-01 Matrix, Singapore 138671, Singapore; SingHealth Duke-NUS Institute of Precision Medicine, 5 Hospital Drive Level 9, Singapore 169609, Singapore; School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University, Kent Street, Bentley, WA 6102, Australia.
| | - Warittha Rayabsri
- Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, 374 Bagot Road, Subiaco, WA 6008, Australia
| | - Dylan Gration
- Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, 374 Bagot Road, Subiaco, WA 6008, Australia
| | - Harshini Hariram
- Medical Student, Division of Medical Education, School of Medical Sciences, Faculty of Biology, Medicine and Health, The University of Manchester, Manchester M13 9PL, UK
| | - Saumya Shekhar Jamuar
- SingHealth Duke-NUS Institute of Precision Medicine, 5 Hospital Drive Level 9, Singapore 169609, Singapore; Genetics Service, Department of Paediatrics, KK Women's and Children's Hospital, 100 Bukit Timah Road, Singapore 229899, Singapore; SingHealth Duke-NUS Genomic Medicine Centre, 100 Bukit Timah Road, Singapore 229899, Singapore
| | - Gareth Baynam
- Rare Care Centre, Perth Children's Hospital, Nedlands, WA 6009, Australia; Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, 374 Bagot Road, Subiaco, WA 6008, Australia; Faculty of Health and Medical Sciences, University of Western Australia, 35 Stirling Highway, Crawley, WA 6009, Australia
| |
Collapse
|
2
|
Hier DB, Do TS, Obafemi-Ajayi T. A simplified retriever to improve accuracy of phenotype normalizations by large language models. Front Digit Health 2025; 7:1495040. [PMID: 40103736 PMCID: PMC11913805 DOI: 10.3389/fdgth.2025.1495040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Accepted: 02/12/2025] [Indexed: 03/20/2025] Open
Abstract
Large language models have shown improved accuracy in phenotype term normalization tasks when augmented with retrievers that suggest candidate normalizations based on term definitions. In this work, we introduce a simplified retriever that enhances large language model accuracy by searching the Human Phenotype Ontology (HPO) for candidate matches using contextual word embeddings from BioBERT without the need for explicit term definitions. Testing this method on terms derived from the clinical synopses of Online Mendelian Inheritance in Man (OMIM®), we demonstrate that the normalization accuracy of GPT-4o increases from a baseline of 62% without augmentation to 85% with retriever augmentation. This approach is potentially generalizable to other biomedical term normalization tasks and offers an efficient alternative to more complex retrieval methods.
Collapse
Affiliation(s)
- Daniel B Hier
- Department of Neurology and Rehabilitation, University of Illinois at Chicago, Chicago, IL, United States
| | - Thanh Son Do
- Department of Computer Science, Missouri State University, Springfield, MO, United States
| | - Tayo Obafemi-Ajayi
- Engineering Program, Missouri State University, Springfield, MO, United States
| |
Collapse
|
3
|
Garcia BT, Westerfield L, Yelemali P, Gogate N, Andres Rivera-Munoz E, Du H, Dawood M, Jolly A, Lupski JR, Posey JE. Improving Automated Deep Phenotyping Through Large Language Models Using Retrieval Augmented Generation. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.12.01.24318253. [PMID: 39677442 PMCID: PMC11643181 DOI: 10.1101/2024.12.01.24318253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Background Diagnosing rare genetic disorders relies on precise phenotypic and genotypic analysis, with the Human Phenotype Ontology (HPO) providing a standardized language for capturing clinical phenotypes. Traditional HPO tools, such as Doc2HPO and ClinPhen, employ concept recognition to automate phenotype extraction but struggle with incomplete phenotype assignment, often requiring intensive manual review. While large language models (LLMs) hold promise for more context-driven phenotype extraction, they are prone to errors and "hallucinations," making them less reliable without further refinement. We present RAG-HPO, a Python-based tool that leverages Retrieval-Augmented Generation (RAG) to elevate LLM accuracy in HPO term assignment, bypassing the limitations of baseline models while avoiding the time and resource intensive process of fine-tuning. RAG-HPO integrates a dynamic vector database, allowing real-time retrieval and contextual matching. Methods The high-dimensional vector database utilized by RAG-HPO includes >54,000 phenotypic phrases mapped to HPO IDs, derived from the HPO database and supplemented with additional validated phrases. The RAG-HPO workflow uses an LLM to first extract phenotypic phrases that are then matched via semantic similarity to entries within a vector database before providing best term matches back to the LLM as context for final HPO term assignment. A benchmarking dataset of 120 published case reports with 1,792 manually-assigned HPO terms was developed, and the performance of RAG-HPO measured against existing published tools Doc2HPO, ClinPhen, and FastHPOCR. Results In evaluations, RAG-HPO, powered by Llama-3 70B and applied to a set of 120 case reports, achieved a mean precision of 0.84, recall of 0.78, and an F1 score of 0.80-significantly surpassing conventional tools (p<0.00001). False positive HPO term identification occurred for 15.8% (256/1,624) of terms, of which only 2.7% (7/256) represented hallucinations, and 33.6% (86/256) unrelated terms; the remainder of false positives (63.7%, 163/256) were relative terms of the target term. Conclusions RAG-HPO is a user-friendly, adaptable tool designed for secure evaluation of clinical text and outperforms standard HPO-matching tools in precision, recall, and F1. Its enhanced precision and recall represent a substantial advancement in phenotypic analysis, accelerating the identification of genetic mechanisms underlying rare diseases and driving progress in genetic research and clinical genomics.
Collapse
Affiliation(s)
- Brandon T. Garcia
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Medical Scientist Training Program, Baylor College of Medicine, Houston, TX, 77030, USA
- Genetics and Genomics Graduate Program, Baylor College of Medicine. Houston, TX, 77030, USA
| | - Lauren Westerfield
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Texas Children’s Hospital, Houston, TX, 77303, USA
| | - Priya Yelemali
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Nikhita Gogate
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Genetics and Genomics Graduate Program, Baylor College of Medicine. Houston, TX, 77030, USA
| | - E. Andres Rivera-Munoz
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Genetics and Genomics Graduate Program, Baylor College of Medicine. Houston, TX, 77030, USA
| | - Haowei Du
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Moez Dawood
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Medical Scientist Training Program, Baylor College of Medicine, Houston, TX, 77030, USA
- Genetics and Genomics Graduate Program, Baylor College of Medicine. Houston, TX, 77030, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Angad Jolly
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - James R. Lupski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Texas Children’s Hospital, Houston, TX, 77303, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
- Department of Pediatrics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Jennifer E. Posey
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| |
Collapse
|