1
|
Wang A, Liu C, Yang J, Weng C. Fine-tuning large language models for rare disease concept normalization. J Am Med Inform Assoc 2024:ocae133. [PMID: 38829731 DOI: 10.1093/jamia/ocae133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 05/20/2024] [Accepted: 05/22/2024] [Indexed: 06/05/2024] Open
Abstract
OBJECTIVE We aim to develop a novel method for rare disease concept normalization by fine-tuning Llama 2, an open-source large language model (LLM), using a domain-specific corpus sourced from the Human Phenotype Ontology (HPO). METHODS We developed an in-house template-based script to generate two corpora for fine-tuning. The first (NAME) contains standardized HPO names, sourced from the HPO vocabularies, along with their corresponding identifiers. The second (NAME+SYN) includes HPO names and half of the concept's synonyms as well as identifiers. Subsequently, we fine-tuned Llama 2 (Llama2-7B) for each sentence set and conducted an evaluation using a range of sentence prompts and various phenotype terms. RESULTS When the phenotype terms for normalization were included in the fine-tuning corpora, both models demonstrated nearly perfect performance, averaging over 99% accuracy. In comparison, ChatGPT-3.5 has only ∼20% accuracy in identifying HPO IDs for phenotype terms. When single-character typos were introduced in the phenotype terms, the accuracy of NAME and NAME+SYN is 10.2% and 36.1%, respectively, but increases to 61.8% (NAME+SYN) with additional typo-specific fine-tuning. For terms sourced from HPO vocabularies as unseen synonyms, the NAME model achieved 11.2% accuracy, while the NAME+SYN model achieved 92.7% accuracy. CONCLUSION Our fine-tuned models demonstrate ability to normalize phenotype terms unseen in the fine-tuning corpus, including misspellings, synonyms, terms from other ontologies, and laymen's terms. Our approach provides a solution for the use of LLMs to identify named medical entities from clinical narratives, while successfully normalizing them to standard concepts in a controlled vocabulary.
Collapse
Affiliation(s)
- Andy Wang
- Peddie School, Hightstown, NJ 08520, United States
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Jingye Yang
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| |
Collapse
|
2
|
Wang A, Liu C, Yang J, Weng C. Fine-tuning Large Language Models for Rare Disease Concept Normalization. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.28.573586. [PMID: 38234802 PMCID: PMC10793431 DOI: 10.1101/2023.12.28.573586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
Objective We aim to develop a novel method for rare disease concept normalization by fine-tuning Llama 2, an open-source large language model (LLM), using a domain-specific corpus sourced from the Human Phenotype Ontology (HPO). Methods We developed an in-house template-based script to generate two corpora for fine-tuning. The first (NAME) contains standardized HPO names, sourced from the HPO vocabularies, along with their corresponding identifiers. The second (NAME+SYN) includes HPO names and half of the concept's synonyms as well as identifiers. Subsequently, we fine-tuned Llama2 (Llama2-7B) for each sentence set and conducted an evaluation using a range of sentence prompts and various phenotype terms. Results When the phenotype terms for normalization were included in the fine-tuning corpora, both models demonstrated nearly perfect performance, averaging over 99% accuracy. In comparison, ChatGPT-3.5 has only ~20% accuracy in identifying HPO IDs for phenotype terms. When single-character typos were introduced in the phenotype terms, the accuracy of NAME and NAME+SYN is 10.2% and 36.1%, respectively, but increases to 61.8% (NAME+SYN) with additional typo-specific fine-tuning. For terms sourced from HPO vocabularies as unseen synonyms, the NAME model achieved 11.2% accuracy, while the NAME+SYN model achieved 92.7% accuracy. Conclusion Our fine-tuned models demonstrate ability to normalize phenotype terms unseen in the fine-tuning corpus, including misspellings, synonyms, terms from other ontologies, and laymen's terms. Our approach provides a solution for the use of LLM to identify named medical entities from the clinical narratives, while successfully normalizing them to standard concepts in a controlled vocabulary.
Collapse
Affiliation(s)
- Andy Wang
- Peddie School, Hightstown, NJ, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Jingye Yang
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| |
Collapse
|
3
|
Facile R, Muhlbradt EE, Gong M, Li Q, Popat V, Pétavy F, Cornet R, Ruan Y, Koide D, Saito TI, Hume S, Rockhold F, Bao W, Dubman S, Jauregui Wurst B. Use of Clinical Data Interchange Standards Consortium (CDISC) Standards for Real-world Data: Expert Perspectives From a Qualitative Delphi Survey. JMIR Med Inform 2022; 10:e30363. [PMID: 35084343 PMCID: PMC8832264 DOI: 10.2196/30363] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Revised: 09/17/2021] [Accepted: 10/09/2021] [Indexed: 01/16/2023] Open
Abstract
Background Real-world data (RWD) and real-world evidence (RWE) are playing increasingly important roles in clinical research and health care decision-making. To leverage RWD and generate reliable RWE, data should be well defined and structured in a way that is semantically interoperable and consistent across stakeholders. The adoption of data standards is one of the cornerstones supporting high-quality evidence for the development of clinical medicine and therapeutics. Clinical Data Interchange Standards Consortium (CDISC) data standards are mature, globally recognized, and heavily used by the pharmaceutical industry for regulatory submissions. The CDISC RWD Connect Initiative aims to better understand the barriers to implementing CDISC standards for RWD and to identify the tools and guidance needed to more easily implement them. Objective The aim of this study is to understand the barriers to implementing CDISC standards for RWD and to identify the tools and guidance that may be needed to implement CDISC standards more easily for this purpose. Methods We conducted a qualitative Delphi survey involving an expert advisory board with multiple key stakeholders, with 3 rounds of input and review. Results Overall, 66 experts participated in round 1, 56 in round 2, and 49 in round 3 of the Delphi survey. Their inputs were collected and analyzed, culminating in group statements. It was widely agreed that the standardization of RWD is highly necessary, and the primary focus should be on its ability to improve data sharing and the quality of RWE. The priorities for RWD standardization included electronic health records, such as data shared using Health Level 7 Fast Health care Interoperability Resources (FHIR), and the data stemming from observational studies. With different standardization efforts already underway in these areas, a gap analysis should be performed to identify the areas where synergies and efficiencies are possible and then collaborate with stakeholders to create or extend existing mappings between CDISC and other standards, controlled terminologies, and models to represent data originating across different sources. Conclusions There are many ongoing data standardization efforts around human health data–related activities, each with different definitions, levels of granularity, and purpose. Among these, CDISC has been successful in standardizing clinical trial-based data for regulation worldwide. However, the complexity of the CDISC standards and the fact that they were developed for different purposes, combined with the lack of awareness and incentives to use a new standard and insufficient training and implementation support, are significant barriers to setting up the use of CDISC standards for RWD. The collection and dissemination of use cases, development of tools and support systems for the RWD community, and collaboration with other standards development organizations are potential steps forward. Using CDISC will help link clinical trial data and RWD and promote innovation in health data science.
Collapse
Affiliation(s)
- Rhonda Facile
- Clinical Data Interchange Standards Consortium, Austin, TX, United States
| | | | - Mengchun Gong
- Digital Health China Technologies, Bejing, China.,Institute of Health Management, Southern Medical University, Guangzhou, China
| | - Qingna Li
- Institute of Clinical Pharmacology, Xiyuan Hospital of China Academy of Chinese Medical Sciences, Beijing, China.,Key Laboratory for Clinical Research and Evaluation of Traditional Chinese Medicine of National Medical Products Administration, Beijing, China.,National Clinical Research Center for Chinese Medicine Cardiology, Beijing, China
| | - Vaishali Popat
- Food and Drug Administration, Center for Drug Evaluation Research, Silver Spring, MD, United States
| | - Frank Pétavy
- European Medicines Agency, Amsterdam, Netherlands
| | - Ronald Cornet
- Department of Medical Informatics, Amsterdam Public Health Research Institute, Amsterdam University Medical Centers - University of Amsterdam, Amsterdam, Netherlands
| | | | - Daisuke Koide
- Department of Biostatistics & Bioinformatics, Graduate School of Medicine, University of Tokyo, Tokyo, Japan
| | - Toshiki I Saito
- National Hospital Organization Nagoya Medical Center, Nagoya, Japan
| | - Sam Hume
- Clinical Data Interchange Standards Consortium, Austin, TX, United States
| | - Frank Rockhold
- Duke Clinical Research Institute, Duke University Medical Center, Durham, NC, United States
| | - Wenjun Bao
- JMP Life Sciences, SAS Institute Inc, Cary, NC, United States
| | - Sue Dubman
- Clinical Data Interchange Standards Consortium, Austin, TX, United States
| | | |
Collapse
|
4
|
Ros F, Kush R, Friedman C, Gil Zorzo E, Rivero Corte P, Rubin JC, Sanchez B, Stocco P, Van Houweling D. Addressing the Covid-19 pandemic and future public health challenges through global collaboration and a data-driven systems approach. Learn Health Syst 2021; 5:e10253. [PMID: 33349796 PMCID: PMC7744897 DOI: 10.1002/lrh2.10253] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Revised: 11/02/2020] [Accepted: 11/17/2020] [Indexed: 11/11/2022] Open
Abstract
Covid-19 has already taught us that the greatest public health challenges of our generation will show no respect for national boundaries, will impact lives and health of people of all nations, and will affect economies and quality of life in unprecedented ways. The types of rapid learning envisioned to address Covid-19 and future public health crises require a systems approach that enables sharing of data and lessons learned at scale. Agreement on a systems approach augmented by technology and standards will be foundational to making such learning meaningful and to ensuring its scientific integrity. With this purpose in mind, a group of individuals from Spain, Italy, and the United States have formed a transatlantic collaboration, with the aim of generating a proposed comprehensive standards-based systems approach and data-driven framework for collection, management, and analysis of high-quality data. This framework will inform decisions in managing clinical responses and social measures to overcome the Covid-19 global pandemic and to prepare for future public health crises. We first argue that standardized data of the type now common in global regulated clinical research is the essential fuel that will power a global system for addressing (and preventing) current and future pandemics. We then present a blueprint for a system that will put these data to use in driving a range of key decisions. In the context of this system, we describe and categorize the specific types of data the system will require for different purposes and document the standards currently in use for each of these categories in the three nations participating in this work. In so doing, we anticipate some of the challenges to harmonizing these data but also suggest opportunities for further global standardization and harmonization. While we have scaled this transnational effort to three nations, we hope to stimulate an international dialogue with a culmination of realizing such a system.
Collapse
Affiliation(s)
- Francisco Ros
- Escuela Técnica Superior Ingenieros de TelecomunicaciónUniversidad Politécnica de MadridMadridSpain
| | - Rebecca Kush
- Elligo Health Research and CatalysisAustinTexasUSA
| | - Charles Friedman
- Department of Learning Health SciencesUniversity of Michigan Medical SchoolAnn ArborMichiganUSA
| | | | | | - Joshua C. Rubin
- Department of Learning Health SciencesUniversity of Michigan Medical SchoolAnn ArborMichiganUSA
| | - Borja Sanchez
- Ministry of Science, Innovation and UniversityGovernment of the Principality of AsturiasOviedoSpain
| | - Paolo Stocco
- Services Care for ElderlyASSP CortinaCortina d'Ampezzo BLItaly
| | - Douglas Van Houweling
- Department of Learning Health SciencesUniversity of Michigan Medical SchoolAnn ArborMichiganUSA
| |
Collapse
|
5
|
Kush R, Warzel D, Kush M, Sherman A, Navarro E, Fitzmartin R, Pétavy F, Galvez J, Becnel L, Zhou F, Harmon N, Jauregui B, Jackson T, Hudson L. FAIR data sharing: The roles of common data elements and harmonization. J Biomed Inform 2020; 107:103421. [DOI: 10.1016/j.jbi.2020.103421] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Revised: 03/13/2020] [Accepted: 04/09/2020] [Indexed: 10/24/2022]
|
6
|
Development of electronic medical records for clinical and research purposes: the breast cancer module using an implementation framework in a middle income country- Malaysia. BMC Bioinformatics 2019; 19:402. [PMID: 30717675 PMCID: PMC7394320 DOI: 10.1186/s12859-018-2406-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Accepted: 10/03/2018] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND Advances in medical domain has led to an increase of clinical data production which offers enhancement opportunities for clinical research sector. In this paper, we propose to expand the scope of Electronic Medical Records in the University Malaya Medical Center (UMMC) using different techniques in establishing interoperability functions between multiple clinical departments involving diagnosis, screening and treatment of breast cancer and building automatic systems for clinical audits as well as for potential data mining to enhance clinical breast cancer research in the future. RESULTS Quality Implementation Framework (QIF) was adopted to develop the breast cancer module as part of the in-house EMR system used at UMMC, called i-Pesakit©. The completion of the i-Pesakit© Breast Cancer Module requires management of clinical data electronically, integration of clinical data from multiple internal clinical departments towards setting up of a research focused patient data governance model. The 14 QIF steps were performed in four main phases involved in this study which are (i) initial considerations regarding host setting, (ii) creating structure for implementation, (iii) ongoing structure once implementation begins, and (iv) improving future applications. The architectural framework of the module incorporates both clinical and research needs that comply to the Personal Data Protection Act. CONCLUSION The completion of the UMMC i-Pesakit© Breast Cancer Module required populating EMR including management of clinical data access, establishing information technology and research focused governance model and integrating clinical data from multiple internal clinical departments. This multidisciplinary collaboration has enhanced the quality of data capture in clinical service, benefited hospital data monitoring, quality assurance, audit reporting and research data management, as well as a framework for implementing a responsive EMR for a clinical and research organization in a typical middle-income country setting. Future applications include establishing integration with external organization such as the National Registration Department for mortality data, reporting of institutional data for national cancer registry as well as data mining for clinical research. We believe that integration of multiple clinical visit data sources provides a more comprehensive, accurate and real-time update of clinical data to be used for epidemiological studies and audits.
Collapse
|
7
|
Costeloe K, Turner MA, Padula MA, Shah PS, Modi N, Soll R, Haumont D, Kusuda S, Göpel W, Chang YS, Smith PB, Lui K, Davis JM, Hudson LD. Sharing Data to Accelerate Medicine Development and Improve Neonatal Care: Data Standards and Harmonized Definitions. J Pediatr 2018; 203:437-441.e1. [PMID: 30293637 DOI: 10.1016/j.jpeds.2018.07.082] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/18/2017] [Revised: 06/06/2018] [Accepted: 07/25/2018] [Indexed: 01/06/2023]
Affiliation(s)
- Kate Costeloe
- Paediatric Research, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | - Mark A Turner
- Institute of Translational Medicine, University of Liverpool, Liverpool, United Kingdom.
| | - Michael A Padula
- Division of Neonatology, Children's Hospital of Philadelphia, Philadelphia, PA
| | - Prakesh S Shah
- Department of Paediatrics and Institute of Health Policy, Management and Evaluation, Lunenfeld Tannebaum Research Institute, Mount Sinai Hospital, Toronto, Canada
| | - Neena Modi
- Neonatal Medicine, Imperial College London, Chelsea and Westminster Hospital Campus, London, United Kingdom
| | - Roger Soll
- Vermont Oxford Network, Neonatology, University of Vermont College of Medicine, Burlington, VT
| | - Dominique Haumont
- Department of Neonatology, Saint-Pierre University Hospital, Brussels, Belgium
| | - Satoshi Kusuda
- Department of Neonatology, Maternal and Perinatal Center, Tokyo Women's Medical University, Tokyo, Japan
| | - Wolfgang Göpel
- Neonatology and Paediatric Intensive Care, University of Lübeck, Department of Paediatrics, Lübeck, Germany
| | - Yun Sil Chang
- Department of Pediatrics, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea
| | - P Brian Smith
- Department of Pediatrics, Duke University School of Medicine, Durham, NC
| | - Kei Lui
- Discipline of Paediatrics, School of Women's and Children's Health, Sydney, New South Wales, Australia
| | - Jonathan M Davis
- Department of Paediatrics, Floating Hospital for Children, Tufts Medical Center, Tufts Clinical and Translational Science Institute, Tufts University, Boston, MA
| | | | | |
Collapse
|