1
|
Adams MC, Perkins ML, Hudson C, Madhira V, Akbilgic O, Ma D, Hurley RW, Topaloglu U. Breaking Digital Health Barriers Through a Large Language Model-Based Tool for Automated Observational Medical Outcomes Partnership Mapping: Development and Validation Study. J Med Internet Res 2025; 27:e69004. [PMID: 40146872 DOI: 10.2196/69004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2024] [Revised: 02/09/2025] [Accepted: 03/27/2025] [Indexed: 03/29/2025] Open
Abstract
BACKGROUND The integration of diverse clinical data sources requires standardization through models such as Observational Medical Outcomes Partnership (OMOP). However, mapping data elements to OMOP concepts demands significant technical expertise and time. While large health care systems often have resources for OMOP conversion, smaller clinical trials and studies frequently lack such support, leaving valuable research data siloed. OBJECTIVE This study aims to develop and validate a user-friendly tool that leverages large language models to automate the OMOP conversion process for clinical trials, electronic health records, and registry data. METHODS We developed a 3-tiered semantic matching system using GPT-3 embeddings to transform heterogeneous clinical data to the OMOP Common Data Model. The system processes input terms by generating vector embeddings, computing cosine similarity against precomputed Observational Health Data Sciences and Informatics vocabulary embeddings, and ranking potential matches. We validated the system using two independent datasets: (1) a development set of 76 National Institutes of Health Helping to End Addiction Long-term Initiative clinical trial common data elements for chronic pain and opioid use disorders and (2) a separate validation set of electronic health record concepts from the National Institutes of Health National COVID Cohort Collaborative COVID-19 enclave. The architecture combines Unified Medical Language System semantic frameworks with asynchronous processing for efficient concept mapping, made available through an open-source implementation. RESULTS The system achieved an area under the receiver operating characteristic curve of 0.9975 for mapping clinical trial common data element terms. Precision ranged from 0.92 to 0.99 and recall ranged from 0.88 to 0.97 across similarity thresholds from 0.85 to 1.0. In practical application, the tool successfully automated mappings that previously required manual informatics expertise, reducing the technical barriers for research teams to participate in large-scale, data-sharing initiatives. Representative mappings demonstrated high accuracy, such as demographic terms achieving 100% similarity with corresponding Logical Observation Identifiers Names and Codes concepts. The implementation successfully processes diverse data types through both individual term mapping and batch processing capabilities. CONCLUSIONS Our validated large language model-based tool effectively automates the transformation of clinical data into the OMOP format while maintaining high accuracy. The combination of semantic matching capabilities and a researcher-friendly interface makes data harmonization accessible to smaller research teams without requiring extensive informatics support. This has direct implications for accelerating clinical research data standardization and enabling broader participation in initiatives such as the National Institutes of Health Helping to End Addiction Long-term Initiative Data Ecosystem.
Collapse
Affiliation(s)
- Meredith Cb Adams
- Department of Anesthesiology, Artificial Intelligence, Translational Neuroscience, and Public Health Sciences, Wake Forest University School of Medicine, Winston-Salem, NC, United States
| | - Matthew L Perkins
- Department of Cancer Biology, Wake Forest University School of Medicine, Winston-Salem, NC, United States
| | - Cody Hudson
- Department of Cancer Biology, Wake Forest University School of Medicine, Winston-Salem, NC, United States
| | | | - Oguz Akbilgic
- Department of Cardiovascular Medicine, Wake Forest University School of Medicine, Winston-Salem, NC, United States
| | - Da Ma
- Department of Internal Medicine, Wake Forest University School of Medicine, Winston-Salem, NC, United States
| | - Robert W Hurley
- Department of Anesthesiology, Translational Neuroscience, and Public Health Sciences, Wake Forest University School of Medicine, Winston-Salem, NC, United States
| | - Umit Topaloglu
- Department of Cancer Biology, Wake Forest University School of Medicine, Winston-Salem, NC, United States
- Clinical Translational Research Informatics Branch, National Cancer Institute, Bethesda, MD, United States
| |
Collapse
|
2
|
Shiferaw KB, Balaur I, Welter D, Waltemath D, Zeleke AA. CALIFRAME: a proposed method of calibrating reporting guidelines with FAIR principles to foster reproducibility of AI research in medicine. JAMIA Open 2024; 7:ooae105. [PMID: 39430802 PMCID: PMC11488973 DOI: 10.1093/jamiaopen/ooae105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Revised: 09/20/2024] [Accepted: 09/26/2024] [Indexed: 10/22/2024] Open
Abstract
Background Procedural and reporting guidelines are crucial in framing scientific practices and communications among researchers and the broader community. These guidelines aim to ensure transparency, reproducibility, and reliability in scientific research. Despite several methodological frameworks proposed by various initiatives to foster reproducibility, challenges such as data leakage and reproducibility remain prevalent. Recent studies have highlighted the transformative potential of incorporating the FAIR (Findable, Accessible, Interoperable, and Reusable) principles into workflows, particularly in contexts like software and machine learning model development, to promote open science. Objective This study aims to introduce a comprehensive framework, designed to calibrate existing reporting guidelines against the FAIR principles. The goal is to enhance reproducibility and promote open science by integrating these principles into the scientific reporting process. Methods We employed the "Best fit" framework synthesis approach which involves systematically reviewing and synthesizing existing frameworks and guidelines to identify best practices and gaps. We then proposed a series of defined workflows to align reporting guidelines with FAIR principles. A use case was developed to demonstrate the practical application of the framework. Results The integration of FAIR principles with established reporting guidelines through the framework effectively bridges the gap between FAIR metrics and traditional reporting standards. The framework provides a structured approach to enhance the findability, accessibility, interoperability, and reusability of scientific data and outputs. The use case demonstrated the practical benefits of the framework, showing improved data management and reporting practices. Discussion The framework addresses critical challenges in scientific research, such as data leakage and reproducibility issues. By embedding FAIR principles into reporting guidelines, the framework ensures that scientific outputs are more transparent, reliable, and reusable. This integration not only benefits researchers by improving data management practices but also enhances the overall scientific process by promoting open science and collaboration. Conclusion The proposed framework successfully combines FAIR principles with reporting guidelines, offering a robust solution to enhance reproducibility and open science. This framework can be applied across various contexts, including software and machine learning model development stages, to foster a more transparent and collaborative scientific environment.
Collapse
Affiliation(s)
- Kirubel Biruk Shiferaw
- Department of Medical Informatics, Institute for Community Medicine, University Medicine Greifswald, Greifswald D-17475, Germany
| | - Irina Balaur
- Luxembourg Centre for Systems Biology, University of Luxembourg, Belvaux L-4367, Luxembourg
| | - Danielle Welter
- Luxembourg National Data Service, Esch-sur-Alzette L-4362, Luxembourg
| | - Dagmar Waltemath
- Department of Medical Informatics, Institute for Community Medicine, University Medicine Greifswald, Greifswald D-17475, Germany
| | - Atinkut Alamirrew Zeleke
- Department of Medical Informatics, Institute for Community Medicine, University Medicine Greifswald, Greifswald D-17475, Germany
| |
Collapse
|
3
|
Gierend K, Krüger F, Genehr S, Hartmann F, Siegel F, Waltemath D, Ganslandt T, Zeleke AA. Provenance Information for Biomedical Data and Workflows: Scoping Review. J Med Internet Res 2024; 26:e51297. [PMID: 39178413 PMCID: PMC11380065 DOI: 10.2196/51297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 05/30/2024] [Accepted: 06/17/2024] [Indexed: 08/25/2024] Open
Abstract
BACKGROUND The record of the origin and the history of data, known as provenance, holds importance. Provenance information leads to higher interpretability of scientific results and enables reliable collaboration and data sharing. However, the lack of comprehensive evidence on provenance approaches hinders the uptake of good scientific practice in clinical research. OBJECTIVE This scoping review aims to identify approaches and criteria for provenance tracking in the biomedical domain. We reviewed the state-of-the-art frameworks, associated artifacts, and methodologies for provenance tracking. METHODS This scoping review followed the methodological framework developed by Arksey and O'Malley. We searched the PubMed and Web of Science databases for English-language articles published from 2006 to 2022. Title and abstract screening were carried out by 4 independent reviewers using the Rayyan screening tool. A majority vote was required for consent on the eligibility of papers based on the defined inclusion and exclusion criteria. Full-text reading and screening were performed independently by 2 reviewers, and information was extracted into a pretested template for the 5 research questions. Disagreements were resolved by a domain expert. The study protocol has previously been published. RESULTS The search resulted in a total of 764 papers. Of 624 identified, deduplicated papers, 66 (10.6%) studies fulfilled the inclusion criteria. We identified diverse provenance-tracking approaches ranging from practical provenance processing and managing to theoretical frameworks distinguishing diverse concepts and details of data and metadata models, provenance components, and notations. A substantial majority investigated underlying requirements to varying extents and validation intensities but lacked completeness in provenance coverage. Mostly, cited requirements concerned the knowledge about data integrity and reproducibility. Moreover, these revolved around robust data quality assessments, consistent policies for sensitive data protection, improved user interfaces, and automated ontology development. We found that different stakeholder groups benefit from the availability of provenance information. Thereby, we recognized that the term provenance is subjected to an evolutionary and technical process with multifaceted meanings and roles. Challenges included organizational and technical issues linked to data annotation, provenance modeling, and performance, amplified by subsequent matters such as enhanced provenance information and quality principles. CONCLUSIONS As data volumes grow and computing power increases, the challenge of scaling provenance systems to handle data efficiently and assist complex queries intensifies, necessitating automated and scalable solutions. With rising legal and scientific demands, there is an urgent need for greater transparency in implementing provenance systems in research projects, despite the challenges of unresolved granularity and knowledge bottlenecks. We believe that our recommendations enable quality and guide the implementation of auditable and measurable provenance approaches as well as solutions in the daily tasks of biomedical scientists. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) RR2-10.2196/31750.
Collapse
Affiliation(s)
- Kerstin Gierend
- Department of Biomedical Informatics, Mannheim Institute for intelligent Systems in Medicine, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Frank Krüger
- Faculty of Engineering, Wismar University of Applied Sciences, Wismar, Germany
- Institute of Communications Engineering, University of Rostock, Rostock, Germany
| | - Sascha Genehr
- Institute of Communications Engineering, University of Rostock, Rostock, Germany
| | - Francisca Hartmann
- Department of Biomedical Informatics, Mannheim Institute for intelligent Systems in Medicine, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Fabian Siegel
- Department of Biomedical Informatics, Mannheim Institute for intelligent Systems in Medicine, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Dagmar Waltemath
- Department of Medical Informatics, University Medicine Greifswald, Greifswald, Germany
| | - Thomas Ganslandt
- Chair of Medical Informatics, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | | |
Collapse
|
4
|
Clark T, Mohan J, Schaffer L, Obernier K, Al Manir S, Churas CP, Dailamy A, Doctor Y, Forget A, Hansen JN, Hu M, Lenkiewicz J, Levinson MA, Marquez C, Nourreddine S, Niestroy J, Pratt D, Qian G, Thaker S, Bélisle-Pipon JC, Brandt C, Chen J, Ding Y, Fodeh S, Krogan N, Lundberg E, Mali P, Payne-Foster P, Ratcliffe S, Ravitsky V, Sali A, Schulz W, Ideker T. Cell Maps for Artificial Intelligence: AI-Ready Maps of Human Cell Architecture from Disease-Relevant Cell Lines. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.21.589311. [PMID: 38826258 PMCID: PMC11142054 DOI: 10.1101/2024.05.21.589311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
This article describes the Cell Maps for Artificial Intelligence (CM4AI) project and its goals, methods, standards, current datasets, software tools , status, and future directions. CM4AI is the Functional Genomics Data Generation Project in the U.S. National Institute of Health's (NIH) Bridge2AI program. Its overarching mission is to produce ethical, AI-ready datasets of cell architecture, inferred from multimodal data collected for human cell lines, to enable transformative biomedical AI research.
Collapse
|
5
|
Sullivan BA, Kausch SL, Fairchild KD. Artificial and human intelligence for early identification of neonatal sepsis. Pediatr Res 2023; 93:350-356. [PMID: 36127407 PMCID: PMC11749885 DOI: 10.1038/s41390-022-02274-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Revised: 07/29/2022] [Accepted: 08/05/2022] [Indexed: 11/09/2022]
Abstract
Artificial intelligence may have a role in the early detection of sepsis in neonates. Machine learning can identify patterns that predict high or increasing risk for clinical deterioration from a sepsis-like illness. In developing this potential addition to NICU care, careful consideration should be given to the data and methods used to develop, validate, and evaluate prediction models. When an AI system alerts clinicians to a change in a patient's condition that warrants a bedside evaluation, human intelligence and experience come into play to determine an appropriate course of action: evaluate and treat or wait and watch closely. With intelligently developed, validated, and implemented AI sepsis systems, both clinicians and patients stand to benefit. IMPACT: This narrative review highlights the application of AI in neonatal sepsis prediction. It describes issues in clinical prediction model development specific to this population. This article reviews the methods, considerations, and literature on neonatal sepsis model development and validation. Challenges of AI technology and potential barriers to using sepsis AI systems in the NICU are discussed.
Collapse
Affiliation(s)
- Brynne A Sullivan
- Department of Pediatrics, University of Virginia School of Medicine, Charlottesville, VA, USA.
| | - Sherry L Kausch
- Department of Pediatrics, University of Virginia School of Medicine, Charlottesville, VA, USA
| | - Karen D Fairchild
- Department of Pediatrics, University of Virginia School of Medicine, Charlottesville, VA, USA
| |
Collapse
|
6
|
Sheffield NC, Bonazzi VR, Bourne PE, Burdett T, Clark T, Grossman RL, Spjuth O, Yates AD. From biomedical cloud platforms to microservices: next steps in FAIR data and analysis. Sci Data 2022; 9:553. [PMID: 36075919 PMCID: PMC9458632 DOI: 10.1038/s41597-022-01619-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Accepted: 08/08/2022] [Indexed: 11/29/2022] Open
Affiliation(s)
- Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville, VA, USA.
- School of Data Science, University of Virginia, Charlottesville VA 22904, Charlottesville, VA, USA.
- Department of Biomedical Engineering, School of Medicine, University of Virginia, 22904, Charlottesville, VA, USA.
- Department of Public Health Sciences, School of Medicine, University of Virginia, 22908, Charlottesville, VA, USA.
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, 22908, Charlottesville, VA, USA.
| | | | - Philip E Bourne
- School of Data Science, University of Virginia, Charlottesville VA 22904, Charlottesville, VA, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, 22904, Charlottesville, VA, USA
| | - Tony Burdett
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Timothy Clark
- School of Data Science, University of Virginia, Charlottesville VA 22904, Charlottesville, VA, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, 22908, Charlottesville, VA, USA
| | - Robert L Grossman
- Center for Translational Data Science, University of Chicago, Chicago, IL, 60615, USA
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, 75124, Uppsala, Sweden
| | - Andrew D Yates
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|