1
|
Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton E, Malin B, Yin Z. Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review. J Med Internet Res 2024; 26:e22769. [PMID: 39509695 PMCID: PMC11582494 DOI: 10.2196/22769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 09/19/2024] [Accepted: 10/03/2024] [Indexed: 11/15/2024] Open
Abstract
BACKGROUND The launch of ChatGPT (OpenAI) in November 2022 attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including health care. Numerous studies have since been conducted regarding how to use state-of-the-art LLMs in health-related scenarios. OBJECTIVE This review aims to summarize applications of and concerns regarding conversational LLMs in health care and provide an agenda for future research in this field. METHODS We used PubMed, ACM, and the IEEE digital libraries as primary sources for this review. We followed the guidance of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) to screen and select peer-reviewed research articles that (1) were related to health care applications and conversational LLMs and (2) were published before September 1, 2023, the date when we started paper collection. We investigated these papers and classified them according to their applications and concerns. RESULTS Our search initially identified 820 papers according to targeted keywords, out of which 65 (7.9%) papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT (60/65, 92% of papers), followed by Bard (Google LLC; 1/65, 2% of papers), LLaMA (Meta; 1/65, 2% of papers), and other LLMs (6/65, 9% papers). These papers were classified into four categories of applications: (1) summarization, (2) medical knowledge inquiry, (3) prediction (eg, diagnosis, treatment recommendation, and drug synergy), and (4) administration (eg, documentation and information collection), and four categories of concerns: (1) reliability (eg, training data quality, accuracy, interpretability, and consistency in responses), (2) bias, (3) privacy, and (4) public acceptability. There were 49 (75%) papers using LLMs for either summarization or medical knowledge inquiry, or both, and there are 58 (89%) papers expressing concerns about either reliability or bias, or both. We found that conversational LLMs exhibited promising results in summarization and providing general medical knowledge to patients with a relatively high accuracy. However, conversational LLMs such as ChatGPT are not always able to provide reliable answers to complex health-related tasks (eg, diagnosis) that require specialized domain expertise. While bias or privacy issues are often noted as concerns, no experiments in our reviewed papers thoughtfully examined how conversational LLMs lead to these issues in health care research. CONCLUSIONS Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications bring bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in health care.
Collapse
Affiliation(s)
- Leyao Wang
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- School of Biomedical Engineering, ShanghaiTech University, Shanghai, China
| | - Congning Ni
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Qingyuan Song
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Yang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Ellen Clayton
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN, United States
- School of Law, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Bradley Malin
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Zhijun Yin
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| |
Collapse
|
2
|
Desmet C, Cook DJ. HydraGAN: A Cooperative Agent Model for Multi-Objective Data Generation. ACM T INTEL SYST TEC 2024; 15:60. [PMID: 39469108 PMCID: PMC11513586 DOI: 10.1145/3653982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Accepted: 02/26/2024] [Indexed: 10/30/2024]
Abstract
Generative adversarial networks have become a de facto approach to generate synthetic data points that resemble their real counterparts. We tackle the situation where the realism of individual samples is not the sole criterion for synthetic data generation. Additional constraints such as privacy preservation, distribution realism, and diversity promotion may also be essential to optimize. To address this challenge, we introduce HydraGAN, a multi-agent network that performs multi-objective synthetic data generation. We theoretically verify that training the HydraGAN system, containing a single generator and an arbitrary number of discriminators, leads to a Nash equilibrium. Experimental results for six datasets indicate that HydraGAN consistently outperforms prior methods in maximizing the Area under the Radar Curve (AuRC), balancing a combination of cooperative or competitive data generation goals.
Collapse
|
3
|
Gourabathina A, Wan Z, Brown JT, Yan C, Malin BA. PanDa Game: Optimized Privacy-Preserving Publishing of Individual-Level Pandemic Data Based on a Game Theoretic Model. IEEE Trans Nanobioscience 2023; 22:808-817. [PMID: 37289605 PMCID: PMC10702143 DOI: 10.1109/tnb.2023.3284092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Sharing individual-level pandemic data is essential for accelerating the understanding of a disease. For example, COVID-19 data have been widely collected to support public health surveillance and research. In the United States, these data are typically de-identified before publication to protect the privacy of the corresponding individuals. However, current data publishing approaches for this type of data, such as those adopted by the U.S. Centers for Disease Control and Prevention (CDC), have not flexed over time to account for the dynamic nature of infection rates. Thus, the policies generated by these strategies have the potential to both raise privacy risks or overprotect the data and impair the data utility (or usability). To optimize the tradeoff between privacy risk and data utility, we introduce a game theoretic model that adaptively generates policies for the publication of individual-level COVID-19 data according to infection dynamics. We model the data publishing process as a two-player Stackelberg game between a data publisher and a data recipient and then search for the best strategy for the publisher. In this game, we consider 1) average performance of predicting future case counts; and 2) mutual information between the original data and the released data. We use COVID-19 case data from Vanderbilt University Medical Center from March 2020 to December 2021 to demonstrate the effectiveness of the new model. The results indicate that the game theoretic model outperforms all state-of-the-art baseline approaches, including those adopted by CDC, while maintaining low privacy risk. We further perform an extensive sensitivity analyses to show that our findings are robust to order-of-magnitude parameter fluctuations.
Collapse
Affiliation(s)
- Abinitha Gourabathina
- Department of Operations Research & Financial Engineering, Princeton University, Princeton, NJ 08540 USA
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
| | - J. Thomas Brown
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
| | - Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
| | - Bradley A. Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
- Department of Computer Science, Vanderbilt University, Nashville, TN 37212 USA
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
| |
Collapse
|
4
|
Ma S, Yu J, Qin X, Liu J. Current status and challenges in establishing reference intervals based on real-world data. Crit Rev Clin Lab Sci 2023; 60:427-441. [PMID: 37038925 DOI: 10.1080/10408363.2023.2195496] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 01/29/2023] [Accepted: 03/22/2023] [Indexed: 04/12/2023]
Abstract
Reference intervals (RIs) are the cornerstone for evaluation of test results in clinical practice and are invaluable in judging patient health and making clinical decisions. Establishing RIs based on clinical laboratory data is a branch of real-world data mining research. Compared to the traditional direct method, this indirect approach is highly practical, widely applicable, and low-cost. Improving the accuracy of RIs requires not only the collection of sufficient data and the use of correct statistical methods, but also proper stratification of heterogeneous subpopulations. This includes the establishment of age-specific RIs and taking into account other characteristics of reference individuals. Although there are many studies on establishing RIs by indirect methods, it is still very difficult for laboratories to select appropriate statistical methods due to the lack of formal guidelines. This review describes the application of real-world data and an approach for establishing indirect reference intervals (iRIs). We summarize the processes for establishing iRIs using real-world data and analyze the principle and applicable scope of the indirect method model in detail. Moreover, we compare different methods for constructing growth curves to establish age-specific RIs, in hopes of providing laboratories with a reference for establishing specific iRIs and giving new insight into clinical laboratory RI research. (201 words).
Collapse
Affiliation(s)
- Sijia Ma
- Department of Laboratory Medicine, Shengjing Hospital of China Medical University, Liaoning Clinical Research Center for Laboratory Medicine, Shenyang, P.R. China
| | - Juntong Yu
- Department of Laboratory Medicine, Shengjing Hospital of China Medical University, Liaoning Clinical Research Center for Laboratory Medicine, Shenyang, P.R. China
| | - Xiaosong Qin
- Department of Laboratory Medicine, Shengjing Hospital of China Medical University, Liaoning Clinical Research Center for Laboratory Medicine, Shenyang, P.R. China
| | - Jianhua Liu
- Department of Laboratory Medicine, Shengjing Hospital of China Medical University, Liaoning Clinical Research Center for Laboratory Medicine, Shenyang, P.R. China
| |
Collapse
|
5
|
Qi T, Wu F, Wu C, He L, Huang Y, Xie X. Differentially private knowledge transfer for federated learning. Nat Commun 2023; 14:3785. [PMID: 37355643 PMCID: PMC10290720 DOI: 10.1038/s41467-023-38794-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Accepted: 05/15/2023] [Indexed: 06/26/2023] Open
Abstract
Extracting useful knowledge from big data is important for machine learning. When data is privacy-sensitive and cannot be directly collected, federated learning is a promising option that extracts knowledge from decentralized data by learning and exchanging model parameters, rather than raw data. However, model parameters may encode not only non-private knowledge but also private information of local data, thereby transferring knowledge via model parameters is not privacy-secure. Here, we present a knowledge transfer method named PrivateKT, which uses actively selected small public data to transfer high-quality knowledge in federated learning with privacy guarantees. We verify PrivateKT on three different datasets, and results show that PrivateKT can maximally reduce 84% of the performance gap between centralized learning and existing federated learning methods under strict differential privacy restrictions. PrivateKT provides a potential direction to effective and privacy-preserving knowledge transfer in machine intelligent systems.
Collapse
Affiliation(s)
- Tao Qi
- Department of Electronic Engineering, Tsinghua University, 100084, Beijing, China
| | - Fangzhao Wu
- Microsoft Research Asia, 100080, Beijing, China.
| | - Chuhan Wu
- Department of Electronic Engineering, Tsinghua University, 100084, Beijing, China.
| | - Liang He
- Department of Electronic Engineering, Tsinghua University, 100084, Beijing, China
| | - Yongfeng Huang
- Department of Electronic Engineering, Tsinghua University, 100084, Beijing, China.
- Zhongguancun Laboratory, 100094, Beijing, China.
- Institute for Precision Medicine, Tsinghua University, 102218, Beijing, China.
| | - Xing Xie
- Microsoft Research Asia, 100080, Beijing, China
| |
Collapse
|
6
|
Brown JT, Wan Z, Gkoulalas-Divanis A, Kantarcioglu M, Malin BA. Supporting COVID-19 Disparity Investigations with Dynamically Adjusting Case Reporting Policies. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2023; 2022:279-288. [PMID: 37128430 PMCID: PMC10148367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Data access limitations have stifled COVID-19 disparity investigations in the United States. Though federal and state legislation permits publicly disseminating de-identified data, methods for de-identification, including a recently proposed dynamic policy approach to pandemic data sharing, remain unproved in their ability to support pandemic disparity studies. Thus, in this paper, we evaluate how such an approach enables timely, accurate, and fair disparity detection, with respect to potential adversaries with varying prior knowledge about the population. We show that, when considering reasonably enabled adversaries, dynamic policies support up to three times earlier disparity detection in partially synthetic data than data sharing policies derived from two current, public datasets. Using real-world COVID-19 data, we also show how granular date information, which dynamic policies were designed to share, improves disparity characterization. Our results highlight the potential of the dynamic policy approach to publish data that supports disparity investigations in current and future pandemics.
Collapse
Affiliation(s)
| | - Zhiyu Wan
- Vanderbilt University, Nashville, TN, USA
| | | | | | | |
Collapse
|
7
|
Xia W, Basford M, Carroll R, Clayton EW, Harris P, Kantacioglu M, Liu Y, Nyemba S, Vorobeychik Y, Wan Z, Malin BA. Managing re-identification risks while providing access to the All of Us research program. J Am Med Inform Assoc 2023; 30:907-914. [PMID: 36809550 PMCID: PMC10114067 DOI: 10.1093/jamia/ocad021] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Revised: 01/23/2023] [Accepted: 02/09/2023] [Indexed: 02/23/2023] Open
Abstract
OBJECTIVE The All of Us Research Program makes individual-level data available to researchers while protecting the participants' privacy. This article describes the protections embedded in the multistep access process, with a particular focus on how the data was transformed to meet generally accepted re-identification risk levels. METHODS At the time of the study, the resource consisted of 329 084 participants. Systematic amendments were applied to the data to mitigate re-identification risk (eg, generalization of geographic regions, suppression of public events, and randomization of dates). We computed the re-identification risk for each participant using a state-of-the-art adversarial model specifically assuming that it is known that someone is a participant in the program. We confirmed the expected risk is no greater than 0.09, a threshold that is consistent with guidelines from various US state and federal agencies. We further investigated how risk varied as a function of participant demographics. RESULTS The results indicated that 95th percentile of the re-identification risk of all the participants is below current thresholds. At the same time, we observed that risk levels were higher for certain race, ethnic, and genders. CONCLUSIONS While the re-identification risk was sufficiently low, this does not imply that the system is devoid of risk. Rather, All of Us uses a multipronged data protection strategy that includes strong authentication practices, active monitoring of data misuse, and penalization mechanisms for users who violate terms of service.
Collapse
Affiliation(s)
- Weiyi Xia
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Melissa Basford
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Robert Carroll
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Ellen Wright Clayton
- Law School, Vanderbilt University, Nashville, Tennessee, USA
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Health Policy, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Paul Harris
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Biomedical Engineering, Vanderbilt University, Nashville, Tennessee, USA
| | - Murat Kantacioglu
- Department of Computer Science, University of Texas at Dallas, Dallas, Texas, USA
| | - Yongtai Liu
- Department of Computer Science, Vanderbilt University, Nashville, Tennessee, USA
| | - Steve Nyemba
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Yevgeniy Vorobeychik
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Bradley A Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Computer Science, Vanderbilt University, Nashville, Tennessee, USA
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|
8
|
Yan C, Yan Y, Wan Z, Zhang Z, Omberg L, Guinney J, Mooney SD, Malin BA. A Multifaceted benchmarking of synthetic electronic health record generation models. Nat Commun 2022; 13:7609. [PMID: 36494374 PMCID: PMC9734113 DOI: 10.1038/s41467-022-35295-1] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 11/28/2022] [Indexed: 12/13/2022] Open
Abstract
Synthetic health data have the potential to mitigate privacy concerns in supporting biomedical research and healthcare applications. Modern approaches for data generation continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a systematic benchmarking framework to appraise key characteristics with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic health data and further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context.
Collapse
Affiliation(s)
- Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Yao Yan
- Sage Bionetworks, Seattle, WA, USA
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Ziqi Zhang
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA
| | | | - Justin Guinney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
- Tempus Labs, Chicago, IL, USA
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA.
| | - Bradley A Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA.
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA.
| |
Collapse
|
9
|
Wan Z, Hazel JW, Clayton EW, Vorobeychik Y, Kantarcioglu M, Malin BA. Sociotechnical safeguards for genomic data privacy. Nat Rev Genet 2022; 23:429-445. [PMID: 35246669 PMCID: PMC8896074 DOI: 10.1038/s41576-022-00455-y] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/24/2022] [Indexed: 12/21/2022]
Abstract
Recent developments in a variety of sectors, including health care, research and the direct-to-consumer industry, have led to a dramatic increase in the amount of genomic data that are collected, used and shared. This state of affairs raises new and challenging concerns for personal privacy, both legally and technically. This Review appraises existing and emerging threats to genomic data privacy and discusses how well current legal frameworks and technical safeguards mitigate these concerns. It concludes with a discussion of remaining and emerging challenges and illustrates possible solutions that can balance protecting privacy and realizing the benefits that result from the sharing of genetic information.
Collapse
Affiliation(s)
- Zhiyu Wan
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - James W Hazel
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA
- Center for Biomedical Ethics and Society, Vanderbilt University, Nashville, TN, USA
| | - Ellen Wright Clayton
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA
- Center for Biomedical Ethics and Society, Vanderbilt University, Nashville, TN, USA
- Vanderbilt University Law School, Nashville, TN, USA
| | - Yevgeniy Vorobeychik
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO, USA
| | - Murat Kantarcioglu
- Department of Computer Science, University of Texas at Dallas, Richardson, TX, USA
| | - Bradley A Malin
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA.
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA.
| |
Collapse
|