1
|
Kashyap A, Black TA, McQuitty EN, Ali A, Nambiar N, Rashid RM. The future of dermatological research: ethical implications of artificial intelligence integration for medical students. Clin Exp Dermatol 2025; 50:1046-1047. [PMID: 39579069 DOI: 10.1093/ced/llae523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2024] [Accepted: 11/20/2024] [Indexed: 11/25/2024]
Abstract
This letter examines the increasing reliance on artificial intelligence (AI) tools like ChatGPT among medical students in dermatology research, driven by the competitive nature of residency matching. While AI can significantly boost research productivity, it also introduces ethical concerns, including risks of academic dishonesty, data privacy issues, and potential over-reliance on technology that may undermine students’ critical thinking and intellectual development. Balancing the benefits of AI with a commitment to ethical standards is crucial to preserving the integrity and quality of medical research.
Collapse
Affiliation(s)
- Alisha Kashyap
- John P. and Kathrine G. McGovern Medical School at UTHealth, Houston, TX, USA
| | - Troy Austin Black
- John P. and Kathrine G. McGovern Medical School at UTHealth, Houston, TX, USA
| | - Emelie N McQuitty
- John P. and Kathrine G. McGovern Medical School at UTHealth, Houston, TX, USA
| | - Amna Ali
- John P. and Kathrine G. McGovern Medical School at UTHealth, Houston, TX, USA
| | - Nayna Nambiar
- Rice University School of Natural Sciences, Houston, TX, USA
| | | |
Collapse
|
2
|
Mathes S, Seurig S, Bluhme F, Beyer K, Heizmann F, Wagner M, Neugärtner I, Biedermann T, Darsow U. ChatGPT Performance on 120 Interdisciplinary Allergology Questions-Systematic Evaluation With Clinical Error Impact Assessment for Critical Erroneous AI-Guided Chatbot Advice. THE JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY. IN PRACTICE 2025:S2213-2198(25)00280-6. [PMID: 40157421 DOI: 10.1016/j.jaip.2025.03.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/04/2025] [Revised: 02/28/2025] [Accepted: 03/19/2025] [Indexed: 04/01/2025]
Abstract
BACKGROUND ChatGPT (Chatbot with Generative Pretrained Transformer), despite not being a medical device, may be used by patients for medical inquiries. Its accessibility and convenience, particularly amidst long waiting times for allergology appointments, make it an attractive but potentially erroneous source of advice. OBJECTIVES This study evaluates ChatGPT's performance on allergological questions from clinical practice, offering a systematic approach to rating its errors. An Allergological Error Impact Assessment is proposed to analyze the potential consequences of these errors on patients. METHODS A total of 120 multidisciplinary allergology questions from dermatology, pediatrics, and pulmonology were prompted to ChatGPT (3.5). Errors were assessed in terms of content, accuracy (ACC), completeness (CO), perceived humanness (PHU), and readability (Flesch Reading Ease). Erroneous responses were categorized on a 3-step severity scale (minor, major, and critical). Critical errors underwent allergological error impact analysis. Statistical evaluation included descriptive analyses and Kruskal-Wallis and Mann-Whitney U tests. RESULTS ChatGPT demonstrated good accuracy (mean ACC 4.1/5, standard deviation: 0.78, range: 1-5). CO and PHU were sufficient but lowest for pediatric queries. Readability was at an academic level for most responses. Six critical errors were identified: 1 in dermatology, 2 in pediatrics, and 3 in pulmonology. Notably, a critical pediatric food allergen error carried a potentially life-threatening risk. CONCLUSION ChatGPT's imperfect reliability in allergology highlights the need for expert counseling in specialized fields. Tailoring these tools to allergy use cases could improve utility of models like ChatGPT for clinical applications, such as answering questions from allergological routine care.
Collapse
Affiliation(s)
- Sonja Mathes
- Department for Dermatology and Allergology, School of Medicine, Technical University of Munich, Munich, Germany.
| | - Sebastian Seurig
- Department of Respiratory Medicine, Allergology and Sleep Medicine, General Hospital Nuremberg, Campus North, Paracelsus Medical University, Nuremberg, Germany
| | - Friederike Bluhme
- Department of Pediatric Respiratory Medicine, Immunology, and Critical Care Medicine, Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Kirsten Beyer
- Department of Pediatric Respiratory Medicine, Immunology, and Critical Care Medicine, Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Fabian Heizmann
- Department of Respiratory Medicine, Allergology and Sleep Medicine, General Hospital Nuremberg, Campus North, Paracelsus Medical University, Nuremberg, Germany
| | - Manfred Wagner
- Department of Respiratory Medicine, Allergology and Sleep Medicine, General Hospital Nuremberg, Campus North, Paracelsus Medical University, Nuremberg, Germany
| | - Ina Neugärtner
- Department of Pediatric Respiratory Medicine, Immunology, and Critical Care Medicine, Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Tilo Biedermann
- Department for Dermatology and Allergology, School of Medicine, Technical University of Munich, Munich, Germany
| | - Ulf Darsow
- Department for Dermatology and Allergology, School of Medicine, Technical University of Munich, Munich, Germany
| |
Collapse
|
3
|
Estrada-Mendizabal RJ, Cojuc-Konigsberg G, Labib EN, De la Cruz-De la Cruz C, Gonzalez-Estrada A, Cuervo-Pardo L, Zwiener R, Canel-Paredes A. Assessing the accuracy of ChatGPT's responses to common allergy myths. THE JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY. IN PRACTICE 2025; 13:440-442.e1. [PMID: 39551385 DOI: 10.1016/j.jaip.2024.11.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/24/2024] [Revised: 10/16/2024] [Accepted: 11/06/2024] [Indexed: 11/19/2024]
Affiliation(s)
| | | | - Ebram N Labib
- Division of Allergy, Asthma, and Clinical Immunology, Department of Medicine, Mayo Clinic, Scottsdale, Ariz
| | | | - Alexei Gonzalez-Estrada
- Division of Allergy, Asthma, and Clinical Immunology, Department of Medicine, Mayo Clinic, Scottsdale, Ariz
| | - Lyda Cuervo-Pardo
- Division of Rheumatology, Allergy and Clinical Immunology, Department of Medicine, University of Florida, Gainesville, Fla
| | - Ricardo Zwiener
- Allergy and Immunology Department, Hospital Universitario Austral, Buenos Aires, Argentina
| | - Alejandra Canel-Paredes
- Division of Allergy and Clinical Immunology, Hospital Zambrano-Hellion, Monterrey, NL, Mexico.
| |
Collapse
|
4
|
Durmuş MA, Kömeç S, Gülmez A. Artificial intelligence applications for immunology laboratory: image analysis and classification study of IIF photos. Immunol Res 2024; 72:1277-1287. [PMID: 39107556 DOI: 10.1007/s12026-024-09527-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Accepted: 08/01/2024] [Indexed: 02/06/2025]
Abstract
Artificial intelligence (AI) is increasingly being used in medicine to enhance the speed and accuracy of disease diagnosis and treatment. AI-based image analysis is expected to play a crucial role in future healthcare facilities and laboratories, offering improved precision and cost-effectiveness. As technology advances, the requirement for specialized software knowledge to utilize AI applications is diminishing. Our study will examine the advantages and challenges of employing AI-based image analysis in the field of immunology and will investigate whether physicians without software expertise can use MS Azure Portal for ANA IIF test classification and image analysis. This is the first study to perform Hep-2 image analysis using MS Azure Portal. We will also assess the potential for AI applications to aid physicians in interpreting ANA IIF results in immunology laboratories. The study was designed in four stages by two specialists. Stage 1: creation of an image library, Stage 2: finding an artificial intelligence application, Stage 3: uploading images and training artificial intelligence, Stage 4: performance analysis of the artificial intelligence application. In the first training, the average pattern identification accuracy for 72 testing images was 81.94%. After the second training, this accuracy increased to 87.5%. Patterns Precision improved from 71.42 to 79.96% after the second training. As a result, the number of correctly identified patterns and their accuracy increased with the second training process. Artificial intelligence-based image analysis shows promising potential. This technology is expected to become essential in healthcare facility laboratories, offering higher accuracy rates and lower costs.
Collapse
Affiliation(s)
- Mehmet Akif Durmuş
- Medical Microbiology Laboratory, Çam and Sakura City Hospital, Istanbul, Türkiye.
| | - Selda Kömeç
- Medical Microbiology Laboratory, Çam and Sakura City Hospital, Istanbul, Türkiye
| | - Abdurrahman Gülmez
- Medical Microbiology Laboratory, Aydın Atatürk State Hospital, Aydın, Türkiye
| |
Collapse
|
5
|
Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton E, Malin B, Yin Z. Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review. J Med Internet Res 2024; 26:e22769. [PMID: 39509695 PMCID: PMC11582494 DOI: 10.2196/22769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 09/19/2024] [Accepted: 10/03/2024] [Indexed: 11/15/2024] Open
Abstract
BACKGROUND The launch of ChatGPT (OpenAI) in November 2022 attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including health care. Numerous studies have since been conducted regarding how to use state-of-the-art LLMs in health-related scenarios. OBJECTIVE This review aims to summarize applications of and concerns regarding conversational LLMs in health care and provide an agenda for future research in this field. METHODS We used PubMed, ACM, and the IEEE digital libraries as primary sources for this review. We followed the guidance of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) to screen and select peer-reviewed research articles that (1) were related to health care applications and conversational LLMs and (2) were published before September 1, 2023, the date when we started paper collection. We investigated these papers and classified them according to their applications and concerns. RESULTS Our search initially identified 820 papers according to targeted keywords, out of which 65 (7.9%) papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT (60/65, 92% of papers), followed by Bard (Google LLC; 1/65, 2% of papers), LLaMA (Meta; 1/65, 2% of papers), and other LLMs (6/65, 9% papers). These papers were classified into four categories of applications: (1) summarization, (2) medical knowledge inquiry, (3) prediction (eg, diagnosis, treatment recommendation, and drug synergy), and (4) administration (eg, documentation and information collection), and four categories of concerns: (1) reliability (eg, training data quality, accuracy, interpretability, and consistency in responses), (2) bias, (3) privacy, and (4) public acceptability. There were 49 (75%) papers using LLMs for either summarization or medical knowledge inquiry, or both, and there are 58 (89%) papers expressing concerns about either reliability or bias, or both. We found that conversational LLMs exhibited promising results in summarization and providing general medical knowledge to patients with a relatively high accuracy. However, conversational LLMs such as ChatGPT are not always able to provide reliable answers to complex health-related tasks (eg, diagnosis) that require specialized domain expertise. While bias or privacy issues are often noted as concerns, no experiments in our reviewed papers thoughtfully examined how conversational LLMs lead to these issues in health care research. CONCLUSIONS Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications bring bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in health care.
Collapse
Affiliation(s)
- Leyao Wang
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- School of Biomedical Engineering, ShanghaiTech University, Shanghai, China
| | - Congning Ni
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Qingyuan Song
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Yang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Ellen Clayton
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN, United States
- School of Law, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Bradley Malin
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Zhijun Yin
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| |
Collapse
|
6
|
Goktas P, Grzybowski A. Assessing the Impact of ChatGPT in Dermatology: A Comprehensive Rapid Review. J Clin Med 2024; 13:5909. [PMID: 39407969 PMCID: PMC11477344 DOI: 10.3390/jcm13195909] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 09/23/2024] [Accepted: 10/01/2024] [Indexed: 10/20/2024] Open
Abstract
Background/Objectives: The use of artificial intelligence (AI) in dermatology is expanding rapidly, with ChatGPT, a large language model (LLM) from OpenAI, showing promise in patient education, clinical decision-making, and teledermatology. Despite its potential, the ethical, clinical, and practical implications of its application remain insufficiently explored. This study aims to evaluate the effectiveness, challenges, and future prospects of ChatGPT in dermatology, focusing on clinical applications, patient interactions, and medical writing. ChatGPT was selected due to its broad adoption, extensive validation, and strong performance in dermatology-related tasks. Methods: A thorough literature review was conducted, focusing on publications related to ChatGPT and dermatology. The search included articles in English from November 2022 to August 2024, as this period captures the most recent developments following the launch of ChatGPT in November 2022, ensuring that the review includes the latest advancements and discussions on its role in dermatology. Studies were chosen based on their relevance to clinical applications, patient interactions, and ethical issues. Descriptive metrics, such as average accuracy scores and reliability percentages, were used to summarize study characteristics, and key findings were analyzed. Results: ChatGPT has shown significant potential in passing dermatology specialty exams and providing reliable responses to patient queries, especially for common dermatological conditions. However, it faces limitations in diagnosing complex cases like cutaneous neoplasms, and concerns about the accuracy and completeness of its information persist. Ethical issues, including data privacy, algorithmic bias, and the need for transparent guidelines, were identified as critical challenges. Conclusions: While ChatGPT has the potential to significantly enhance dermatological practice, particularly in patient education and teledermatology, its integration must be cautious, addressing ethical concerns and complementing, rather than replacing, dermatologist expertise. Future research should refine ChatGPT's diagnostic capabilities, mitigate biases, and develop comprehensive clinical guidelines.
Collapse
Affiliation(s)
- Polat Goktas
- UCD School of Computer Science, University College Dublin, D04 V1W8 Dublin, Ireland;
| | - Andrzej Grzybowski
- Department of Ophthalmology, University of Warmia and Mazury, 10-719 Olsztyn, Poland
- Institute for Research in Ophthalmology, Foundation for Ophthalmology Development, 61-553 Poznan, Poland
| |
Collapse
|
7
|
Jo E, Song S, Kim JH, Lim S, Kim JH, Cha JJ, Kim YM, Joo HJ. Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts. JMIR MEDICAL EDUCATION 2024; 10:e51282. [PMID: 38989848 PMCID: PMC11250047 DOI: 10.2196/51282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 04/10/2024] [Accepted: 04/19/2024] [Indexed: 07/12/2024]
Abstract
Background Accurate medical advice is paramount in ensuring optimal patient care, and misinformation can lead to misguided decisions with potentially detrimental health outcomes. The emergence of large language models (LLMs) such as OpenAI's GPT-4 has spurred interest in their potential health care applications, particularly in automated medical consultation. Yet, rigorous investigations comparing their performance to human experts remain sparse. Objective This study aims to compare the medical accuracy of GPT-4 with human experts in providing medical advice using real-world user-generated queries, with a specific focus on cardiology. It also sought to analyze the performance of GPT-4 and human experts in specific question categories, including drug or medication information and preliminary diagnoses. Methods We collected 251 pairs of cardiology-specific questions from general users and answers from human experts via an internet portal. GPT-4 was tasked with generating responses to the same questions. Three independent cardiologists (SL, JHK, and JJC) evaluated the answers provided by both human experts and GPT-4. Using a computer interface, each evaluator compared the pairs and determined which answer was superior, and they quantitatively measured the clarity and complexity of the questions as well as the accuracy and appropriateness of the responses, applying a 3-tiered grading scale (low, medium, and high). Furthermore, a linguistic analysis was conducted to compare the length and vocabulary diversity of the responses using word count and type-token ratio. Results GPT-4 and human experts displayed comparable efficacy in medical accuracy ("GPT-4 is better" at 132/251, 52.6% vs "Human expert is better" at 119/251, 47.4%). In accuracy level categorization, humans had more high-accuracy responses than GPT-4 (50/237, 21.1% vs 30/238, 12.6%) but also a greater proportion of low-accuracy responses (11/237, 4.6% vs 1/238, 0.4%; P=.001). GPT-4 responses were generally longer and used a less diverse vocabulary than those of human experts, potentially enhancing their comprehensibility for general users (sentence count: mean 10.9, SD 4.2 vs mean 5.9, SD 3.7; P<.001; type-token ratio: mean 0.69, SD 0.07 vs mean 0.79, SD 0.09; P<.001). Nevertheless, human experts outperformed GPT-4 in specific question categories, notably those related to drug or medication information and preliminary diagnoses. These findings highlight the limitations of GPT-4 in providing advice based on clinical experience. Conclusions GPT-4 has shown promising potential in automated medical consultation, with comparable medical accuracy to human experts. However, challenges remain particularly in the realm of nuanced clinical judgment. Future improvements in LLMs may require the integration of specific clinical reasoning pathways and regulatory oversight for safe use. Further research is needed to understand the full potential of LLMs across various medical specialties and conditions.
Collapse
Affiliation(s)
- Eunbeen Jo
- Department of Medical Informatics, Korea University College of Medicine, Seoul, Republic of Korea
| | - Sanghoun Song
- Department of Linguistics, Korea University, Seoul, Republic of Korea
| | - Jong-Ho Kim
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea
- Department of Cardiology, Cardiovascular Center, Korea University College of Medicine, Seoul, Republic of Korea
| | - Subin Lim
- Division of Cardiology, Department of Internal Medicine, Korea University Anam Hospital, Seoul, Republic of Korea
| | - Ju Hyeon Kim
- Division of Cardiology, Department of Internal Medicine, Korea University Anam Hospital, Seoul, Republic of Korea
| | - Jung-Joon Cha
- Division of Cardiology, Department of Internal Medicine, Korea University Anam Hospital, Seoul, Republic of Korea
| | - Young-Min Kim
- School of Interdisciplinary Industrial Studies, Hanyang University, Seoul, Republic of Korea
| | - Hyung Joon Joo
- Department of Medical Informatics, Korea University College of Medicine, Seoul, Republic of Korea
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea
- Department of Cardiology, Cardiovascular Center, Korea University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
8
|
Khan M, Banerjee S, Muskawad S, Maity R, Chowdhury SR, Ejaz R, Kuuzie E, Satnarine T. The Impact of Artificial Intelligence on Allergy Diagnosis and Treatment. Curr Allergy Asthma Rep 2024; 24:361-372. [PMID: 38954325 DOI: 10.1007/s11882-024-01152-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/19/2024] [Indexed: 07/04/2024]
Abstract
PURPOSE OF REVIEW Artificial intelligence (AI), be it neuronal networks, machine learning or deep learning, has numerous beneficial effects on healthcare systems; however, its potential applications and diagnostic capabilities for immunologic diseases have yet to be explored. Understanding AI systems can help healthcare workers better assimilate artificial intelligence into their practice and unravel its potential in diagnostics, clinical research, and disease management. RECENT FINDINGS We reviewed recent advancements in AI systems and their integration in healthcare systems, along with their potential benefits in the diagnosis and management of diseases. We explored machine learning as employed in allergy diagnosis and its learning patterns from patient datasets, as well as the possible advantages of using AI in the field of research related to allergic reactions and even remote monitoring. Considering the ethical challenges and privacy concerns raised by clinicians and patients with regard to integrating AI in healthcare, we explored the new guidelines adapted by regulatory bodies. Despite these challenges, AI appears to have been successfully incorporated into various healthcare systems and is providing patient-centered solutions while simultaneously assisting healthcare workers. Artificial intelligence offers new hope in the field of immunologic disease diagnosis, monitoring, and management and thus has the potential to revolutionize healthcare systems.
Collapse
Affiliation(s)
- Maham Khan
- Fatima Jinnah Medical University, Lahore, Pakistan.
| | | | | | - Rick Maity
- Institute of Post Graduate Medical Education and Research, Kolkata, West Bengal, India
| | | | - Rida Ejaz
- Shifa College of Medicine, Islamabad, Pakistan
| | | | | |
Collapse
|
9
|
Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton EW, Malin BA, Yin Z. A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.04.26.24306390. [PMID: 38712148 PMCID: PMC11071576 DOI: 10.1101/2024.04.26.24306390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Background The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators. Objective This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare. Methods We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns. Results Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research. Conclusions Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.
Collapse
Affiliation(s)
- Leyao Wang
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Congning Ni
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Qingyuan Song
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Yang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Ellen Wright Clayton
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
- Center for Biomedical Ethics and Society, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
| | - Bradley A. Malin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
- Department of Biostatistics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Zhijun Yin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| |
Collapse
|
10
|
Yan S, Du D, Liu X, Dai Y, Kim MK, Zhou X, Wang L, Zhang L, Jiang X. Assessment of the Reliability and Clinical Applicability of ChatGPT's Responses to Patients' Common Queries About Rosacea. Patient Prefer Adherence 2024; 18:249-253. [PMID: 38313827 PMCID: PMC10838492 DOI: 10.2147/ppa.s444928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 01/22/2024] [Indexed: 02/06/2024] Open
Abstract
Objective Artificial intelligence chatbot, particularly ChatGPT (Chat Generative Pre-trained Transformer), is capable of analyzing human input and generating human-like responses, which shows its potential application in healthcare. People with rosacea often have questions about alleviating symptoms and daily skin-care, which is suitable for ChatGPT to response. This study aims to assess the reliability and clinical applicability of ChatGPT 3.5 in responding to patients' common queries about rosacea and to evaluate the extent of ChatGPT's coverage in dermatology resources. Methods Based on a qualitative analysis of the literature on the queries from rosacea patients, we have extracted 20 questions of patients' greatest concerns, covering four main categories: treatment, triggers and diet, skincare, and special manifestations of rosacea. Each question was inputted into ChatGPT separately for three rounds of question-and-answer conversations. The generated answers will be evaluated by three experienced dermatologists with postgraduate degrees and over five years of clinical experience in dermatology, to assess their reliability and applicability for clinical practice. Results The analysis results indicate that the reviewers unanimously agreed that ChatGPT achieved a high reliability of 92.22% to 97.78% in responding to patients' common queries about rosacea. Additionally, almost all answers were applicable for supporting rosacea patient education, with a clinical applicability ranging from 98.61% to 100.00%. The consistency of the expert ratings was excellent (all significance levels were less than 0.05), with a consistency coefficient of 0.404 for content reliability and 0.456 for clinical practicality, indicating significant consistency in the results and a high level of agreement among the expert ratings. Conclusion ChatGPT 3.5 exhibits excellent reliability and clinical applicability in responding to patients' common queries about rosacea. This artificial intelligence tool is applicable for supporting rosacea patient education.
Collapse
Affiliation(s)
- Sihan Yan
- Department of Dermatology, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
- Laboratory of Dermatology, Clinical Institute of Inflammation and Immunology, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
| | - Dan Du
- Department of Dermatology, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
- Laboratory of Dermatology, Clinical Institute of Inflammation and Immunology, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
| | - Xu Liu
- Department of Dermatology, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
- Laboratory of Dermatology, Clinical Institute of Inflammation and Immunology, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
| | - Yingying Dai
- Department of Dermatology, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
- Laboratory of Dermatology, Clinical Institute of Inflammation and Immunology, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
| | - Min-Kyu Kim
- Department of Dermatology, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
- Laboratory of Dermatology, Clinical Institute of Inflammation and Immunology, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
| | - Xinyu Zhou
- Department of Dermatology, Nanbu County People’s Hospital, Nanbu County, Nanchong, Sichuan, People’s Republic of China
| | - Lian Wang
- Department of Dermatology, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
- Laboratory of Dermatology, Clinical Institute of Inflammation and Immunology, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
| | - Lu Zhang
- Department of Dermatology, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
- Laboratory of Dermatology, Clinical Institute of Inflammation and Immunology, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
| | - Xian Jiang
- Department of Dermatology, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
- Laboratory of Dermatology, Clinical Institute of Inflammation and Immunology, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, Chengdu, People’s Republic of China
| |
Collapse
|
11
|
Wang G, Liu Q, Chen G, Xia B, Zeng D, Chen G, Guo C. AI's deep dive into complex pediatric inguinal hernia issues: a challenge to traditional guidelines? Hernia 2023; 27:1587-1599. [PMID: 37843604 DOI: 10.1007/s10029-023-02900-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Accepted: 09/19/2023] [Indexed: 10/17/2023]
Abstract
OBJECTIVE This study utilized ChatGPT, an artificial intelligence program based on large language models, to explore controversial issues in pediatric inguinal hernia surgery and compare its responses with the guidelines of the European Association of Pediatric Surgeons (EUPSA). METHODS Six contentious issues raised by EUPSA were submitted to ChatGPT 4.0 for analysis, for which two independent responses were generated for each issue. These generated answers were subsequently compared with systematic reviews and guidelines. To ensure content accuracy and reliability, a content analysis was conducted, and expert evaluations were solicited for validation. Content analysis evaluated the consistency or discrepancy between ChatGPT 4.0's responses and the guidelines. An expert scoring method assess the quality, reliability, and applicability of responses. The TF-IDF model tested the stability and consistency of the two responses. RESULTS The responses generated by ChatGPT 4.0 were mostly consistent with the guidelines. However, some differences and contradictions were noted. The average quality score was 3.33, reliability score was 2.75, and applicability score was 3.46 (out of 5). The average similarity between the two responses was 0.72 (out of 1), Content analysis and expert ratings yielded consistent conclusions, enhancing the credibility of our research. CONCLUSION ChatGPT can provide valuable responses to clinical questions, but it has limitations and requires further improvement. It is recommended to combine ChatGPT with other reliable data sources to improve clinical practice and decision-making.
Collapse
Affiliation(s)
- G Wang
- Department of Pediatrics, Women's and Children's Hospital, Chongqing Medical University, 120 Longshan Rd., Chongqing, 401147, People's Republic of China
- Department of Pediatrics, Children's Hospital, Chongqing Medical University, Chongqing, People's Republic of China
- Department of Pediatric General Surgery, Chongqing Maternal and Child Health Hospital, Chongqing Medical University, Chongqing, People's Republic of China
| | - Q Liu
- Department of Pediatrics, Women's and Children's Hospital, Chongqing Medical University, 120 Longshan Rd., Chongqing, 401147, People's Republic of China
- Department of Fetus and Pediatrics, Chongqing Health Center for Women and Children, Chongqing, People's Republic of China
| | - G Chen
- Department of Pediatrics, Women's and Children's Hospital, Chongqing Medical University, 120 Longshan Rd., Chongqing, 401147, People's Republic of China
- Department of Fetus and Pediatrics, Chongqing Health Center for Women and Children, Chongqing, People's Republic of China
| | - B Xia
- Department of Pediatrics, Women's and Children's Hospital, Chongqing Medical University, 120 Longshan Rd., Chongqing, 401147, People's Republic of China
- Department of Fetus and Pediatrics, Chongqing Health Center for Women and Children, Chongqing, People's Republic of China
| | - D Zeng
- Department of Pediatrics, Women's and Children's Hospital, Chongqing Medical University, 120 Longshan Rd., Chongqing, 401147, People's Republic of China
- Department of Fetus and Pediatrics, Chongqing Health Center for Women and Children, Chongqing, People's Republic of China
| | - G Chen
- Department of Pediatrics, Women's and Children's Hospital, Chongqing Medical University, 120 Longshan Rd., Chongqing, 401147, People's Republic of China.
- Department of Fetus and Pediatrics, Chongqing Health Center for Women and Children, Chongqing, People's Republic of China.
- Department of Pediatric General Surgery, Chongqing Maternal and Child Health Hospital, Chongqing Medical University, Chongqing, People's Republic of China.
- Department of Obstetrics and Gynecology, Chongqing Health Center for Women and Children, Women and Children's Hospital of Chongqing Medical University, 120 Longshan Rd., Chongqing, 401147, People's Republic of China.
| | - C Guo
- Department of Pediatrics, Women's and Children's Hospital, Chongqing Medical University, 120 Longshan Rd., Chongqing, 401147, People's Republic of China.
- Department of Fetus and Pediatrics, Chongqing Health Center for Women and Children, Chongqing, People's Republic of China.
- Department of Pediatric General Surgery, Chongqing Maternal and Child Health Hospital, Chongqing Medical University, Chongqing, People's Republic of China.
| |
Collapse
|
12
|
Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, Alas-Brun R, Onambele L, Ortega W, Montejo R, Aguinaga-Ontoso E, Barach P, Aguinaga-Ontoso I. Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine. Clin Pract 2023; 13:1460-1487. [PMID: 37987431 PMCID: PMC10660543 DOI: 10.3390/clinpract13060130] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Revised: 11/15/2023] [Accepted: 11/16/2023] [Indexed: 11/22/2023] Open
Abstract
The rapid progress in artificial intelligence, machine learning, and natural language processing has led to increasingly sophisticated large language models (LLMs) for use in healthcare. This study assesses the performance of two LLMs, the GPT-3.5 and GPT-4 models, in passing the MIR medical examination for access to medical specialist training in Spain. Our objectives included gauging the model's overall performance, analyzing discrepancies across different medical specialties, discerning between theoretical and practical questions, estimating error proportions, and assessing the hypothetical severity of errors committed by a physician. MATERIAL AND METHODS We studied the 2022 Spanish MIR examination results after excluding those questions requiring image evaluations or having acknowledged errors. The remaining 182 questions were presented to the LLM GPT-4 and GPT-3.5 in Spanish and English. Logistic regression models analyzed the relationships between question length, sequence, and performance. We also analyzed the 23 questions with images, using GPT-4's new image analysis capability. RESULTS GPT-4 outperformed GPT-3.5, scoring 86.81% in Spanish (p < 0.001). English translations had a slightly enhanced performance. GPT-4 scored 26.1% of the questions with images in English. The results were worse when the questions were in Spanish, 13.0%, although the differences were not statistically significant (p = 0.250). Among medical specialties, GPT-4 achieved a 100% correct response rate in several areas, and the Pharmacology, Critical Care, and Infectious Diseases specialties showed lower performance. The error analysis revealed that while a 13.2% error rate existed, the gravest categories, such as "error requiring intervention to sustain life" and "error resulting in death", had a 0% rate. CONCLUSIONS GPT-4 performs robustly on the Spanish MIR examination, with varying capabilities to discriminate knowledge across specialties. While the model's high success rate is commendable, understanding the error severity is critical, especially when considering AI's potential role in real-world medical practice and its implications for patient safety.
Collapse
Affiliation(s)
- Francisco Guillen-Grima
- Department of Health Sciences, Public University of Navarra, 31008 Pamplona, Spain; (S.G.-A.); (L.G.-A.); (R.A.-B.)
- Healthcare Research Institute of Navarra (IdiSNA), 31008 Pamplona, Spain
- Department of Preventive Medicine, Clinica Universidad de Navarra, 31008 Pamplona, Spain
- CIBER in Epidemiology and Public Health (CIBERESP), Institute of Health Carlos III, 46980 Madrid, Spain
| | - Sara Guillen-Aguinaga
- Department of Health Sciences, Public University of Navarra, 31008 Pamplona, Spain; (S.G.-A.); (L.G.-A.); (R.A.-B.)
| | - Laura Guillen-Aguinaga
- Department of Health Sciences, Public University of Navarra, 31008 Pamplona, Spain; (S.G.-A.); (L.G.-A.); (R.A.-B.)
- Department of Nursing, Kystad Helse-og Velferdssenter, 7026 Trondheim, Norway
| | - Rosa Alas-Brun
- Department of Health Sciences, Public University of Navarra, 31008 Pamplona, Spain; (S.G.-A.); (L.G.-A.); (R.A.-B.)
| | - Luc Onambele
- School of Health Sciences, Catholic University of Central Africa, Yaoundé 1100, Cameroon;
| | - Wilfrido Ortega
- Department of Surgery, Medical and Social Sciences, University of Alcala de Henares, 28871 Alcalá de Henares, Spain;
| | - Rocio Montejo
- Department of Obstetrics and Gynecology, Institute of Clinical Sciences, University of Gothenburg, 413 46 Gothenburg, Sweden;
- Department of Obstetrics and Gynecology, Sahlgrenska University Hospital, 413 46 Gothenburg, Sweden
| | | | - Paul Barach
- Jefferson College of Population Health, Philadelphia, PA 19107, USA;
- School of Medicine, Thomas Jefferson University, Philadelphia, PA 19107, USA
- Interdisciplinary Research Institute for Health Law and Science, Sigmund Freud University, 1020 Vienna, Austria
- Department of Surgery, Imperial College, London SW7 2AZ, UK
| | - Ines Aguinaga-Ontoso
- Department of Health Sciences, Public University of Navarra, 31008 Pamplona, Spain; (S.G.-A.); (L.G.-A.); (R.A.-B.)
- Healthcare Research Institute of Navarra (IdiSNA), 31008 Pamplona, Spain
| |
Collapse
|
13
|
Goktas P, Kucukkaya A, Karacay P. Leveraging the efficiency and transparency of artificial intelligence-driven visual Chatbot through smart prompt learning concept. Skin Res Technol 2023; 29:e13417. [PMID: 38009033 PMCID: PMC10587733 DOI: 10.1111/srt.13417] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Accepted: 07/13/2023] [Indexed: 11/28/2023]
Affiliation(s)
- Polat Goktas
- UCD School of Computer ScienceUniversity College Dublin, BelfieldDublinIreland
- CeADAR: Ireland's Centre for Applied Artificial Intelligence, ClonskeaghDublinIreland
| | | | | |
Collapse
|
14
|
Minzoni A, Gallo O. Artificial intelligence's potential in tailoring prescription of biologic therapy for chronic rhinosinusitis. THE JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY. IN PRACTICE 2023; 11:3285-3286. [PMID: 37805232 DOI: 10.1016/j.jaip.2023.07.043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Accepted: 07/27/2023] [Indexed: 10/09/2023]
Affiliation(s)
- Alberto Minzoni
- Department of Otorhinolaryngology, Careggi University Hospital, Florence, Italy
| | - Oreste Gallo
- Department of Otorhinolaryngology, Careggi University Hospital, Florence, Italy.
| |
Collapse
|
15
|
Goktas P, Karakaya G, Kalyoncu AF, Damadoglu E. Reply to "Artificial intelligence's potential in tailoring prescription of biological therapy for chronic rhinosinusitis". THE JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY. IN PRACTICE 2023; 11:3286-3287. [PMID: 37805233 DOI: 10.1016/j.jaip.2023.07.044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 07/27/2023] [Indexed: 10/09/2023]
Affiliation(s)
- Polat Goktas
- UCD School of Computer Science, University College Dublin, Belfield, Ireland; CeADAR: Ireland's Centre for Applied Artificial Intelligence, Clonskeagh, Ireland.
| | - Gul Karakaya
- School of Medicine, Department of Chest Diseases, Division of Allergy and Clinical Immunology, Hacettepe University, Ankara, Turkey
| | - Ali Fuat Kalyoncu
- School of Medicine, Department of Chest Diseases, Division of Allergy and Clinical Immunology, Hacettepe University, Ankara, Turkey
| | - Ebru Damadoglu
- School of Medicine, Department of Chest Diseases, Division of Allergy and Clinical Immunology, Hacettepe University, Ankara, Turkey
| |
Collapse
|
16
|
Cerci P. Allergic to ChatGPT? Introducing a Desensitization Protocol to Embrace Artificial Intelligence. Int Arch Allergy Immunol 2023; 184:903-905. [PMID: 37557096 DOI: 10.1159/000531785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 06/29/2023] [Indexed: 08/11/2023] Open
Affiliation(s)
- Pamir Cerci
- Division of Immunology and Allergy, Department of Internal Medicine/Eskisehir City Hospital, Eskisehir, Turkey
| |
Collapse
|