1
|
Wen B, Shi S, Long Y, Dang Y, Tian W. PhenoDP: leveraging deep learning for phenotype-based case reporting, disease ranking, and symptom recommendation. Genome Med 2025; 17:67. [PMID: 40481598 PMCID: PMC12143081 DOI: 10.1186/s13073-025-01496-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2024] [Accepted: 05/27/2025] [Indexed: 06/11/2025] Open
Abstract
BACKGROUND Current phenotype-based diagnostic tools often struggle with accurate disease prioritization due to incomplete phenotypic data and the complexity of rare disease presentations. Additionally, they lack the ability to generate patient-centered clinical insights or recommend further symptoms for differential diagnosis. METHODS We developed PhenoDP, a deep learning-based toolkit with three modules: Summarizer, Ranker, and Recommender. The Summarizer fine-tuned a distilled large language model to create clinical summaries from a patient's Human Phenotype Ontology (HPO) terms. The Ranker prioritizes diseases by combining information content-based, phi-based, and semantic-based similarity measures. The Recommender employs contrastive learning to recommend additional HPO terms for enhanced diagnostic accuracy. RESULTS PhenoDP's Summarizer produces more clinically coherent and patient-centered summaries than the general-purpose language model FlanT5. The Ranker achieves state-of-the-art diagnostic performance, consistently outperforming existing phenotype-based methods across both simulated and real-world datasets. The Recommender also outperformed GPT-4o and PhenoTips in improving diagnostic accuracy when its suggested terms were incorporated into different ranking pipelines. CONCLUSIONS PhenoDP enhances Mendelian disease diagnosis through deep learning, offering precise summarization, ranking, and symptom recommendation. Its superior performance and open-source design make it a valuable clinical tool, with potential to accelerate diagnosis and improve patient outcomes. PhenoDP is freely available at https://github.com/TianLab-Bioinfo/PhenoDP .
Collapse
Affiliation(s)
- Baole Wen
- State Key Laboratory of Genetics and Development of Complex Phenotypes, Department of Computational Biology, School of Life Sciences, Fudan University, 2005 Songhu Road, Shanghai, 200438, China
| | - Sheng Shi
- State Key Laboratory of Genetics and Development of Complex Phenotypes, Department of Computational Biology, School of Life Sciences, Fudan University, 2005 Songhu Road, Shanghai, 200438, China
| | - Yi Long
- School of Medicine, Nankai University, Tianjin, 300071, China
| | - Yanan Dang
- State Key Laboratory of Genetics and Development of Complex Phenotypes, Department of Computational Biology, School of Life Sciences, Fudan University, 2005 Songhu Road, Shanghai, 200438, China
| | - Weidong Tian
- State Key Laboratory of Genetics and Development of Complex Phenotypes, Department of Computational Biology, School of Life Sciences, Fudan University, 2005 Songhu Road, Shanghai, 200438, China.
- Children's Hospital of Fudan University, Shanghai, 201102, China.
- Children's Hospital of Shandong University, Jinan, Shandong, 250022, China.
| |
Collapse
|
2
|
Birol NY, Çiftci HB, Yılmaz A, Çağlayan A, Alkan F. Is there any room for ChatGPT AI bot in speech-language pathology? Eur Arch Otorhinolaryngol 2025; 282:3267-3280. [PMID: 40025183 PMCID: PMC12122639 DOI: 10.1007/s00405-025-09295-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Accepted: 02/21/2025] [Indexed: 03/04/2025]
Abstract
PURPOSE This study investigates the potential of the ChatGPT-4.0 artificial intelligence bot to assist speech-language pathologists (SLPs) by assessing its accuracy, comprehensiveness, and relevance in various tasks related to speech, language, and swallowing disorders. METHOD In this cross-sectional descriptive study, 15 practicing SLPs evaluated ChatGPT-4.0's responses to task-specific queries across six core areas: report writing, assessment material generation, clinical decision support, therapy stimulus generation, therapy planning, and client/family training material generation. English prompts were created in seven areas: speech sound disorders, motor speech disorders, aphasia, stuttering, childhood language disorders, voice disorders, and swallowing disorders. These prompts were entered into ChatGPT-4.0, and its responses were evaluated. Using a three-point Likert-type scale, participants rated each response for accuracy, relevance, and comprehensiveness based on clinical expectations and their professional judgment. RESULTS The study revealed that ChatGPT-4.0 performed with predominantly high accuracy, comprehensiveness, and relevance in tasks related to speech and language disorders. High accuracy, comprehensiveness, and relevance levels were observed in report writing, clinical decision support, and creating education material. However, tasks such as creating therapy stimuli and therapy planning showed more variation with medium and high accuracy levels. CONCLUSIONS ChatGPT-4.0 shows promise in assisting SLPs with various professional tasks, particularly report writing, clinical decision support, and education material creation. However, further research is needed to address its limitations in therapy stimulus generation and therapy planning to improve its usability in clinical practice. Integrating AI technologies such as ChatGPT could improve the efficiency and effectiveness of therapeutic processes in speech-language pathology.
Collapse
Affiliation(s)
- Namık Yücel Birol
- Department of Speech and Language Therapy, Faculty of Health Sciences, Tarsus University, Mersin, Türkiye.
| | - Hilal Berber Çiftci
- Department of Speech and Language Therapy, Faculty of Health Sciences, Tarsus University, Mersin, Türkiye
| | - Ayşegül Yılmaz
- Department of Speech and Language Therapy, Graduate School of Health Sciences, İstanbul Medipol University, İstanbul, Türkiye
| | - Ayhan Çağlayan
- Çağlayan Speech and Language Therapy Center, İzmir, Türkiye
| | - Ferhat Alkan
- Department of Speech and Language Therapy, Institute of Graduate Education, İstinye University, İstanbul, Türkiye
| |
Collapse
|
3
|
Owens OL, Leonard M. A Comparison of Prostate Cancer Screening Information Quality on Standard and Advanced Versions of ChatGPT, Google Gemini, and Microsoft Copilot: A Cross-Sectional Study. Am J Health Promot 2025; 39:766-776. [PMID: 39854615 DOI: 10.1177/08901171251316371] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2025]
Abstract
PurposeArtificially Intelligent (AI) chatbots have the potential to produce information to support shared prostate cancer (PrCA) decision-making. Therefore, our purpose was to evaluate and compare the accuracy, completeness, readability, and credibility of responses from standard and advanced versions of popular chatbots: ChatGPT-3.5, ChatGPT-4.0, Microsoft Copilot, Microsoft Copilot Pro, Google Gemini, and Google Gemini Advanced. We also investigated whether prompting chatbots for low-literacy PrCA information would improve the readability of responses. Lastly, we determined if the responses were appropriate for African-American men, who have the worst PrCA outcomes.ApproachThe study used a cross-sectional approach to examine the quality of responses solicited from chatbots.ParticipantsThe study did not include human subjects.MethodEleven frequently asked PrCA questions, based on resources produced by the Centers for Disease Control and Prevention (CDC) and the American Cancer Society (ACS), were posed to each chatbot twice (once for low literacy populations). A coding/rating form containing questions with key points/answers from the ACS or CDC to facilitate the rating process. Accuracy and completeness were rated dichotomously (i.e., yes/no). Credibility was determined by whether a trustworthy medical or health-related organization was cited. Readability was determined using a Flesch-Kincaid readability score calculator that enabled chatbot responses to be entered individually. Average accuracy, completeness, credibility, and readability percentages or scores were calculated using Excel.ResultsAll chatbots were accurate, but the completeness, readability, and credibility of responses varied. Soliciting low-literacy responses significantly improved readability, but sometimes at the detriment of completeness. All chatbots recognized the higher PrCA risk in African-American men and tailored screening recommendations. Microsoft Copilot Pro had the best overall performance on standard screening questions. Microsoft Copilot outperformed other chatbots on responses for low literacy populations.ConclusionsAI chatbots are useful tools for learning about PrCA screening but should be combined with healthcare provider advice.
Collapse
Affiliation(s)
- Otis L Owens
- College of Social Work, University of South Carolina, Columbia, SC, USA
| | - Michael Leonard
- College of Social Work, University of South Carolina, Columbia, SC, USA
| |
Collapse
|
4
|
Ye Q, Wang S, Liu Y, Luo J, Deng Z, Liu S, Jiang Y. Application of AI-assisted multi-advisor system combined with BOPPPS teaching model in clinical pharmacy education. BMC MEDICAL EDUCATION 2025; 25:783. [PMID: 40426135 PMCID: PMC12107824 DOI: 10.1186/s12909-025-07394-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/27/2024] [Accepted: 05/22/2025] [Indexed: 05/29/2025]
Abstract
BACKGROUND The development of clinical pharmacy in China has been relatively slow, and standardized, effective training for clinical pharmacists remains a major challenge. At present, traditional teaching methods are not conducive to cultivating clinical practice abilities or critical thinking, leading to a lack of enthusiasm among students. To address this issue, a hybrid teaching model combining an AI-assisted multi-advisor system with the BOPPPS (bridge, objective, pre-assessment, participatory learning, post-assessment, summary) teaching method was applied in clinical pharmacy education. METHODS In this study, a teaching model was developed that consists of five components: advisor system, teaching strategy, teaching method, teaching mode, and assessment methods. The model was implemented in the Pharmacy Department of Xiangya Hospital at Central South University in Changsha, China, from spring 2023 to spring 2024. After the spring 2024 training session, anonymous questionnaires were completed by students. The effects of the teaching reform were evaluated. The questionnaire focused on clinical practice effectiveness, course design rationality, improvement in professional ability, interest in learning, and perceptions of ChatGPT as a teaching tool. RESULTS A total of 81 students participated in the study, with 69 completing the questionnaire survey (response rate: 85.2%, n = 69). Overall, students gave positive comments regarding the instructional improvements. A higher proportion of interns and advanced students considered the courses difficult (32.4%, n = 12 and 31.6%, n = 6, respectively). Identified shortcomings include the frequent periodic assessments and increased pressure. Regarding the use of ChatGPT, 49 students (71.0%, n = 49) commented that it played a multidimensional role in teaching. The most valued asset was its ability to provide massive data information (43.5%, n = 30), while the lack of interactivity was a prominent issue (56.5%, n = 39). CONCLUSION This innovative teaching model integrating an AI-assisted BOPPPS framework with a supervised ChatGPT implementation may facilitate the comprehensive self-evaluation of abilities. This research offers a replicable model for AI-driven educational transformation in clinical pharmacy training.
Collapse
Affiliation(s)
- Qianqian Ye
- Department of Pharmacy, Xiangya Hospital, Central South University, Changsha, Hunan, 410008, China
- National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha, Hunan, 410008, China
| | - Shu Wang
- Zhongshan Medical College, Sun Yat-Sen University, Guangzhou, Guangdong, China
| | - Yufeng Liu
- College of Pharmacy, Jining Medical University, Rizhao , Shandong, 276826, China
| | - Jinque Luo
- Hunan Provincial Key Laboratory of the ResearchandDevelopment of Novel Pharmaceutical PreparationsThe 14 Five-Year Plan" Application Characteristic Discipline of Hunan Province (Pharmaceutical Science), College of Pharmacy, Changsha Medical University, Changsha, Hunan, 410219, China
| | - Ziwei Deng
- Department of Clinical Pharmacy, Hunan University of Medicine General Hospital, Huaihua, Hunan, 418000, China
| | - Shao Liu
- Department of Pharmacy, Xiangya Hospital, Central South University, Changsha, Hunan, 410008, China.
- National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha, Hunan, 410008, China.
| | - Yueping Jiang
- Department of Pharmacy, Xiangya Hospital, Central South University, Changsha, Hunan, 410008, China.
- National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha, Hunan, 410008, China.
- Department of Clinical Pharmacy, Hunan University of Medicine General Hospital, Huaihua, Hunan, 418000, China.
- Hunan Provincial Key Laboratory of the ResearchandDevelopment of Novel Pharmaceutical PreparationsThe 14 Five-Year Plan" Application Characteristic Discipline of Hunan Province (Pharmaceutical Science), College of Pharmacy, Changsha Medical University, Changsha, Hunan, 410219, China.
| |
Collapse
|
5
|
Ali MJ. DeepSeek TM and lacrimal drainage disorders: hype or is it performing better than ChatGPT TM? Orbit 2025:1-7. [PMID: 40336348 DOI: 10.1080/01676830.2025.2501656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2025] [Accepted: 04/29/2025] [Indexed: 05/09/2025]
Abstract
PURPOSE This study aimed to report the performance of the large language model DeepSeekTM (DeepSeek TM, Hangzhou, China) and perform a head-to-head comparison with ChatGPTTM (OpenAI, San Francisco, USA) in the context of lacrimal drainage disorders. METHODS Questions and statements were used to construct prompts to include common and uncommon aspects of lacrimal drainage disorders. Prompts avoided covering new knowledge beyond February 2024. Prompts were presented at least twice to the latest versions of DeepSeekTM and ChatGPTTM [Accessed February 15-18, 2025]. A set of assessed prompts for ChatGPTTM from 2023 (ChatGPT-2023) was utilized in this study. The responses of DeepSeekTM and ChatGPTTM were analyzed for evidence-based content, updated knowledge, specific responses, speed, and factual inaccuracies. The responses of the current ChatGPTTM were also compared with those of 2023 to assess the improvement of the artificial intelligence chatbot. Three lacrimal surgeons graded the responses into three categories: correct, partially correct, and factually incorrect. They also compared the overall quality of the response between DeepSeekTM and ChatGPTTM based on the overall content, organization, and clarity of the answers. RESULTS 25 prompts were presented to the latest versions [February 2025] of DeepSeekTM and ChatGPTTM. There was no significant difference in the speed of response. The agreement among the three observers was high (96%) in grading the responses. In terms of the accuracy of the responses, both AI models were similar. DeepSeek's responses were graded as correct in 60% (15/25), partially correct in 36% (9/25), and factually incorrect in 4% (1/25). ChatGPT-2025 responses were graded as correct in 56% (14/25), partially correct in 40% (10/25), and factually incorrect in 4% (1/25). Compared to 2023, ChatGPT-2025 gave responses which were more specific, more accurate, less generic with lesser recycling of phrases. When confronted with inaccuracies, both admitted and corrected the mistakes in subsequent responses. Both the AI models demonstrated the capability of challenging incorrect prompts and premises. CONCLUSION DeepSeekTM was not superior but comparable to ChatGPTTM in the context of lacrimal drainage disorders. Each had unique advantages and could complement each other. They need to be specifically trained and re-trained for individual medical subspecialties.
Collapse
Affiliation(s)
- Mohammad Javed Ali
- Govindram Seksaria Institute of Dacryology, L.V. Prasad Eye Institute, Hyderabad, India
| |
Collapse
|
6
|
Iqbal U, Tanweer A, Rahmanti AR, Greenfield D, Lee LTJ, Li YCJ. Impact of large language model (ChatGPT) in healthcare: an umbrella review and evidence synthesis. J Biomed Sci 2025; 32:45. [PMID: 40335969 PMCID: PMC12057020 DOI: 10.1186/s12929-025-01131-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2024] [Accepted: 03/04/2025] [Indexed: 05/09/2025] Open
Abstract
BACKGROUND The emergence of Artificial Intelligence (AI), particularly Chat Generative Pre-Trained Transformer (ChatGPT), a Large Language Model (LLM), in healthcare promises to reshape patient care, clinical decision-making, and medical education. This review aims to synthesise research findings to consolidate the implications of ChatGPT integration in healthcare and identify research gaps. MAIN BODY The umbrella review was conducted following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. The Cochrane Library, PubMed, Scopus, Web of Science, and Google Scholar were searched from inception until February 2024. Due to the heterogeneity of the included studies, no quantitative analysis was performed. Instead, information was extracted, summarised, synthesised, and presented in a narrative form. Two reviewers undertook title, abstract, and full text screening independently. The methodological quality and overall rating of the included reviews were assessed using the A Measurement Tool to Assess systematic Reviews (AMSTAR-2) checklist. The review examined 17 studies, comprising 15 systematic reviews and 2 meta-analyses, on ChatGPT in healthcare, revealing diverse focuses. The AMSTAR-2 assessment identified 5 moderate and 12 low-quality reviews, with deficiencies like study design justification and funding source reporting. The most reported theme that emerged was ChatGPT's use in disease diagnosis or clinical decision-making. While 82.4% of studies focused on its general usage, 17.6% explored unique topics like its role in medical examinations and conducting systematic reviews. Among these, 52.9% targeted general healthcare, with 41.2% focusing on specific domains like radiology, neurosurgery, gastroenterology, public health dentistry, and ophthalmology. ChatGPT's use for manuscript review or writing was mentioned in 17.6% of reviews. Promising applications include enhancing patient care and clinical decision-making, though ethical, legal, and accuracy concerns require cautious integration. CONCLUSION We summarise the identified areas in reviews regarding ChatGPT's transformative impact in healthcare, highlighting patient care, decision-making, and medical education. Emphasising the importance of ethical regulations and the involvement of policymakers, we urge further investigation to ensure the reliability of ChatGPT and to promote trust in healthcare and research.
Collapse
Affiliation(s)
- Usman Iqbal
- Institute for Evidence-Based Healthcare, Faculty of Health Sciences & Medicine, Bond University, Gold Coast, Australia
- Evidence-Based Practice Professorial Unit, Gold Coast Hospital & Health Service (GCHHS), Gold Coast, QLD, Australia
| | - Afifa Tanweer
- Department of Nutrition & Dietetics, School of Health Sciences, University of Management and Technology, Lahore, Pakistan
| | - Annisa Ristya Rahmanti
- Department of Health Policy and Management, Faculty of Medicine, Public Health and Nursing, Universitas Gadjah Mada, Yogyakarta, Indonesia
- Department of Computer Science, Faculty of Science and Technology, Middlesex University, London, UK
| | - David Greenfield
- School of Population Health, Faculty of Medicine and Health, University of New South Wales (UNSW), Sydney, Australia
| | - Leon Tsung-Ju Lee
- Graduate Institute of Clinical Medicine, Taipei Medical University, Taipei, Taiwan
- Department of Dermatology, Taipei Medical University Hospital, Taipei Medical University, Taipei, Taiwan
- Department of Dermatology, School of Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
| | - Yu-Chuan Jack Li
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan.
- Department of Dermatology, Taipei Municipal Wanfang Hospital, Taipei Medical University, Taipei, Taiwan.
- International Center for Health Information Technology, Taipei Medical University, Taipei, Taiwan.
| |
Collapse
|
7
|
Bui N, Nguyen G, Nguyen N, Vo B, Vo L, Huynh T, Tang A, Tran VN, Huynh T, Nguyen HQ, Dinh M. Fine-tuning large language models for improved health communication in low-resource languages. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2025; 263:108655. [PMID: 39987667 DOI: 10.1016/j.cmpb.2025.108655] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 12/10/2024] [Accepted: 02/05/2025] [Indexed: 02/25/2025]
Abstract
BACKGROUND The reported study illustrates a methodology for compiling training datasets to fine-tune Large Language Models (LLMs) for healthcare information in Vietnamese, a low-resource language. The objective is to bridge the gap in medical information accessibility and enhance healthcare communication in developing countries by adapting LLMs to specific linguistic nuances and domain needs. METHOD The methodology involves selecting a base model, compiling a domain-specific dataset, and fine-tuning the model with this dataset. Three open-source models were selected. The dataset, comprising approximately 337,000 prompt-response pairs in Vietnamese, was compiled using existing datasets, data crawled from Vietnamese medical online forums, and distilled from Vietnamese medical textbooks. The three models were fine-tuned using the Low-Rank adaptation (LoRA) and Quantized Low-Rank adaptation (QLoRA) techniques. Models' performances were evaluated using BertScore score, Rouge-L score, and the "LLM-as-a-Judge" method. RESULTS The fine-tuned models showed enhancements in performance over their base versions across evaluation metrics in BertScore score, Rouge-L score and "LLM-as-a-Judge" method, confirming the effectiveness of the fine-tuning process. This study details the process of fine-tuning open-source LLMs for health information inquiries in Vietnamese, demonstrating its potential to improve healthcare communication in low-resource languages. Deploying the fine-tuned LLM on-premise enhances data privacy and security. However, the significant computing power and costs required pose challenges, especially for organizations in developing countries. CONCLUSION This case study highlights the unique challenges faced by developing countries using low-resource languages. Initiatives are needed to emphasize efforts to bridge healthcare gaps in underserved areas and contribute to global health equity.
Collapse
Affiliation(s)
- Nhat Bui
- School of Science, Engineering and Technology, RMIT University, Ho Chi Minh City, Vietnam
| | - Giang Nguyen
- School of Science, Engineering and Technology, RMIT University, Ho Chi Minh City, Vietnam
| | - Nguyen Nguyen
- School of Science, Engineering and Technology, RMIT University, Ho Chi Minh City, Vietnam
| | - Bao Vo
- School of Science, Engineering and Technology, RMIT University, Ho Chi Minh City, Vietnam
| | - Luan Vo
- School of Science, Engineering and Technology, RMIT University, Ho Chi Minh City, Vietnam
| | - Tom Huynh
- School of Science, Engineering and Technology, RMIT University, Ho Chi Minh City, Vietnam
| | - Arthur Tang
- School of Science, Engineering and Technology, RMIT University, Ho Chi Minh City, Vietnam.
| | - Van Nhiem Tran
- AI Research Center, Hon Hai Research Institute, Taipei 114699, Taiwan
| | - Tuyen Huynh
- Oxford University Clinical Research Unit (OUCRU), Ho Chi Minh City, Vietnam
| | - Huy Quang Nguyen
- Oxford University Clinical Research Unit (OUCRU), Ho Chi Minh City, Vietnam
| | - Minh Dinh
- School of Science, Engineering and Technology, RMIT University, Ho Chi Minh City, Vietnam
| |
Collapse
|
8
|
Chen X, Wang T, Zhou J, Song Z, Gao X, Zhang X. Evaluating and mitigating bias in AI-based medical text generation. NATURE COMPUTATIONAL SCIENCE 2025; 5:388-396. [PMID: 40269315 DOI: 10.1038/s43588-025-00789-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2024] [Accepted: 03/12/2025] [Indexed: 04/25/2025]
Abstract
Artificial intelligence (AI) systems, particularly those based on deep learning models, have increasingly achieved expert-level performance in medical applications. However, there is growing concern that such AI systems may reflect and amplify human bias, reducing the quality of their performance in historically underserved populations. The fairness issue has attracted considerable research interest in the medical imaging classification field, yet it remains understudied in the text-generation domain. In this study, we investigate the fairness problem in text generation within the medical field and observe substantial performance discrepancies across different races, sexes and age groups, including intersectional groups, various model scales and different evaluation metrics. To mitigate this fairness issue, we propose an algorithm that selectively optimizes those underserved groups to reduce bias. Our evaluations across multiple backbones, datasets and modalities demonstrate that our proposed algorithm enhances fairness in text generation without compromising overall performance.
Collapse
Affiliation(s)
- Xiuying Chen
- Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates.
- King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
| | - Tairan Wang
- King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Juexiao Zhou
- King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Zirui Song
- Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
| | - Xin Gao
- King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
| | - Xiangliang Zhang
- King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
- University of Notre Dame, Notre Dame, IN, USA.
| |
Collapse
|
9
|
Young T, Au Yeung J, Sambasivan K, Adjogatse D, Kong A, Petkar I, Reis Ferreira M, Lei M, King A, Teo J, Guerrero Urbano T. Natural Language Processing to Extract Head and Neck Cancer Data From Unstructured Electronic Health Records. Clin Oncol (R Coll Radiol) 2025; 41:103805. [PMID: 40188745 DOI: 10.1016/j.clon.2025.103805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2024] [Revised: 03/11/2025] [Accepted: 03/14/2025] [Indexed: 04/20/2025]
Abstract
AIMS Patient data is frequently stored as unstructured data within Electronic Health Records (EHRs), requiring manual curation. AI tools using Natural Language Processing (NLP) may rapidly curate accurate real-world unstructured EHRs to enrich datasets. We evaluated this approach for Head and Neck Cancer (HNC) patient data extraction using an open-source general-purpose healthcare NLP tool (CogStack). MATERIALS AND METHODS CogStack was applied to extract relevant SNOMED-CT concepts from HNC patients' documents, generating outputs denoting the identifications of each concept for each patient. Outputs were compared to manually curated ground truth HNC datasets to calculate pre-training performance. Supervised model training was then performed using SNOMED-CT concept annotation on clinical documents, and the updated model was re-evaluated. A second training cycle was performed before the final evaluation. A thresholding approach (multiple detections needed to qualify a concept as 'present') was used to increase precision. The final model was evaluated on an unseen test cohort. F1 score (harmonic mean of precision and recall) was used for evaluation. RESULTS Pre-training, the F1 score was incalculable for 19.5% of concepts due to insufficient recall. Following one training cycle, F1 score became calculable for all concepts (median 0.692). After further training, the final model demonstrated improvement in the median F1 score (0.708). Test cohort median F1 score was 0.750. Thresholding analysis developed a concept-specific best threshold approach, resulting in a median F1 score of 0.778 in the test cohort, where 50 out of 109 SNOMED-CT concepts met pre-set criteria to be considered adequately fine-tuned. CONCLUSIONS NLP can mine unstructured cancer data following limited training. Certain concepts such as histopathology terms remained poorly retrieved. Model performance is maintained when applied to a test cohort, demonstrating good generalisability. Concept-specific thresholding strategy improved performance. Fine-tuning annotations were incorporated into the NLP parent model for future performance. CogStack has been applied to extract data for 50 concepts with validated performance for our entire retrospective HNC cohort.
Collapse
Affiliation(s)
- T Young
- Guy's and St Thomas' NHS Foundation Trust (GSTT), UK; King's College London, UK.
| | - J Au Yeung
- Guy's and St Thomas' NHS Foundation Trust (GSTT), UK; King's College London, UK
| | - K Sambasivan
- Guy's and St Thomas' NHS Foundation Trust (GSTT), UK; King's College London, UK
| | - D Adjogatse
- Guy's and St Thomas' NHS Foundation Trust (GSTT), UK; King's College London, UK
| | - A Kong
- Guy's and St Thomas' NHS Foundation Trust (GSTT), UK; King's College London, UK
| | - I Petkar
- Guy's and St Thomas' NHS Foundation Trust (GSTT), UK; King's College London, UK
| | - M Reis Ferreira
- Guy's and St Thomas' NHS Foundation Trust (GSTT), UK; King's College London, UK
| | - M Lei
- Guy's and St Thomas' NHS Foundation Trust (GSTT), UK; King's College London, UK
| | - A King
- King's College London, UK
| | - J Teo
- Guy's and St Thomas' NHS Foundation Trust (GSTT), UK; King's College London, UK
| | - T Guerrero Urbano
- Guy's and St Thomas' NHS Foundation Trust (GSTT), UK; King's College London, UK
| |
Collapse
|
10
|
Xiong YT, Zhan ZZ, Zhong CL, Zeng W, Guo JX, Tang W, Liu C. Evaluating the Performance of Large Language Models (LLMs) in Answering and Analysing the Chinese Dental Licensing Examination. EUROPEAN JOURNAL OF DENTAL EDUCATION : OFFICIAL JOURNAL OF THE ASSOCIATION FOR DENTAL EDUCATION IN EUROPE 2025; 29:332-340. [PMID: 39889108 DOI: 10.1111/eje.13073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2024] [Revised: 11/25/2024] [Accepted: 01/15/2025] [Indexed: 02/02/2025]
Abstract
BACKGROUND This study aimed to simulate diverse scenarios of students employing LLMs for CDLE examination preparation, providing a detailed evaluation of their performance in medical education. METHODS A stratified random sampling strategy was implemented to select and subsequently revise 200 questions from the CDLE. Seven LLMs, recognised for their exceptional performance in the Chinese domain, were selected as test subjects. Three distinct testing scenarios were constructed: answering questions, explaining questions and adversarial testing. The evaluation metrics included accuracy, agreement rate and teaching effectiveness score. Wald χ2 tests and Kruskal-Wallis tests were employed to determine whether the differences among the LLMs across various scenarios and before and after adversarial testing were statistically significant. RESULTS The majority of the tested LLMs met the passing threshold on the CDLE benchmark, with Doubao-pro 32k and Qwen2-72b (81%) achieving the highest accuracy rates. Doubao-pro 32k demonstrated the highest 98% agreement rate with the reference answers when providing explanations. Although statistically significant differences existed among various LLMs in their teaching effectiveness scores based on the Likert scale, all these models demonstrated a commendable ability to deliver comprehensible and effective instructional content. In adversarial testing, GPT-4 exhibited the smallest decline in accuracy (2%, p = 0.623), while ChatGLM-4 demonstrated the least reduction in agreement rate (14.6%, p = 0.001). CONCLUSIONS LLMs trained on Chinese corpora, such as Doubao-pro 32k, demonstrated superior performance compared to GPT-4 in answering and explaining questions, with no statistically significant difference. However, during adversarial testing, all models exhibited diminished performance, with GPT-4 displaying comparatively greater robustness. Future research should further investigate the interpretability of LLM outputs and develop strategies to mitigate hallucinations generated in medical education.
Collapse
Affiliation(s)
- Yu-Tao Xiong
- State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | - Zheng-Zhe Zhan
- State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | - Cheng-Lan Zhong
- Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, China
| | - Wei Zeng
- State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | - Ji-Xiang Guo
- Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, China
| | - Wei Tang
- State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | - Chang Liu
- State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| |
Collapse
|
11
|
Tay JRH, Chow DY, Lim YRI, Ng E. Enhancing patient-centered information on implant dentistry through prompt engineering: a comparison of four large language models. FRONTIERS IN ORAL HEALTH 2025; 6:1566221. [PMID: 40260428 PMCID: PMC12009804 DOI: 10.3389/froh.2025.1566221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2025] [Accepted: 03/21/2025] [Indexed: 04/23/2025] Open
Abstract
Background Patients frequently seek dental information online, and generative pre-trained transformers (GPTs) may be a valuable resource. However, the quality of responses based on varying prompt designs has not been evaluated. As dental implant treatment is widely performed, this study aimed to investigate the influence of prompt design on GPT performance in answering commonly asked questions related to dental implants. Materials and methods Thirty commonly asked questions about implant dentistry - covering patient selection, associated risks, peri-implant disease symptoms, treatment for missing teeth, prevention, and prognosis - were posed to four different GPT models with different prompt designs. Responses were recorded and independently appraised by two periodontists across six quality domains. Results All models performed well, with responses classified as good quality. The contextualized model performed worse on treatment-related questions (21.5 ± 3.4, p < 0.05), but outperformed the input-output, zero-shot chain of thought, and instruction-tuned models in citing appropriate sources in its responses (4.1 ± 1.0, p < 0.001). However, responses had less clarity and relevance compared to the other models. Conclusion GPTs can provide accurate, complete, and useful information for questions related to dental implants. While prompt designs can enhance response quality, further refinement is necessary to optimize its performance.
Collapse
Affiliation(s)
- John Rong Hao Tay
- Department of Restorative Dentistry, National Dental Centre Singapore, Singapore, Singapore
- Health Services and Systems Research Programme, Duke-NUS Medical School, Singapore, Singapore
| | - Dian Yi Chow
- Department of Restorative Dentistry, National Dental Centre Singapore, Singapore, Singapore
| | | | - Ethan Ng
- Department of Restorative Dentistry, National Dental Centre Singapore, Singapore, Singapore
- Centre for Oral Clinical Research, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| |
Collapse
|
12
|
Meyer MKR, Kandathil CK, Davis SJ, Durairaj KK, Patel PN, Pepper JP, Spataro EA, Most SP. Evaluation of Rhinoplasty Information from ChatGPT, Gemini, and Claude for Readability and Accuracy. Aesthetic Plast Surg 2025; 49:1868-1873. [PMID: 39285054 DOI: 10.1007/s00266-024-04343-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Accepted: 08/23/2024] [Indexed: 04/26/2025]
Abstract
OBJECTIVE Assessment of the readability, accuracy, quality, and completeness of ChatGPT (Open AI, San Francisco, CA), Gemini (Google, Mountain View, CA), and Claude (Anthropic, San Francisco, CA) responses to common questions about rhinoplasty. METHODS Ten questions commonly encountered in the senior author's (SPM) rhinoplasty practice were presented to ChatGPT-4, Gemini and Claude. Seven Facial Plastic and Reconstructive Surgeons with experience in rhinoplasty were asked to evaluate these responses for accuracy, quality, completeness, relevance, and use of medical jargon on a Likert scale. The responses were also evaluated using several readability indices. RESULTS ChatGPT achieved significantly higher evaluator scores for accuracy, and overall quality but scored significantly lower on completeness compared to Gemini and Claude. All three chatbot responses to the ten questions were rated as neutral to incomplete. All three chatbots were found to use medical jargon and scored at a college reading level for readability scores. CONCLUSIONS Rhinoplasty surgeons should be aware that the medical information found on chatbot platforms is incomplete and still needs to be scrutinized for accuracy. However, the technology does have potential for use in healthcare education by training it on evidence-based recommendations and improving readability. LEVEL OF EVIDENCE V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Collapse
Affiliation(s)
- Monica K Rossi Meyer
- Division of Facial Plastic and Reconstructive Surgery, Department of Otolaryngology-Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, USA
| | - Cherian Kurian Kandathil
- Division of Facial Plastic and Reconstructive Surgery, Department of Otolaryngology-Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, USA
| | - Seth J Davis
- Division of Facial Plastic and Reconstructive Surgery, Department of Otolaryngology-Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, USA
| | - K Kay Durairaj
- Department of Otolaryngology, Head and Neck Surgery, Huntington Hospital, Pasadena, California, USA
- Kay Durairaj, MD, A Medical Corp, Pasadena, California, USA
| | - Priyesh N Patel
- Division of Facial Plastic and Reconstructive Surgery, Department of Otolaryngology-Head and Neck Surgery, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Jon-Paul Pepper
- Division of Facial Plastic and Reconstructive Surgery, Department of Otolaryngology-Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, USA
| | - Emily A Spataro
- Division of Facial Plastic and Reconstructive Surgery, Department of Otolaryngology-Head and Neck Surgery, Washington University School of Medicine, St. Louis, Missouri, USA
| | - Sam P Most
- Division of Facial Plastic and Reconstructive Surgery, Department of Otolaryngology-Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, USA.
| |
Collapse
|
13
|
Umesh C, Mahendra M, Bej S, Wolkenhauer O, Wolfien M. Challenges and applications in generative AI for clinical tabular data in physiology. Pflugers Arch 2025; 477:531-542. [PMID: 39417878 PMCID: PMC11958401 DOI: 10.1007/s00424-024-03024-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 09/17/2024] [Accepted: 09/23/2024] [Indexed: 10/19/2024]
Abstract
Recent advancements in generative approaches in AI have opened up the prospect of synthetic tabular clinical data generation. From filling in missing values in real-world data, these approaches have now advanced to creating complex multi-tables. This review explores the development of techniques capable of synthesizing patient data and modeling multiple tables. We highlight the challenges and opportunities of these methods for analyzing patient data in physiology. Additionally, it discusses the challenges and potential of these approaches in improving clinical research, personalized medicine, and healthcare policy. The integration of these generative models into physiological settings may represent both a theoretical advancement and a practical tool that has the potential to improve mechanistic understanding and patient care. By providing a reliable source of synthetic data, these models can also help mitigate privacy concerns and facilitate large-scale data sharing.
Collapse
Affiliation(s)
- Chaithra Umesh
- Institute of Computer Science, Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany.
| | - Manjunath Mahendra
- Institute of Computer Science, Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany.
| | - Saptarshi Bej
- School of Data Science, Indian Institute of Science Education and Research (IISER), Thiruvananthapuram, India
| | - Olaf Wolkenhauer
- Institute of Computer Science, Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany
- Leibniz-Institute for Food Systems Biology, Technical University of Munich, Freising, Germany
| | - Markus Wolfien
- Faculty of Medicine Carl Gustav Carus, Institute for Medical Informatics and Biometry, TUD Dresden University of Technology, Dresden, Germany
- Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Dresden, Germany
| |
Collapse
|
14
|
Asiksoy G. Nurses' assessment of artificial ıntelligence chatbots for health literacy education. JOURNAL OF EDUCATION AND HEALTH PROMOTION 2025; 14:128. [PMID: 40271238 PMCID: PMC12017437 DOI: 10.4103/jehp.jehp_1195_24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Accepted: 09/24/2024] [Indexed: 04/25/2025]
Abstract
BACKGROUND Artificial intelligence (AI)-powered chatbots are emerging as a new tool in healthcare, offering the potential to provide patients with information and support. Despite their growing presence, there are concerns regarding the medical reliability of the information they provide and the potential risks to patient safety. MATERIAL AND METHODS The aim of this study is to assess the medical reliability of responses to health-related questions provided by an AI-powered chatbot and to evaluate the risks to patient safety. The study is designed using a mixed-methods phenomenology approach. The participants are 44 nurses working at a private hospital in Cyprus. Data collection was conducted via survey forms and focus group discussions. Quantitative data were analyzed using descriptive statistics, while qualitative data were examined using content analysis. RESULTS The results indicate that according to the nurses' evaluations, the medical reliability of the AI chatbot's responses is generally high. However, instances of incorrect or incomplete information were also noted. Specifically, the quantitative analysis showed that a majority of the nurses found the chatbot's responses to be accurate and useful. The qualitative analysis revealed concerns about the potential for the chatbot to misdirect patients or contribute to diagnostic errors. These risks highlight the importance of monitoring and improving the AI systems to minimize errors and enhance reliability. CONCLUSION AI chatbots can provide valuable information and support to patients, improving accessibility and engagement in healthcare. However, concerns about medical reliability and patient safety remain. Continuous evaluation and improvement of these systems are necessary, alongside efforts to enhance patients' health literacy to help them accurately assess information from AI chatbots.
Collapse
Affiliation(s)
- Gulsum Asiksoy
- Department of Education and Instructional Technology, Atatürk Faculty of Education, North Cyprus via Mersin 10, Turkey
| |
Collapse
|
15
|
Akrimi S, Schwensfeier L, Düking P, Kreutz T, Brinkmann C. ChatGPT-4o-Generated Exercise Plans for Patients with Type 2 Diabetes Mellitus-Assessment of Their Safety and Other Quality Criteria by Coaching Experts. Sports (Basel) 2025; 13:92. [PMID: 40278718 PMCID: PMC12031090 DOI: 10.3390/sports13040092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2025] [Revised: 03/04/2025] [Accepted: 03/17/2025] [Indexed: 04/26/2025] Open
Abstract
In this discussion paper based on preliminary data, the safety and other quality criteria of ChatGPT-4o-generated exercise plans for patients with type 2 diabetes mellitus (T2DM) are evaluated. The study team created three fictional patient profiles varying in sex, age, body mass index, secondary diseases/complications, medication, self-rated physical fitness, weekly exercise routine and personal exercise preferences. Three distinct prompts were used to generate three exercise plans for each fictional patient. While Prompt 1 was very simple, Prompt 2 and Prompt 3 included more detailed requests. Prompt 3 was optimized by ChatGPT itself. Three coaching experts reviewed the exercise plans for safety and other quality criteria and discussed their evaluations. Some of the exercise plans showed serious safety issues, especially for patients with secondary diseases/complications. While most exercise plans incorporated key training principles, they showed some deficits, e.g., insufficient feasibility. The use of more detailed prompts (Prompt 2 and Prompt 3) tended to result in more elaborate exercise plans with better ratings. ChatGPT-4o-generated exercise plans may have safety issues for patients with T2DM, indicating the need to consult a professional coach for feedback before starting a training program.
Collapse
Affiliation(s)
- Samir Akrimi
- Department of Preventive and Rehabilitative Sport Medicine, Institute of Cardiovascular Research and Sport Medicine, German Sport University Cologne, 50933 Cologne, Germany; (S.A.); (L.S.)
| | - Leon Schwensfeier
- Department of Preventive and Rehabilitative Sport Medicine, Institute of Cardiovascular Research and Sport Medicine, German Sport University Cologne, 50933 Cologne, Germany; (S.A.); (L.S.)
| | - Peter Düking
- Department of Sports Science and Movement Pedagogy, TU Braunschweig, 38106 Braunschweig, Germany;
| | - Thorsten Kreutz
- Department of Fitness & Health, IST University of Applied Sciences, 40233 Düsseldorf, Germany;
| | - Christian Brinkmann
- Department of Preventive and Rehabilitative Sport Medicine, Institute of Cardiovascular Research and Sport Medicine, German Sport University Cologne, 50933 Cologne, Germany; (S.A.); (L.S.)
- Department of Fitness & Health, IST University of Applied Sciences, 40233 Düsseldorf, Germany;
| |
Collapse
|
16
|
Schnepper R, Roemmel N, Schaefert R, Lambrecht-Walzinger L, Meinlschmidt G. Exploring Biases of Large Language Models in the Field of Mental Health: Comparative Questionnaire Study of the Effect of Gender and Sexual Orientation in Anorexia Nervosa and Bulimia Nervosa Case Vignettes. JMIR Ment Health 2025; 12:e57986. [PMID: 40111287 PMCID: PMC11949086 DOI: 10.2196/57986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Revised: 10/30/2024] [Accepted: 11/24/2024] [Indexed: 03/22/2025] Open
Abstract
Background Large language models (LLMs) are increasingly used in mental health, showing promise in assessing disorders. However, concerns exist regarding their accuracy, reliability, and fairness. Societal biases and underrepresentation of certain populations may impact LLMs. Because LLMs are already used for clinical practice, including decision support, it is important to investigate potential biases to ensure a responsible use of LLMs. Anorexia nervosa (AN) and bulimia nervosa (BN) show a lifetime prevalence of 1%-2%, affecting more women than men. Among men, homosexual men face a higher risk of eating disorders (EDs) than heterosexual men. However, men are underrepresented in ED research, and studies on gender, sexual orientation, and their impact on AN and BN prevalence, symptoms, and treatment outcomes remain limited. objectives We aimed to estimate the presence and size of bias related to gender and sexual orientation produced by a common LLM as well as a smaller LLM specifically trained for mental health analyses, exemplified in the context of ED symptomatology and health-related quality of life (HRQoL) of patients with AN or BN. Methods We extracted 30 case vignettes (22 AN and 8 BN) from scientific papers. We adapted each vignette to create 4 versions, describing a female versus male patient living with their female versus male partner (2 × 2 design), yielding 120 vignettes. We then fed each vignette into ChatGPT-4 and to "MentaLLaMA" based on the Large Language Model Meta AI (LLaMA) architecture thrice with the instruction to evaluate them by providing responses to 2 psychometric instruments, the RAND-36 questionnaire assessing HRQoL and the eating disorder examination questionnaire. With the resulting LLM-generated scores, we calculated multilevel models with a random intercept for gender and sexual orientation (accounting for within-vignette variance), nested in vignettes (accounting for between-vignette variance). Results In ChatGPT-4, the multilevel model with 360 observations indicated a significant association with gender for the RAND-36 mental composite summary (conditional means: 12.8 for male and 15.1 for female cases; 95% CI of the effect -6.15 to -0.35; P=.04) but neither with sexual orientation (P=.71) nor with an interaction effect (P=.37). We found no indications for main effects of gender (conditional means: 5.65 for male and 5.61 for female cases; 95% CI -0.10 to 0.14; P=.88), sexual orientation (conditional means: 5.63 for heterosexual and 5.62 for homosexual cases; 95% CI -0.14 to 0.09; P=.67), or for an interaction effect (P=.61, 95% CI -0.11 to 0.19) for the eating disorder examination questionnaire overall score (conditional means 5.59-5.65 95% CIs 5.45 to 5.7). MentaLLaMA did not yield reliable results. Conclusions LLM-generated mental HRQoL estimates for AN and BN case vignettes may be biased by gender, with male cases scoring lower despite no real-world evidence supporting this pattern. This highlights the risk of bias in generative artificial intelligence in the field of mental health. Understanding and mitigating biases related to gender and other factors, such as ethnicity, and socioeconomic status are crucial for responsible use in diagnostics and treatment recommendations.
Collapse
Affiliation(s)
- Rebekka Schnepper
- Department of Psychosomatic Medicine, University Hospital and University of Basel, Hebelstr. 2, Basel, 4031, Switzerland, 41 613284633
- Department of Digital and Blended Psychosomatics and Psychotherapy, Psychosomatic Medicine, University Hospital and University of Basel, Basel, Switzerland
| | - Noa Roemmel
- Department of Psychosomatic Medicine, University Hospital and University of Basel, Hebelstr. 2, Basel, 4031, Switzerland, 41 613284633
- Department of Digital and Blended Psychosomatics and Psychotherapy, Psychosomatic Medicine, University Hospital and University of Basel, Basel, Switzerland
| | - Rainer Schaefert
- Department of Psychosomatic Medicine, University Hospital and University of Basel, Hebelstr. 2, Basel, 4031, Switzerland, 41 613284633
| | - Lena Lambrecht-Walzinger
- Department of Psychosomatic Medicine, University Hospital and University of Basel, Hebelstr. 2, Basel, 4031, Switzerland, 41 613284633
| | - Gunther Meinlschmidt
- Department of Psychosomatic Medicine, University Hospital and University of Basel, Hebelstr. 2, Basel, 4031, Switzerland, 41 613284633
- Department of Digital and Blended Psychosomatics and Psychotherapy, Psychosomatic Medicine, University Hospital and University of Basel, Basel, Switzerland
- Department of Clinical Psychology and Psychotherapy, University of Trier, Trier, Rheinland-Pfalz, Germany
- Department of Psychology, Division of Clinical Psychology and Epidemiology, University of Basel, Basel, Switzerland
| |
Collapse
|
17
|
Sridhar GR, Gumpeny L. Prospects and perils of ChatGPT in diabetes. World J Diabetes 2025; 16:98408. [PMID: 40093292 PMCID: PMC11885976 DOI: 10.4239/wjd.v16.i3.98408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 11/05/2024] [Accepted: 12/03/2024] [Indexed: 01/21/2025] Open
Abstract
ChatGPT, a popular large language model developed by OpenAI, has the potential to transform the management of diabetes mellitus. It is a conversational artificial intelligence model trained on extensive datasets, although not specifically health-related. The development and core components of ChatGPT include neural networks and machine learning. Since the current model is not yet developed on diabetes-related datasets, it has limitations such as the risk of inaccuracies and the need for human supervision. Nevertheless, it has the potential to aid in patient engagement, medical education, and clinical decision support. In diabetes management, it can contribute to patient education, personalized dietary guidelines, and providing emotional support. Specifically, it is being tested in clinical scenarios such as assessment of obesity, screening for diabetic retinopathy, and provision of guidelines for the management of diabetic ketoacidosis. Ethical and legal considerations are essential before ChatGPT can be integrated into healthcare. Potential concerns relate to data privacy, accuracy of responses, and maintenance of the patient-doctor relationship. Ultimately, while ChatGPT and large language models hold immense potential to revolutionize diabetes care, one needs to weigh their limitations, ethical implications, and the need for human supervision. The integration promises a future of proactive, personalized, and patient-centric care in diabetes management.
Collapse
Affiliation(s)
- Gumpeny R Sridhar
- Department of Endocrinology and Diabetes, Endocrine and Diabetes Centre, Visakhapatnam 530002, Andhra Pradesh, India
| | - Lakshmi Gumpeny
- Department of Internal Medicine, Gayatri Vidya Parishad Institute of Healthcare & Medical Technology, Visakhapatnam 530048, Andhra Pradesh, India
| |
Collapse
|
18
|
Grosser J, Düvel J, Hasemann L, Schneider E, Greiner W. Studying the Potential Effects of Artificial Intelligence on Physician Autonomy: Scoping Review. JMIR AI 2025; 4:e59295. [PMID: 40080059 PMCID: PMC11950692 DOI: 10.2196/59295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Revised: 05/15/2024] [Accepted: 12/31/2024] [Indexed: 03/15/2025]
Abstract
BACKGROUND Physician autonomy has been found to play a role in physician acceptance and adoption of artificial intelligence (AI) in medicine. However, there is still no consensus in the literature on how to define and assess physician autonomy. Furthermore, there is a lack of research focusing specifically on the potential effects of AI on physician autonomy. OBJECTIVE This scoping review addresses the following research questions: (1) How do qualitative studies conceptualize and assess physician autonomy? (2) Which aspects of physician autonomy are addressed by these studies? (3) What are the potential benefits and harms of AI for physician autonomy identified by these studies? METHODS We performed a scoping review of qualitative studies on AI and physician autonomy published before November 6, 2023, by searching MEDLINE and Web of Science. To answer research question 1, we determined whether the included studies explicitly include physician autonomy as a research focus and whether their interview, survey, and focus group questions explicitly name or implicitly include aspects of physician autonomy. To answer research question 2, we extracted the qualitative results of the studies, categorizing them into the 7 components of physician autonomy introduced by Schulz and Harrison. We then inductively formed subcomponents based on the results of the included studies in each component. To answer research question 3, we summarized the potentially harmful and beneficial effects of AI on physician autonomy in each of the inductively formed subcomponents. RESULTS The search yielded 369 studies after duplicates were removed. Of these, 27 studies remained after titles and abstracts were screened. After full texts were screened, we included a total of 7 qualitative studies. Most studies did not explicitly name physician autonomy as a research focus or explicitly address physician autonomy in their interview, survey, and focus group questions. No studies addressed a complete set of components of physician autonomy; while 3 components were addressed by all included studies, 2 components were addressed by none. We identified a total of 11 subcomponents for the 5 components of physician autonomy that were addressed by at least 1 study. For most of these subcomponents, studies reported both potential harms and potential benefits of AI for physician autonomy. CONCLUSIONS Little research to date has explicitly addressed the potential effects of AI on physician autonomy and existing results on these potential effects are mixed. Further qualitative and quantitative research is needed that focuses explicitly on physician autonomy and addresses all relevant components of physician autonomy.
Collapse
Affiliation(s)
- John Grosser
- Department of Health Economics and Health Care Management, School of Public Health, Bielefeld University, Bielefeld, Germany
| | - Juliane Düvel
- Centre for Electronic Public Health Research (CePHR), School of Public Health, Bielefeld University, Bielefeld, Germany
| | - Lena Hasemann
- Department of Health Economics and Health Care Management, School of Public Health, Bielefeld University, Bielefeld, Germany
| | - Emilia Schneider
- Department of Health Economics and Health Care Management, School of Public Health, Bielefeld University, Bielefeld, Germany
| | - Wolfgang Greiner
- Department of Health Economics and Health Care Management, School of Public Health, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
19
|
Zada T, Tam N, Barnard F, Van Sittert M, Bhat V, Rambhatla S. Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models. JMIR Form Res 2025; 9:e66207. [PMID: 40063849 PMCID: PMC11913316 DOI: 10.2196/66207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Revised: 01/29/2025] [Accepted: 01/29/2025] [Indexed: 03/19/2025] Open
Abstract
Background Rapid integration of large language models (LLMs) in health care is sparking global discussion about their potential to revolutionize health care quality and accessibility. At a time when improving health care quality and access remains a critical concern for countries worldwide, the ability of these models to pass medical examinations is often cited as a reason to use them for medical training and diagnosis. However, the impact of their inevitable use as a self-diagnostic tool and their role in spreading health care misinformation has not been evaluated. Objective This study aims to assess the effectiveness of LLMs, particularly ChatGPT, from the perspective of an individual self-diagnosing to better understand the clarity, correctness, and robustness of the models. Methods We propose the comprehensive testing methodology evaluation of LLM prompts (EvalPrompt). This evaluation methodology uses multiple-choice medical licensing examination questions to evaluate LLM responses. Experiment 1 prompts ChatGPT with open-ended questions to mimic real-world self-diagnosis use cases, and experiment 2 performs sentence dropout on the correct responses from experiment 1 to mimic self-diagnosis with missing information. Humans then assess the responses returned by ChatGPT for both experiments to evaluate the clarity, correctness, and robustness of ChatGPT. Results In experiment 1, we found that ChatGPT-4.0 was deemed correct for 31% (29/94) of the questions by both nonexperts and experts, with only 34% (32/94) agreement between the 2 groups. Similarly, in experiment 2, which assessed robustness, 61% (92/152) of the responses continued to be categorized as correct by all assessors. As a result, in comparison to a passing threshold of 60%, ChatGPT-4.0 is considered incorrect and unclear, though robust. This indicates that sole reliance on ChatGPT-4.0 for self-diagnosis could increase the risk of individuals being misinformed. Conclusions The results highlight the modest capabilities of LLMs, as their responses are often unclear and inaccurate. Any medical advice provided by LLMs should be cautiously approached due to the significant risk of misinformation. However, evidence suggests that LLMs are steadily improving and could potentially play a role in health care systems in the future. To address the issue of medical misinformation, there is a pressing need for the development of a comprehensive self-diagnosis dataset. This dataset could enhance the reliability of LLMs in medical applications by featuring more realistic prompt styles with minimal information across a broader range of medical fields.
Collapse
Affiliation(s)
- Troy Zada
- Department of Management Sciences and Engineering, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada, 1 5198884567 ext 33279
| | - Natalie Tam
- Department of Management Sciences and Engineering, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada, 1 5198884567 ext 33279
| | - Francois Barnard
- Department of Management Sciences and Engineering, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada, 1 5198884567 ext 33279
| | | | - Venkat Bhat
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
- Interventional Psychiatry Program, St. Michael’s Hospital, Unity Health Toronto, Toronto, ON, Canada
| | - Sirisha Rambhatla
- Department of Management Sciences and Engineering, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada, 1 5198884567 ext 33279
| |
Collapse
|
20
|
Pinard CJ, Poon AC, Lagree A, Wu K, Li J, Tran WT. Precision in Parsing: Evaluation of an Open-Source Named Entity Recognizer (NER) in Veterinary Oncology. Vet Comp Oncol 2025; 23:102-108. [PMID: 39711253 PMCID: PMC11830456 DOI: 10.1111/vco.13035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Revised: 11/14/2024] [Accepted: 12/02/2024] [Indexed: 12/24/2024]
Abstract
Integrating Artificial Intelligence (AI) through Natural Language Processing (NLP) can improve veterinary medical oncology clinical record analytics. Named Entity Recognition (NER), a critical component of NLP, can facilitate efficient data extraction and automated labelling for research and clinical decision-making. This study assesses the efficacy of the Bio-Epidemiology-NER (BioEN), an open-source NER developed using human epidemiological and medical data, on veterinary medical oncology records. The NER's performance was compared with manual annotations by a veterinary medical oncologist and a veterinary intern. Evaluation metrics included Jaccard similarity, intra-rater reliability, ROUGE scores, and standard NER performance metrics (precision, recall, F1-score). Results indicate poor direct translatability to veterinary medical oncology record text and room for improvement in the NER's performance, with precision, recall, and F1-score suggesting a marginally better alignment with the oncologist than the intern. While challenges remain, these insights contribute to the ongoing development of AI tools tailored for veterinary healthcare and highlight the need for veterinary-specific models.
Collapse
Affiliation(s)
- Christopher J. Pinard
- Department of Clinical StudiesOntario Veterinary College, University of GuelphGuelphOntarioCanada
- Department of OncologyLakeshore Animal Health PartnersMississaugaOntarioCanada
- Centre for Advancing Responsible & Ethical Artificial Intelligence, University of GuelphGuelphOntarioCanada
- Radiogenomics Laboratory, Sunnybrook Health Sciences CentreTorontoOntarioCanada
- ANI.ML Research, ANI.ML Health Inc.TorontoOntarioCanada
| | - Andrew C. Poon
- VCA Mississauga Oakville Veterinary Emergency HospitalMississaugaOntarioCanada
| | - Andrew Lagree
- Radiogenomics Laboratory, Sunnybrook Health Sciences CentreTorontoOntarioCanada
- ANI.ML Research, ANI.ML Health Inc.TorontoOntarioCanada
- Odette Cancer Program, Sunnybrook Health Sciences CentreTorontoOntarioCanada
| | - Kuan‐Chuen Wu
- ANI.ML Research, ANI.ML Health Inc.TorontoOntarioCanada
| | - Jiaxu Li
- Radiogenomics Laboratory, Sunnybrook Health Sciences CentreTorontoOntarioCanada
| | - William T. Tran
- Radiogenomics Laboratory, Sunnybrook Health Sciences CentreTorontoOntarioCanada
- Odette Cancer Program, Sunnybrook Health Sciences CentreTorontoOntarioCanada
- Department of Radiation OncologyUniversity of TorontoTorontoOntarioCanada
- Temerty Centre for AI Research and Education in Medicine, University of TorontoTorontoOntarioCanada
| |
Collapse
|
21
|
Gan W, Ouyang J, She G, Xue Z, Zhu L, Lin A, Mou W, Jiang A, Qi C, Cheng Q, Luo P, Li H, Zheng X. ChatGPT's role in alleviating anxiety in total knee arthroplasty consent process: a randomized controlled trial pilot study. Int J Surg 2025; 111:2546-2557. [PMID: 39903546 DOI: 10.1097/js9.0000000000002223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Accepted: 12/01/2024] [Indexed: 02/06/2025]
Abstract
BACKGROUND Recent advancements in artificial intelligence (AI) like ChatGPT have expanded possibilities for patient education, yet its impact on perioperative anxiety in total knee arthroplasty (TKA) patients remains unexplored. METHODS In this single-blind, randomized controlled pilot study from April to July 2023, 60 patients were randomly allocated using sealed envelopes to either ChatGPT-assisted or traditional surgeon-led informed consent groups. In the ChatGPT group, physicians used ChatGPT 4.0 to provide standardized, comprehensive responses to patient queries during the consent process, while maintaining their role in interpreting and contextualizing the information. Outcomes were measured using Hospital Anxiety and Depression Scales (HADS), Perioperative Apprehension Scale-7 (PAS-7), Visual Analogue Scales for Anxiety and Pain (VAS-A, VAS-P), Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), and satisfaction questionnaires. RESULTS Of 55 patients completing the study, the ChatGPT group showed significantly lower anxiety scores after informed consent (HADS-A: 10.48 ± 3.84 vs 12.75 ± 4.12, P = .04, Power = .67; PAS-7: 12.44 ± 3.70 vs 14.64 ± 2.11, P = .01, Power = .85; VAS-A: 5.40 ± 1.89 vs 6.71 ± 2.27, P = .02, Power = .75) and on the fifth postoperative day (HADS-A: 8.33 ± 3.20 vs 10.71 ± 3.83, P = .01, Power = .79; VAS-A: 3.41 ± 1.58 vs 4.64 ± 1.70, P = .008, Power = .85). The ChatGPT group also reported higher satisfaction with preoperative education (4.22 ± 0.51 vs 3.43 ± 0.84, P <.001, Power = .99) and overall hospitalization experience (4.11 ± 0.65 vs 3.46 ± 0.69, P = .001, Power = .97). No significant differences were found in depression scores, knee function, or pain levels. CONCLUSIONS ChatGPT-assisted informed consent effectively reduced perioperative anxiety and improved patient satisfaction in TKA patients. While these preliminary findings are promising, larger studies are needed to validate these results and explore broader applications of AI in preoperative patient education.
Collapse
Affiliation(s)
- Wenyi Gan
- Department of Joint Surgery and Sports Medicine, Zhuhai People's Hospital (The Affiliated Hospital of Beijing Institute of Technology, Zhuhai Clinical Medical College of Jinan University), Zhuhai, Guangdong, China
- Department of Bone and Joint Surgery and Sports Medicine Center, The First Affiliated Hospital, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Jianfeng Ouyang
- Department of Joint Surgery and Sports Medicine, Zhuhai People's Hospital (The Affiliated Hospital of Beijing Institute of Technology, Zhuhai Clinical Medical College of Jinan University), Zhuhai, Guangdong, China
| | - Guorong She
- Department of Bone and Joint Surgery and Sports Medicine Center, The First Affiliated Hospital, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Zhaowen Xue
- Department of Bone and Joint Surgery and Sports Medicine Center, The First Affiliated Hospital, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Lingxuan Zhu
- Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou, Guangdong, China
| | - Anqi Lin
- Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou, Guangdong, China
| | - Weiming Mou
- Department of Urology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Aimin Jiang
- Department of Urology, Changhai hospital, Naval Medical University (Second Military Medical University), Shanghai, China
| | - Chang Qi
- The University of Hong Kong, Hong Kong, China
| | - Quan Cheng
- Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Peng Luo
- Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou, Guangdong, China
| | - Hua Li
- Department of Foot and Ankle Surgery, Beijing Jishuitan Hospital, Capital Medical University, Beijing, China
| | - Xiaofei Zheng
- Department of Bone and Joint Surgery and Sports Medicine Center, The First Affiliated Hospital, The First Affiliated Hospital of Jinan University, Guangzhou, China
| |
Collapse
|
22
|
Johnson ES, Welch EK, Kikuchi J, Barbier H, Vaccaro CM, Balzano F, Dengler KL. Use of ChatGPT to Generate Informed Consent for Surgery in Urogynecology. UROGYNECOLOGY (PHILADELPHIA, PA.) 2025; 31:285-291. [PMID: 39823203 DOI: 10.1097/spv.0000000000001638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2025]
Abstract
IMPORTANCE Use of the publicly available Large Language Model, Chat Generative Pre-trained Transformer (ChatGPT 3.5; OpenAI, 2022), is growing in health care despite varying accuracies. OBJECTIVE The aim of this study was to assess the accuracy and readability of ChatGPT's responses to questions encompassing surgical informed consent in urogynecology. STUDY DESIGN Five fellowship-trained urogynecology attending physicians and 1 reconstructive female urologist evaluated ChatGPT's responses to questions about 4 surgical procedures: (1) retropubic midurethral sling, (2) total vaginal hysterectomy, (3) uterosacral ligament suspension, and (4) sacrocolpopexy. Questions involved procedure descriptions, risks/benefits/alternatives, and additional resources. Responses were rated using the DISCERN tool, a 4-point accuracy scale, and the Flesch-Kinkaid Grade Level score. RESULTS The median DISCERN tool overall rating was 3 (interquartile range [IQR], 3-4), indicating a moderate rating ("potentially important but not serious shortcomings"). Retropubic midurethral sling received the highest overall score (median, 4; IQR, 3-4), and uterosacral ligament suspension received the lowest (median, 3; IQR, 3-3). Using the 4-point accuracy scale, 44.0% of responses received a score of 4 ("correct and adequate"), 22.6% received a score of 3 ("correct but insufficient"), 29.8% received a score of 2 ("accurate and misleading information together"), and 3.6% received a score of 1 ("wrong or irrelevant answer"). ChatGPT performance was poor for discussion of benefits and alternatives for all surgical procedures, with some responses being inaccurate. The mean Flesch-Kinkaid Grade Level score for all responses was 17.5 (SD, 2.1), corresponding to a postgraduate reading level. CONCLUSIONS Overall, ChatGPT generated accurate responses to questions about surgical informed consent. However, it produced clearly false portions of responses, highlighting the need for a careful review of responses by qualified health care professionals.
Collapse
Affiliation(s)
- Emily S Johnson
- From the Division of Urogynecology, Walter Reed National Military Medical Center, Bethesda, MD
| | - Eva K Welch
- From the Division of Urogynecology, Walter Reed National Military Medical Center, Bethesda, MD
| | - Jacqueline Kikuchi
- Division of Urogynecology, AT Augusta Military Medical Center, Fort Belvoir, VA
| | - Heather Barbier
- From the Division of Urogynecology, Walter Reed National Military Medical Center, Bethesda, MD
| | - Christine M Vaccaro
- From the Division of Urogynecology, Walter Reed National Military Medical Center, Bethesda, MD
| | - Felicia Balzano
- Department of Urology, Walter Reed National Military Medical Center, Bethesda, MD
| | - Katherine L Dengler
- From the Division of Urogynecology, Walter Reed National Military Medical Center, Bethesda, MD
| |
Collapse
|
23
|
Köroğlu EY, Ersoy R, Saçıkara M, Dellal Kahramanca FD, Polat ŞB, Topaloğlu O, Çakır B. Evaluation of the impact Of ChatGPT support on acromegaly management and patient education. Endocrine 2025; 87:1141-1149. [PMID: 39497015 DOI: 10.1007/s12020-024-04086-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Accepted: 10/26/2024] [Indexed: 11/06/2024]
Abstract
PURPOSE ChatGPT is a widely used artificial intelligence modeling tool. Healthcare is one potential area of use of ChatGPT. This study aimed to test the usability and reliability of ChatGPT in acromegaly, which is less known in society and should be evaluated by a group of specialized physicians. METHODS The study is designed in two parts. For the first part, 35 questions regarding acromegaly that patients frequently ask were identified, and these questions were asked to ChatGPT. In the second part, four patient examples were presented to ChatGPT using medical terminology. Three experts evaluated ChatGPT's answers to the questions and approaches in case management using 7-point scales in terms of safety, reliability, correctness, and usability. RESULTS When the ChatGPT answers to the patient's questions were evaluated, a mean score of 6.78 ± 0.55 was given for correctness and 6.69 ± 0.60 for reliability. The mean scores given by the raters for correctness, safety and usability in the evaluation of the cases were as follows: 6.33 ± 0.88, 6.16 ± 0. 71 and 6.08 ± 0.79 points for case 1; 5.35 ± 1.88, 5.29 ± 1.80 and 5.20 ± 1.86 points for case 2; 6.08 ± 0.97, 6.00 ± 0.93 and 5.91 ± 0.82 points for case 3; 6.10 ± 1.29, 6.13 ± 1.30 and 6.16 ± 1.14 points for case 4. CONCLUSION ChatGPT can actively answer the questions of acromegaly patients. Although it is not a reliable source alone in managing patients with acromegaly, it can be a supportive tool for physicians.
Collapse
Affiliation(s)
- Ekin Yiğit Köroğlu
- Ankara Bilkent City Hospital, Endocrinology and Metabolism Department, Üniversiteler Mahallesi, 1604, Cadde No: 9 Çankaya, Ankara, Türkiye.
| | - Reyhan Ersoy
- Ankara Yıldırım Beyazıt Faculty of Medicine, Endocrinology and Metabolism Department, Üniversiteler Mahallesi, 1604, Cadde No: 9 Çankaya, Ankara, Türkiye
| | - Muhammed Saçıkara
- Ankara Bilkent City Hospital, Endocrinology and Metabolism Department, Üniversiteler Mahallesi, 1604, Cadde No: 9 Çankaya, Ankara, Türkiye
| | - Fatma Dilek Dellal Kahramanca
- Ankara Bilkent City Hospital, Endocrinology and Metabolism Department, Üniversiteler Mahallesi, 1604, Cadde No: 9 Çankaya, Ankara, Türkiye
| | - Şefika Burçak Polat
- Ankara Yıldırım Beyazıt Faculty of Medicine, Endocrinology and Metabolism Department, Üniversiteler Mahallesi, 1604, Cadde No: 9 Çankaya, Ankara, Türkiye
| | - Oya Topaloğlu
- Ankara Yıldırım Beyazıt Faculty of Medicine, Endocrinology and Metabolism Department, Üniversiteler Mahallesi, 1604, Cadde No: 9 Çankaya, Ankara, Türkiye
| | - Bekir Çakır
- Ankara Yıldırım Beyazıt Faculty of Medicine, Endocrinology and Metabolism Department, Üniversiteler Mahallesi, 1604, Cadde No: 9 Çankaya, Ankara, Türkiye
| |
Collapse
|
24
|
Puts S, Zegers CML, Dekker A, Bermejo I. Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection. JMIR Form Res 2025; 9:e60095. [PMID: 39935026 PMCID: PMC11835781 DOI: 10.2196/60095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Revised: 11/20/2024] [Accepted: 11/24/2024] [Indexed: 02/13/2025] Open
Abstract
Background The International Classification of Diseases (ICD), developed by the World Health Organization, standardizes health condition coding to support health care policy, research, and billing, but artificial intelligence automation, while promising, still underperforms compared with human accuracy and lacks the explainability needed for adoption in medical settings. Objective The potential of large language models for assisting medical coders in the ICD-10 coding was explored through the development of a computer-assisted coding system. This study aimed to augment human coding by initially identifying lead terms and using retrieval-augmented generation (RAG)-based methods for computer-assisted coding enhancement. Methods The explainability dataset from the CodiEsp challenge (CodiEsp-X) was used, featuring 1000 Spanish clinical cases annotated with ICD-10 codes. A new dataset, CodiEsp-X-lead, was generated using GPT-4 to replace full-textual evidence annotations with lead term annotations. A Robustly Optimized BERT (Bidirectional Encoder Representations from Transformers) Pretraining Approach transformer model was fine-tuned for named entity recognition to extract lead terms. GPT-4 was subsequently employed to generate code descriptions from the extracted textual evidence. Using a RAG approach, ICD codes were assigned to the lead terms by querying a vector database of ICD code descriptions with OpenAI's text-embedding-ada-002 model. Results The fine-tuned Robustly Optimized BERT Pretraining Approach achieved an overall F1-score of 0.80 for ICD lead term extraction on the new CodiEsp-X-lead dataset. GPT-4-generated code descriptions reduced retrieval failures in the RAG approach by approximately 5% for both diagnoses and procedures. However, the overall explainability F1-score for the CodiEsp-X task was limited to 0.305, significantly lower than the state-of-the-art F1-score of 0.633. The diminished performance was partly due to the reliance on code descriptions, as some ICD codes lacked descriptions, and the approach did not fully align with the medical coder's workflow. Conclusions While lead term extraction showed promising results, the subsequent RAG-based code assignment using GPT-4 and code descriptions was less effective. Future research should focus on refining the approach to more closely mimic the medical coder's workflow, potentially integrating the alphabetic index and official coding guidelines, rather than relying solely on code descriptions. This alignment may enhance system accuracy and better support medical coders in practice.
Collapse
Affiliation(s)
- Sander Puts
- Department of Radiation Oncology (Maastro), GROW Research Institute for Oncology and Reproduction, Maastricht University Medical Centre+, P.O. Box 616, Maastricht, 6200 MD, Netherlands, 31 43 38 81863
| | - Catharina M L Zegers
- Department of Radiation Oncology (Maastro), GROW Research Institute for Oncology and Reproduction, Maastricht University Medical Centre+, P.O. Box 616, Maastricht, 6200 MD, Netherlands, 31 43 38 81863
| | - Andre Dekker
- Department of Radiation Oncology (Maastro), GROW Research Institute for Oncology and Reproduction, Maastricht University Medical Centre+, P.O. Box 616, Maastricht, 6200 MD, Netherlands, 31 43 38 81863
| | - Iñigo Bermejo
- Department of Radiation Oncology (Maastro), GROW Research Institute for Oncology and Reproduction, Maastricht University Medical Centre+, P.O. Box 616, Maastricht, 6200 MD, Netherlands, 31 43 38 81863
- Data Science Institute (DSI), Hasselt University, Belgium
| |
Collapse
|
25
|
Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen JP, McKechnie T, Lee Y, Mayol J, Antoniou SA, Thirunavukarasu AJ, Sanger S, Ramji K, Guyatt G. Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Netw Open 2025; 8:e2457879. [PMID: 39903463 PMCID: PMC11795331 DOI: 10.1001/jamanetworkopen.2024.57879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/26/2024] [Indexed: 02/06/2025] Open
Abstract
Importance There is much interest in the clinical integration of large language models (LLMs) in health care. Many studies have assessed the ability of LLMs to provide health advice, but the quality of their reporting is uncertain. Objective To perform a systematic review to examine the reporting variability among peer-reviewed studies evaluating the performance of generative artificial intelligence (AI)-driven chatbots for summarizing evidence and providing health advice to inform the development of the Chatbot Assessment Reporting Tool (CHART). Evidence Review A search of MEDLINE via Ovid, Embase via Elsevier, and Web of Science from inception to October 27, 2023, was conducted with the help of a health sciences librarian to yield 7752 articles. Two reviewers screened articles by title and abstract followed by full-text review to identify primary studies evaluating the clinical accuracy of generative AI-driven chatbots in providing health advice (chatbot health advice studies). Two reviewers then performed data extraction for 137 eligible studies. Findings A total of 137 studies were included. Studies examined topics in surgery (55 [40.1%]), medicine (51 [37.2%]), and primary care (13 [9.5%]). Many studies focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most studies (136 [99.3%]) evaluated inaccessible, closed-source LLMs and did not provide enough information to identify the version of the LLM under evaluation. All studies lacked a sufficient description of LLM characteristics, including temperature, token length, fine-tuning availability, layers, and other details. Most studies (136 [99.3%]) did not describe a prompt engineering phase in their study. The date of LLM querying was reported in 54 (39.4%) studies. Most studies (89 [65.0%]) used subjective means to define the successful performance of the chatbot, while less than one-third addressed the ethical, regulatory, and patient safety implications of the clinical integration of LLMs. Conclusions and Relevance In this systematic review of 137 chatbot health advice studies, the reporting quality was heterogeneous and may inform the development of the CHART reporting standards. Ethical, regulatory, and patient safety considerations are crucial as interest grows in the clinical integration of LLMs.
Collapse
Affiliation(s)
- Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Amy Boyle
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Nana Marfo
- H. Ross University School of Medicine, Miramar, Florida
| | - Wimonchat Tangamornsuksan
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Jeremy P. Steen
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Tyler McKechnie
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Yung Lee
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Julio Mayol
- Hospital Clinico San Carlos, IdISSC, Universidad Complutense de Madrid, Madrid, Spain
| | | | | | - Stephanie Sanger
- Health Science Library, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Karim Ramji
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Gordon Guyatt
- Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
26
|
Dias R, Castan A, Gotoff K, Kadkoy Y, Ippolito J, Beebe K, Benevenia J. ChatGPT 35 Better Improves Comprehensibility of English, Than Spanish, Generated Responses to Osteosarcoma Questions. J Surg Oncol 2025. [PMID: 39898783 DOI: 10.1002/jso.28109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2025] [Accepted: 01/12/2025] [Indexed: 02/04/2025]
Abstract
BACKGROUND Despite adequate discussion and counseling in the office, inadequate health literacy or language barriers may make it difficult to follow instructions from a physician and access necessary resources. This may negatively impact survival outcomes. Most healthcare materials are written at a 10th grade level, while many patients read at an 8th grade level. Hispanic Americans comprise about 25% of the US patient population, while only 6% of physicians identify as bilingual. QUESTIONS/PURPOSE (1) Does ChatGPT 3.5 provide appropriate responses to frequently asked patient questions that are sufficient for clinical practice and accurate in English and Spanish? (2) What is the comprehensibility of the responses provided by ChatGPT 3.5 and are these modifiable? METHODS Twenty frequently asked osteosarcoma patient questions, evaluated by two fellowship-trained musculoskeletal oncologists were input into ChatGPT 3.5. Responses were evaluated by two independent reviewers to assess appropriateness for clinical practice, and accuracy. Responses were graded using the Flesch Reading Ease Score (FRES) and the Flesch-Kincaid Grade Level test (FKGL). The responses were then input into ChatGPT 3.5 for a second time with the following command "Make text easier to understand". The same method was done in Spanish. RESULTS All responses generated were appropriate for a patient-facing informational platform. There was no difference in the Flesch Reading Ease Score between English and Spanish responses before the modification (p = 0.307) and with the Flesch-Kincaid grade level (p = 0.294). After modification, there was a statistically significant difference in comprehensibility between English and Spanish responses (p = 0.003 and p = 0.011). CONCLUSION In both English and Spanish, none of the ChatGPT generated responses were found to be factually inaccurate. ChatGPT was able to modify responses upon follow-up with a simplified command. However, it was shown to be better at improving English responses than equivalent Spanish responses.
Collapse
Affiliation(s)
- Rosamaria Dias
- Department of Orthopaedics, Rutgers New Jersey Medical School, Newark, New Jersey, USA
| | - Ashley Castan
- Department of Orthopaedics, Rutgers New Jersey Medical School, Newark, New Jersey, USA
| | - Katie Gotoff
- Department of Orthopaedics, Rutgers New Jersey Medical School, Newark, New Jersey, USA
| | - Yazan Kadkoy
- Department of Orthopaedics, Rutgers New Jersey Medical School, Newark, New Jersey, USA
| | - Joseph Ippolito
- Department of Orthopaedics, Rutgers New Jersey Medical School, Newark, New Jersey, USA
| | - Kathleen Beebe
- Department of Orthopaedics, Rutgers New Jersey Medical School, Newark, New Jersey, USA
| | - Joseph Benevenia
- Department of Orthopaedics, Rutgers New Jersey Medical School, Newark, New Jersey, USA
| |
Collapse
|
27
|
Haaker TS, Choi JS, Nanjo CJ, Warner PB, Abu-Hanna A, Kawamoto K. Approaches for extracting daily dosage from free-text prescription signatures in heart failure with reduced ejection fraction: a comparative study. JAMIA Open 2025; 8:ooae153. [PMID: 39759772 PMCID: PMC11700559 DOI: 10.1093/jamiaopen/ooae153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Revised: 12/10/2024] [Accepted: 12/21/2024] [Indexed: 01/07/2025] Open
Abstract
Objective To compare various methods for extracting daily dosage information from prescription signatures (sigs) and identify the best performers. Materials and Methods In this study, 5 daily dosage extraction methods were identified. Parsigs, RxSig, Sig2db, a large language model (LLM), and a bidirectional long short-term memory (BiLSTM) model were selected. The methods were analyzed with regard to positive predictive value (PPV), sensitivity, F1-score, cost to compute, and time to finish on a sig dataset in the context of heart failure with reduced ejection fraction. Results The dataset consisted of 29 896 free-text sigs, which were split into training and validation sets of 70% and 30%, respectively. The BiLSTM model scored lowest with an F1-score of 0.71. The LLM GPT-4o and regular expression-based RxSig achieved the highest F1-scores with 0.98 and 0.95, respectively. The LLM outperformed RxSig in sensitivity. RxSig outperformed the LLM in PPV. Additionally, RxSig had a lower run time and no costs compared to a cost of 25 dollars. Discussion In practical usage, it would be preferable for an algorithm to score high on PPV and F1-score, to reduce false positive assertions of daily dosage. Additionally, long running times and high costs are not scalable for larger datasets. Thus, RxSig is likely the most scalable approach. Further research is needed to investigate the generalizability of the findings. Conclusion This study demonstrates that both the LLM and RxSig models excel in daily dose extraction from free-text sigs, with the RxSig model appearing to be the more scalable approach.
Collapse
Affiliation(s)
- Theodorus S Haaker
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT 84108, United States
- Department of Medical Informatics, University of Amsterdam, 1105 AZ Amsterdam, The Netherlands
| | - Joshua S Choi
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT 84108, United States
- Department of Internal Medicine, University of Utah Health, Salt Lake City, UT 84112, United States
| | - Claude J Nanjo
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT 84108, United States
| | - Phillip B Warner
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT 84108, United States
| | - Ameen Abu-Hanna
- Department of Medical Informatics, University of Amsterdam, 1105 AZ Amsterdam, The Netherlands
| | - Kensaku Kawamoto
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT 84108, United States
| |
Collapse
|
28
|
Razai MS, Ussher M, Goldsmith L, Hargreaves S, Oakeshott P. Navigating vaccination in pregnancy: Qualitative study in 21 ethnically diverse pregnant women. PLoS One 2025; 20:e0310823. [PMID: 39888913 PMCID: PMC11785291 DOI: 10.1371/journal.pone.0310823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Accepted: 11/19/2024] [Indexed: 02/02/2025] Open
Abstract
BACKGROUND Vaccination during pregnancy is crucial for safeguarding maternal and neonatal health, but vaccination rates remain suboptimal, especially in women from Black and Asian ethnic minorities. We explored the perspectives and decision-making processes of pregnant women regarding uptake of the three recommended vaccines in pregnancy: Influenza, Pertussis (whooping cough) and COVID-19. We also explored women's attitudes to taking part in vaccine trials during pregnancy and the use of artificial intelligence (AI) to obtain information on vaccines. METHODS In 2023, we conducted in-depth telephone interviews with ethnically diverse pregnant women in the Greater London area using convenience and snowball sampling. The interviews focused on participants' views on vaccination during pregnancy, participation in vaccine trials, information-seeking behaviours, and attitudes to emerging technologies for health information. Interviews were transcribed verbatim and thematically analysed. The data collection and analysis were conducted alongside the iterative development of the topic guide and coding framework, with key themes emerging through collaborative team discussions. RESULTS Twenty one pregnant women aged 20-39 were interviewed of whom 67% were from ethnic minorities and 29% were from migrant backgrounds. Half the participants (53%) reported hesitancy towards at least one of the vaccines. The analysis revealed several themes: concerns about vaccine safety, particularly regarding newer vaccines due to lack of long-term data; reliance on healthcare professionals for guidance, balanced with personal research; and a strong desire for clear and comprehensive information specifically tailored to pregnant women. Pregnant women reported insufficient information, explanation, or recommendation by midwives. Additionally, there was widespread refusal regarding participation in vaccine trials; and mixed responses to the use of AI (such as chatbots) for obtaining vaccine information. CONCLUSIONS Pregnant women's vaccination decisions are complex and require clear, unambiguous communication from healthcare providers, especially midwives, to address their specific concerns. Although information obtained via AI can be useful, responses were mixed.
Collapse
Affiliation(s)
- Mohammad S. Razai
- Primary Care Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom
- St George’s School of Health and Medical Sciences, Population Health Research Institute, City St George’s, University of London, London, United Kingdom
| | - Michael Ussher
- St George’s School of Health and Medical Sciences, Population Health Research Institute, City St George’s, University of London, London, United Kingdom
- Institute of Social Marketing and Health, University of Stirling, Stirling, United Kingdom
| | - Lucy Goldsmith
- The Health Foundation, London, United Kingdom
- Department of Health Services Research and Management, School of Health & Psychological Sciences, City St George’s, University of London, London, United Kingdom
| | - Sally Hargreaves
- St George’s School of Health and Medical Sciences, Population Health Research Institute, City St George’s, University of London, London, United Kingdom
- The Migrant Health Research Group, Institute for Infection and Immunity, City St George’s, University of London, London, United Kingdom
| | - Pippa Oakeshott
- St George’s School of Health and Medical Sciences, Population Health Research Institute, City St George’s, University of London, London, United Kingdom
| |
Collapse
|
29
|
Tayyar Madabushi H, Jones MD. Large language models in healthcare information research: making progress in an emerging field. BMJ Qual Saf 2025; 34:73-76. [PMID: 39443104 DOI: 10.1136/bmjqs-2024-017896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/15/2024] [Indexed: 10/25/2024]
Affiliation(s)
| | - Matthew D Jones
- Department of Life Sciences, University of Bath, Bath, Somerset, UK
| |
Collapse
|
30
|
Ding L, Fan L, Shen M, Wang Y, Sheng K, Zou Z, An H, Jiang Z. Evaluating ChatGPT's diagnostic potential for pathology images. Front Med (Lausanne) 2025; 11:1507203. [PMID: 39917264 PMCID: PMC11798939 DOI: 10.3389/fmed.2024.1507203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 12/27/2024] [Indexed: 02/09/2025] Open
Abstract
Background Chat Generative Pretrained Transformer (ChatGPT) is a type of large language model (LLM) developed by OpenAI, known for its extensive knowledge base and interactive capabilities. These attributes make it a valuable tool in the medical field, particularly for tasks such as answering medical questions, drafting clinical notes, and optimizing the generation of radiology reports. However, keeping accuracy in medical contexts is the biggest challenge to employing GPT-4 in a clinical setting. This study aims to investigate the accuracy of GPT-4, which can process both text and image inputs, in generating diagnoses from pathological images. Methods This study analyzed 44 histopathological images from 16 organs and 100 colorectal biopsy photomicrographs. The initial evaluation was conducted using the standard GPT-4 model in January 2024, with a subsequent re-evaluation performed in July 2024. The diagnostic accuracy of GPT-4 was assessed by comparing its outputs to a reference standard using statistical measures. Additionally, four pathologists independently reviewed the same images to compare their diagnoses with the model's outputs. Both scanned and photographed images were tested to evaluate GPT-4's generalization ability across different image types. Results GPT-4 achieved an overall accuracy of 0.64 in identifying tumor imaging and tissue origins. For colon polyp classification, accuracy varied from 0.57 to 0.75 in different subtypes. The model achieved 0.88 accuracy in distinguishing low-grade from high-grade dysplasia and 0.75 in distinguishing high-grade dysplasia from adenocarcinoma, with a high sensitivity in detecting adenocarcinoma. Consistency between initial and follow-up evaluations showed slight to moderate agreement, with Kappa values ranging from 0.204 to 0.375. Conclusion GPT-4 demonstrates the ability to diagnose pathological images, showing improved performance over earlier versions. Its diagnostic accuracy in cancer is comparable to that of pathology residents. These findings suggest that GPT-4 holds promise as a supportive tool in pathology diagnostics, offering the potential to assist pathologists in routine diagnostic workflows.
Collapse
Affiliation(s)
- Liya Ding
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Lei Fan
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Department of Pathology, Ninghai County Traditional Chinese Medicine Hospital, Ningbo, China
| | - Miao Shen
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Department of Pathology, Deqing People’s Hospital, Hangzhou, China
| | - Yawen Wang
- College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Kaiqin Sheng
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zijuan Zou
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Huimin An
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zhinong Jiang
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| |
Collapse
|
31
|
Busch F, Hoffmann L, Rueger C, van Dijk EH, Kader R, Ortiz-Prado E, Makowski MR, Saba L, Hadamitzky M, Kather JN, Truhn D, Cuocolo R, Adams LC, Bressem KK. Current applications and challenges in large language models for patient care: a systematic review. COMMUNICATIONS MEDICINE 2025; 5:26. [PMID: 39838160 PMCID: PMC11751060 DOI: 10.1038/s43856-024-00717-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 12/17/2024] [Indexed: 01/23/2025] Open
Abstract
BACKGROUND The introduction of large language models (LLMs) into clinical practice promises to improve patient education and empowerment, thereby personalizing medical care and broadening access to medical knowledge. Despite the popularity of LLMs, there is a significant gap in systematized information on their use in patient care. Therefore, this systematic review aims to synthesize current applications and limitations of LLMs in patient care. METHODS We systematically searched 5 databases for qualitative, quantitative, and mixed methods articles on LLMs in patient care published between 2022 and 2023. From 4349 initial records, 89 studies across 29 medical specialties were included. Quality assessment was performed using the Mixed Methods Appraisal Tool 2018. A data-driven convergent synthesis approach was applied for thematic syntheses of LLM applications and limitations using free line-by-line coding in Dedoose. RESULTS We show that most studies investigate Generative Pre-trained Transformers (GPT)-3.5 (53.2%, n = 66 of 124 different LLMs examined) and GPT-4 (26.6%, n = 33/124) in answering medical questions, followed by patient information generation, including medical text summarization or translation, and clinical documentation. Our analysis delineates two primary domains of LLM limitations: design and output. Design limitations include 6 second-order and 12 third-order codes, such as lack of medical domain optimization, data transparency, and accessibility issues, while output limitations include 9 second-order and 32 third-order codes, for example, non-reproducibility, non-comprehensiveness, incorrectness, unsafety, and bias. CONCLUSIONS This review systematically maps LLM applications and limitations in patient care, providing a foundational framework and taxonomy for their implementation and evaluation in healthcare settings.
Collapse
Affiliation(s)
- Felix Busch
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany.
| | - Lena Hoffmann
- Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Christopher Rueger
- Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Elon Hc van Dijk
- Department of Ophthalmology, Leiden University Medical Center, Leiden, The Netherlands
- Department of Ophthalmology, Sir Charles Gairdner Hospital, Perth, Australia
| | - Rawen Kader
- Division of Surgery and Interventional Sciences, University College London, London, United Kingdom
| | - Esteban Ortiz-Prado
- One Health Research Group, Faculty of Health Science, Universidad de Las Américas, Quito, Ecuador
| | - Marcus R Makowski
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Luca Saba
- Department of Radiology, Azienda Ospedaliero Universitaria (A.O.U.), Cagliari, Italy
| | - Martin Hadamitzky
- School of Medicine and Health, Institute for Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Jakob Nikolas Kather
- Department of Medical Oncology, National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany
| | - Renato Cuocolo
- Department of Medicine, Surgery and Dentistry, University of Salerno, Baronissi, Italy
| | - Lisa C Adams
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Keno K Bressem
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
- School of Medicine and Health, Institute for Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, TUM University Hospital, Technical University of Munich, Munich, Germany
| |
Collapse
|
32
|
Bahrami S, Rubulotta F. Artificial Intelligence-Driven Translation Tools in Intensive Care Units for Enhancing Communication and Research. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2025; 22:95. [PMID: 39857547 PMCID: PMC11765060 DOI: 10.3390/ijerph22010095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/18/2024] [Revised: 01/10/2025] [Accepted: 01/10/2025] [Indexed: 01/27/2025]
Abstract
There is a need to improve communication for patients and relatives who belong to cultural minority communities in intensive care units (ICUs). As a matter of fact, language barriers negatively impact patient safety and family participation in the care of critically ill patients, as well as recruitment to clinical trials. Recent studies indicate that Google Translate and ChatGPT are not accurate enough for advanced medical terminology. Therefore, developing and implementing an ad hoc machine translation tool is essential for bridging language barriers. This tool would enable language minority communities to access advanced healthcare facilities and innovative research in a timely and effective manner, ensuring they receive the comprehensive care and information they need. METHOD Key factors that facilitate access to advanced health services, in particular ICUs, for language minority communities are reviewed. RESULTS The existing digital communication tools in emergency departments and ICUs are reviewed. To the best of our knowledge, no AI English/French translation app has been developed for deployment in ICUs. Patient privacy and data confidentiality are other important issues that should be addressed. CONCLUSIONS Developing an artificial intelligence-driven translation tool for intensive care units (AITIC) which uses language models trained with medical/ICU terminology datasets could offer fast and accurate real-time translation. An AITIC could support communication, and consolidate and expand original research involving language minority communities.
Collapse
Affiliation(s)
- Sahar Bahrami
- Department of Critical Care Medicine, McGill University Health Centre, Montreal, QC H3A 0G4, Canada
| | - Francesca Rubulotta
- Department of Critical Care Medicine, University of Catania, 95124 Catania, Italy;
| |
Collapse
|
33
|
Farhadi Nia M, Ahmadi M, Irankhah E. Transforming dental diagnostics with artificial intelligence: advanced integration of ChatGPT and large language models for patient care. FRONTIERS IN DENTAL MEDICINE 2025; 5:1456208. [PMID: 39917691 PMCID: PMC11797834 DOI: 10.3389/fdmed.2024.1456208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Accepted: 10/16/2024] [Indexed: 02/09/2025] Open
Abstract
Artificial intelligence has dramatically reshaped our interaction with digital technologies, ushering in an era where advancements in AI algorithms and Large Language Models (LLMs) have natural language processing (NLP) systems like ChatGPT. This study delves into the impact of cutting-edge LLMs, notably OpenAI's ChatGPT, on medical diagnostics, with a keen focus on the dental sector. Leveraging publicly accessible datasets, these models augment the diagnostic capabilities of medical professionals, streamline communication between patients and healthcare providers, and enhance the efficiency of clinical procedures. The advent of ChatGPT-4 is poised to make substantial inroads into dental practices, especially in the realm of oral surgery. This paper sheds light on the current landscape and explores potential future research directions in the burgeoning field of LLMs, offering valuable insights for both practitioners and developers. Furthermore, it critically assesses the broad implications and challenges within various sectors, including academia and healthcare, thus mapping out an overview of AI's role in transforming dental diagnostics for enhanced patient care.
Collapse
Affiliation(s)
- Masoumeh Farhadi Nia
- Department of Electrical and Computer Engineering, University of Massachusetts Lowell, Lowell, MA, United States
| | - Mohsen Ahmadi
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, United States
- Department of Industrial Engineering, Urmia University of Technology, Urmia, Iran
| | - Elyas Irankhah
- Department of Mechanical Engineering, University of Massachusetts Lowell, Lowell, MA, United States
| |
Collapse
|
34
|
Cheng HY. ChatGPT's Attitude, Knowledge, and Clinical Application in Geriatrics Practice and Education: Exploratory Observational Study. JMIR Form Res 2025; 9:e63494. [PMID: 39752214 PMCID: PMC11742095 DOI: 10.2196/63494] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Revised: 10/26/2024] [Accepted: 11/17/2024] [Indexed: 01/04/2025] Open
Abstract
BACKGROUND The increasing use of ChatGPT in clinical practice and medical education necessitates the evaluation of its reliability, particularly in geriatrics. OBJECTIVE This study aimed to evaluate ChatGPT's trustworthiness in geriatrics through 3 distinct approaches: evaluating ChatGPT's geriatrics attitude, knowledge, and clinical application with 2 vignettes of geriatric syndromes (polypharmacy and falls). METHODS We used the validated University of California, Los Angeles, geriatrics attitude and knowledge instruments to evaluate ChatGPT's geriatrics attitude and knowledge and compare its performance with that of medical students, residents, and geriatrics fellows from reported results in the literature. We also evaluated ChatGPT's application to 2 vignettes of geriatric syndromes (polypharmacy and falls). RESULTS The mean total score on geriatrics attitude of ChatGPT was significantly lower than that of trainees (medical students, internal medicine residents, and geriatric medicine fellows; 2.7 vs 3.7 on a scale from 1-5; 1=strongly disagree; 5=strongly agree). The mean subscore on positive geriatrics attitude of ChatGPT was higher than that of the trainees (medical students, internal medicine residents, and neurologists; 4.1 vs 3.7 on a scale from 1 to 5 where a higher score means a more positive attitude toward older adults). The mean subscore on negative geriatrics attitude of ChatGPT was lower than that of the trainees and neurologists (1.8 vs 2.8 on a scale from 1 to 5 where a lower subscore means a less negative attitude toward aging). On the University of California, Los Angeles geriatrics knowledge test, ChatGPT outperformed all medical students, internal medicine residents, and geriatric medicine fellows from validated studies (14.7 vs 11.3 with a score range of -18 to +18 where +18 means that all questions were answered correctly). Regarding the polypharmacy vignette, ChatGPT not only demonstrated solid knowledge of potentially inappropriate medications but also accurately identified 7 common potentially inappropriate medications and 5 drug-drug and 3 drug-disease interactions. However, ChatGPT missed 5 drug-disease and 1 drug-drug interaction and produced 2 hallucinations. Regarding the fall vignette, ChatGPT answered 3 of 5 pretests correctly and 2 of 5 pretests partially correctly, identified 6 categories of fall risks, followed fall guidelines correctly, listed 6 key physical examinations, and recommended 6 categories of fall prevention methods. CONCLUSIONS This study suggests that ChatGPT can be a valuable supplemental tool in geriatrics, offering reliable information with less age bias, robust geriatrics knowledge, and comprehensive recommendations for managing 2 common geriatric syndromes (polypharmacy and falls) that are consistent with evidence from guidelines, systematic reviews, and other types of studies. ChatGPT's potential as an educational and clinical resource could significantly benefit trainees, health care providers, and laypeople. Further research using GPT-4o, larger geriatrics question sets, and more geriatric syndromes is needed to expand and confirm these findings before adopting ChatGPT widely for geriatrics education and practice.
Collapse
Affiliation(s)
- Huai Yong Cheng
- Minneapolis VA Health Care System, Minneapolis, MN, United States
| |
Collapse
|
35
|
Solomon BD, Khatri P. Clustering of clinical symptoms using large language models reveals low diagnostic specificity of proposed alternatives to consensus mast cell activation syndrome criteria. J Allergy Clin Immunol 2025; 155:213-218.e4. [PMID: 39278360 DOI: 10.1016/j.jaci.2024.09.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 09/03/2024] [Accepted: 09/04/2024] [Indexed: 09/18/2024]
Abstract
BACKGROUND The rate of diagnosis of mast cell activation syndrome (MCAS) has increased since the disorder's original description as a mastocytosis-like phenotype. While a set of consortium MCAS criteria is well described and widely accepted, this increase occurs in the setting of a broader set of proposed alternative MCAS criteria. OBJECTIVE Effective diagnostic criteria must minimize the range of unrelated diagnoses that can be erroneously classified as the condition of interest. We sought to determine if the symptoms associated with alternative MCAS criteria result in less concise or consistent diagnostic alternatives, reducing diagnostic specificity. METHODS We used multiple large language models, including ChatGPT, Claude, and Gemini, to bootstrap the probabilities of diagnoses that are compatible with consortium or alternative MCAS criteria. We utilized diversity and network analyses to quantify diagnostic precision and specificity compared to control diagnostic criteria including systemic lupus erythematosus, Kawasaki disease, and migraines. RESULTS Compared to consortium MCAS criteria, alternative MCAS criteria are associated with more variable (Shannon diversity 5.8 vs 4.6, respectively; P = .004) and less precise (mean Bray-Curtis similarity 0.07 vs 0.19, respectively; P = .004) diagnoses. The diagnosis networks derived from consortium and alternative MCAS criteria had lower between-network similarity compared to the similarity between diagnosis networks derived from 2 distinct systemic lupus erythematosus criteria (cosine similarity 0.55 vs 0.86, respectively; P = .0022). CONCLUSION Alternative MCAS criteria are associated with a distinct set of diagnoses compared to consortium MCAS criteria and have lower diagnostic consistency. This lack of specificity is pronounced in relation to multiple control criteria, raising the concern that alternative criteria could disproportionately contribute to MCAS overdiagnosis, to the exclusion of more appropriate diagnoses.
Collapse
Affiliation(s)
- Benjamin D Solomon
- Department of Pediatrics, Division of Allergy and Immunology, Stanford University, Palo Alto, Calif.
| | - Purvesh Khatri
- Institute for Immunity, Transplantation, and Infection, School of Medicine, Stanford University, Palo Alto, Calif; Department of Medicine, Center for Biomedical Informatics Research, School of Medicine, Stanford University, Palo Alto, Calif
| |
Collapse
|
36
|
Su Z, Jin K, Wu H, Luo Z, Grzybowski A, Ye J. Assessment of Large Language Models in Cataract Care Information Provision: A Quantitative Comparison. Ophthalmol Ther 2025; 14:103-116. [PMID: 39516445 PMCID: PMC11724831 DOI: 10.1007/s40123-024-01066-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Accepted: 10/24/2024] [Indexed: 11/16/2024] Open
Abstract
INTRODUCTION Cataracts are a significant cause of blindness. While individuals frequently turn to the Internet for medical advice, distinguishing reliable information can be challenging. Large language models (LLMs) have attracted attention for generating accurate, human-like responses that may be used for medical consultation. However, a comprehensive assessment of LLMs' accuracy within specific medical domains is still lacking. METHODS We compiled 46 commonly inquired questions related to cataract care, categorized into six domains. Each question was presented to the LLMs, and three consultant-level ophthalmologists independently assessed the accuracy of their responses on a three-point scale (poor, borderline, good) and their comprehensiveness on a five-point scale. A majority consensus approach established the final rating for each response. Responses rated as 'Poor' were prompted for self-correction and reassessed. RESULTS For accuracy, ChatGPT-4o and Google Bard both achieved average sum scores of 8.7 (out of 9), followed by ChatGPT-3.5, Bing Chat, Llama 2, and Wenxin Yiyan. In consensus-based ratings, ChatGPT-4o outperformed Google Bard in the 'Good' rating. For completeness, ChatGPT-4o had the highest average sum score of 13.22 (out of 15), followed by Google Bard, ChatGPT-3.5, Llama 2, Bing Chat, and Wenxin Yiyan. Detailed performance data reveal nuanced differences in model capabilities. In the 'Prevention' domain, apart from Wenxin Yiyan, all other models were rated as 'Good'. All models showed improvement in self-correction. Bard and Bing improved 1/1 from 'Poor' to better, Llama improved 3/4, and Wenxin Yiyan improved 4/5. CONCLUSIONS Our findings emphasize the potential of LLMs, particularly ChatGPT-4o, to deliver accurate and comprehensive responses to cataract-related queries, especially in prevention, indicating potential for medical consultation. Continuous efforts to enhance LLMs' accuracy through ongoing strategies and evaluations are essential.
Collapse
Affiliation(s)
- Zichang Su
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China
| | - Kai Jin
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China.
| | - Hongkang Wu
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China
| | - Ziyao Luo
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China
- Zhejiang University Chu Kochen Honors College, Hangzhou, 310009, China
| | - Andrzej Grzybowski
- Institute for Research in Ophthalmology, Foundation for Ophthalmology Development, Poznań, Poland
| | - Juan Ye
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China.
| |
Collapse
|
37
|
Chang Y, Yin JM, Li JM, Liu C, Cao LY, Lin SY. Applications and Future Prospects of Medical LLMs: A Survey Based on the M-KAT Conceptual Framework. J Med Syst 2024; 48:112. [PMID: 39725770 DOI: 10.1007/s10916-024-02132-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Accepted: 12/10/2024] [Indexed: 12/28/2024]
Abstract
The success of large language models (LLMs) in general areas have sparked a wave of research into their applications in the medical field. However, enhancing the medical professionalism of these models remains a major challenge. This study proposed a novel model training theoretical framework, the M-KAT framework, which integrated domain-specific training methods for LLMs with the unique characteristics of the medical discipline. This framework aimed to improve the medical professionalism of the models from three perspectives: general knowledge acquisition, specialized skill development, and alignment with clinical thinking. This study summarized the outcomes of medical LLMs across four tasks: clinical diagnosis and treatment, medical question answering, medical research, and health management. Using the M-KAT framework, we analyzed the contribution to enhancement of professionalism of models through different training stages. At the same time, for some of the potential risks associated with medical LLMs, targeted solutions can be achieved through pre-training, SFT, and model alignment based on cultivated professional capabilities. Additionally, this study identified main directions for future research on medical LLMs: advancing professional evaluation datasets and metrics tailored to the needs of medical tasks, conducting in-depth studies on medical multimodal large language models (MLLMs) capable of integrating diverse data types, and exploring the forms of medical agents and multi-agent frameworks that can interact with real healthcare environments and support clinical decision-making. It is hoped that predictions of work can provide a reference for subsequent research.
Collapse
Affiliation(s)
- Ying Chang
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China
| | - Jian-Ming Yin
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China
| | - Jian-Min Li
- Gancao Doctor Chinese Medicine Artificial Intelligence Joint Engineering Center, Zhejiang Chinese Medical University, Zhejiang Chinese Medical University, Hangzhou, China
| | - Chang Liu
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China
- Gancao Doctor Chinese Medicine Artificial Intelligence Joint Engineering Center, Zhejiang Chinese Medical University, Zhejiang Chinese Medical University, Hangzhou, China
- Breast Disease Specialist Hospital of Guangdong Provincial Hospital of Chinese Medicine, Guangdong Provincial Hospital of Chinese Medicine, Guangzhou, China
| | - Ling-Yong Cao
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China.
| | - Shu-Yuan Lin
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China.
- Gancao Doctor Chinese Medicine Artificial Intelligence Joint Engineering Center, Zhejiang Chinese Medical University, Zhejiang Chinese Medical University, Hangzhou, China.
| |
Collapse
|
38
|
Maaß L, Grab-Kroll C, Koerner J, Öchsner W, Schön M, Messerer DAC, Böckers TM, Böckers A. Artificial Intelligence and ChatGPT in Medical Education: A Cross-Sectional Questionnaire on students' Competence. JOURNAL OF CME 2024; 14:2437293. [PMID: 39776442 PMCID: PMC11703531 DOI: 10.1080/28338073.2024.2437293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Revised: 11/16/2024] [Accepted: 11/27/2024] [Indexed: 01/11/2025]
Abstract
Artificial intelligence is rapidly transforming the field of health science and medical education, but less is known about the students´ competencies related to knowledge, skills and attitudes towards the application of AI tools like ChatGPT. Therefore, a unicentric questionnaire-based cross-sectional study was applied to students in the medical field (n = 207). The data revealed that while most students were familiar with ChatGPT (66.7%), other AI tools were significantly less known or utilised for study purposes. Students approached AI tools rather informally, often preferring to use them as a simple search engine. More than half of the students admitted that they were not sufficiently informed about the underlying technology of AI. They applied ChatGPT in a self-directed manner but expressed considerable uncertainty regarding effective prompt engineering and ChatGPT's legal implications. Overall, the majority of respondents showed interest in and positivity towards the introduction of AI. However, they did not feel adequately prepared to handle AI confidently, leading many to express interest in further training. This training should be directly related to students' professional roles, e.g. as a physician. The three most favoured AI-topics for voluntary learning formats were AI in their studies (62.5%), AI in general (58.0%), and the use of AI in scientific writing (57.0%). Notable subgroup differences related to the students" gender or self-assessed study performance were observed and should be considered in future research.
Collapse
Affiliation(s)
- L. Maaß
- Institute for Anatomy and Cell Biology, Faculty of Medicine, Ulm University, Ulm, Germany
| | - C. Grab-Kroll
- Office of the Dean of Studies, Faculty of Medicine, Ulm University, Ulm, Germany
| | - J. Koerner
- Office of the Dean of Studies, Faculty of Medicine, Ulm University, Ulm, Germany
| | - W. Öchsner
- Office of the Dean of Studies, Faculty of Medicine, Ulm University, Ulm, Germany
- Department of Anesthesiology and Intensive Care Medicine, University Hospital Ulm, Ulm, Germany
| | - M. Schön
- Institute for Anatomy and Cell Biology, Faculty of Medicine, Ulm University, Ulm, Germany
| | - DAC Messerer
- Institute of Transfusion Medicine, University of Ulm, Ulm, Germany
| | - TM Böckers
- Institute for Anatomy and Cell Biology, Faculty of Medicine, Ulm University, Ulm, Germany
| | - Anja Böckers
- Institute for Anatomy and Cell Biology, Faculty of Medicine, Ulm University, Ulm, Germany
| |
Collapse
|
39
|
Chung D, Sidhom K, Dhillon H, Bal DS, Fidel MG, Jawanda G, Patel P. Real-world utility of ChatGPT in pre-vasectomy counselling, a safe and efficient practice: a prospective single-centre clinical study. World J Urol 2024; 43:32. [PMID: 39673635 DOI: 10.1007/s00345-024-05385-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Accepted: 11/15/2024] [Indexed: 12/16/2024] Open
Abstract
PURPOSE This study sought to assess if pre-vasectomy counselling with ChatGPT can safely streamline the consultation process by reducing visit times and increasing patient satisfaction. METHODS A single-institution randomized pilot study was conducted to evaluate the safety and efficacy of ChatGPT for pre-vasectomy counselling. All adult patients interested in undergoing a vasectomy were included. Unwillingness to provide consent or not having internet access constituted exclusion. Patients were randomized 1:1 to ChatGPT with standard in-person or in-person consultation without ChatGPT. Length of visit, number of questions asked, and a Likert scale questionnaire (on a scale of 10, with 10 being defined as great and 0 being defined as poor), were collected. Descriptive statistics and a comparative analysis were performed. RESULTS 18 patients were included with a mean age of 35.8 ± 5.4 (n = 9) in the intervention arm and 36.9 ± 7.4 (n = 9) in the control arm. Pre-vasectomy counselling with ChatGPT was associated with a higher provider perception of patient understanding of the procedure (8.8 ± 1.0 vs. 6.7 ± 2.8; p = 0.047) and a decreased length of in-person consultation (7.7 ± 2.3 min vs. 10.6 ± 3.4 min; p = 0.05). Quality of information provided by ChatGPT, ease of use, and overall experience were rated highly at 8.3 ± 1.9, 9.1 ± 1.5, and 8.6 ± 1.7, respectively. CONCLUSIONS ChatGPT for pre-vasectomy counselling improved the efficiency of consultations and the provider's perception of the patient's understanding of the procedure.
Collapse
Affiliation(s)
- David Chung
- Section of Urology, Department of Surgery, University of Manitoba, AD203-720 McDermot Avenue, Winnipeg, Manitoba, R3N 1B1, Canada.
| | - Karim Sidhom
- Section of Urology, Department of Surgery, University of Manitoba, AD203-720 McDermot Avenue, Winnipeg, Manitoba, R3N 1B1, Canada
| | | | - Dhiraj S Bal
- Max Rady College of Medicine, University of Manitoba, Winnipeg, MB, Canada
| | - Maximilian G Fidel
- Max Rady College of Medicine, University of Manitoba, Winnipeg, MB, Canada
| | - Gary Jawanda
- Manitoba Men's Health Clinic, Winnipeg, MB, Canada
| | - Premal Patel
- Section of Urology, Department of Surgery, University of Manitoba, AD203-720 McDermot Avenue, Winnipeg, Manitoba, R3N 1B1, Canada
- Manitoba Men's Health Clinic, Winnipeg, MB, Canada
| |
Collapse
|
40
|
Breitwieser M, Moore V, Wiesner T, Wichlas F, Deininger C. NLP-Driven Analysis of Pneumothorax Incidence Following Central Venous Catheter Procedures: A Data-Driven Re-Evaluation of Routine Imaging in Value-Based Medicine. Diagnostics (Basel) 2024; 14:2792. [PMID: 39767153 PMCID: PMC11674588 DOI: 10.3390/diagnostics14242792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2024] [Revised: 11/14/2024] [Accepted: 12/10/2024] [Indexed: 01/11/2025] Open
Abstract
Background: This study presents a systematic approach using a natural language processing (NLP) algorithm to assess the necessity of routine imaging after central venous catheter (CVC) placement and removal. With pneumothorax being a key complication of CVC procedures, this research aims to provide evidence-based recommendations for optimizing imaging protocols and minimizing unnecessary imaging risks. Methods: We analyzed electronic health records from four university hospitals in Salzburg, Austria, focusing on X-rays performed between 2012 and 2021 following CVC procedures. A custom-built NLP algorithm identified cases of pneumothorax from radiologists' reports and clinician requests, while excluding cases with contraindications such as chest injuries, prior pneumothorax, or missing data. Chi-square tests were used to compare pneumothorax rates between CVC insertion and removal, and multivariate logistic regression identified risk factors, with a focus on age and gender. Results: This study analyzed 17,175 cases of patients aged 18 and older, with 95.4% involving CVC insertion and 4.6% involving CVC removal. Pneumothorax was observed in 106 cases post-insertion (1.3%) and in 3 cases post-removal (0.02%), with no statistically significant difference between procedures (p = 0.5025). The NLP algorithm achieved an accuracy of 93%, with a sensitivity of 97.9%, a specificity of 87.9%, and an area under the ROC curve (AUC) of 0.9283. Conclusions: The findings indicate no significant difference in pneumothorax incidence between CVC insertion and removal, supporting existing recommendations against routine imaging post-removal for asymptomatic patients and suggesting that routine imaging after CVC insertion may also be unnecessary in similar cases. This study demonstrates how advanced NLP techniques can support value-based medicine by enhancing clinical decision making and optimizing resources.
Collapse
Affiliation(s)
- Martin Breitwieser
- Department for Orthopedic Surgery and Traumatology, Paracelsus Medical University, 5020 Salzburg, Austria; (V.M.); (F.W.); (C.D.)
| | | | | | | | | |
Collapse
|
41
|
De Busser B, Roth L, De Loof H. The role of large language models in self-care: a study and benchmark on medicines and supplement guidance accuracy. Int J Clin Pharm 2024:10.1007/s11096-024-01839-2. [PMID: 39644377 DOI: 10.1007/s11096-024-01839-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Accepted: 11/12/2024] [Indexed: 12/09/2024]
Abstract
BACKGROUND The recent surge in the capabilities of artificial intelligence systems, particularly large language models, is also impacting the medical and pharmaceutical field in a major way. Beyond specialized uses in diagnostics and data discovery, these tools have now become accessible to the general public. AIM The study aimed to critically analyse the current performance of large language models in answering patient's self-care questions regarding medications and supplements. METHOD Answers from six major language models were analysed for correctness, language-independence, context-sensitivity, and reproducibility using a newly developed reference set of questions and a scoring matrix. RESULTS The investigated large language models are capable of answering a clear majority of self-care questions accurately, providing relevant health information. However, substantial variability in the responses, including potentially unsafe advice, was observed, influenced by language, question structure, user context and time. GPT 4.0 scored highest on average, while GPT 3.5, Gemini, and Gemini Advanced had varied scores. Responses were context and language sensitive. In terms of consistency over time, Perplexity had the worst performance. CONCLUSION Given the high-quality output of large language models, their potential in self-care applications is undeniable. The newly created benchmark can facilitate further validation and guide the establishment of strict safeguards to combat the sizable risk of misinformation in order to reach a more favourable risk/benefit ratio when this cutting-edge technology is used by patients.
Collapse
Affiliation(s)
- Branco De Busser
- Laboratory of Physiopharmacology, University of Antwerp, Universiteitsplein 1, 2610, Antwerp, Belgium
| | - Lynn Roth
- Laboratory of Physiopharmacology, University of Antwerp, Universiteitsplein 1, 2610, Antwerp, Belgium
| | - Hans De Loof
- Laboratory of Physiopharmacology, University of Antwerp, Universiteitsplein 1, 2610, Antwerp, Belgium.
| |
Collapse
|
42
|
Yu H, Fan L, Li L, Zhou J, Ma Z, Xian L, Hua W, He S, Jin M, Zhang Y, Gandhi A, Ma X. Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2024; 8:658-711. [PMID: 39463859 PMCID: PMC11499577 DOI: 10.1007/s41666-024-00171-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Revised: 08/16/2024] [Accepted: 08/22/2024] [Indexed: 10/29/2024]
Abstract
Large language models (LLMs) have rapidly become important tools in Biomedical and Health Informatics (BHI), potentially enabling new ways to analyze data, treat patients, and conduct research. This study aims to provide a comprehensive overview of LLM applications in BHI, highlighting their transformative potential and addressing the associated ethical and practical challenges. We reviewed 1698 research articles from January 2022 to December 2023, categorizing them by research themes and diagnostic categories. Additionally, we conducted network analysis to map scholarly collaborations and research dynamics. Our findings reveal a substantial increase in the potential applications of LLMs to a variety of BHI tasks, including clinical decision support, patient interaction, and medical document analysis. Notably, LLMs are expected to be instrumental in enhancing the accuracy of diagnostic tools and patient care protocols. The network analysis highlights dense and dynamically evolving collaborations across institutions, underscoring the interdisciplinary nature of LLM research in BHI. A significant trend was the application of LLMs in managing specific disease categories, such as mental health and neurological disorders, demonstrating their potential to influence personalized medicine and public health strategies. LLMs hold promising potential to further transform biomedical research and healthcare delivery. While promising, the ethical implications and challenges of model validation call for rigorous scrutiny to optimize their benefits in clinical settings. This survey serves as a resource for stakeholders in healthcare, including researchers, clinicians, and policymakers, to understand the current state and future potential of LLMs in BHI.
Collapse
Affiliation(s)
- Huizi Yu
- University of Michigan, Ann Arbor, MI USA
| | - Lizhou Fan
- University of Michigan, Ann Arbor, MI USA
| | - Lingyao Li
- University of Michigan, Ann Arbor, MI USA
| | | | - Zihui Ma
- University of Maryland, College Park, MD USA
| | - Lu Xian
- University of Michigan, Ann Arbor, MI USA
| | | | - Sijia He
- University of Michigan, Ann Arbor, MI USA
| | | | | | - Ashvin Gandhi
- University of California, Los Angeles, Los Angeles, CA USA
| | - Xin Ma
- Shandong University, Jinan, Shandong China
| |
Collapse
|
43
|
Lee JJ, Zepeda A, Arbour G, Isaac KV, Ng RT, Nichol AM. Automated Identification of Breast Cancer Relapse in Computed Tomography Reports Using Natural Language Processing. JCO Clin Cancer Inform 2024; 8:e2400107. [PMID: 39705642 DOI: 10.1200/cci.24.00107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Revised: 08/15/2024] [Accepted: 10/18/2024] [Indexed: 12/22/2024] Open
Abstract
PURPOSE Breast cancer relapses are rarely collected by cancer registries because of logistical and financial constraints. Hence, we investigated natural language processing (NLP), enhanced with state-of-the-art deep learning transformer tools and large language models, to automate relapse identification in the text of computed tomography (CT) reports. METHODS We analyzed follow-up CT reports from patients diagnosed with breast cancer between January 1, 2005, and December 31, 2014. The reports were curated and annotated for the presence or absence of local, regional, and distant breast cancer relapses. We performed 10-fold cross-validation to evaluate models identifying different types of relapses in CT reports. Model performance was assessed with classification metrics, reported with 95% confidence intervals. RESULTS In our data set of 1,445 CT reports, 799 (55.3%) described any relapse, 72 (5.0%) local relapses, 97 (6.7%) regional relapses, and 743 (51.4%) distant relapses. The any-relapse model achieved an accuracy of 89.6% (87.8-91.1), with a sensitivity of 93.2% (91.4-94.9) and a specificity of 84.2% (80.9-87.1). The local relapse model achieved an accuracy of 94.6% (93.3-95.7), a sensitivity of 44.4% (32.8-56.3), and a specificity of 97.2% (96.2-98.0). The regional relapse model showed an accuracy of 93.6% (92.3-94.9), a sensitivity of 70.1% (60.0-79.1), and a specificity of 95.3% (94.2-96.5). Finally, the distant relapse model demonstrated an accuracy of 88.1% (86.2-89.7), a sensitivity of 91.8% (89.9-93.8), and a specificity of 83.7% (80.5-86.4). CONCLUSION We developed NLP models to identify local, regional, and distant breast cancer relapses from CT reports. Automating the identification of breast cancer relapses can enhance data collection about patient outcomes.
Collapse
Affiliation(s)
- Jaimie J Lee
- Department of Radiation Oncology, BC Cancer, Vancouver, BC, Canada
- Department of Surgery, University of British Columbia, Vancouver, BC, Canada
| | - Andres Zepeda
- Department of Computer Science, University of British Columbia, Vancouver, BC, Canada
| | - Gregory Arbour
- Department of Computer Science, University of British Columbia, Vancouver, BC, Canada
| | - Kathryn V Isaac
- Department of Surgery, University of British Columbia, Vancouver, BC, Canada
| | - Raymond T Ng
- Department of Computer Science, University of British Columbia, Vancouver, BC, Canada
| | - Alan M Nichol
- Department of Radiation Oncology, BC Cancer, Vancouver, BC, Canada
- Department of Surgery, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
44
|
Farook TH, Dudley J. Understanding Occlusion and Temporomandibular Joint Function Using Deep Learning and Predictive Modeling. Clin Exp Dent Res 2024; 10:e70028. [PMID: 39563180 PMCID: PMC11576518 DOI: 10.1002/cre2.70028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 08/19/2024] [Accepted: 10/01/2024] [Indexed: 11/21/2024] Open
Abstract
OBJECTIVES Advancements in artificial intelligence (AI)-driven predictive modeling in dentistry are outpacing the clinical translation of research findings. Predictive modeling uses statistical methods to anticipate norms related to TMJ dynamics, complementing imaging modalities like cone beam computed tomography (CBCT) and magnetic resonance imaging (MRI). Deep learning, a subset of AI, helps quantify and analyze complex hierarchical relationships in occlusion and TMJ function. This narrative review explores the application of predictive modeling and deep learning to identify clinical trends and associations related to occlusion and TMJ function. RESULTS Debates persist regarding best practices for managing occlusal factors in temporomandibular joint (TMJ) function analysis while interpreting and quantifying findings related to the TMJ and occlusion and mitigating biases remain challenging. Data generated from noninvasive chairside tools such as jaw trackers, video tracking, and 3D scanners with virtual articulators offer unique insights by predicting variations in dynamic jaw movement, TMJ, and occlusion. The predictions help us understand the highly individualized norms surrounding TMJ function that are often required to address temporomandibular disorders (TMDs) in general practice. CONCLUSIONS Normal TMJ function, occlusion, and the appropriate management of TMDs are complex and continue to attract ongoing debate. This review examines how predictive modeling and artificial intelligence aid in understanding occlusion and TMJ function and provides insights into complex dental conditions such as TMDs that may improve diagnosis and treatment outcomes with noninvasive techniques.
Collapse
Affiliation(s)
| | - James Dudley
- Adelaide Dental SchoolThe University of AdelaideSouth AustraliaAustralia
| |
Collapse
|
45
|
Kwok KO, Huynh T, Wei WI, Wong SYS, Riley S, Tang A. Utilizing large language models in infectious disease transmission modelling for public health preparedness. Comput Struct Biotechnol J 2024; 23:3254-3257. [PMID: 39286528 PMCID: PMC11402906 DOI: 10.1016/j.csbj.2024.08.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 08/07/2024] [Accepted: 08/07/2024] [Indexed: 09/19/2024] Open
Abstract
Introduction OpenAI's ChatGPT, a Large Language Model (LLM), is a powerful tool across domains, designed for text and code generation, fostering collaboration, especially in public health. Investigating the role of this advanced LLM chatbot in assisting public health practitioners in shaping disease transmission models to inform infection control strategies, marks a new era in infectious disease epidemiology research. This study used a case study to illustrate how ChatGPT collaborates with a public health practitioner in co-designing a mathematical transmission model. Methods Using natural conversation, the practitioner initiated a dialogue involving an iterative process of code generation, refinement, and debugging with ChatGPT to develop a model to fit 10 days of prevalence data to estimate two key epidemiological parameters: i) basic reproductive number (Ro) and ii) final epidemic size. Verification and validation processes are conducted to ensure the accuracy and functionality of the final model. Results ChatGPT developed a validated transmission model which replicated the epidemic curve and gave estimates of Ro of 4.19 (95 % CI: 4.13- 4.26) and a final epidemic size of 98.3 % of the population within 60 days. It highlighted the advantages of using maximum likelihood estimation with Poisson distribution over least squares method. Conclusion Integration of LLM in medical research accelerates model development, reducing technical barriers for health practitioners, democratizing access to advanced modeling and potentially enhancing pandemic preparedness globally, particularly in resource-constrained populations.
Collapse
Affiliation(s)
- Kin On Kwok
- JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong Special Administrative Region of China
- Hong Kong Institute of Asia-Pacific Studies, The Chinese University of Hong Kong, Hong Kong Special Administrative Region of China
- Department of Infectious Disease Epidemiology, School of Public Health, Imperial College London, London, United Kingdom
| | - Tom Huynh
- School of Science, Engineering and Technology, RMIT University, Viet Nam
| | - Wan In Wei
- JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong Special Administrative Region of China
| | - Samuel Y S Wong
- JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong Special Administrative Region of China
| | - Steven Riley
- MRC Centre for Global Infectious Disease Analysis and Jameel Institute, Imperial College London, London, United Kingdom
- School of Public Health, Imperial College London, Norfolk Place, London W2 1PG, United Kingdom
| | - Arthur Tang
- School of Science, Engineering and Technology, RMIT University, Viet Nam
| |
Collapse
|
46
|
Abhari S, Afshari Y, Fatehi F, Salmani H, Garavand A, Chumachenko D, Zakerabasali S, Morita PP. Exploring ChatGPT in clinical inquiry: a scoping review of characteristics, applications, challenges, and evaluation. Ann Med Surg (Lond) 2024; 86:7094-7104. [PMID: 39649918 PMCID: PMC11623824 DOI: 10.1097/ms9.0000000000002716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Accepted: 10/25/2024] [Indexed: 12/11/2024] Open
Abstract
Introduction Recent advancements in generative AI, exemplified by ChatGPT, hold promise for healthcare applications such as decision-making support, education, and patient engagement. However, rigorous evaluation is crucial to ensure reliability and safety in clinical contexts. This scoping review explores ChatGPT's role in clinical inquiry, focusing on its characteristics, applications, challenges, and evaluation. Methods This review, conducted in 2023, followed PRISMA-ScR guidelines (Supplemental Digital Content 1, http://links.lww.com/MS9/A636). Searches were performed across PubMed, Scopus, IEEE, Web of Science, Cochrane, and Google Scholar using relevant keywords. The review explored ChatGPT's effectiveness in various medical domains, evaluation methods, target users, and comparisons with other AI models. Data synthesis and analysis incorporated both quantitative and qualitative approaches. Results Analysis of 41 academic studies highlights ChatGPT's potential in medical education, patient care, and decision support, though performance varies by medical specialty and linguistic context. GPT-3.5, frequently referenced in 26 studies, demonstrated adaptability across diverse scenarios. Challenges include limited access to official answer keys and inconsistent performance, underscoring the need for ongoing refinement. Evaluation methods, including expert comparisons and statistical analyses, provided significant insights into ChatGPT's efficacy. The identification of target users, such as medical educators and nonexpert clinicians, illustrates its broad applicability. Conclusion ChatGPT shows significant potential in enhancing clinical practice and medical education. Nevertheless, continuous refinement is essential for its successful integration into healthcare, aiming to improve patient care outcomes, and address the evolving needs of the medical community.
Collapse
Affiliation(s)
- Shahabeddin Abhari
- School of Public Health Sciences, University of Waterloo, Waterloo, Ontario, Canada
| | - Yasna Afshari
- Department of Radiology and Nuclear Medicine, Erasmus MC University Medical Center Rotterdam, Rotterdam
- Department of Epidemiology, Erasmus MC University Medical Center Rotterdam, Rotterdam, The Netherlands
| | - Farhad Fatehi
- Business School, The University of Queensland, Brisbane, Australia
| | - Hosna Salmani
- Department of Health Information Management, School of Health Management and Information Sciences, Iran University of Medical Sciences, Tehran, Iran
| | - Ali Garavand
- Department of Health Information Technology, School of Allied Medical Sciences, Lorestan University of Medical Sciences, Khorramabad, Iran
| | - Dmytro Chumachenko
- Department of Mathematical Modeling and Artificial Intelligence, National Aerospace University ‘Kharkiv Aviation Institute’, Kharkiv, Ukraine
| | - Somayyeh Zakerabasali
- Department of Health Information Management, Clinical Education Research Center, Health Human Resources Research Center, School of Health Management and Information Sciences, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Plinio P. Morita
- School of Public Health Sciences, University of Waterloo, Waterloo, Ontario, Canada
- Department of Systems Design Engineering, University of Waterloo
- Research Institute for Aging, University of Waterloo, Waterloo, Ontario, Canada
- Centre for Digital Therapeutics, Techna Institute, University Health Network, Toronto
- Dalla Lana School of Public Health, Institute of Health Policy, Management, and Evaluation, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
47
|
Elkarmi R, Abu-Ghazaleh S, Sonbol H, Haha O, Al-Haddad A, Hassona Y. ChatGPT for parents' education about early childhood caries: A friend or foe? Int J Paediatr Dent 2024. [PMID: 39533165 DOI: 10.1111/ipd.13283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 08/25/2024] [Accepted: 10/22/2024] [Indexed: 11/16/2024]
Abstract
BACKGROUND With the increasing popularity of online sources for health information, parents may seek information related to early childhood caries (ECC) from artificial intelligence-based chatbots. AIM The aim of this article was to evaluate the usefulness, quality, reliability, and readability of ChatGPT answers to parents' questions about ECC. DESIGN Eighty questions commonly asked about ECC were compiled from experts and keyword research tools. ChatGPT 3.5 was asked these questions independently. The answers were evaluated by experts in paediatric dentistry. RESULTS ChatGPT provided "very useful" and "useful" responses to 82.5% of the questions. The mean global quality score was 4.3 ± 1 (good quality). The mean reliability score was 18.5 ± 8.9 (average to very good). The mean understandability score was 59.5% ± 13.8 (not highly understandable), and the mean actionability score was 40.5% ± 12.8 (low actionability). The mean Flesch-Kincaid reading ease score was 32% ± 25.7, and the mean Simple Measure of Gobbledygook index readability score was 15.3 ± 9.1(indicating poor readability for the lay person). Misleading and false information were detected in some answers. CONCLUSION ChatGPT has significant potential as a tool for answering parent's questions about ECC. Concerns, however, do exist about the readability and actionability of the answers. The presence of false information should not be overlooked.
Collapse
Affiliation(s)
- Rawan Elkarmi
- Department of Paediatric Dentistry, Orthodontics, and Preventive Dentistry, School of Dentistry, The University of Jordan, Amman, Jordan
- School of Dentistry, The University of Jordan, Amman, Jordan
| | - Suha Abu-Ghazaleh
- Department of Paediatric Dentistry, Orthodontics, and Preventive Dentistry, School of Dentistry, The University of Jordan, Amman, Jordan
- School of Dentistry, The University of Jordan, Amman, Jordan
| | - Hawazen Sonbol
- Department of Paediatric Dentistry, Orthodontics, and Preventive Dentistry, School of Dentistry, The University of Jordan, Amman, Jordan
- School of Dentistry, The University of Jordan, Amman, Jordan
| | - Ola Haha
- School of Dentistry, The University of Jordan, Amman, Jordan
| | - Alaa Al-Haddad
- School of Dentistry, The University of Jordan, Amman, Jordan
- Department of Prosthodontics, School of Dentistry, The University of Jordan, Amman, Jordan
| | - Yazan Hassona
- School of Dentistry, The University of Jordan, Amman, Jordan
- Department of Oral and Maxillofacial Surgery, Oral Medicine and Periodontology, School of Dentistry, The University of Jordan, Amman, Jordan
| |
Collapse
|
48
|
Sanduleanu S, Ersahin K, Bremm J, Talibova N, Damer T, Erdogan M, Kottlors J, Goertz L, Bruns C, Maintz D, Abdullayev N. Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis. AI 2024; 5:1942-1954. [DOI: 10.3390/ai5040096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2025] Open
Abstract
Background: Nonsurgical treatment of uncomplicated appendicitis is a reasonable option in many cases despite the sparsity of robust, easy access, externally validated, and multimodally informed clinical decision support systems (CDSSs). Developed by OpenAI, the Generative Pre-trained Transformer 3.5 model (GPT-3) may provide enhanced decision support for surgeons in less certain appendicitis cases or those posing a higher risk for (relative) operative contra-indications. Our objective was to determine whether GPT-3.5, when provided high-throughput clinical, laboratory, and radiological text-based information, will come to clinical decisions similar to those of a machine learning model and a board-certified surgeon (reference standard) in decision-making for appendectomy versus conservative treatment. Methods: In this cohort study, we randomly collected patients presenting at the emergency department (ED) of two German hospitals (GFO, Troisdorf, and University Hospital Cologne) with right abdominal pain between October 2022 and October 2023. Statistical analysis was performed using R, version 3.6.2, on RStudio, version 2023.03.0 + 386. Overall agreement between the GPT-3.5 output and the reference standard was assessed by means of inter-observer kappa values as well as accuracy, sensitivity, specificity, and positive and negative predictive values with the “Caret” and “irr” packages. Statistical significance was defined as p < 0.05. Results: There was agreement between the surgeon’s decision and GPT-3.5 in 102 of 113 cases, and all cases where the surgeon decided upon conservative treatment were correctly classified by GPT-3.5. The estimated model training accuracy was 83.3% (95% CI: 74.0, 90.4), while the validation accuracy for the model was 87.0% (95% CI: 66.4, 97.2). This is in comparison to the GPT-3.5 accuracy of 90.3% (95% CI: 83.2, 95.0), which did not perform significantly better in comparison to the machine learning model (p = 0.21). Conclusions: This study, the first study of the “intended use” of GPT-3.5 for surgical treatment to our knowledge, comparing surgical decision-making versus an algorithm found a high degree of agreement between board-certified surgeons and GPT-3.5 for surgical decision-making in patients presenting to the emergency department with lower abdominal pain.
Collapse
Affiliation(s)
| | - Koray Ersahin
- Department of General and Visceral Surgery, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 50937 Troisdorf, Germany
| | - Johannes Bremm
- Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, Germany
| | - Narmin Talibova
- Department of Internal Medicine III, University Hospital, 89081 Ulm, Germany
| | - Tim Damer
- Department of General and Visceral Surgery, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 50937 Troisdorf, Germany
| | - Merve Erdogan
- Department of Radiology and Neuroradiology, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 53840 Troisdorf, Germany
| | - Jonathan Kottlors
- Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, Germany
| | - Lukas Goertz
- Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, Germany
| | - Christiane Bruns
- Department of General, Visceral, Tumor and Transplantation Surgery, University Hospital of Cologne, Kerpener Straße 62, 50937 Cologne, Germany
- Center for Integrated Oncology (CIO) Aachen, Bonn, Cologne and Düsseldorf, 50937 Cologne, Germany
| | - David Maintz
- Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, Germany
| | - Nuran Abdullayev
- Department of Radiology and Neuroradiology, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 53840 Troisdorf, Germany
| |
Collapse
|
49
|
OSullivan C, Gaddum C, Lee AJ. Use of AI to enhance written information in paediatric settings-stochastic parrot or clinical tool? Evid Based Nurs 2024:ebnurs-2024-104164. [PMID: 39357997 DOI: 10.1136/ebnurs-2024-104164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/18/2024] [Indexed: 10/04/2024]
Affiliation(s)
| | - Clare Gaddum
- Faculty of Health Sciences, Manchester Metropolitan University, Manchester, UK
| | - Amanda J Lee
- Faculty of Health Sciences, Manchester Metropolitan University, Manchester, UK
| |
Collapse
|
50
|
Hölzing CR, Rumpf S, Huber S, Papenfuß N, Meybohm P, Happel O. The Potential of Using Generative AI/NLP to Identify and Analyse Critical Incidents in a Critical Incident Reporting System (CIRS): A Feasibility Case-Control Study. Healthcare (Basel) 2024; 12:1964. [PMID: 39408144 PMCID: PMC11475821 DOI: 10.3390/healthcare12191964] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Revised: 09/28/2024] [Accepted: 09/29/2024] [Indexed: 10/20/2024] Open
Abstract
BACKGROUND To enhance patient safety in healthcare, it is crucial to address the underreporting of issues in Critical Incident Reporting Systems (CIRSs). This study aims to evaluate the effectiveness of generative Artificial Intelligence and Natural Language Processing (AI/NLP) in reviewing CIRS cases by comparing its performance with human reviewers and categorising these cases into relevant topics. METHODS A case-control feasibility study was conducted using CIRS cases from the German CIRS-Anaesthesiology subsystem. Each case was reviewed by a human expert and by an AI/NLP model (ChatGPT-3.5). Two CIRS experts blindly assessed these reviews, rating them on linguistic quality, recognisable expertise, logical derivability, and overall quality using six-point Likert scales. RESULTS On average, the CIRS experts correctly classified 80% of human CIRS reviews as created by a human and misclassified 45.8% of AI reviews as written by a human. Ratings on a scale of 1 (very good) to 6 (failed) revealed a comparable performance between human- and AI-generated reviews across the dimensions of linguistic expression (p = 0.39), recognisable expertise (p = 0.89), logical derivability (p = 0.84), and overall quality (p = 0.87). The AI model was able to categorise the cases into relevant topics independently. CONCLUSIONS This feasibility study demonstrates the potential of generative AI/NLP in analysing and categorising cases from the CIRS. This could have implications for improving incident reporting in healthcare. Therefore, additional research is required to verify and expand upon these discoveries.
Collapse
Affiliation(s)
- Carlos Ramon Hölzing
- Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, University Hospital Würzburg, Oberdürrbacher Str. 6, 97080 Würzburg, Germany
| | - Sebastian Rumpf
- Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, University Hospital Würzburg, Oberdürrbacher Str. 6, 97080 Würzburg, Germany
| | - Stephan Huber
- Psychological Ergonomics, University of Würzburg, 97070 Würzburg, Germany
| | - Nathalie Papenfuß
- Psychological Ergonomics, University of Würzburg, 97070 Würzburg, Germany
| | - Patrick Meybohm
- Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, University Hospital Würzburg, Oberdürrbacher Str. 6, 97080 Würzburg, Germany
| | - Oliver Happel
- Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, University Hospital Würzburg, Oberdürrbacher Str. 6, 97080 Würzburg, Germany
| |
Collapse
|