1
|
Zhou M, Pan Y, Zhang Y, Song X, Zhou Y. Evaluating AI-generated patient education materials for spinal surgeries: Comparative analysis of readability and DISCERN quality across ChatGPT and deepseek models. Int J Med Inform 2025; 198:105871. [PMID: 40107040 DOI: 10.1016/j.ijmedinf.2025.105871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2025] [Revised: 03/01/2025] [Accepted: 03/06/2025] [Indexed: 03/22/2025]
Abstract
BACKGROUND Access to patient-centered health information is essential for informed decision-making. However, online medical resources vary in quality and often fail to accommodate differing degrees of health literacy. This issue is particularly evident in surgical contexts, where complex terminology obstructs patient comprehension. With the increasing reliance on AI models for supplementary medical information, the reliability and readability of AI-generated content require thorough evaluation. OBJECTIVE This study aimed to evaluate four natural language processing models-ChatGPT-4o, ChatGPT-o3 mini, DeepSeek-V3, and DeepSeek-R1-in generating patient education materials for three common spinal surgeries: lumbar discectomy, spinal fusion, and decompressive laminectomy. Information quality was evaluated using the DISCERN score, and readability was assessed through Flesch-Kincaid indices. RESULTS DeepSeek-R1 produced the most readable responses, with Flesch-Kincaid Grade Level (FKGL) scores ranging from 7.2 to 9.0, succeeded by ChatGPT-4o. In contrast, ChatGPT-o3 exhibited the lowest readability (FKGL > 10.4). The DISCERN scores for all AI models were below 60, classifying the information quality as "fair," primarily due to insufficient cited references. CONCLUSION All models achieved merely a "fair" quality rating, underscoring the necessity for improvements in citation practices, and personalization. Nonetheless, DeepSeek-R1 and ChatGPT-4o generated more readable surgical information than ChatGPT-o3. Given that enhanced readability can improve patient engagement, reduce anxiety, and contribute to better surgical outcomes, these two models should be prioritized for assisting patients in the clinical. LIMITATION & FUTURE DIRECTION This study is limited by the rapid evolution of AI models, its exclusive focus on spinal surgery education, and the absence of real-world patient feedback, which may affect the generalizability and long-term applicability of the findings. Future research ought to explore interactive, multimodal approaches and incorporate patient feedback to ensure that AI-generated health information is accurate, accessible, and facilitates informed healthcare decisions.
Collapse
Affiliation(s)
- Mi Zhou
- Allied Health & Human Performance, University of South Australia, Adelaide, Australia.
| | - Yun Pan
- Department of Cardiovascular Medicine, The Second Affiliated Hospital of Soochow University, Suzhou, Jiangsu, China.
| | - Yuye Zhang
- Department of Orthopaedics, The Second Affiliated Hospital of Soochow University, Suzhou, Jiangsu, China.
| | - Xiaomei Song
- Department of Nursing, The Second Affiliated Hospital of Soochow University, Suzhou, Jiangsu, China.
| | - Youbin Zhou
- College of Intelligent Science and Control Engineering, Jinling Institute of Technology, Nanjing, China.
| |
Collapse
|
2
|
Cadiente A, Implicito C, Udaiyar A, Ho A, Wan C, Chen J, Palmer C, Cao Q, Raver M, Lembrikova K, Billah M. Evaluating Incontinence Abstracts: Artificial Intelligence-Generated Versus Cochrane Review. UROGYNECOLOGY (PHILADELPHIA, PA.) 2025:02273501-990000000-00377. [PMID: 40193590 DOI: 10.1097/spv.0000000000001688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2025]
Abstract
IMPORTANCE As the volume of medical literature continues to expand, the provision of artificial intelligence (AI) to produce concise, accessible summaries has the potential to enhance the efficacy of content review. OBJECTIVES This project assessed the readability and quality of summaries generated by ChatGPT in comparison to the Plain Text Summaries from Cochrane Review, a systematic review database, in incontinence research. STUDY DESIGN Seventy-three abstracts from the Cochrane Library tagged under "Incontinence" were summarized using ChatGPT-3.5 (July 2023 Version) and compared with their corresponding Cochrane Plain Text Summaries. Readability was assessed using Flesch Kincaid Reading Ease, Flesch Kincaid Grade Level, Gunning Fog Score, Smog Index, Coleman Liau Index, and Automated Readability Index. A 2-tailed t test was used to compare the summaries. Each summary was also evaluated by 2 blinded, independent reviewers on a 5-point scale where higher scores indicated greater accuracy and adherence to the abstract. RESULTS Compared to ChatGPT, Cochrane Review's Plain Text Summaries scored higher in the numerical Flesch Kincaid Reading Ease score and showed lower necessary education levels in the 5 other readability metrics with statistical significance, indicating better readability. However, ChatGPT earned a higher mean accuracy grade of 4.25 compared to Cochrane Review's mean grade of 4.05 with statistical significance. CONCLUSIONS Cochrane Review's Plain Text Summaries provide clearer summaries of the incontinence literature when compared to ChatGPT, yet ChatGPT generated more comprehensive summaries. While ChatGPT can effectively summarize the medical literature, further studies can improve reader accessibility to these summaries.
Collapse
Affiliation(s)
| | | | - Abinav Udaiyar
- From the Hackensack Meridian School of Medicine, Nutley, NJ
| | - Andre Ho
- From the Hackensack Meridian School of Medicine, Nutley, NJ
| | | | - Jamie Chen
- From the Hackensack Meridian School of Medicine, Nutley, NJ
| | - Charles Palmer
- From the Hackensack Meridian School of Medicine, Nutley, NJ
| | - Qilin Cao
- From the Hackensack Meridian School of Medicine, Nutley, NJ
| | - Michael Raver
- Hackensack University Medical Center, Hackensack, NJ
| | | | | |
Collapse
|
3
|
Baturu M, Solakhan M, Kazaz TG, Bayrak O. Frequently asked questions on erectile dysfunction: evaluating artificial intelligence answers with expert mentorship. Int J Impot Res 2025; 37:310-314. [PMID: 38714784 DOI: 10.1038/s41443-024-00898-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 03/28/2024] [Accepted: 04/25/2024] [Indexed: 05/10/2024]
Abstract
The present study assessed the accuracy of artificiaI intelligence-generated responses to frequently asked questions on erectile dysfunction. A cross-sectional analysis involved 56 erectile dysfunction-related questions searched on Google, categorized into nine sections: causes, diagnosis, treatment options, treatment complications, protective measures, relationship with other illnesses, treatment costs, treatment with herbal agents, and appointments. Responses from ChatGPT 3.5, ChatGPT 4, and BARD were evaluated by two experienced urology experts using the F1 and global quality scores (GQS) for accuracy, relevance, and comprehensibility. ChatGPT 3.5 and ChatGPT 4 achieved higher GQS than BARD in categories such as causes (4.5 ± 0.54, 4.5 ± 0.51, 3.15 ± 1.01, respectively, p < 0.001), treatment options (4.35 ± 0.6, 4.5 ± 0.43, 2.71 ± 1.38, respectively, p < 0.001), protective measures (5.0 ± 0, 5.0 ± 0, 4 ± 0.5, respectively, p = 0.013), relationships with other illnesses (4.58 ± 0.58, 4.83 ± 0.25, 3.58 ± 0.8, respectively, p = 0.006), and treatment with herbal agents (3 ± 0.61, 3.33 ± 0.83, 1.8 ± 1.09, respectively, p = 0.043). F1 scores in categories: causes (1), diagnosis (0.857), treatment options (0.726), and protective measures (1), indicated their alignment with the guidelines. There was no significant difference between ChatGPT 3.5 and ChatGPT 4 regarding answer quality, but both outperformed BARD in the GQS. These results emphasize the need to continually enhance and validate AI-generated medical information, underscoring the importance of artificiaI intelligence systems in delivering reliable information on erectile dysfunction.
Collapse
Affiliation(s)
- Muharrem Baturu
- Department of Urology, University of Gaziantep, Gaziantep, Turkey
| | - Mehmet Solakhan
- Department of Urology, Hasan Kalyoncu University, Gaziantep, Turkey
| | | | - Omer Bayrak
- Department of Urology, University of Gaziantep, Gaziantep, Turkey.
| |
Collapse
|
4
|
Kianian R, Sun D, Rojas-Carabali W, Agrawal R, Tsui E. Large Language Models May Help Patients Understand Peer-Reviewed Scientific Articles About Ophthalmology: Development and Usability Study. J Med Internet Res 2024; 26:e59843. [PMID: 39719077 PMCID: PMC11707445 DOI: 10.2196/59843] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 07/10/2024] [Accepted: 11/05/2024] [Indexed: 12/26/2024] Open
Abstract
BACKGROUND Adequate health literacy has been shown to be important for the general health of a population. To address this, it is recommended that patient-targeted medical information is written at a sixth-grade reading level. To make well-informed decisions about their health, patients may want to interact directly with peer-reviewed open access scientific articles. However, studies have shown that such text is often written with highly complex language above the levels that can be comprehended by the general population. Previously, we have published on the use of large language models (LLMs) in easing the readability of patient-targeted health information on the internet. In this study, we continue to explore the advantages of LLMs in patient education. OBJECTIVE This study aimed to explore the use of LLMs, specifically ChatGPT (OpenAI), to enhance the readability of peer-reviewed scientific articles in the field of ophthalmology. METHODS A total of 12 open access, peer-reviewed papers published by the senior authors of this study (ET and RA) were selected. Readability was assessed using the Flesch-Kincaid Grade Level and Simple Measure of Gobbledygook tests. ChatGPT 4.0 was asked "I will give you the text of a peer-reviewed scientific paper. Considering that the recommended readability of the text is 6th grade, can you simplify the following text so that a layperson reading this text can fully comprehend it? - Insert Manuscript Text -". Appropriateness was evaluated by the 2 uveitis-trained ophthalmologists. Statistical analysis was performed in Microsoft Excel. RESULTS ChatGPT significantly lowered the readability and length of the selected papers from 15th to 7th grade (P<.001) while generating responses that were deemed appropriate by expert ophthalmologists. CONCLUSIONS LLMs show promise in improving health literacy by enhancing the accessibility of peer-reviewed scientific articles and allowing the general population to interact directly with medical literature.
Collapse
Affiliation(s)
- Reza Kianian
- Stein Eye Institute, Department of Ophthalmology, David Geffen School of Medicine, Los Angeles, CA, United States
| | - Deyu Sun
- Stein Eye Institute, Department of Ophthalmology, David Geffen School of Medicine, Los Angeles, CA, United States
| | - William Rojas-Carabali
- Nanyang Technological University, Lee Kong Chian School of Medicine, Singapore, Singapore
- Tan Tock Seng Hospital, National Healthcare Group Eye Institute, Singapore, Singapore
- National Healthcare Group, Programme for Ocular Inflammation & Infection Translational Research, Singapore, Singapore
| | - Rupesh Agrawal
- Nanyang Technological University, Lee Kong Chian School of Medicine, Singapore, Singapore
- Tan Tock Seng Hospital, National Healthcare Group Eye Institute, Singapore, Singapore
- National Healthcare Group, Programme for Ocular Inflammation & Infection Translational Research, Singapore, Singapore
| | - Edmund Tsui
- Stein Eye Institute, Department of Ophthalmology, David Geffen School of Medicine, Los Angeles, CA, United States
| |
Collapse
|
5
|
Pompili D, Richa Y, Collins P, Richards H, Hennessey DB. Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models. World J Urol 2024; 42:455. [PMID: 39073590 PMCID: PMC11286728 DOI: 10.1007/s00345-024-05146-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Accepted: 06/23/2024] [Indexed: 07/30/2024] Open
Abstract
PURPOSE Large language models (LLMs) are a form of artificial intelligence (AI) that uses deep learning techniques to understand, summarize and generate content. The potential benefits of LLMs in healthcare is predicted to be immense. The objective of this study was to examine the quality of patient information leaflets (PILs) produced by 3 LLMs on urological topics. METHODS Prompts were created to generate PILs from 3 LLMs: ChatGPT-4, PaLM 2 (Google Bard) and Llama 2 (Meta) across four urology topics (circumcision, nephrectomy, overactive bladder syndrome, and transurethral resection of the prostate). PILs were evaluated using a quality assessment checklist. PIL readability was assessed by the Average Reading Level Consensus Calculator. RESULTS PILs generated by PaLM 2 had the highest overall average quality score (3.58), followed by Llama 2 (3.34) and ChatGPT-4 (3.08). PaLM 2 generated PILs were of the highest quality in all topics except TURP and was the only LLM to include images. Medical inaccuracies were present in all generated content including instances of significant error. Readability analysis identified PaLM 2 generated PILs as the simplest (age 14-15 average reading level). Llama 2 PILs were the most difficult (age 16-17 average). CONCLUSION While LLMs can generate PILs that may help reduce healthcare professional workload, generated content requires clinician input for accuracy and inclusion of health literacy aids, such as images. LLM-generated PILs were above the average reading level for adults, necessitating improvement in LLM algorithms and/or prompt design. How satisfied patients are to LLM-generated PILs remains to be evaluated.
Collapse
Affiliation(s)
- David Pompili
- School of Medicine, University College Cork, Cork, Ireland
| | - Yasmina Richa
- School of Medicine, University College Cork, Cork, Ireland
| | - Patrick Collins
- Department of Urology, Mercy University Hospital, Cork, Ireland
| | - Helen Richards
- School of Medicine, University College Cork, Cork, Ireland
- Department of Clinical Psychology, Mercy University Hospital, Cork, Ireland
| | - Derek B Hennessey
- School of Medicine, University College Cork, Cork, Ireland.
- Department of Urology, Mercy University Hospital, Cork, Ireland.
| |
Collapse
|
6
|
Kleebayoon A, Wiwanitkit V. Artificial Intelligence to Patient-Targeted Health Information on Kidney Stone Disease: Comment. J Ren Nutr 2024; 34:266. [PMID: 38007186 DOI: 10.1053/j.jrn.2023.11.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 11/04/2023] [Indexed: 11/27/2023] Open
|
7
|
Kistler B, Avesani CM, Burrowes JD, Chan M, Cuppari L, Hensley MK, Karupaiah T, Kilates MC, Mafra D, Manley K, Vennegoor M, Wang AYM, Lambert K, Sumida K, Moore LW, Kalantar-Zadeh K, Campbell KL. Dietitians Play a Crucial and Expanding Role in Renal Nutrition and Medical Nutrition Therapy. J Ren Nutr 2024; 34:91-94. [PMID: 38373524 DOI: 10.1053/j.jrn.2024.02.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 02/13/2024] [Indexed: 02/21/2024] Open
Affiliation(s)
- Brandon Kistler
- Department of Nutrition Science, Purdue University, West Lafayette, Indiana.
| | - Carla Maria Avesani
- Nephrology Division, Baxter Novum, Department of Clinical Science, Intervention and Technology, Karolsinka Institutet, Stockholm, Sweden
| | | | - Maria Chan
- The St. George Hospital, Sydney, New South Wales, Australia
| | | | | | - Tilakavati Karupaiah
- School of Biosciences, Faculty of Health & Medical Science, Taylor's University Lakeside Campus, Subang Jaya, Malaysia
| | | | - Denise Mafra
- Federal University Fluminense, UFF, Niterói, Brazil
| | | | - Marianne Vennegoor
- Retired, Department of Renal Medicine, Guy's and St Thomas' Hospital NHS Foundation Trust, London, United Kingdom
| | - Angela Yee-Moon Wang
- Department of Medicine, Queen Mary Hospital, The University of Hong Kong, Hong Kong
| | - Kelly Lambert
- School of Medical, Indigenous and Health Sciences, University of Wollongong, Wollongong, New South Wales, Australia
| | - Keiichi Sumida
- Division of Nephrology, Department of Medicine, University of Tennessee Health Science Center, Memphis, Tennessee
| | - Linda W Moore
- Department of Surgery, Houston Methodist Hospital, Houston, Texas
| | - Kamyar Kalantar-Zadeh
- Department of Epidemiology, UCLA Fielding School of Public Health, Los Angeles, California; Division of Nephrology, Hypertension, and Transplantation, Harbor-UCLA and the Lundquist Institute, Torrence, California
| | - Katrina L Campbell
- Metro North Hospital and Health Service, Brisbane, Queensland, Australia
| |
Collapse
|