1
|
Tepe M, Emekli E. Decoding medical jargon: The use of AI language models (ChatGPT-4, BARD, microsoft copilot) in radiology reports. PATIENT EDUCATION AND COUNSELING 2024; 126:108307. [PMID: 38743965 DOI: 10.1016/j.pec.2024.108307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 03/20/2024] [Accepted: 04/24/2024] [Indexed: 05/16/2024]
Abstract
OBJECTIVE Evaluate Artificial Intelligence (AI) language models (ChatGPT-4, BARD, Microsoft Copilot) in simplifying radiology reports, assessing readability, understandability, actionability, and urgency classification. METHODS This study evaluated the effectiveness of these AI models in translating radiology reports into patient-friendly language and providing understandable and actionable suggestions and urgency classifications. Thirty radiology reports were processed using AI tools, and their outputs were assessed for readability (Flesch Reading Ease, Flesch-Kincaid Grade Level), understandability (PEMAT), and the accuracy of urgency classification. ANOVA and Chi-Square tests were performed to compare the models' performances. RESULTS All three AI models successfully transformed medical jargon into more accessible language, with BARD showing superior readability scores. In terms of understandability, all models achieved scores above 70%, with ChatGPT-4 and BARD leading (p < 0.001, both). However, the AI models varied in accuracy of urgency recommendations, with no significant statistical difference (p = 0.284). CONCLUSION AI language models have proven effective in simplifying radiology reports, thereby potentially improving patient comprehension and engagement in their health decisions. However, their accuracy in assessing the urgency of medical conditions based on radiology reports suggests a need for further refinement. PRACTICE IMPLICATIONS Incorporating AI in radiology communication can empower patients, but further development is crucial for comprehensive and actionable patient support.
Collapse
Affiliation(s)
- Murat Tepe
- Department of Radiology, King's College Hospital London, Dubai, United Arab Emirates.
| | - Emre Emekli
- Department of Radiology, Eskişehir Osmangazi University, Eskişehir, Turkiye; Department of Medical Education, Gazi University, Ankara, Turkiye
| |
Collapse
|
2
|
Tailor PD, Dalvin LA, Chen JJ, Iezzi R, Olsen TW, Scruggs BA, Barkmeier AJ, Bakri SJ, Ryan EH, Tang PH, Parke DW, Belin PJ, Sridhar J, Xu D, Kuriyan AE, Yonekawa Y, Starr MR. A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone. OPHTHALMOLOGY SCIENCE 2024; 4:100485. [PMID: 38660460 PMCID: PMC11041826 DOI: 10.1016/j.xops.2024.100485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2023] [Revised: 01/03/2024] [Accepted: 02/01/2024] [Indexed: 04/26/2024]
Abstract
Objective To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created, and LLM responses to common retina patient questions. Design Randomized, masked multicenter study. Participants Twenty-one common retina patient questions were randomly assigned among 13 retina specialists. Methods Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert + artificial intelligence [AI]), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, and Bard) also generated responses to each question. The original question along with anonymized and randomized Expert + AI, Expert, and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content). Main Outcome Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type. Results There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy (P < 0.001, P < 0.001) between LLM, Expert and Expert + AI groups. For quality, Expert + AI (3.86 ± 0.85) performed the best overall while GPT-3.5 (3.75 ± 0.79) was the top performing LLM. For empathy, GPT-3.5 (3.75 ± 0.69) had the highest mean score followed by Expert + AI (3.73 ± 0.63). By mean score, Expert placed 4 out of 7 for quality and 6 out of 7 for empathy. For both quality (P < 0.001) and empathy (P < 0.001), expert-edited LLM responses performed better than expert-created responses. There were time savings for an expert-edited LLM response versus expert-created response (P = 0.02). ChatGPT-4 performed similar to Expert for inappropriate content (P = 0.35), missing content (P = 0.001), extent of possible harm (P = 0.356), and likelihood of possible harm (P = 0.129). Conclusions In this randomized, masked, multicenter study, LLM responses were comparable with experts in terms of quality, empathy, and safety metrics, warranting further exploration of their potential benefits in clinical settings. Financial Disclosures Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of the article.
Collapse
Affiliation(s)
| | | | - John J. Chen
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - Raymond Iezzi
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | | | | | | | - Sophie J. Bakri
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - Edwin H. Ryan
- Retina Consultants of Minnesota, Edina, Minnesota
- Department of Ophthalmology & Visual Neurosciences, University of Minnesota Medical School, Minneapolis, Minnesota
| | - Peter H. Tang
- Retina Consultants of Minnesota, Edina, Minnesota
- Department of Ophthalmology & Visual Neurosciences, University of Minnesota Medical School, Minneapolis, Minnesota
| | - D. Wilkin. Parke
- Retina Consultants of Minnesota, Edina, Minnesota
- Department of Ophthalmology & Visual Neurosciences, University of Minnesota Medical School, Minneapolis, Minnesota
| | | | - Jayanth Sridhar
- Olive View Medical Center, University of California Los Angeles, Los Angeles, California
| | - David Xu
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Ajay E. Kuriyan
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Yoshihiro Yonekawa
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | | |
Collapse
|
3
|
Luo S, Canavese F, Aroojis A, Andreacchio A, Anticevic D, Bouchard M, Castaneda P, De Rosa V, Fiogbe MA, Frick SL, Hui JH, Johari AN, Loro A, Lyu X, Matsushita M, Omeroglu H, Roye DP, Shah MM, Yong B, Li L. Are Generative Pretrained Transformer 4 Responses to Developmental Dysplasia of the Hip Clinical Scenarios Universal? An International Review. J Pediatr Orthop 2024; 44:e504-e511. [PMID: 38597198 DOI: 10.1097/bpo.0000000000002682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
OBJECTIVE There is increasing interest in applying artificial intelligence chatbots like generative pretrained transformer 4 (GPT-4) in the medical field. This study aimed to explore the universality of GPT-4 responses to simulated clinical scenarios of developmental dysplasia of the hip (DDH) across diverse global settings. METHODS Seventeen international experts with more than 15 years of experience in pediatric orthopaedics were selected for the evaluation panel. Eight simulated DDH clinical scenarios were created, covering 4 key areas: (1) initial evaluation and diagnosis, (2) initial examination and treatment, (3) nursing care and follow-up, and (4) prognosis and rehabilitation planning. Each scenario was completed independently in a new GPT-4 session. Interrater reliability was assessed using Fleiss kappa, and the quality, relevance, and applicability of GPT-4 responses were analyzed using median scores and interquartile ranges. Following scoring, experts met in ZOOM sessions to generate Regional Consensus Assessment Scores, which were intended to represent a consistent regional assessment of the use of the GPT-4 in pediatric orthopaedic care. RESULTS GPT-4's responses to the 8 clinical DDH scenarios received performance scores ranging from 44.3% to 98.9% of the 88-point maximum. The Fleiss kappa statistic of 0.113 ( P = 0.001) indicated low agreement among experts in their ratings. When assessing the responses' quality, relevance, and applicability, the median scores were 3, with interquartile ranges of 3 to 4, 3 to 4, and 2 to 3, respectively. Significant differences were noted in the prognosis and rehabilitation domain scores ( P < 0.05 for all). Regional consensus scores were 75 for Africa, 74 for Asia, 73 for India, 80 for Europe, and 65 for North America, with the Kruskal-Wallis test highlighting significant disparities between these regions ( P = 0.034). CONCLUSIONS This study demonstrates the promise of GPT-4 in pediatric orthopaedic care, particularly in supporting preliminary DDH assessments and guiding treatment strategies for specialist care. However, effective integration of GPT-4 into clinical practice will require adaptation to specific regional health care contexts, highlighting the importance of a nuanced approach to health technology adaptation. LEVEL OF EVIDENCE Level IV.
Collapse
Affiliation(s)
- Shaoting Luo
- Department of Pediatric Orthopaedics, Shengjing Hospital of China Medical University, Shenyang, Liaoning
| | - Federico Canavese
- Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
| | - Alaric Aroojis
- Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
| | - Antonio Andreacchio
- Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
| | - Darko Anticevic
- Pediatric Orthopedics Clinic of Pediatric Surgery and Orthopedics, Pediatric Institute of Southern Switzerland (IPSI), Via Athos Gallino, Bellinzona, Switzerland
| | | | - Pablo Castaneda
- Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
| | - Vincenzo De Rosa
- Pediatric Orthopedics Clinic of Pediatric Surgery and Orthopedics, Pediatric Institute of Southern Switzerland (IPSI), Via Athos Gallino, Bellinzona, Switzerland
| | | | - Steven L Frick
- Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
| | - James H Hui
- Department of Orthopaedic Surgery, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
| | - Ashok N Johari
- Pediatric Orthopedics Clinic of Pediatric Surgery and Orthopedics, Pediatric Institute of Southern Switzerland (IPSI), Via Athos Gallino, Bellinzona, Switzerland
| | - Antonio Loro
- Ufuk University Faculty of Medicine, Ankara, Turkey
| | - Xuemin Lyu
- Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
| | - Masaki Matsushita
- Department of Orthopaedic Surgery, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
| | | | - David P Roye
- Department of Orthopaedic Surgery, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
| | | | - Bicheng Yong
- Department of Pediatric Orthopaedics, Beit CURE Children's Hospital of Malawi, Chichiri Blantyre, Malawi
| | - Lianyong Li
- Department of Pediatric Orthopaedics, Shengjing Hospital of China Medical University, Shenyang, Liaoning
| |
Collapse
|
4
|
Maywood MJ, Parikh R, Deobhakta A, Begaj T. PERFORMANCE ASSESSMENT OF AN ARTIFICIAL INTELLIGENCE CHATBOT IN CLINICAL VITREORETINAL SCENARIOS. Retina 2024; 44:954-964. [PMID: 38271674 DOI: 10.1097/iae.0000000000004053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2024]
Abstract
PURPOSE To determine how often ChatGPT is able to provide accurate and comprehensive information regarding clinical vitreoretinal scenarios. To assess the types of sources ChatGPT primarily uses and to determine whether they are hallucinated. METHODS This was a retrospective cross-sectional study. The authors designed 40 open-ended clinical scenarios across four main topics in vitreoretinal disease. Responses were graded on correctness and comprehensiveness by three blinded retina specialists. The primary outcome was the number of clinical scenarios that ChatGPT answered correctly and comprehensively. Secondary outcomes included theoretical harm to patients, the distribution of the type of references used by the chatbot, and the frequency of hallucinated references. RESULTS In June 2023, ChatGPT answered 83% of clinical scenarios (33/40) correctly but provided a comprehensive answer in only 52.5% of cases (21/40). Subgroup analysis demonstrated an average correct score of 86.7% in neovascular age-related macular degeneration, 100% in diabetic retinopathy, 76.7% in retinal vascular disease, and 70% in the surgical domain. There were six incorrect responses with one case (16.7%) of no harm, three cases (50%) of possible harm, and two cases (33.3%) of definitive harm. CONCLUSION ChatGPT correctly answered more than 80% of complex open-ended vitreoretinal clinical scenarios, with a reduced capability to provide a comprehensive response.
Collapse
Affiliation(s)
- Michael J Maywood
- Department of Ophthalmology, Corewell Health William Beaumont University Hospital, Royal Oak, Michigan
| | - Ravi Parikh
- Manhattan Retina and Eye Consultants, New York, New York
- Department of Ophthalmology, New York University School of Medicine, New York, New York
| | | | - Tedi Begaj
- Department of Ophthalmology, Corewell Health William Beaumont University Hospital, Royal Oak, Michigan
- Associated Retinal Consultants, Royal Oak, Michigan
| |
Collapse
|
5
|
Tailor PD, D'Souza HS, Li H, Starr MR. Vision of the future: large language models in ophthalmology. Curr Opin Ophthalmol 2024:00055735-990000000-00173. [PMID: 38814572 DOI: 10.1097/icu.0000000000001062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2024]
Abstract
PURPOSE OF REVIEW Large language models (LLMs) are rapidly entering the landscape of medicine in areas from patient interaction to clinical decision-making. This review discusses the evolving role of LLMs in ophthalmology, focusing on their current applications and future potential in enhancing ophthalmic care. RECENT FINDINGS LLMs in ophthalmology have demonstrated potential in improving patient communication and aiding preliminary diagnostics because of their ability to process complex language and generate human-like domain-specific interactions. However, some studies have shown potential for harm and there have been no prospective real-world studies evaluating the safety and efficacy of LLMs in practice. SUMMARY While current applications are largely theoretical and require rigorous safety testing before implementation, LLMs exhibit promise in augmenting patient care quality and efficiency. Challenges such as data privacy and user acceptance must be overcome before LLMs can be fully integrated into clinical practice.
Collapse
Affiliation(s)
| | - Haley S D'Souza
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - Hanzhou Li
- Department of Radiology, Emory University, Atlanta, Georgia, USA
| | - Matthew R Starr
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
6
|
Gomez-Cabello CA, Borna S, Pressman SM, Haider SA, Sehgal A, Leibovich BC, Forte AJ. Artificial Intelligence in Postoperative Care: Assessing Large Language Models for Patient Recommendations in Plastic Surgery. Healthcare (Basel) 2024; 12:1083. [PMID: 38891158 PMCID: PMC11171524 DOI: 10.3390/healthcare12111083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 05/13/2024] [Accepted: 05/23/2024] [Indexed: 06/21/2024] Open
Abstract
Since their release, the medical community has been actively exploring large language models' (LLMs) capabilities, which show promise in providing accurate medical knowledge. One potential application is as a patient resource. This study analyzes and compares the ability of the currently available LLMs, ChatGPT-3.5, GPT-4, and Gemini, to provide postoperative care recommendations to plastic surgery patients. We presented each model with 32 questions addressing common patient concerns after surgical cosmetic procedures and evaluated the medical accuracy, readability, understandability, and actionability of the models' responses. The three LLMs provided equally accurate information, with GPT-3.5 averaging the highest on the Likert scale (LS) (4.18 ± 0.93) (p = 0.849), while Gemini provided significantly more readable (p = 0.001) and understandable responses (p = 0.014; p = 0.001). There was no difference in the actionability of the models' responses (p = 0.830). Although LLMs have shown their potential as adjunctive tools in postoperative patient care, further refinement and research are imperative to enable their evolution into comprehensive standalone resources.
Collapse
Affiliation(s)
| | - Sahar Borna
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | | | - Syed Ali Haider
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Ajai Sehgal
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| | - Bradley C. Leibovich
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
- Department of Urology, Mayo Clinic, Rochester, MN 55905, USA
| | - Antonio J. Forte
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| |
Collapse
|
7
|
Xu P, Chen X, Zhao Z, Shi D. Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis. Br J Ophthalmol 2024:bjo-2023-325054. [PMID: 38789133 DOI: 10.1136/bjo-2023-325054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2023] [Accepted: 05/13/2024] [Indexed: 05/26/2024]
Abstract
PURPOSE To evaluate the capabilities and incapabilities of a GPT-4V(ision)-based chatbot in interpreting ocular multimodal images. METHODS We developed a digital ophthalmologist app using GPT-4V and evaluated its performance with a dataset (60 images, 60 ophthalmic conditions, 6 modalities) that included slit-lamp, scanning laser ophthalmoscopy, fundus photography of the posterior pole (FPP), optical coherence tomography, fundus fluorescein angiography and ocular ultrasound images. The chatbot was tested with ten open-ended questions per image, covering examination identification, lesion detection, diagnosis and decision support. The responses were manually assessed for accuracy, usability, safety and diagnosis repeatability. Auto-evaluation was performed using sentence similarity and GPT-4-based auto-evaluation. RESULTS Out of 600 responses, 30.6% were accurate, 21.5% were highly usable and 55.6% were deemed as no harm. GPT-4V performed best with slit-lamp images, with 42.0%, 38.5% and 68.5% of the responses being accurate, highly usable and no harm, respectively. However, its performance was weaker in FPP images, with only 13.7%, 3.7% and 38.5% in the same categories. GPT-4V correctly identified 95.6% of the imaging modalities and showed varying accuracies in lesion identification (25.6%), diagnosis (16.1%) and decision support (24.0%). The overall repeatability of GPT-4V in diagnosing ocular images was 63.3% (38/60). The overall sentence similarity between responses generated by GPT-4V and human answers is 55.5%, with Spearman correlations of 0.569 for accuracy and 0.576 for usability. CONCLUSION GPT-4V currently is not yet suitable for clinical decision-making in ophthalmology. Our study serves as a benchmark for enhancing ophthalmic multimodal models.
Collapse
Affiliation(s)
- Pusheng Xu
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
| | - Xiaolan Chen
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
| | - Ziwei Zhao
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
| | - Danli Shi
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
- Research Centre for SHARP Vision, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
- Centre for Eye and Vision Research (CEVR), 17W Hong Kong Science Park, Hong Kong
| |
Collapse
|
8
|
Gül Ş, Erdemir İ, Hanci V, Aydoğmuş E, Erkoç YS. How artificial intelligence can provide information about subdural hematoma: Assessment of readability, reliability, and quality of ChatGPT, BARD, and perplexity responses. Medicine (Baltimore) 2024; 103:e38009. [PMID: 38701313 PMCID: PMC11062651 DOI: 10.1097/md.0000000000038009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 04/04/2024] [Indexed: 05/05/2024] Open
Abstract
Subdural hematoma is defined as blood collection in the subdural space between the dura mater and arachnoid. Subdural hematoma is a condition that neurosurgeons frequently encounter and has acute, subacute and chronic forms. The incidence in adults is reported to be 1.72-20.60/100.000 people annually. Our study aimed to evaluate the quality, reliability and readability of the answers to questions asked to ChatGPT, Bard, and perplexity about "Subdural Hematoma." In this observational and cross-sectional study, we asked ChatGPT, Bard, and perplexity to provide the 100 most frequently asked questions about "Subdural Hematoma" separately. Responses from both chatbots were analyzed separately for readability, quality, reliability and adequacy. When the median readability scores of ChatGPT, Bard, and perplexity answers were compared with the sixth-grade reading level, a statistically significant difference was observed in all formulas (P < .001). All 3 chatbot responses were found to be difficult to read. Bard responses were more readable than ChatGPT's (P < .001) and perplexity's (P < .001) responses for all scores evaluated. Although there were differences between the results of the evaluated calculators, perplexity's answers were determined to be more readable than ChatGPT's answers (P < .05). Bard answers were determined to have the best GQS scores (P < .001). Perplexity responses had the best Journal of American Medical Association and modified DISCERN scores (P < .001). ChatGPT, Bard, and perplexity's current capabilities are inadequate in terms of quality and readability of "Subdural Hematoma" related text content. The readability standard for patient education materials as determined by the American Medical Association, National Institutes of Health, and the United States Department of Health and Human Services is at or below grade 6. The readability levels of the responses of artificial intelligence applications such as ChatGPT, Bard, and perplexity are significantly higher than the recommended 6th grade level.
Collapse
Affiliation(s)
- Şanser Gül
- Department of Neurosurgery, Ankara Ataturk Sanatory Education and Research Hospital, Ankara, Turkey
| | - İsmail Erdemir
- Department of Anesthesiology and Critical Care, Faculty of Medicine, Dokuz Eylül University, Izmir, Turkey
| | - Volkan Hanci
- Department of Anesthesiology and Reanimation, Ankara Sincan Education and Research Hospital, Ankara, Turkey
| | - Evren Aydoğmuş
- Department of Neurosurgery, Istanbul Kartal Dr Lütfi Kırdar City Hospital, Istanbul, Turkey
| | - Yavuz Selim Erkoç
- Department of Neurosurgery, Ankara Ataturk Sanatory Education and Research Hospital, Ankara, Turkey
| |
Collapse
|
9
|
Chen X, Zhang W, Xu P, Zhao Z, Zheng Y, Shi D, He M. FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer. NPJ Digit Med 2024; 7:111. [PMID: 38702471 PMCID: PMC11068733 DOI: 10.1038/s41746-024-01101-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 04/04/2024] [Indexed: 05/06/2024] Open
Abstract
Fundus fluorescein angiography (FFA) is a crucial diagnostic tool for chorioretinal diseases, but its interpretation requires significant expertise and time. Prior studies have used Artificial Intelligence (AI)-based systems to assist FFA interpretation, but these systems lack user interaction and comprehensive evaluation by ophthalmologists. Here, we used large language models (LLMs) to develop an automated interpretation pipeline for both report generation and medical question-answering (QA) for FFA images. The pipeline comprises two parts: an image-text alignment module (Bootstrapping Language-Image Pre-training) for report generation and an LLM (Llama 2) for interactive QA. The model was developed using 654,343 FFA images with 9392 reports. It was evaluated both automatically, using language-based and classification-based metrics, and manually by three experienced ophthalmologists. The automatic evaluation of the generated reports demonstrated that the system can generate coherent and comprehensible free-text reports, achieving a BERTScore of 0.70 and F1 scores ranging from 0.64 to 0.82 for detecting top-5 retinal conditions. The manual evaluation revealed acceptable accuracy (68.3%, Kappa 0.746) and completeness (62.3%, Kappa 0.739) of the generated reports. The generated free-form answers were evaluated manually, with the majority meeting the ophthalmologists' criteria (error-free: 70.7%, complete: 84.0%, harmless: 93.7%, satisfied: 65.3%, Kappa: 0.762-0.834). This study introduces an innovative framework that combines multi-modal transformers and LLMs, enhancing ophthalmic image interpretation, and facilitating interactive communications during medical consultation.
Collapse
Affiliation(s)
- Xiaolan Chen
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
| | - Weiyi Zhang
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
| | - Pusheng Xu
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, China
| | - Ziwei Zhao
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
| | - Yingfeng Zheng
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, China
| | - Danli Shi
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China.
- Research Centre for SHARP Vision (RCSV), The Hong Kong Polytechnic University, Kowloon, Hong Kong, China.
| | - Mingguang He
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
- Research Centre for SHARP Vision (RCSV), The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
- Centre for Eye and Vision Research (CEVR), 17W Hong Kong Science Park, Hong Kong, China
| |
Collapse
|
10
|
Ghanem YK, Rouhi AD, Al-Houssan A, Saleh Z, Moccia MC, Joshi H, Dumon KR, Hong Y, Spitz F, Joshi AR, Kwiatt M. Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis. Surg Endosc 2024; 38:2887-2893. [PMID: 38443499 PMCID: PMC11078845 DOI: 10.1007/s00464-024-10739-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Accepted: 01/28/2024] [Indexed: 03/07/2024]
Abstract
INTRODUCTION Generative artificial intelligence (AI) chatbots have recently been posited as potential sources of online medical information for patients making medical decisions. Existing online patient-oriented medical information has repeatedly been shown to be of variable quality and difficult readability. Therefore, we sought to evaluate the content and quality of AI-generated medical information on acute appendicitis. METHODS A modified DISCERN assessment tool, comprising 16 distinct criteria each scored on a 5-point Likert scale (score range 16-80), was used to assess AI-generated content. Readability was determined using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores. Four popular chatbots, ChatGPT-3.5 and ChatGPT-4, Bard, and Claude-2, were prompted to generate medical information about appendicitis. Three investigators independently scored the generated texts blinded to the identity of the AI platforms. RESULTS ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 had overall mean (SD) quality scores of 60.7 (1.2), 62.0 (1.0), 62.3 (1.2), and 51.3 (2.3), respectively, on a scale of 16-80. Inter-rater reliability was 0.81, 0.75, 0.81, and 0.72, respectively, indicating substantial agreement. Claude-2 demonstrated a significantly lower mean quality score compared to ChatGPT-4 (p = 0.001), ChatGPT-3.5 (p = 0.005), and Bard (p = 0.001). Bard was the only AI platform that listed verifiable sources, while Claude-2 provided fabricated sources. All chatbots except for Claude-2 advised readers to consult a physician if experiencing symptoms. Regarding readability, FKGL and FRE scores of ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 were 14.6 and 23.8, 11.9 and 33.9, 8.6 and 52.8, 11.0 and 36.6, respectively, indicating difficulty readability at a college reading skill level. CONCLUSION AI-generated medical information on appendicitis scored favorably upon quality assessment, but most either fabricated sources or did not provide any altogether. Additionally, overall readability far exceeded recommended levels for the public. Generative AI platforms demonstrate measured potential for patient education and engagement about appendicitis.
Collapse
Affiliation(s)
- Yazid K Ghanem
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA.
- Cooper Medical School of Rowan University, Camden, NJ, USA.
| | - Armaun D Rouhi
- Department of Surgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Ammr Al-Houssan
- Department of Surgery, University of Connecticut, Hartford, CT, USA
| | - Zena Saleh
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
| | - Matthew C Moccia
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
| | - Hansa Joshi
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
| | - Kristoffel R Dumon
- Department of Surgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Young Hong
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
- Cooper Medical School of Rowan University, Camden, NJ, USA
| | - Francis Spitz
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
- Cooper Medical School of Rowan University, Camden, NJ, USA
| | - Amit R Joshi
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
- Cooper Medical School of Rowan University, Camden, NJ, USA
| | - Michael Kwiatt
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
- Cooper Medical School of Rowan University, Camden, NJ, USA
| |
Collapse
|
11
|
Momenaei B, Mansour HA, Kuriyan AE, Xu D, Sridhar J, Ting DSW, Yonekawa Y. ChatGPT enters the room: what it means for patient counseling, physician education, academics, and disease management. Curr Opin Ophthalmol 2024; 35:205-209. [PMID: 38334288 DOI: 10.1097/icu.0000000000001036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/10/2024]
Abstract
PURPOSE OF REVIEW This review seeks to provide a summary of the most recent research findings regarding the utilization of ChatGPT, an artificial intelligence (AI)-powered chatbot, in the field of ophthalmology in addition to exploring the limitations and ethical considerations associated with its application. RECENT FINDINGS ChatGPT has gained widespread recognition and demonstrated potential in enhancing patient and physician education, boosting research productivity, and streamlining administrative tasks. In various studies examining its utility in ophthalmology, ChatGPT has exhibited fair to good accuracy, with its most recent iteration showcasing superior performance in providing ophthalmic recommendations across various ophthalmic disorders such as corneal diseases, orbital disorders, vitreoretinal diseases, uveitis, neuro-ophthalmology, and glaucoma. This proves beneficial for patients in accessing information and aids physicians in triaging as well as formulating differential diagnoses. Despite such benefits, ChatGPT has limitations that require acknowledgment including the potential risk of offering inaccurate or harmful information, dependence on outdated data, the necessity for a high level of education for data comprehension, and concerns regarding patient privacy and ethical considerations within the research domain. SUMMARY ChatGPT is a promising new tool that could contribute to ophthalmic healthcare education and research, potentially reducing work burdens. However, its current limitations necessitate a complementary role with human expert oversight.
Collapse
Affiliation(s)
- Bita Momenaei
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Hana A Mansour
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Ajay E Kuriyan
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - David Xu
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Jayanth Sridhar
- University of California Los Angeles, Los Angeles, California, USA
| | | | - Yoshihiro Yonekawa
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| |
Collapse
|
12
|
Biswas S, Davies LN, Sheppard AL, Logan NS, Wolffsohn JS. Utility of artificial intelligence-based large language models in ophthalmic care. Ophthalmic Physiol Opt 2024; 44:641-671. [PMID: 38404172 DOI: 10.1111/opo.13284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 01/23/2024] [Accepted: 01/25/2024] [Indexed: 02/27/2024]
Abstract
PURPOSE With the introduction of ChatGPT, artificial intelligence (AI)-based large language models (LLMs) are rapidly becoming popular within the scientific community. They use natural language processing to generate human-like responses to queries. However, the application of LLMs and comparison of the abilities among different LLMs with their human counterparts in ophthalmic care remain under-reported. RECENT FINDINGS Hitherto, studies in eye care have demonstrated the utility of ChatGPT in generating patient information, clinical diagnosis and passing ophthalmology question-based examinations, among others. LLMs' performance (median accuracy, %) is influenced by factors such as the iteration, prompts utilised and the domain. Human expert (86%) demonstrated the highest proficiency in disease diagnosis, while ChatGPT-4 outperformed others in ophthalmology examinations (75.9%), symptom triaging (98%) and providing information and answering questions (84.6%). LLMs exhibited superior performance in general ophthalmology but reduced accuracy in ophthalmic subspecialties. Although AI-based LLMs like ChatGPT are deemed more efficient than their human counterparts, these AIs are constrained by their nonspecific and outdated training, no access to current knowledge, generation of plausible-sounding 'fake' responses or hallucinations, inability to process images, lack of critical literature analysis and ethical and copyright issues. A comprehensive evaluation of recently published studies is crucial to deepen understanding of LLMs and the potential of these AI-based LLMs. SUMMARY Ophthalmic care professionals should undertake a conservative approach when using AI, as human judgement remains essential for clinical decision-making and monitoring the accuracy of information. This review identified the ophthalmic applications and potential usages which need further exploration. With the advancement of LLMs, setting standards for benchmarking and promoting best practices is crucial. Potential clinical deployment requires the evaluation of these LLMs to move away from artificial settings, delve into clinical trials and determine their usefulness in the real world.
Collapse
Affiliation(s)
- Sayantan Biswas
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - Leon N Davies
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - Amy L Sheppard
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - Nicola S Logan
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - James S Wolffsohn
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| |
Collapse
|
13
|
Al-Sharif EM, Penteado RC, Dib El Jalbout N, Topilow NJ, Shoji MK, Kikkawa DO, Liu CY, Korn BS. Evaluating the Accuracy of ChatGPT and Google BARD in Fielding Oculoplastic Patient Queries: A Comparative Study on Artificial versus Human Intelligence. Ophthalmic Plast Reconstr Surg 2024; 40:303-311. [PMID: 38215452 DOI: 10.1097/iop.0000000000002567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2024]
Abstract
PURPOSE This study evaluates and compares the accuracy of responses from 2 artificial intelligence platforms to patients' oculoplastics-related questions. METHODS Questions directed toward oculoplastic surgeons were collected, rephrased, and input independently into ChatGPT-3.5 and BARD chatbots, using the prompt: "As an oculoplastic surgeon, how can I respond to my patient's question?." Responses were independently evaluated by 4 experienced oculoplastic specialists as comprehensive, correct but inadequate, mixed correct and incorrect/outdated data, and completely incorrect. Additionally, the empathy level, length, and automated readability index of the responses were assessed. RESULTS A total of 112 patient questions underwent evaluation. The rates of comprehensive, correct but inadequate, mixed, and completely incorrect answers for ChatGPT were 71.4%, 12.9%, 10.5%, and 5.1%, respectively, compared with 53.1%, 18.3%, 18.1%, and 10.5%, respectively, for BARD. ChatGPT showed more empathy (48.9%) than BARD (13.2%). All graders found that ChatGPT outperformed BARD in question categories of postoperative healing, medical eye conditions, and medications. Categorizing questions by anatomy, ChatGPT excelled in answering lacrimal questions (83.8%), while BARD performed best in the eyelid group (60.4%). ChatGPT's answers were longer and potentially more challenging to comprehend than BARD's. CONCLUSION This study emphasizes the promising role of artificial intelligence-powered chatbots in oculoplastic patient education and support. With continued development, these chatbots may potentially assist physicians and offer patients accurate information, ultimately contributing to improved patient care while alleviating surgeon burnout. However, it is crucial to highlight that artificial intelligence may be good at answering questions, but physician oversight remains essential to ensure the highest standard of care and address complex medical cases.
Collapse
Affiliation(s)
- Eman M Al-Sharif
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
- Clinical Sciences Department, College of Medicine, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Rafaella C Penteado
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Nahia Dib El Jalbout
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Nicole J Topilow
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Marissa K Shoji
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Don O Kikkawa
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
- Division of Plastic and Reconstructive Surgery, Department of Surgery, UC San Diego School of Medicine, La Jolla, California, U.S.A
| | - Catherine Y Liu
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Bobby S Korn
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
- Division of Plastic and Reconstructive Surgery, Department of Surgery, UC San Diego School of Medicine, La Jolla, California, U.S.A
| |
Collapse
|
14
|
Doğan L, Özçakmakcı GB, Yılmaz ĬE. The Performance of Chatbots and the AAPOS Website as a Tool for Amblyopia Education. J Pediatr Ophthalmol Strabismus 2024:1-7. [PMID: 38661309 DOI: 10.3928/01913913-20240409-01] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
PURPOSE To evaluate the understandability, actionability, and readability of responses provided by the website of the American Association for Pediatric Ophthalmology and Strabismus (AAPOS), ChatGPT-3.5, Bard, and Bing Chat about amblyopia and the appropriateness of the responses generated by the chatbots. METHOD Twenty-five questions provided by the AAPOS website were directed three times to fresh ChatGPT-3.5, Bard, and Bing Chat interfaces. Two experienced pediatric ophthalmologists categorized the responses of the chatbots in terms of their appropriateness. Flesch Reading Ease (FRE), Flesch Kincaid Grade Level (FKGL), and Coleman-Liau Index (CLI) were used to evaluate the readability of the responses of the AAPOS website and chatbots. Furthermore, the understandability scores were evaluated using the Patient Education Materials Assessment Tool (PEMAT). RESULTS The appropriateness of the chatbots' responses was 84.0% for ChatGPT-3.5 and Bard and 80% for Bing Chat (P > .05). For understandability (mean PEMAT-U score AAPOS website: 81.5%, Bard: 77.6%, ChatGPT-3.5: 76.1%, and Bing Chat: 71.5%, P < .05) and actionability (mean PEMAT-A score AAPOS website: 74.6%, Bard: 69.2%, ChatGPT-3.5: 67.8%, and Bing Chat: 64.8%, P < .05), the AAPOs website scored better than the chat-bots. Three readability analyses showed that Bard had the highest mean score, followed by the AAPOS website, Bing Chat, and ChatGPT-3.5, and these scores were more challenging than the recommended level. CONCLUSIONS Chatbots have the potential to provide detailed and appropriate responses at acceptable levels. The AAPOS website has the advantage of providing information that is more understandable and actionable. The AAPOS website and chatbots, especially Chat-GPT, provided difficult-to-read data for patient education regarding amblyopia. [J Pediatr Ophthalmol Strabismus. 20XX;X(X):XXX-XXX.].
Collapse
|
15
|
Lam MR, Manion GN, Young BK. Search engine optimization and its association with readability and accessibility of diabetic retinopathy websites. Graefes Arch Clin Exp Ophthalmol 2024:10.1007/s00417-024-06472-3. [PMID: 38639789 DOI: 10.1007/s00417-024-06472-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 03/22/2024] [Accepted: 03/28/2024] [Indexed: 04/20/2024] Open
Abstract
PURPOSE This study investigated whether websites regarding diabetic retinopathy are readable for patients, and adequately designed to be found by search engines. METHODS The term "diabetic retinopathy" was queried in the Google search engine. Patient-oriented websites from the first 10 pages were categorized by search result page number and website organization type. Metrics of search engine optimization (SEO) and readability were then calculated. RESULTS Among the 71 sites meeting inclusion criteria, informational and organizational sites were best optimized for search engines, and informational sites were the most visited. Better optimization as measured by authority score was correlated with lower Flesch Kincaid Grade Level (r = 0.267, P = 0.024). There was a significant increase in Flesch Kincaid Grade Level with successive search result pages (r = 0.275, P = 0.020). Only 2 sites met the 6th grade reading level AMA recommendation by Flesch Kincaid Grade Level; the average reading level was 10.5. There was no significant difference in readability between website categories. CONCLUSION While the readability of diabetic retinopathy patient information was poor, better readability was correlated to better SEO metrics. While we cannot assess causality, we recommend websites improve their readability, which may increase uptake of their resources.
Collapse
Affiliation(s)
- Matthew R Lam
- Creighton University School of Medicine-Phoenix Regional Campus, Phoenix, AZ, USA.
| | | | - Benjamin K Young
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, OR, USA
| |
Collapse
|
16
|
King RC, Samaan JS, Yeo YH, Peng Y, Kunkel DC, Habib AA, Ghashghaei R. A Multidisciplinary Assessment of ChatGPT's Knowledge of Amyloidosis: Observational Study. JMIR Cardio 2024; 8:e53421. [PMID: 38640472 PMCID: PMC11069089 DOI: 10.2196/53421] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 01/31/2024] [Accepted: 02/22/2024] [Indexed: 04/21/2024] Open
Abstract
BACKGROUND Amyloidosis, a rare multisystem condition, often requires complex, multidisciplinary care. Its low prevalence underscores the importance of efforts to ensure the availability of high-quality patient education materials for better outcomes. ChatGPT (OpenAI) is a large language model powered by artificial intelligence that offers a potential avenue for disseminating accurate, reliable, and accessible educational resources for both patients and providers. Its user-friendly interface, engaging conversational responses, and the capability for users to ask follow-up questions make it a promising future tool in delivering accurate and tailored information to patients. OBJECTIVE We performed a multidisciplinary assessment of the accuracy, reproducibility, and readability of ChatGPT in answering questions related to amyloidosis. METHODS In total, 98 amyloidosis questions related to cardiology, gastroenterology, and neurology were curated from medical societies, institutions, and amyloidosis Facebook support groups and inputted into ChatGPT-3.5 and ChatGPT-4. Cardiology- and gastroenterology-related responses were independently graded by a board-certified cardiologist and gastroenterologist, respectively, who specialize in amyloidosis. These 2 reviewers (RG and DCK) also graded general questions for which disagreements were resolved with discussion. Neurology-related responses were graded by a board-certified neurologist (AAH) who specializes in amyloidosis. Reviewers used the following grading scale: (1) comprehensive, (2) correct but inadequate, (3) some correct and some incorrect, and (4) completely incorrect. Questions were stratified by categories for further analysis. Reproducibility was assessed by inputting each question twice into each model. The readability of ChatGPT-4 responses was also evaluated using the Textstat library in Python (Python Software Foundation) and the Textstat readability package in R software (R Foundation for Statistical Computing). RESULTS ChatGPT-4 (n=98) provided 93 (95%) responses with accurate information, and 82 (84%) were comprehensive. ChatGPT-3.5 (n=83) provided 74 (89%) responses with accurate information, and 66 (79%) were comprehensive. When examined by question category, ChatGTP-4 and ChatGPT-3.5 provided 53 (95%) and 48 (86%) comprehensive responses, respectively, to "general questions" (n=56). When examined by subject, ChatGPT-4 and ChatGPT-3.5 performed best in response to cardiology questions (n=12) with both models producing 10 (83%) comprehensive responses. For gastroenterology (n=15), ChatGPT-4 received comprehensive grades for 9 (60%) responses, and ChatGPT-3.5 provided 8 (53%) responses. Overall, 96 of 98 (98%) responses for ChatGPT-4 and 73 of 83 (88%) for ChatGPT-3.5 were reproducible. The readability of ChatGPT-4's responses ranged from 10th to beyond graduate US grade levels with an average of 15.5 (SD 1.9). CONCLUSIONS Large language models are a promising tool for accurate and reliable health information for patients living with amyloidosis. However, ChatGPT's responses exceeded the American Medical Association's recommended fifth- to sixth-grade reading level. Future studies focusing on improving response accuracy and readability are warranted. Prior to widespread implementation, the technology's limitations and ethical implications must be further explored to ensure patient safety and equitable implementation.
Collapse
Affiliation(s)
- Ryan C King
- Division of Cardiology, Department of Medicine, University of California, Irvine Medical Center, Orange, CA, United States
| | - Jamil S Samaan
- Karsh Division of Gastroenterology and Hepatology, Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, CA, United States
| | - Yee Hui Yeo
- Karsh Division of Gastroenterology and Hepatology, Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, CA, United States
| | - Yuxin Peng
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China
| | - David C Kunkel
- GI Motility and Physiology Program, Division of Gastroenterology, University of California, San Diego, La Jolla, CA, United States
| | - Ali A Habib
- Division of Neurology, University of California, Irvine Medical Center, Orange, CA, United States
| | - Roxana Ghashghaei
- Division of Cardiology, Department of Medicine, University of California, Irvine Medical Center, Orange, CA, United States
| |
Collapse
|
17
|
Şahin MF, Ateş H, Keleş A, Özcan R, Doğan Ç, Akgül M, Yazıcı CM. Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis. J Med Syst 2024; 48:38. [PMID: 38568432 PMCID: PMC10990980 DOI: 10.1007/s10916-024-02056-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 03/22/2024] [Indexed: 04/05/2024]
Abstract
The aim of the study is to evaluate and compare the quality and readability of responses generated by five different artificial intelligence (AI) chatbots-ChatGPT, Bard, Bing, Ernie, and Copilot-to the top searched queries of erectile dysfunction (ED). Google Trends was used to identify ED-related relevant phrases. Each AI chatbot received a specific sequence of 25 frequently searched terms as input. Responses were evaluated using DISCERN, Ensuring Quality Information for Patients (EQIP), and Flesch-Kincaid Grade Level (FKGL) and Reading Ease (FKRE) metrics. The top three most frequently searched phrases were "erectile dysfunction cause", "how to erectile dysfunction," and "erectile dysfunction treatment." Zimbabwe, Zambia, and Ghana exhibited the highest level of interest in ED. None of the AI chatbots achieved the necessary degree of readability. However, Bard exhibited significantly higher FKRE and FKGL ratings (p = 0.001), and Copilot achieved better EQIP and DISCERN ratings than the other chatbots (p = 0.001). Bard exhibited the simplest linguistic framework and posed the least challenge in terms of readability and comprehension, and Copilot's text quality on ED was superior to the other chatbots. As new chatbots are introduced, their understandability and text quality increase, providing better guidance to patients.
Collapse
Affiliation(s)
- Mehmet Fatih Şahin
- Faculty of Medicine Department of Urology, Tekirdağ Namık Kemal University, Süleymanpaşa, Tekirdağ, 59020, Turkey.
| | - Hüseyin Ateş
- Faculty of Medicine Department of Urology, Tekirdağ Namık Kemal University, Süleymanpaşa, Tekirdağ, 59020, Turkey
| | - Anıl Keleş
- Faculty of Medicine Department of Urology, Tekirdağ Namık Kemal University, Süleymanpaşa, Tekirdağ, 59020, Turkey
| | - Rıdvan Özcan
- Department of Urology, Bursa State Hospital, Nilüfer, Bursa, 16110, Turkey
| | - Çağrı Doğan
- Faculty of Medicine Department of Urology, Tekirdağ Namık Kemal University, Süleymanpaşa, Tekirdağ, 59020, Turkey
| | - Murat Akgül
- Faculty of Medicine Department of Urology, Tekirdağ Namık Kemal University, Süleymanpaşa, Tekirdağ, 59020, Turkey
| | - Cenk Murat Yazıcı
- Faculty of Medicine Department of Urology, Tekirdağ Namık Kemal University, Süleymanpaşa, Tekirdağ, 59020, Turkey
| |
Collapse
|
18
|
Shen OY, Pratap JS, Li X, Chen NC, Bhashyam AR. How Does ChatGPT Use Source Information Compared With Google? A Text Network Analysis of Online Health Information. Clin Orthop Relat Res 2024; 482:578-588. [PMID: 38517757 PMCID: PMC10936961 DOI: 10.1097/corr.0000000000002995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/03/2023] [Accepted: 01/08/2024] [Indexed: 03/24/2024]
Abstract
BACKGROUND The lay public is increasingly using ChatGPT (a large language model) as a source of medical information. Traditional search engines such as Google provide several distinct responses to each search query and indicate the source for each response, but ChatGPT provides responses in paragraph form in prose without providing the sources used, which makes it difficult or impossible to ascertain whether those sources are reliable. One practical method to infer the sources used by ChatGPT is text network analysis. By understanding how ChatGPT uses source information in relation to traditional search engines, physicians and physician organizations can better counsel patients on the use of this new tool. QUESTIONS/PURPOSES (1) In terms of key content words, how similar are ChatGPT and Google Search responses for queries related to topics in orthopaedic surgery? (2) Does the source distribution (academic, governmental, commercial, or form of a scientific manuscript) differ for Google Search responses based on the topic's level of medical consensus, and how is this reflected in the text similarity between ChatGPT and Google Search responses? (3) Do these results vary between different versions of ChatGPT? METHODS We evaluated three search queries relating to orthopaedic conditions: "What is the cause of carpal tunnel syndrome?," "What is the cause of tennis elbow?," and "Platelet-rich plasma for thumb arthritis?" These were selected because of their relatively high, medium, and low consensus in the medical evidence, respectively. Each question was posed to ChatGPT version 3.5 and version 4.0 20 times for a total of 120 responses. Text network analysis using term frequency-inverse document frequency (TF-IDF) was used to compare text similarity between responses from ChatGPT and Google Search. In the field of information retrieval, TF-IDF is a weighted statistical measure of the importance of a key word to a document in a collection of documents. Higher TF-IDF scores indicate greater similarity between two sources. TF-IDF scores are most often used to compare and rank the text similarity of documents. Using this type of text network analysis, text similarity between ChatGPT and Google Search can be determined by calculating and summing the TF-IDF for all keywords in a ChatGPT response and comparing it with each Google search result to assess their text similarity to each other. In this way, text similarity can be used to infer relative content similarity. To answer our first question, we characterized the text similarity between ChatGPT and Google Search responses by finding the TF-IDF scores of the ChatGPT response and each of the 20 Google Search results for each question. Using these scores, we could compare the similarity of each ChatGPT response to the Google Search results. To provide a reference point for interpreting TF-IDF values, we generated randomized text samples with the same term distribution as the Google Search results. By comparing ChatGPT TF-IDF to the random text sample, we could assess whether TF-IDF values were statistically significant from TF-IDF values obtained by random chance, and it allowed us to test whether text similarity was an appropriate quantitative statistical measure of relative content similarity. To answer our second question, we classified the Google Search results to better understand sourcing. Google Search provides 20 or more distinct sources of information, but ChatGPT gives only a single prose paragraph in response to each query. So, to answer this question, we used TF-IDF to ascertain whether the ChatGPT response was principally driven by one of four source categories: academic, government, commercial, or material that took the form of a scientific manuscript but was not peer-reviewed or indexed on a government site (such as PubMed). We then compared the TF-IDF similarity between ChatGPT responses and the source category. To answer our third research question, we repeated both analyses and compared the results when using ChatGPT 3.5 versus ChatGPT 4.0. RESULTS The ChatGPT response was dominated by the top Google Search result. For example, for carpal tunnel syndrome, the top result was an academic website with a mean TF-IDF of 7.2. A similar result was observed for the other search topics. To provide a reference point for interpreting TF-IDF values, a randomly generated sample of text compared with Google Search would have a mean TF-IDF of 2.7 ± 1.9, controlling for text length and keyword distribution. The observed TF-IDF distribution was higher for ChatGPT responses than for random text samples, supporting the claim that keyword text similarity is a measure of relative content similarity. When comparing source distribution, the ChatGPT response was most similar to the most common source category from Google Search. For the subject where there was strong consensus (carpal tunnel syndrome), the ChatGPT response was most similar to high-quality academic sources rather than lower-quality commercial sources (TF-IDF 8.6 versus 2.2). For topics with low consensus, the ChatGPT response paralleled lower-quality commercial websites compared with higher-quality academic websites (TF-IDF 14.6 versus 0.2). ChatGPT 4.0 had higher text similarity to Google Search results than ChatGPT 3.5 (mean increase in TF-IDF similarity of 0.80 to 0.91; p < 0.001). The ChatGPT 4.0 response was still dominated by the top Google Search result and reflected the most common search category for all search topics. CONCLUSION ChatGPT responses are similar to individual Google Search results for queries related to orthopaedic surgery, but the distribution of source information can vary substantially based on the relative level of consensus on a topic. For example, for carpal tunnel syndrome, where there is widely accepted medical consensus, ChatGPT responses had higher similarity to academic sources and therefore used those sources more. When fewer academic or government sources are available, especially in our search related to platelet-rich plasma, ChatGPT appears to have relied more heavily on a small number of nonacademic sources. These findings persisted even as ChatGPT was updated from version 3.5 to version 4.0. CLINICAL RELEVANCE Physicians should be aware that ChatGPT and Google likely use the same sources for a specific question. The main difference is that ChatGPT can draw upon multiple sources to create one aggregate response, while Google maintains its distinctness by providing multiple results. For topics with a low consensus and therefore a low number of quality sources, there is a much higher chance that ChatGPT will use less-reliable sources, in which case physicians should take the time to educate patients on the topic or provide resources that give more reliable information. Physician organizations should make it clear when the evidence is limited so that ChatGPT can reflect the lack of quality information or evidence.
Collapse
Affiliation(s)
- Oscar Y. Shen
- Department of Orthopaedic Surgery, Hand and Upper Extremity Service, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong
| | - Jayanth S. Pratap
- Department of Orthopaedic Surgery, Hand and Upper Extremity Service, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Harvard University, Cambridge, MA, USA
| | - Xiang Li
- Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Neal C. Chen
- Department of Orthopaedic Surgery, Hand and Upper Extremity Service, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Abhiram R. Bhashyam
- Department of Orthopaedic Surgery, Hand and Upper Extremity Service, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
19
|
Parikh AO, Oca MC, Conger JR, McCoy A, Chang J, Zhang-Nunes S. Accuracy and Bias in Artificial Intelligence Chatbot Recommendations for Oculoplastic Surgeons. Cureus 2024; 16:e57611. [PMID: 38707042 PMCID: PMC11069401 DOI: 10.7759/cureus.57611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/30/2024] [Indexed: 05/07/2024] Open
Abstract
Purpose The purpose of this study is to assess the accuracy of and bias in recommendations for oculoplastic surgeons from three artificial intelligence (AI) chatbot systems. Methods ChatGPT, Microsoft Bing Balanced, and Google Bard were asked for recommendations for oculoplastic surgeons practicing in 20 cities with the highest population in the United States. Three prompts were used: "can you help me find (an oculoplastic surgeon)/(a doctor who does eyelid lifts)/(an oculofacial plastic surgeon) in (city)." Results A total of 672 suggestions were made between (oculoplastic surgeon; doctor who does eyelid lifts; oculofacial plastic surgeon); 19.8% suggestions were excluded, leaving 539 suggested physicians. Of these, 64.1% were oculoplastics specialists (of which 70.1% were American Society of Ophthalmic Plastic and Reconstructive Surgery (ASOPRS) members); 16.1% were general plastic surgery trained, 9.0% were ENT trained, 8.8% were ophthalmology but not oculoplastics trained, and 1.9% were trained in another specialty. 27.7% of recommendations across all AI systems were female. Conclusions Among the chatbot systems tested, there were high rates of inaccuracy: up to 38% of recommended surgeons were nonexistent or not practicing in the city requested, and 35.9% of those recommended as oculoplastic/oculofacial plastic surgeons were not oculoplastics specialists. Choice of prompt affected the result, with requests for "a doctor who does eyelid lifts" resulting in more plastic surgeons and ENTs and fewer oculoplastic surgeons. It is important to identify inaccuracies and biases in recommendations provided by AI systems as more patients may start using them to choose a surgeon.
Collapse
Affiliation(s)
- Alomi O Parikh
- Ophthalmology, USC Roski Eye Institute, Keck School of Medicine, University of Southern California, Los Angeles, USA
| | - Michael C Oca
- Ophthalmology, University of California San Diego School of Medicine, La Jolla, USA
| | - Jordan R Conger
- Oculofacial Plastic Surgery, USC Roski Eye Institute, Keck School of Medicine, University of Southern California, Los Angeles, USA
| | - Allison McCoy
- Oculofacial Plastic Surgery, Del Mar Plastic Surgery, San Diego, USA
| | - Jessica Chang
- Oculofacial Plastic Surgery, USC Roski Eye Institute, Keck School Medicine, University of Southern California, Los Angeles, USA
| | - Sandy Zhang-Nunes
- Ophthalmology, USC Roski Eye Institute, Keck School Medicine, University of Southern California, Los Angeles, USA
| |
Collapse
|
20
|
Young BK, Zhao PY. Large Language Models and the Shoreline of Ophthalmology. JAMA Ophthalmol 2024; 142:375-376. [PMID: 38386327 DOI: 10.1001/jamaophthalmol.2023.6937] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]
Affiliation(s)
- Benjamin K Young
- Casey Eye Institute, Oregon Health & Sciences University, Portland
| | - Peter Y Zhao
- New England Eye Center, Tufts University School of Medicine, Boston, Massachusetts
| |
Collapse
|
21
|
Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. Assessment of a Large Language Model's Responses to Questions and Cases About Glaucoma and Retina Management. JAMA Ophthalmol 2024; 142:371-375. [PMID: 38386351 PMCID: PMC10884943 DOI: 10.1001/jamaophthalmol.2023.6917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Accepted: 12/03/2023] [Indexed: 02/23/2024]
Abstract
Importance Large language models (LLMs) are revolutionizing medical diagnosis and treatment, offering unprecedented accuracy and ease surpassing conventional search engines. Their integration into medical assistance programs will become pivotal for ophthalmologists as an adjunct for practicing evidence-based medicine. Therefore, the diagnostic and treatment accuracy of LLM-generated responses compared with fellowship-trained ophthalmologists can help assess their accuracy and validate their potential utility in ophthalmic subspecialties. Objective To compare the diagnostic accuracy and comprehensiveness of responses from an LLM chatbot with those of fellowship-trained glaucoma and retina specialists on ophthalmological questions and real patient case management. Design, Setting, and Participants This comparative cross-sectional study recruited 15 participants aged 31 to 67 years, including 12 attending physicians and 3 senior trainees, from eye clinics affiliated with the Department of Ophthalmology at Icahn School of Medicine at Mount Sinai, New York, New York. Glaucoma and retina questions (10 of each type) were randomly selected from the American Academy of Ophthalmology's commonly asked questions Ask an Ophthalmologist. Deidentified glaucoma and retinal cases (10 of each type) were randomly selected from ophthalmology patients seen at Icahn School of Medicine at Mount Sinai-affiliated clinics. The LLM used was GPT-4 (version dated May 12, 2023). Data were collected from June to August 2023. Main Outcomes and Measures Responses were assessed via a Likert scale for medical accuracy and completeness. Statistical analysis involved the Mann-Whitney U test and the Kruskal-Wallis test, followed by pairwise comparison. Results The combined question-case mean rank for accuracy was 506.2 for the LLM chatbot and 403.4 for glaucoma specialists (n = 831; Mann-Whitney U = 27976.5; P < .001), and the mean rank for completeness was 528.3 and 398.7, respectively (n = 828; Mann-Whitney U = 25218.5; P < .001). The mean rank for accuracy was 235.3 for the LLM chatbot and 216.1 for retina specialists (n = 440; Mann-Whitney U = 15518.0; P = .17), and the mean rank for completeness was 258.3 and 208.7, respectively (n = 439; Mann-Whitney U = 13123.5; P = .005). The Dunn test revealed a significant difference between all pairwise comparisons, except specialist vs trainee in rating chatbot completeness. The overall pairwise comparisons showed that both trainees and specialists rated the chatbot's accuracy and completeness more favorably than those of their specialist counterparts, with specialists noting a significant difference in the chatbot's accuracy (z = 3.23; P = .007) and completeness (z = 5.86; P < .001). Conclusions and Relevance This study accentuates the comparative proficiency of LLM chatbots in diagnostic accuracy and completeness compared with fellowship-trained ophthalmologists in various clinical scenarios. The LLM chatbot outperformed glaucoma specialists and matched retina specialists in diagnostic and treatment accuracy, substantiating its role as a promising diagnostic adjunct in ophthalmology.
Collapse
Affiliation(s)
- Andy S. Huang
- Department of Ophthalmology, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Kyle Hirabayashi
- Department of Ophthalmology, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Laura Barna
- Department of Ophthalmology, Icahn School of Medicine at Mount Sinai, New York, New York
- Department of Ophthalmology, Massachusetts Eye and Ear, Harvard Medical School, Boston
| | - Deep Parikh
- Department of Ophthalmology, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Louis R. Pasquale
- Department of Ophthalmology, Icahn School of Medicine at Mount Sinai, New York, New York
| |
Collapse
|
22
|
Chen X, Zhang W, Zhao Z, Xu P, Zheng Y, Shi D, He M. ICGA-GPT: report generation and question answering for indocyanine green angiography images. Br J Ophthalmol 2024:bjo-2023-324446. [PMID: 38508675 DOI: 10.1136/bjo-2023-324446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 03/03/2024] [Indexed: 03/22/2024]
Abstract
BACKGROUND Indocyanine green angiography (ICGA) is vital for diagnosing chorioretinal diseases, but its interpretation and patient communication require extensive expertise and time-consuming efforts. We aim to develop a bilingual ICGA report generation and question-answering (QA) system. METHODS Our dataset comprised 213 129 ICGA images from 2919 participants. The system comprised two stages: image-text alignment for report generation by a multimodal transformer architecture, and large language model (LLM)-based QA with ICGA text reports and human-input questions. Performance was assessed using both qualitative metrics (including Bilingual Evaluation Understudy (BLEU), Consensus-based Image Description Evaluation (CIDEr), Recall-Oriented Understudy for Gisting Evaluation-Longest Common Subsequence (ROUGE-L), Semantic Propositional Image Caption Evaluation (SPICE), accuracy, sensitivity, specificity, precision and F1 score) and subjective evaluation by three experienced ophthalmologists using 5-point scales (5 refers to high quality). RESULTS We produced 8757 ICGA reports covering 39 disease-related conditions after bilingual translation (66.7% English, 33.3% Chinese). The ICGA-GPT model's report generation performance was evaluated with BLEU scores (1-4) of 0.48, 0.44, 0.40 and 0.37; CIDEr of 0.82; ROUGE of 0.41 and SPICE of 0.18. For disease-based metrics, the average specificity, accuracy, precision, sensitivity and F1 score were 0.98, 0.94, 0.70, 0.68 and 0.64, respectively. Assessing the quality of 50 images (100 reports), three ophthalmologists achieved substantial agreement (kappa=0.723 for completeness, kappa=0.738 for accuracy), yielding scores from 3.20 to 3.55. In an interactive QA scenario involving 100 generated answers, the ophthalmologists provided scores of 4.24, 4.22 and 4.10, displaying good consistency (kappa=0.779). CONCLUSION This pioneering study introduces the ICGA-GPT model for report generation and interactive QA for the first time, underscoring the potential of LLMs in assisting with automated ICGA image interpretation.
Collapse
Affiliation(s)
- Xiaolan Chen
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
| | - Weiyi Zhang
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
| | - Ziwei Zhao
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
| | - Pusheng Xu
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, Guangdong, China
| | - Yingfeng Zheng
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, Guangdong, China
| | - Danli Shi
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
- Research Centre for SHARP Vision (RCSV), The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
| | - Mingguang He
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
- Research Centre for SHARP Vision (RCSV), The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
- Centre for Eye and Vision Research (CEVR), 17W Hong Kong Science Park, Hong Kong, China
| |
Collapse
|
23
|
Cohen SA, Brant A, Fisher AC, Pershing S, Do D, Pan C. Dr. Google vs. Dr. ChatGPT: Exploring the Use of Artificial Intelligence in Ophthalmology by Comparing the Accuracy, Safety, and Readability of Responses to Frequently Asked Patient Questions Regarding Cataracts and Cataract Surgery. Semin Ophthalmol 2024:1-8. [PMID: 38516983 DOI: 10.1080/08820538.2024.2326058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 02/27/2024] [Indexed: 03/23/2024]
Abstract
PURPOSE Patients are using online search modalities to learn about their eye health. While Google remains the most popular search engine, the use of large language models (LLMs) like ChatGPT has increased. Cataract surgery is the most common surgical procedure in the US, and there is limited data on the quality of online information that populates after searches related to cataract surgery on search engines such as Google and LLM platforms such as ChatGPT. We identified the most common patient frequently asked questions (FAQs) about cataracts and cataract surgery and evaluated the accuracy, safety, and readability of the answers to these questions provided by both Google and ChatGPT. We demonstrated the utility of ChatGPT in writing notes and creating patient education materials. METHODS The top 20 FAQs related to cataracts and cataract surgery were recorded from Google. Responses to the questions provided by Google and ChatGPT were evaluated by a panel of ophthalmologists for accuracy and safety. Evaluators were also asked to distinguish between Google and LLM chatbot answers. Five validated readability indices were used to assess the readability of responses. ChatGPT was instructed to generate operative notes, post-operative instructions, and customizable patient education materials according to specific readability criteria. RESULTS Responses to 20 patient FAQs generated by ChatGPT were significantly longer and written at a higher reading level than responses provided by Google (p < .001), with an average grade level of 14.8 (college level). Expert reviewers were correctly able to distinguish between a human-reviewed and chatbot generated response an average of 31% of the time. Google answers contained incorrect or inappropriate material 27% of the time, compared with 6% of LLM generated answers (p < .001). When expert reviewers were asked to compare the responses directly, chatbot responses were favored (66%). CONCLUSIONS When comparing the responses to patients' cataract FAQs provided by ChatGPT and Google, practicing ophthalmologists overwhelming preferred ChatGPT responses. LLM chatbot responses were less likely to contain inaccurate information. ChatGPT represents a viable information source for eye health for patients with higher health literacy. ChatGPT may also be used by ophthalmologists to create customizable patient education materials for patients with varying health literacy.
Collapse
Affiliation(s)
- Samuel A Cohen
- Byers Eye Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Arthur Brant
- Byers Eye Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Ann Caroline Fisher
- Byers Eye Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Suzann Pershing
- Byers Eye Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Diana Do
- Byers Eye Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Carolyn Pan
- Byers Eye Institute, Stanford University School of Medicine, Stanford, CA, USA
| |
Collapse
|
24
|
Chervonski E, Harish KB, Rockman CB, Sadek M, Teter KA, Jacobowitz GR, Berland TL, Lohr J, Moore C, Maldonado TS. Generative artificial intelligence chatbots may provide appropriate informational responses to common vascular surgery questions by patients. Vascular 2024:17085381241240550. [PMID: 38500300 DOI: 10.1177/17085381241240550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
OBJECTIVES Generative artificial intelligence (AI) has emerged as a promising tool to engage with patients. The objective of this study was to assess the quality of AI responses to common patient questions regarding vascular surgery disease processes. METHODS OpenAI's ChatGPT-3.5 and Google Bard were queried with 24 mock patient questions spanning seven vascular surgery disease domains. Six experienced vascular surgery faculty at a tertiary academic center independently graded AI responses on their accuracy (rated 1-4 from completely inaccurate to completely accurate), completeness (rated 1-4 from totally incomplete to totally complete), and appropriateness (binary). Responses were also evaluated with three readability scales. RESULTS ChatGPT responses were rated, on average, more accurate than Bard responses (3.08 ± 0.33 vs 2.82 ± 0.40, p < .01). ChatGPT responses were scored, on average, more complete than Bard responses (2.98 ± 0.34 vs 2.62 ± 0.36, p < .01). Most ChatGPT responses (75.0%, n = 18) and almost half of Bard responses (45.8%, n = 11) were unanimously deemed appropriate. Almost one-third of Bard responses (29.2%, n = 7) were deemed inappropriate by at least two reviewers (29.2%), and two Bard responses (8.4%) were considered inappropriate by the majority. The mean Flesch Reading Ease, Flesch-Kincaid Grade Level, and Gunning Fog Index of ChatGPT responses were 29.4 ± 10.8, 14.5 ± 2.2, and 17.7 ± 3.1, respectively, indicating that responses were readable with a post-secondary education. Bard's mean readability scores were 58.9 ± 10.5, 8.2 ± 1.7, and 11.0 ± 2.0, respectively, indicating that responses were readable with a high-school education (p < .0001 for three metrics). ChatGPT's mean response length (332 ± 79 words) was higher than Bard's mean response length (183 ± 53 words, p < .001). There was no difference in the accuracy, completeness, readability, or response length of ChatGPT or Bard between disease domains (p > .05 for all analyses). CONCLUSIONS AI offers a novel means of educating patients that avoids the inundation of information from "Dr Google" and the time barriers of physician-patient encounters. ChatGPT provides largely valid, though imperfect, responses to myriad patient questions at the expense of readability. While Bard responses are more readable and concise, their quality is poorer. Further research is warranted to better understand failure points for large language models in vascular surgery patient education.
Collapse
Affiliation(s)
- Ethan Chervonski
- New York University Grossman School of Medicine, New York, NY, USA
| | - Keerthi B Harish
- New York University Grossman School of Medicine, New York, NY, USA
| | - Caron B Rockman
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Mikel Sadek
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Katherine A Teter
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Glenn R Jacobowitz
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Todd L Berland
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Joann Lohr
- Dorn Veterans Affairs Medical Center, Columbia, SC, USA
| | | | - Thomas S Maldonado
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| |
Collapse
|
25
|
Hershenhouse JS, Cacciamani GE. Comment on: Assessing ChatGPT's ability to answer questions pertaining to erectile dysfunction. Int J Impot Res 2024:10.1038/s41443-023-00821-2. [PMID: 38467775 DOI: 10.1038/s41443-023-00821-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Revised: 12/12/2023] [Accepted: 12/21/2023] [Indexed: 03/13/2024]
Affiliation(s)
- Jacob S Hershenhouse
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Giovanni E Cacciamani
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
26
|
Carlà MM, Gambini G, Baldascino A, Giannuzzi F, Boselli F, Crincoli E, D'Onofrio NC, Rizzo S. Exploring AI-chatbots' capability to suggest surgical planning in ophthalmology: ChatGPT versus Google Gemini analysis of retinal detachment cases. Br J Ophthalmol 2024:bjo-2023-325143. [PMID: 38448201 DOI: 10.1136/bjo-2023-325143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 02/16/2024] [Indexed: 03/08/2024]
Abstract
BACKGROUND We aimed to define the capability of three different publicly available large language models, Chat Generative Pretrained Transformer (ChatGPT-3.5), ChatGPT-4 and Google Gemini in analysing retinal detachment cases and suggesting the best possible surgical planning. METHODS Analysis of 54 retinal detachments records entered into ChatGPT and Gemini's interfaces. After asking 'Specify what kind of surgical planning you would suggest and the eventual intraocular tamponade.' and collecting the given answers, we assessed the level of agreement with the common opinion of three expert vitreoretinal surgeons. Moreover, ChatGPT and Gemini answers were graded 1-5 (from poor to excellent quality), according to the Global Quality Score (GQS). RESULTS After excluding 4 controversial cases, 50 cases were included. Overall, ChatGPT-3.5, ChatGPT-4 and Google Gemini surgical choices agreed with those of vitreoretinal surgeons in 40/50 (80%), 42/50 (84%) and 35/50 (70%) of cases. Google Gemini was not able to respond in five cases. Contingency analysis showed significant differences between ChatGPT-4 and Gemini (p=0.03). ChatGPT's GQS were 3.9±0.8 and 4.2±0.7 for versions 3.5 and 4, while Gemini scored 3.5±1.1. There was no statistical difference between the two ChatGPTs (p=0.22), while both outperformed Gemini scores (p=0.03 and p=0.002, respectively). The main source of error was endotamponade choice (14% for ChatGPT-3.5 and 4, and 12% for Google Gemini). Only ChatGPT-4 was able to suggest a combined phacovitrectomy approach. CONCLUSION In conclusion, Google Gemini and ChatGPT evaluated vitreoretinal patients' records in a coherent manner, showing a good level of agreement with expert surgeons. According to the GQS, ChatGPT's recommendations were much more accurate and precise.
Collapse
Affiliation(s)
- Matteo Mario Carlà
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Gloria Gambini
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Antonio Baldascino
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Federico Giannuzzi
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Francesco Boselli
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Emanuele Crincoli
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Nicola Claudio D'Onofrio
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Stanislao Rizzo
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| |
Collapse
|
27
|
Nikdel M, Ghadimi H, Tavakoli M, Suh DW. Assessment of the Responses of the Artificial Intelligence-based Chatbot ChatGPT-4 to Frequently Asked Questions About Amblyopia and Childhood Myopia. J Pediatr Ophthalmol Strabismus 2024; 61:86-89. [PMID: 37882183 DOI: 10.3928/01913913-20231005-02] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/27/2023]
Abstract
PURPOSE To assess the responses of the ChatGPT-4, the forerunner artificial intelligence-based chatbot, to frequently asked questions regarding two common pediatric ophthalmologic disorders, amblyopia and childhood myopia. METHODS Twenty-seven questions about amblyopia and 28 questions about childhood myopia were asked of the ChatGPT twice (totally 110 questions). The responses were evaluated by two pediatric ophthalmologists as acceptable, incomplete, or unacceptable. RESULTS There was remarkable agreement (96.4%) between the two pediatric ophthalmologists on their assessment of the responses. Acceptable responses were provided by the ChatGPT to 93 of 110 (84.6%) questions in total (44 of 54 [81.5%] for amblyopia and 49 of 56 [87.5%] questions for childhood myopia). Seven of 54 (12.9%) responses to questions on amblyopia were graded as incomplete compared to 4 of 56 (7.1%) of questions on childhood myopia. The ChatGPT gave inappropriate responses to three questions about amblyopia (5.6%) and childhood myopia (5.4%). The most noticeable inappropriate responses were related to the definition of reverse amblyopia and the threshold of refractive error for prescription of spectacles to children with myopia. CONCLUSIONS The ChatGPT has the potential to serve as an adjunct informational tool for pediatric ophthalmology patients and their caregivers by demonstrating a relatively good performance in answering 84.6% of the most frequently asked questions about amblyopia and childhood myopia. [J Pediatr Ophthalmol Strabismus. 2024;61(2):86-89.].
Collapse
|
28
|
Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, Mansour HA, Abishek RM, Xu D, Sridhar J, Yonekawa Y, Kuriyan AE. Assessing ChatGPT-3.5 Versus ChatGPT-4 Performance in Surgical Treatment of Retinal Diseases: A Comparative Study. Ophthalmic Surg Lasers Imaging Retina 2024:1-2. [PMID: 38531015 DOI: 10.3928/23258160-20240227-02] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2024]
|
29
|
Yalla GR, Hyman N, Hock LE, Zhang Q, Shukla AG, Kolomeyer NN. Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted From Patient Brochures. Cureus 2024; 16:e56766. [PMID: 38650824 PMCID: PMC11034394 DOI: 10.7759/cureus.56766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/23/2024] [Indexed: 04/25/2024] Open
Abstract
Introduction With the potential for artificial intelligence (AI) chatbots to serve as the primary source of glaucoma information to patients, it is essential to characterize the information that chatbots provide such that providers can tailor discussions, anticipate patient concerns, and identify misleading information. Therefore, the purpose of this study was to evaluate glaucoma information from AI chatbots, including ChatGPT-4, Bard, and Bing, by analyzing response accuracy, comprehensiveness, readability, word count, and character count in comparison to each other and glaucoma-related American Academy of Ophthalmology (AAO) patient materials. Methods Section headers from AAO glaucoma-related patient education brochures were adapted into question form and asked five times to each AI chatbot (ChatGPT-4, Bard, and Bing). Two sets of responses from each chatbot were used to evaluate the accuracy of AI chatbot responses and AAO brochure information, and the comprehensiveness of AI chatbot responses compared to the AAO brochure information, scored 1-5 by three independent glaucoma-trained ophthalmologists. Readability (assessed with Flesch-Kincaid Grade Level (FKGL), corresponding to the United States school grade levels), word count, and character count were determined for all chatbot responses and AAO brochure sections. Results Accuracy scores for AAO, ChatGPT, Bing, and Bard were 4.84, 4.26, 4.53, and 3.53, respectively. On direct comparison, AAO was more accurate than ChatGPT (p=0.002), and Bard was the least accurate (Bard versus AAO, p<0.001; Bard versus ChatGPT, p<0.002; Bard versus Bing, p=0.001). ChatGPT had the most comprehensive responses (ChatGPT versus Bing, p<0.001; ChatGPT versus Bard p=0.008), with comprehensiveness scores for ChatGPT, Bing, and Bard at 3.32, 2.16, and 2.79, respectively. AAO information and Bard responses were at the most accessible readability levels (AAO versus ChatGPT, AAO versus Bing, Bard versus ChatGPT, Bard versus Bing, all p<0.0001), with readability levels for AAO, ChatGPT, Bing, and Bard at 8.11, 13.01, 11.73, and 7.90, respectively. Bing responses had the lowest word and character count. Conclusion AI chatbot responses varied in accuracy, comprehensiveness, and readability. With accuracy scores and comprehensiveness below that of AAO brochures and elevated readability levels, AI chatbots require improvements to be a more useful supplementary source of glaucoma information for patients. Physicians must be aware of these limitations such that patients are asked about existing knowledge and questions and are then provided with clarifying and comprehensive information.
Collapse
Affiliation(s)
- Goutham R Yalla
- Department of Ophthalmology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, USA
- Glaucoma Research Center, Wills Eye Hospital, Philadelphia, USA
| | - Nicholas Hyman
- Department of Ophthalmology, Vagelos College of Physicians and Surgeons, Columbia University, New York, USA
- Department of Ophthalmology, Glaucoma Division, Columbia University Irving Medical Center, New York, USA
| | - Lauren E Hock
- Glaucoma Research Center, Wills Eye Hospital, Philadelphia, USA
| | - Qiang Zhang
- Glaucoma Research Center, Wills Eye Hospital, Philadelphia, USA
- Biostatistics Consulting Core, Vickie and Jack Farber Vision Research Center, Wills Eye Hospital, Philadelphia, USA
| | - Aakriti G Shukla
- Department of Ophthalmology, Glaucoma Division, Columbia University Irving Medical Center, New York, USA
| | | |
Collapse
|
30
|
Kianian R, Sun D, Crowell EL, Tsui E. The Use of Large Language Models to Generate Education Materials about Uveitis. Ophthalmol Retina 2024; 8:195-201. [PMID: 37716431 DOI: 10.1016/j.oret.2023.09.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2023] [Revised: 09/08/2023] [Accepted: 09/08/2023] [Indexed: 09/18/2023]
Abstract
OBJECTIVE To assess large language models in generating readable uveitis information and in improving the readability of online health information. DESIGN Evaluation of technology. SUBJECTS Not applicable. METHODS ChatGPT and Bard were asked the following prompts: (prompt A) "considering that the average American reads at a 6th grade level, using the Flesch-Kincaid Grade Level (FKGL) formula, can you write patient-targeted health information on uveitis of around 6th grade level?" and (prompt B) "can you write patient-targeted health information on uveitis that is easy to understand by an average American?" Additionally, ChatGPT and Bard were asked the following prompt from the first-page results of Google when the term "uveitis" was searched: "Considering that the average American reads at a 6th grade level, using the FKGL formula, can you rewrite the following text to 6th grade level: [insert text]." The readability of each response was analyzed and compared using several metrics described below. MAIN OUTCOME MEASURES The FKGL is a highly validated readability assessment tool that assigns a grade level to a given text, the total number of words, sentences, syllables, and complex words. Complex words were defined as those with > 2 syllables. RESULTS ChatGPT and Bard generated responses with lower FKGL scores (i.e., easier to understand) in response to prompt A compared with prompt B. This was only significant for ChatGPT (P < 0.0001). The mean FKGL of responses to ChatGPT (6.3 ± 1.2) was significantly lower (P < 0.0001) than Bard 10.5 ± 0.8. ChatGPT responses also contained less complex words than Bard (P < 0.0001). Online health information on uveitis had a mean grade level of 11.0 ± 1.4. ChatGPT lowered the FKGL to 8.0 ± 1.0 (P < 0.0001) when asked to rewrite the content. Bard was not able to do so (mean FKGL of 11.1 ± 1.6). CONCLUSIONS ChatGPT can aid clinicians in producing easier-to-understand health information on uveitis for patients compared with already-existing content. It can also help with reducing the difficulty of the language used for uveitis health information targeted for patients. FINANCIAL DISCLOSURE(S) Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Collapse
Affiliation(s)
- Reza Kianian
- Stein Eye Institute, Department of Ophthalmology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California
| | - Deyu Sun
- Stein Eye Institute, Department of Ophthalmology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California
| | - Eric L Crowell
- Mitchel and Shannon Wong Eye Institute, Dell Medical School at the University of Texas at Austin, Austin, Texas
| | - Edmund Tsui
- Stein Eye Institute, Department of Ophthalmology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California.
| |
Collapse
|
31
|
Onder CE, Koc G, Gokbulut P, Taskaldiran I, Kuskonmaz SM. Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy. Sci Rep 2024; 14:243. [PMID: 38167988 PMCID: PMC10761760 DOI: 10.1038/s41598-023-50884-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 12/27/2023] [Indexed: 01/05/2024] Open
Abstract
Hypothyroidism is characterized by thyroid hormone deficiency and has adverse effects on both pregnancy and fetal health. Chat Generative Pre-trained Transformer (ChatGPT) is a large language model trained with a very large database from many sources. Our study was aimed to evaluate the reliability and readability of ChatGPT-4 answers about hypothyroidism in pregnancy. A total of 19 questions were created in line with the recommendations in the latest guideline of the American Thyroid Association (ATA) on hypothyroidism in pregnancy and were asked to ChatGPT-4. The reliability and quality of the responses were scored by two independent researchers using the global quality scale (GQS) and modified DISCERN tools. The readability of ChatGPT was assessed used Flesch Reading Ease (FRE) Score, Flesch-Kincaid grade level (FKGL), Gunning Fog Index (GFI), Coleman-Liau Index (CLI), and Simple Measure of Gobbledygook (SMOG) tools. No misleading information was found in any of the answers. The mean mDISCERN score of the responses was 30.26 ± 3.14; the median GQS score was 4 (2-4). In terms of reliability, most of the answers showed moderate (78.9%) followed by good (21.1%) reliability. In the readability analysis, the median FRE was 32.20 (13.00-37.10). The years of education required to read the answers were mostly found at the university level [9 (47.3%)]. Although ChatGPT-4 has significant potential, it can be used as an auxiliary information source for counseling by creating a bridge between patients and clinicians about hypothyroidism in pregnancy. Efforts should be made to improve the reliability and readability of ChatGPT.
Collapse
Affiliation(s)
- C E Onder
- Department of Endocrinology and Metabolic Diseases, Ankara Training and Research Hospital, Ankara, Turkey.
| | - G Koc
- Department of Endocrinology and Metabolic Diseases, Ankara Training and Research Hospital, Ankara, Turkey
| | - P Gokbulut
- Department of Endocrinology and Metabolic Diseases, Ankara Training and Research Hospital, Ankara, Turkey
| | - I Taskaldiran
- Department of Endocrinology and Metabolic Diseases, Ankara Training and Research Hospital, Ankara, Turkey
| | - S M Kuskonmaz
- Department of Endocrinology and Metabolic Diseases, Ankara Training and Research Hospital, Ankara, Turkey
| |
Collapse
|
32
|
Patnaik SS, Hoffmann U. Quantitative evaluation of ChatGPT versus Bard responses to anaesthesia-related queries. Br J Anaesth 2024; 132:169-171. [PMID: 37945414 DOI: 10.1016/j.bja.2023.09.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Revised: 09/28/2023] [Accepted: 09/29/2023] [Indexed: 11/12/2023] Open
Affiliation(s)
- Sourav S Patnaik
- Department of Anesthesiology and Pain Management, The University of Texas Southwestern Medical Center, Dallas, TX, USA.
| | - Ulrike Hoffmann
- Department of Anesthesiology and Pain Management, The University of Texas Southwestern Medical Center, Dallas, TX, USA
| |
Collapse
|
33
|
Ray PP. Can we depend on LLMs to persuade myopia-related issues? Ophthalmic Physiol Opt 2024; 44:231-232. [PMID: 37635304 DOI: 10.1111/opo.13226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 08/17/2023] [Indexed: 08/29/2023]
|
34
|
Biswas S, Logan NS, Davies LN, Sheppard AL, Wolffsohn JS. Authors' Reply: Assessing the utility of ChatGPT as an artificial intelligence-based large language model for information to answer questions on myopia. Ophthalmic Physiol Opt 2024; 44:233-234. [PMID: 37635297 DOI: 10.1111/opo.13227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Accepted: 08/17/2023] [Indexed: 08/29/2023]
Affiliation(s)
- Sayantan Biswas
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - Nicola S Logan
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - Leon N Davies
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - Amy L Sheppard
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - James S Wolffsohn
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| |
Collapse
|
35
|
Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, Mansour HA, Abishek RM, Xu D, Sridhar J, Yonekawa Y, Kuriyan AE. Reply. Ophthalmol Retina 2024; 8:e1-e2. [PMID: 37815785 DOI: 10.1016/j.oret.2023.09.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 09/06/2023] [Indexed: 10/11/2023]
Affiliation(s)
- Bita Momenaei
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Taku Wakabayashi
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Abtin Shahlaee
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Asad F Durrani
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Saagar A Pandit
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Kristine Wang
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Hana A Mansour
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Robert M Abishek
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - David Xu
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Jayanth Sridhar
- Bascom Palmer Eye Institute, University of Miami Miller School of Medicine, Miami, Florida
| | - Yoshihiro Yonekawa
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Ajay E Kuriyan
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania.
| |
Collapse
|
36
|
Wei K, Fritz C, Rajasekaran K. Answering head and neck cancer questions: An assessment of ChatGPT responses. Am J Otolaryngol 2024; 45:104085. [PMID: 37844413 DOI: 10.1016/j.amjoto.2023.104085] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Accepted: 10/01/2023] [Indexed: 10/18/2023]
Abstract
PURPOSE To examine and compare ChatGPT versus Google websites in answering common head and neck cancer questions. MATERIALS AND METHODS Commonly asked questions about head and neck cancer were obtained and inputted into both ChatGPT-4 and Google search engine. For each question, the ChatGPT response and first website search result were compiled and examined. Content quality was assessed by independent reviewers using standardized grading criteria and the modified Ensuring Quality Information for Patients (EQIP) tool. Readability was determined using the Flesch reading ease scale. RESULTS In total, 49 questions related to head and neck cancer were included. Google sources were on average significantly higher quality than ChatGPT responses (4.2 vs 3.6, p = 0.005). According to the EQIP tool, Google and ChatGPT had on average similar response rates per criterion (24.4 vs 20.5, p = 0.09) while Google had a significantly higher average score per question than ChatGPT (13.8 vs 11.7, p < 0.001) According to the Flesch reading ease scale, ChatGPT and Google sources were both considered similarly difficult to read (33.1 vs 37.0, p = 0.180) and at a college level (14.3 vs 14.2, p = 0.820.) CONCLUSION: ChatGPT responses were as challenging to read as Google sources, but poorer quality due to decreased reliability and accuracy in answering questions. Though promising, ChatGPT in its current form should not be considered dependable. Google sources are a preferred resource for patient educational materials.
Collapse
Affiliation(s)
- Kimberly Wei
- Department of Otorhinolaryngology - Head and Neck Surgery, University of Pennsylvania, Philadelphia, PA, USA
| | - Christian Fritz
- Department of Otorhinolaryngology - Head and Neck Surgery, University of Pennsylvania, Philadelphia, PA, USA
| | - Karthik Rajasekaran
- Department of Otorhinolaryngology - Head and Neck Surgery, University of Pennsylvania, Philadelphia, PA, USA; Leonard Davis Institute of Health Economics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
37
|
Tangadulrat P, Sono S, Tangtrakulwanich B. Using ChatGPT for Clinical Practice and Medical Education: Cross-Sectional Survey of Medical Students' and Physicians' Perceptions. JMIR MEDICAL EDUCATION 2023; 9:e50658. [PMID: 38133908 PMCID: PMC10770783 DOI: 10.2196/50658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Revised: 10/17/2023] [Accepted: 12/11/2023] [Indexed: 12/23/2023]
Abstract
BACKGROUND ChatGPT is a well-known large language model-based chatbot. It could be used in the medical field in many aspects. However, some physicians are still unfamiliar with ChatGPT and are concerned about its benefits and risks. OBJECTIVE We aim to evaluate the perception of physicians and medical students toward using ChatGPT in the medical field. METHODS A web-based questionnaire was sent to medical students, interns, residents, and attending staff with questions regarding their perception toward using ChatGPT in clinical practice and medical education. Participants were also asked to rate their perception of ChatGPT's generated response about knee osteoarthritis. RESULTS Participants included 124 medical students, 46 interns, 37 residents, and 32 attending staff. After reading ChatGPT's response, 132 of the 239 (55.2%) participants had a positive rating about using ChatGPT for clinical practice. The proportion of positive answers was significantly lower in graduated physicians (48/115, 42%) compared with medical students (84/124, 68%; P<.001). Participants listed a lack of a patient-specific treatment plan, updated evidence, and a language barrier as ChatGPT's pitfalls. Regarding using ChatGPT for medical education, the proportion of positive responses was also significantly lower in graduate physicians (71/115, 62%) compared to medical students (103/124, 83.1%; P<.001). Participants were concerned that ChatGPT's response was too superficial, might lack scientific evidence, and might need expert verification. CONCLUSIONS Medical students generally had a positive perception of using ChatGPT for guiding treatment and medical education, whereas graduated doctors were more cautious in this regard. Nonetheless, both medical students and graduated doctors positively perceived using ChatGPT for creating patient educational materials.
Collapse
Affiliation(s)
- Pasin Tangadulrat
- Department of Orthopedics, Faculty of Medicine, Prince of Songkla University, Hatyai, Thailand
| | - Supinya Sono
- Division of Family and Preventive Medicine, Faculty of Medicine, Prince of Songkla University, Hatyai, Thailand
| | | |
Collapse
|
38
|
Wong M, Lim ZW, Pushpanathan K, Cheung CY, Wang YX, Chen D, Tham YC. Review of emerging trends and projection of future developments in large language models research in ophthalmology. Br J Ophthalmol 2023:bjo-2023-324734. [PMID: 38164563 DOI: 10.1136/bjo-2023-324734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 11/14/2023] [Indexed: 01/03/2024]
Abstract
BACKGROUND Large language models (LLMs) are fast emerging as potent tools in healthcare, including ophthalmology. This systematic review offers a twofold contribution: it summarises current trends in ophthalmology-related LLM research and projects future directions for this burgeoning field. METHODS We systematically searched across various databases (PubMed, Europe PMC, Scopus and Web of Science) for articles related to LLM use in ophthalmology, published between 1 January 2022 and 31 July 2023. Selected articles were summarised, and categorised by type (editorial, commentary, original research, etc) and their research focus (eg, evaluating ChatGPT's performance in ophthalmology examinations or clinical tasks). FINDINGS We identified 32 articles meeting our criteria, published between January and July 2023, with a peak in June (n=12). Most were original research evaluating LLMs' proficiency in clinically related tasks (n=9). Studies demonstrated that ChatGPT-4.0 outperformed its predecessor, ChatGPT-3.5, in ophthalmology exams. Furthermore, ChatGPT excelled in constructing discharge notes (n=2), evaluating diagnoses (n=2) and answering general medical queries (n=6). However, it struggled with generating scientific articles or abstracts (n=3) and answering specific subdomain questions, especially those regarding specific treatment options (n=2). ChatGPT's performance relative to other LLMs (Google's Bard, Microsoft's Bing) varied by study design. Ethical concerns such as data hallucination (n=27), authorship (n=5) and data privacy (n=2) were frequently cited. INTERPRETATION While LLMs hold transformative potential for healthcare and ophthalmology, concerns over accountability, accuracy and data security remain. Future research should focus on application programming interface integration, comparative assessments of popular LLMs, their ability to interpret image-based data and the establishment of standardised evaluation frameworks.
Collapse
Affiliation(s)
| | - Zhi Wei Lim
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Krithi Pushpanathan
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Carol Y Cheung
- Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| | - Ya Xing Wang
- Beijing Institute of Ophthalmology, Beijing Tongren Hospital, Capital University of Medical Science, Beijing, China
| | - David Chen
- Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Yih Chung Tham
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
| |
Collapse
|
39
|
Kunze KN. Editorial Commentary: Recognizing and Avoiding Medical Misinformation Across Digital Platforms: Smoke, Mirrors (and Streaming). Arthroscopy 2023; 39:2454-2455. [PMID: 37981387 DOI: 10.1016/j.arthro.2023.06.054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Revised: 06/27/2023] [Accepted: 06/30/2023] [Indexed: 11/21/2023]
Abstract
The evolution of social media and related online sources has substantially increased the ability of patients to query and access publicly available information that may have relevance to a potential musculoskeletal condition of interest. Although increased accessibility to information has several purported benefits, including encouragement of patients to become more invested in their care through self-teaching, a downside to the existence of a vast number of unregulated resources remains the risk of misinformation. As health care providers, we have a moral and ethical obligation to mitigate this risk by directing patients to high-quality resources for medical information and to be aware of resources that are unreliable. To this end, a growing body of evidence has suggested that YouTube lacks reliability and quality in terms of medical information concerning a variety of musculoskeletal conditions.
Collapse
|
40
|
Ferro Desideri L, Roth J, Zinkernagel M, Anguita R. "Application and accuracy of artificial intelligence-derived large language models in patients with age related macular degeneration". Int J Retina Vitreous 2023; 9:71. [PMID: 37980501 PMCID: PMC10657493 DOI: 10.1186/s40942-023-00511-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Accepted: 11/11/2023] [Indexed: 11/20/2023] Open
Abstract
INTRODUCTION Age-related macular degeneration (AMD) affects millions of people globally, leading to a surge in online research of putative diagnoses, causing potential misinformation and anxiety in patients and their parents. This study explores the efficacy of artificial intelligence-derived large language models (LLMs) like in addressing AMD patients' questions. METHODS ChatGPT 3.5 (2023), Bing AI (2023), and Google Bard (2023) were adopted as LLMs. Patients' questions were subdivided in two question categories, (a) general medical advice and (b) pre- and post-intravitreal injection advice and classified as (1) accurate and sufficient (2) partially accurate but sufficient and (3) inaccurate and not sufficient. Non-parametric test has been done to compare the means between the 3 LLMs scores and also an analysis of variance and reliability tests were performed among the 3 groups. RESULTS In category a) of questions, the average score was 1.20 (± 0.41) with ChatGPT 3.5, 1.60 (± 0.63) with Bing AI and 1.60 (± 0.73) with Google Bard, showing no significant differences among the 3 groups (p = 0.129). The average score in category b was 1.07 (± 0.27) with ChatGPT 3.5, 1.69 (± 0.63) with Bing AI and 1.38 (± 0.63) with Google Bard, showing a significant difference among the 3 groups (p = 0.0042). Reliability statistics showed Chronbach's α of 0.237 (range 0.448, 0.096-0.544). CONCLUSION ChatGPT 3.5 consistently offered the most accurate and satisfactory responses, particularly with technical queries. While LLMs displayed promise in providing precise information about AMD; however, further improvements are needed especially in more technical questions.
Collapse
Affiliation(s)
- Lorenzo Ferro Desideri
- Department of Ophthalmology, Inselspital, University Hospital of Bern, Bern, Switzerland.
- Bern Photographic Reading Center, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland.
| | - Janice Roth
- Department of Ophthalmology, Inselspital, University Hospital of Bern, Bern, Switzerland
| | - Martin Zinkernagel
- Department of Ophthalmology, Inselspital, University Hospital of Bern, Bern, Switzerland
- Bern Photographic Reading Center, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
| | - Rodrigo Anguita
- Department of Ophthalmology, Inselspital, University Hospital of Bern, Bern, Switzerland
- Moorfields Eye Hospital NHS Foundation Trust, City Road, London, EC1V 2PD, UK
| |
Collapse
|
41
|
Haidar O, Jaques A, McCaughran PW, Metcalfe MJ. AI-Generated Information for Vascular Patients: Assessing the Standard of Procedure-Specific Information Provided by the ChatGPT AI-Language Model. Cureus 2023; 15:e49764. [PMID: 38046759 PMCID: PMC10691169 DOI: 10.7759/cureus.49764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/30/2023] [Indexed: 12/05/2023] Open
Abstract
Introduction Ensuring access to high-quality information is paramount to facilitating informed surgical decision-making. The use of the internet to access health-related information is increasing, along with the growing prevalence of AI language models such as ChatGPT. We aim to assess the standard of AI-generated patient-facing information through a qualitative analysis of its readability and quality. Materials and methods We performed a retrospective qualitative analysis of information regarding three common vascular procedures: endovascular aortic repair (EVAR), endovenous laser ablation (EVLA), and femoro-popliteal bypass (FPBP). The ChatGPT responses were compared to patient information leaflets provided by the vascular charity, Circulation Foundation UK. Readability was assessed using four readability scores: the Flesch-Kincaid reading ease (FKRE) score, the Flesch-Kincaid grade level (FKGL), the Gunning fog score (GFS), and the simple measure of gobbledygook (SMOG) index. Quality was assessed using the DISCERN tool by two independent assessors. Results The mean FKRE score was 33.3, compared to 59.1 for the information provided by the Circulation Foundation (SD=14.5, p=0.025) indicating poor readability of AI-generated information. The FFKGL indicated that the expected grade of students likely to read and understand ChatGPT responses was consistently higher than compared to information leaflets at 12.7 vs. 9.4 (SD=1.9, p=0.002). Two metrics measure readability in terms of the number of years of education required to understand a piece of writing: the GFS and SMOG. Both scores indicated that AI-generated answers were less accessible. The GFS for ChatGPT-provided information was 16.7 years versus 12.8 years for the leaflets (SD=2.2, p=0.002) and the SMOG index scores were 12.2 and 9.4 years for ChatGPT and the patient information leaflets, respectively (SD=1.7, p=0.001). The DISCERN scores were consistently higher in human-generated patient information leaflets compared to AI-generated information across all procedures; the mean score for the information provided by ChatGPT was 50.3 vs. 56.0 for the Circulation Foundation information leaflets (SD=3.38, p<0.001). Conclusion We concluded that AI-generated information about vascular surgical procedures is currently poor in both the readability of text and the quality of information. Patients should be directed to reputable, human-generated information sources from trusted professional bodies to supplement direct education from the clinician during the pre-procedure consultation process.
Collapse
Affiliation(s)
- Omar Haidar
- Vascular Surgery, Lister Hospital, Stevenage, GBR
| | | | | | | |
Collapse
|
42
|
Hernandez CA, Vazquez Gonzalez AE, Polianovskaia A, Amoro Sanchez R, Muyolema Arce V, Mustafa A, Vypritskaya E, Perez Gutierrez O, Bashir M, Eighaei Sedeh A. The Future of Patient Education: AI-Driven Guide for Type 2 Diabetes. Cureus 2023; 15:e48919. [PMID: 38024047 PMCID: PMC10654048 DOI: 10.7759/cureus.48919] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/16/2023] [Indexed: 12/01/2023] Open
Abstract
Introduction and aim The surging incidence of type 2 diabetes has become a growing concern for the healthcare sector. This chronic ailment, characterized by its complex blend of genetic and lifestyle determinants, has witnessed a notable increase in recent times, exerting substantial pressure on healthcare resources. As more individuals turn to online platforms for health guidance and embrace the utilization of Chat Generative Pre-trained Transformer (ChatGPT; San Francisco, CA: OpenAI), a text-generating AI (TGAI), to get insights into their well-being, evaluating its effectiveness and reliability becomes crucial. This research primarily aimed to evaluate the correctness of TGAI responses to type 2 diabetes (T2DM) inquiries via ChatGPT. Furthermore, this study aimed to examine the consistency of TGAI in addressing common queries on T2DM complications for patient education. Material and methods Questions on T2DM were formulated by experienced physicians and screened by research personnel before querying ChatGPT. Each question was posed thrice, and the collected answers were summarized. Responses were then sorted into three distinct categories as follows: (a) appropriate, (b) inappropriate, and (c) unreliable by two seasoned physicians. In instances of differing opinions, a third physician was consulted to achieve consensus. Results From the initial set of 110 T2DM questions, 40 were dismissed by experts for relevance, resulting in a final count of 70. An overwhelming 98.5% of the AI's answers were judged as appropriate, thus underscoring its reliability over traditional online search engines. Nonetheless, a 1.5% rate of inappropriate responses underlines the importance of ongoing AI improvements and strict adherence to medical protocols. Conclusion TGAI provides medical information of high quality and reliability. This study underscores TGAI's impressive effectiveness in delivering reliable information about T2DM, with 98.5% of responses aligning with the standard of care. These results hold promise for integrating AI platforms as supplementary tools to enhance patient education and outcomes.
Collapse
|
43
|
Kleebayoon A, Wiwanitkit V. Re: Momenaei et al.: Appropriateness and readability of ChatGPT-4 generated responses for surgical treatment of retinal diseases (Ophthalmol Retina. 2023;7:862-868.). Ophthalmol Retina 2023; 7:e15. [PMID: 37379882 DOI: 10.1016/j.oret.2023.06.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 06/22/2023] [Indexed: 06/30/2023]
|
44
|
Nazir T, Ahmad U, Mal M, Rehman MU, Saeed R, Kalia J. Microsoft Bing vs Google Bard in Neurology: A Comparative Study of AI-Generated Patient Education Material.. [DOI: 10.1101/2023.08.25.23294641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
AbstractBackgroundPatient education is an essential component of healthcare, and artificial intelligence (AI) language models such as Google Bard and Microsoft Bing have the potential to improve information transmission and enhance patient care. However, it is crucial to evaluate the quality, accuracy, and understandability of the materials generated by these models before applying them in medical practice. This study aimed to assess and compare the quality of patient education materials produced by Google Bard and Microsoft Bing in response to questions related to neurological conditions.MethodsA cross-sectional study design was used to evaluate and compare the ability of Google Bard and Microsoft Bing to generate patient education materials. The study included the top ten prevalent neurological diseases based on WHO prevalence data. Ten board-certified neurologists and four neurology residents evaluated the responses generated by the models on six quality metrics. The scores for each model were compiled and averaged across all measures, and the significance of any observed variations was assessed using an independent t-test.ResultsGoogle Bard performed better than Microsoft Bing in all six-quality metrics, with an overall mean score of 79% and 69%, respectively. Google Bard outperformed Microsoft Bing in all measures for eight questions, while Microsoft Bing performed marginally better in terms of objectivity and clarity for the epilepsy query.ConclusionThis study showed that Google Bard performs better than Microsoft Bing in generating patient education materials for neurological diseases. However, healthcare professionals should take into account both AI models’ advantages and disadvantages when providing support for health information requirements. Future studies can help determine the underlying causes of these variations and guide cooperative initiatives to create more user-focused AI-generated patient education materials. Finally, researchers should consider the perception of patients regarding AI-generated patient education material and its impact on implementing these solutions in healthcare settings.
Collapse
|