Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, Mansour HA, Abishek RM, Xu D, Sridhar J, Yonekawa Y, Kuriyan AE. Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases. Ophthalmol Retina 2023;7:862-868. [PMID: 37277096 DOI: 10.1016/j.oret.2023.05.022] [Citation(s) in RCA: 24] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 05/26/2023] [Accepted: 05/30/2023] [Indexed: 06/07/2023]

For:	Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, Mansour HA, Abishek RM, Xu D, Sridhar J, Yonekawa Y, Kuriyan AE. Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases. Ophthalmol Retina 2023;7:862-868. [PMID: 37277096 DOI: 10.1016/j.oret.2023.05.022] [Citation(s) in RCA: 24] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 05/26/2023] [Accepted: 05/30/2023] [Indexed: 06/07/2023]

Number

Cited by Other Article(s)

Tepe M, Emekli E. Decoding medical jargon: The use of AI language models (ChatGPT-4, BARD, microsoft copilot) in radiology reports. PATIENT EDUCATION AND COUNSELING 2024;126:108307. [PMID: 38743965 DOI: 10.1016/j.pec.2024.108307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 03/20/2024] [Accepted: 04/24/2024] [Indexed: 05/16/2024]

Tailor PD, Dalvin LA, Chen JJ, Iezzi R, Olsen TW, Scruggs BA, Barkmeier AJ, Bakri SJ, Ryan EH, Tang PH, Parke DW, Belin PJ, Sridhar J, Xu D, Kuriyan AE, Yonekawa Y, Starr MR. A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone. OPHTHALMOLOGY SCIENCE 2024;4:100485. [PMID: 38660460 PMCID: PMC11041826 DOI: 10.1016/j.xops.2024.100485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2023] [Revised: 01/03/2024] [Accepted: 02/01/2024] [Indexed: 04/26/2024]

Abstract

Objective

To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created, and LLM responses to common retina patient questions.

Design

Randomized, masked multicenter study.

Participants

Twenty-one common retina patient questions were randomly assigned among 13 retina specialists.

Methods

Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert + artificial intelligence [AI]), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, and Bard) also generated responses to each question. The original question along with anonymized and randomized Expert + AI, Expert, and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content).

Main Outcome

Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type.

Results

There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy (P < 0.001, P < 0.001) between LLM, Expert and Expert + AI groups. For quality, Expert + AI (3.86 ± 0.85) performed the best overall while GPT-3.5 (3.75 ± 0.79) was the top performing LLM. For empathy, GPT-3.5 (3.75 ± 0.69) had the highest mean score followed by Expert + AI (3.73 ± 0.63). By mean score, Expert placed 4 out of 7 for quality and 6 out of 7 for empathy. For both quality (P < 0.001) and empathy (P < 0.001), expert-edited LLM responses performed better than expert-created responses. There were time savings for an expert-edited LLM response versus expert-created response (P = 0.02). ChatGPT-4 performed similar to Expert for inappropriate content (P = 0.35), missing content (P = 0.001), extent of possible harm (P = 0.356), and likelihood of possible harm (P = 0.129).

Conclusions

In this randomized, masked, multicenter study, LLM responses were comparable with experts in terms of quality, empathy, and safety metrics, warranting further exploration of their potential benefits in clinical settings.

Financial Disclosures

Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of the article.

Collapse

Luo S, Canavese F, Aroojis A, Andreacchio A, Anticevic D, Bouchard M, Castaneda P, De Rosa V, Fiogbe MA, Frick SL, Hui JH, Johari AN, Loro A, Lyu X, Matsushita M, Omeroglu H, Roye DP, Shah MM, Yong B, Li L. Are Generative Pretrained Transformer 4 Responses to Developmental Dysplasia of the Hip Clinical Scenarios Universal? An International Review. J Pediatr Orthop 2024;44:e504-e511. [PMID: 38597198 DOI: 10.1097/bpo.0000000000002682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]

Abstract

OBJECTIVE

There is increasing interest in applying artificial intelligence chatbots like generative pretrained transformer 4 (GPT-4) in the medical field. This study aimed to explore the universality of GPT-4 responses to simulated clinical scenarios of developmental dysplasia of the hip (DDH) across diverse global settings.

METHODS

Seventeen international experts with more than 15 years of experience in pediatric orthopaedics were selected for the evaluation panel. Eight simulated DDH clinical scenarios were created, covering 4 key areas: (1) initial evaluation and diagnosis, (2) initial examination and treatment, (3) nursing care and follow-up, and (4) prognosis and rehabilitation planning. Each scenario was completed independently in a new GPT-4 session. Interrater reliability was assessed using Fleiss kappa, and the quality, relevance, and applicability of GPT-4 responses were analyzed using median scores and interquartile ranges. Following scoring, experts met in ZOOM sessions to generate Regional Consensus Assessment Scores, which were intended to represent a consistent regional assessment of the use of the GPT-4 in pediatric orthopaedic care.

RESULTS

GPT-4's responses to the 8 clinical DDH scenarios received performance scores ranging from 44.3% to 98.9% of the 88-point maximum. The Fleiss kappa statistic of 0.113 ( P = 0.001) indicated low agreement among experts in their ratings. When assessing the responses' quality, relevance, and applicability, the median scores were 3, with interquartile ranges of 3 to 4, 3 to 4, and 2 to 3, respectively. Significant differences were noted in the prognosis and rehabilitation domain scores ( P < 0.05 for all). Regional consensus scores were 75 for Africa, 74 for Asia, 73 for India, 80 for Europe, and 65 for North America, with the Kruskal-Wallis test highlighting significant disparities between these regions ( P = 0.034).

CONCLUSIONS

This study demonstrates the promise of GPT-4 in pediatric orthopaedic care, particularly in supporting preliminary DDH assessments and guiding treatment strategies for specialist care. However, effective integration of GPT-4 into clinical practice will require adaptation to specific regional health care contexts, highlighting the importance of a nuanced approach to health technology adaptation.

LEVEL OF EVIDENCE

Level IV.

Collapse

Affiliation(s)

Shaoting Luo Department of Pediatric Orthopaedics, Shengjing Hospital of China Medical University, Shenyang, Liaoning
Federico Canavese Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
Alaric Aroojis Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
Antonio Andreacchio Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
Darko Anticevic Pediatric Orthopedics Clinic of Pediatric Surgery and Orthopedics, Pediatric Institute of Southern Switzerland (IPSI), Via Athos Gallino, Bellinzona, Switzerland
Maryse Bouchard Ufuk University Faculty of Medicine, Ankara, Turkey
Pablo Castaneda Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
Vincenzo De Rosa Pediatric Orthopedics Clinic of Pediatric Surgery and Orthopedics, Pediatric Institute of Southern Switzerland (IPSI), Via Athos Gallino, Bellinzona, Switzerland
Michel Armand Fiogbe Ufuk University Faculty of Medicine, Ankara, Turkey
Steven L Frick Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
James H Hui Department of Orthopaedic Surgery, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
Ashok N Johari Pediatric Orthopedics Clinic of Pediatric Surgery and Orthopedics, Pediatric Institute of Southern Switzerland (IPSI), Via Athos Gallino, Bellinzona, Switzerland
Antonio Loro Ufuk University Faculty of Medicine, Ankara, Turkey
Xuemin Lyu Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
Masaki Matsushita Department of Orthopaedic Surgery, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
Hakan Omeroglu Ufuk University Faculty of Medicine, Ankara, Turkey
David P Roye Department of Orthopaedic Surgery, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
Maulin M Shah Ufuk University Faculty of Medicine, Ankara, Turkey
Bicheng Yong Department of Pediatric Orthopaedics, Beit CURE Children's Hospital of Malawi, Chichiri Blantyre, Malawi
Lianyong Li Department of Pediatric Orthopaedics, Shengjing Hospital of China Medical University, Shenyang, Liaoning

Collapse

Maywood MJ, Parikh R, Deobhakta A, Begaj T. PERFORMANCE ASSESSMENT OF AN ARTIFICIAL INTELLIGENCE CHATBOT IN CLINICAL VITREORETINAL SCENARIOS. Retina 2024;44:954-964. [PMID: 38271674 DOI: 10.1097/iae.0000000000004053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2024]

Tailor PD, D'Souza HS, Li H, Starr MR. Vision of the future: large language models in ophthalmology. Curr Opin Ophthalmol 2024:00055735-990000000-00173. [PMID: 38814572 DOI: 10.1097/icu.0000000000001062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2024]

Gomez-Cabello CA, Borna S, Pressman SM, Haider SA, Sehgal A, Leibovich BC, Forte AJ. Artificial Intelligence in Postoperative Care: Assessing Large Language Models for Patient Recommendations in Plastic Surgery. Healthcare (Basel) 2024;12:1083. [PMID: 38891158 PMCID: PMC11171524 DOI: 10.3390/healthcare12111083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 05/13/2024] [Accepted: 05/23/2024] [Indexed: 06/21/2024] Open

Xu P, Chen X, Zhao Z, Shi D. Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis. Br J Ophthalmol 2024:bjo-2023-325054. [PMID: 38789133 DOI: 10.1136/bjo-2023-325054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2023] [Accepted: 05/13/2024] [Indexed: 05/26/2024]

Gül Ş, Erdemir İ, Hanci V, Aydoğmuş E, Erkoç YS. How artificial intelligence can provide information about subdural hematoma: Assessment of readability, reliability, and quality of ChatGPT, BARD, and perplexity responses. Medicine (Baltimore) 2024;103:e38009. [PMID: 38701313 PMCID: PMC11062651 DOI: 10.1097/md.0000000000038009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 04/04/2024] [Indexed: 05/05/2024] Open

Abstract

Subdural hematoma is defined as blood collection in the subdural space between the dura mater and arachnoid. Subdural hematoma is a condition that neurosurgeons frequently encounter and has acute, subacute and chronic forms. The incidence in adults is reported to be 1.72-20.60/100.000 people annually. Our study aimed to evaluate the quality, reliability and readability of the answers to questions asked to ChatGPT, Bard, and perplexity about "Subdural Hematoma." In this observational and cross-sectional study, we asked ChatGPT, Bard, and perplexity to provide the 100 most frequently asked questions about "Subdural Hematoma" separately. Responses from both chatbots were analyzed separately for readability, quality, reliability and adequacy. When the median readability scores of ChatGPT, Bard, and perplexity answers were compared with the sixth-grade reading level, a statistically significant difference was observed in all formulas (P < .001). All 3 chatbot responses were found to be difficult to read. Bard responses were more readable than ChatGPT's (P < .001) and perplexity's (P < .001) responses for all scores evaluated. Although there were differences between the results of the evaluated calculators, perplexity's answers were determined to be more readable than ChatGPT's answers (P < .05). Bard answers were determined to have the best GQS scores (P < .001). Perplexity responses had the best Journal of American Medical Association and modified DISCERN scores (P < .001). ChatGPT, Bard, and perplexity's current capabilities are inadequate in terms of quality and readability of "Subdural Hematoma" related text content. The readability standard for patient education materials as determined by the American Medical Association, National Institutes of Health, and the United States Department of Health and Human Services is at or below grade 6. The readability levels of the responses of artificial intelligence applications such as ChatGPT, Bard, and perplexity are significantly higher than the recommended 6th grade level.

Collapse

Chen X, Zhang W, Xu P, Zhao Z, Zheng Y, Shi D, He M. FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer. NPJ Digit Med 2024;7:111. [PMID: 38702471 PMCID: PMC11068733 DOI: 10.1038/s41746-024-01101-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 04/04/2024] [Indexed: 05/06/2024] Open

Ghanem YK, Rouhi AD, Al-Houssan A, Saleh Z, Moccia MC, Joshi H, Dumon KR, Hong Y, Spitz F, Joshi AR, Kwiatt M. Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis. Surg Endosc 2024;38:2887-2893. [PMID: 38443499 PMCID: PMC11078845 DOI: 10.1007/s00464-024-10739-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Accepted: 01/28/2024] [Indexed: 03/07/2024]

Abstract

INTRODUCTION

Generative artificial intelligence (AI) chatbots have recently been posited as potential sources of online medical information for patients making medical decisions. Existing online patient-oriented medical information has repeatedly been shown to be of variable quality and difficult readability. Therefore, we sought to evaluate the content and quality of AI-generated medical information on acute appendicitis.

METHODS

A modified DISCERN assessment tool, comprising 16 distinct criteria each scored on a 5-point Likert scale (score range 16-80), was used to assess AI-generated content. Readability was determined using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores. Four popular chatbots, ChatGPT-3.5 and ChatGPT-4, Bard, and Claude-2, were prompted to generate medical information about appendicitis. Three investigators independently scored the generated texts blinded to the identity of the AI platforms.

RESULTS

ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 had overall mean (SD) quality scores of 60.7 (1.2), 62.0 (1.0), 62.3 (1.2), and 51.3 (2.3), respectively, on a scale of 16-80. Inter-rater reliability was 0.81, 0.75, 0.81, and 0.72, respectively, indicating substantial agreement. Claude-2 demonstrated a significantly lower mean quality score compared to ChatGPT-4 (p = 0.001), ChatGPT-3.5 (p = 0.005), and Bard (p = 0.001). Bard was the only AI platform that listed verifiable sources, while Claude-2 provided fabricated sources. All chatbots except for Claude-2 advised readers to consult a physician if experiencing symptoms. Regarding readability, FKGL and FRE scores of ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 were 14.6 and 23.8, 11.9 and 33.9, 8.6 and 52.8, 11.0 and 36.6, respectively, indicating difficulty readability at a college reading skill level.

CONCLUSION

AI-generated medical information on appendicitis scored favorably upon quality assessment, but most either fabricated sources or did not provide any altogether. Additionally, overall readability far exceeded recommended levels for the public. Generative AI platforms demonstrate measured potential for patient education and engagement about appendicitis.

Collapse

Momenaei B, Mansour HA, Kuriyan AE, Xu D, Sridhar J, Ting DSW, Yonekawa Y. ChatGPT enters the room: what it means for patient counseling, physician education, academics, and disease management. Curr Opin Ophthalmol 2024;35:205-209. [PMID: 38334288 DOI: 10.1097/icu.0000000000001036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/10/2024]

Biswas S, Davies LN, Sheppard AL, Logan NS, Wolffsohn JS. Utility of artificial intelligence-based large language models in ophthalmic care. Ophthalmic Physiol Opt 2024;44:641-671. [PMID: 38404172 DOI: 10.1111/opo.13284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 01/23/2024] [Accepted: 01/25/2024] [Indexed: 02/27/2024]

Abstract

PURPOSE

With the introduction of ChatGPT, artificial intelligence (AI)-based large language models (LLMs) are rapidly becoming popular within the scientific community. They use natural language processing to generate human-like responses to queries. However, the application of LLMs and comparison of the abilities among different LLMs with their human counterparts in ophthalmic care remain under-reported.

RECENT FINDINGS

Hitherto, studies in eye care have demonstrated the utility of ChatGPT in generating patient information, clinical diagnosis and passing ophthalmology question-based examinations, among others. LLMs' performance (median accuracy, %) is influenced by factors such as the iteration, prompts utilised and the domain. Human expert (86%) demonstrated the highest proficiency in disease diagnosis, while ChatGPT-4 outperformed others in ophthalmology examinations (75.9%), symptom triaging (98%) and providing information and answering questions (84.6%). LLMs exhibited superior performance in general ophthalmology but reduced accuracy in ophthalmic subspecialties. Although AI-based LLMs like ChatGPT are deemed more efficient than their human counterparts, these AIs are constrained by their nonspecific and outdated training, no access to current knowledge, generation of plausible-sounding 'fake' responses or hallucinations, inability to process images, lack of critical literature analysis and ethical and copyright issues. A comprehensive evaluation of recently published studies is crucial to deepen understanding of LLMs and the potential of these AI-based LLMs.

SUMMARY

Ophthalmic care professionals should undertake a conservative approach when using AI, as human judgement remains essential for clinical decision-making and monitoring the accuracy of information. This review identified the ophthalmic applications and potential usages which need further exploration. With the advancement of LLMs, setting standards for benchmarking and promoting best practices is crucial. Potential clinical deployment requires the evaluation of these LLMs to move away from artificial settings, delve into clinical trials and determine their usefulness in the real world.

Collapse

Al-Sharif EM, Penteado RC, Dib El Jalbout N, Topilow NJ, Shoji MK, Kikkawa DO, Liu CY, Korn BS. Evaluating the Accuracy of ChatGPT and Google BARD in Fielding Oculoplastic Patient Queries: A Comparative Study on Artificial versus Human Intelligence. Ophthalmic Plast Reconstr Surg 2024;40:303-311. [PMID: 38215452 DOI: 10.1097/iop.0000000000002567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2024]

Abstract

PURPOSE

This study evaluates and compares the accuracy of responses from 2 artificial intelligence platforms to patients' oculoplastics-related questions.

METHODS

Questions directed toward oculoplastic surgeons were collected, rephrased, and input independently into ChatGPT-3.5 and BARD chatbots, using the prompt: "As an oculoplastic surgeon, how can I respond to my patient's question?." Responses were independently evaluated by 4 experienced oculoplastic specialists as comprehensive, correct but inadequate, mixed correct and incorrect/outdated data, and completely incorrect. Additionally, the empathy level, length, and automated readability index of the responses were assessed.

RESULTS

A total of 112 patient questions underwent evaluation. The rates of comprehensive, correct but inadequate, mixed, and completely incorrect answers for ChatGPT were 71.4%, 12.9%, 10.5%, and 5.1%, respectively, compared with 53.1%, 18.3%, 18.1%, and 10.5%, respectively, for BARD. ChatGPT showed more empathy (48.9%) than BARD (13.2%). All graders found that ChatGPT outperformed BARD in question categories of postoperative healing, medical eye conditions, and medications. Categorizing questions by anatomy, ChatGPT excelled in answering lacrimal questions (83.8%), while BARD performed best in the eyelid group (60.4%). ChatGPT's answers were longer and potentially more challenging to comprehend than BARD's.

CONCLUSION

This study emphasizes the promising role of artificial intelligence-powered chatbots in oculoplastic patient education and support. With continued development, these chatbots may potentially assist physicians and offer patients accurate information, ultimately contributing to improved patient care while alleviating surgeon burnout. However, it is crucial to highlight that artificial intelligence may be good at answering questions, but physician oversight remains essential to ensure the highest standard of care and address complex medical cases.

Collapse

Affiliation(s)

Eman M Al-Sharif Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A Clinical Sciences Department, College of Medicine, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
Rafaella C Penteado Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
Nahia Dib El Jalbout Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
Nicole J Topilow Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
Marissa K Shoji Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
Don O Kikkawa Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A Division of Plastic and Reconstructive Surgery, Department of Surgery, UC San Diego School of Medicine, La Jolla, California, U.S.A
Catherine Y Liu Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
Bobby S Korn Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A Division of Plastic and Reconstructive Surgery, Department of Surgery, UC San Diego School of Medicine, La Jolla, California, U.S.A

Collapse

Doğan L, Özçakmakcı GB, Yılmaz ĬE. The Performance of Chatbots and the AAPOS Website as a Tool for Amblyopia Education. J Pediatr Ophthalmol Strabismus 2024:1-7. [PMID: 38661309 DOI: 10.3928/01913913-20240409-01] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]

Abstract

PURPOSE

To evaluate the understandability, actionability, and readability of responses provided by the website of the American Association for Pediatric Ophthalmology and Strabismus (AAPOS), ChatGPT-3.5, Bard, and Bing Chat about amblyopia and the appropriateness of the responses generated by the chatbots.

METHOD

Twenty-five questions provided by the AAPOS website were directed three times to fresh ChatGPT-3.5, Bard, and Bing Chat interfaces. Two experienced pediatric ophthalmologists categorized the responses of the chatbots in terms of their appropriateness. Flesch Reading Ease (FRE), Flesch Kincaid Grade Level (FKGL), and Coleman-Liau Index (CLI) were used to evaluate the readability of the responses of the AAPOS website and chatbots. Furthermore, the understandability scores were evaluated using the Patient Education Materials Assessment Tool (PEMAT).

RESULTS

The appropriateness of the chatbots' responses was 84.0% for ChatGPT-3.5 and Bard and 80% for Bing Chat (P > .05). For understandability (mean PEMAT-U score AAPOS website: 81.5%, Bard: 77.6%, ChatGPT-3.5: 76.1%, and Bing Chat: 71.5%, P < .05) and actionability (mean PEMAT-A score AAPOS website: 74.6%, Bard: 69.2%, ChatGPT-3.5: 67.8%, and Bing Chat: 64.8%, P < .05), the AAPOs website scored better than the chat-bots. Three readability analyses showed that Bard had the highest mean score, followed by the AAPOS website, Bing Chat, and ChatGPT-3.5, and these scores were more challenging than the recommended level.

CONCLUSIONS

Chatbots have the potential to provide detailed and appropriate responses at acceptable levels. The AAPOS website has the advantage of providing information that is more understandable and actionable. The AAPOS website and chatbots, especially Chat-GPT, provided difficult-to-read data for patient education regarding amblyopia. [J Pediatr Ophthalmol Strabismus. 20XX;X(X):XXX-XXX.].

Collapse

Lam MR, Manion GN, Young BK. Search engine optimization and its association with readability and accessibility of diabetic retinopathy websites. Graefes Arch Clin Exp Ophthalmol 2024:10.1007/s00417-024-06472-3. [PMID: 38639789 DOI: 10.1007/s00417-024-06472-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 03/22/2024] [Accepted: 03/28/2024] [Indexed: 04/20/2024] Open

King RC, Samaan JS, Yeo YH, Peng Y, Kunkel DC, Habib AA, Ghashghaei R. A Multidisciplinary Assessment of ChatGPT's Knowledge of Amyloidosis: Observational Study. JMIR Cardio 2024;8:e53421. [PMID: 38640472 PMCID: PMC11069089 DOI: 10.2196/53421] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 01/31/2024] [Accepted: 02/22/2024] [Indexed: 04/21/2024] Open

Abstract

BACKGROUND

Amyloidosis, a rare multisystem condition, often requires complex, multidisciplinary care. Its low prevalence underscores the importance of efforts to ensure the availability of high-quality patient education materials for better outcomes. ChatGPT (OpenAI) is a large language model powered by artificial intelligence that offers a potential avenue for disseminating accurate, reliable, and accessible educational resources for both patients and providers. Its user-friendly interface, engaging conversational responses, and the capability for users to ask follow-up questions make it a promising future tool in delivering accurate and tailored information to patients.

OBJECTIVE

We performed a multidisciplinary assessment of the accuracy, reproducibility, and readability of ChatGPT in answering questions related to amyloidosis.

METHODS

In total, 98 amyloidosis questions related to cardiology, gastroenterology, and neurology were curated from medical societies, institutions, and amyloidosis Facebook support groups and inputted into ChatGPT-3.5 and ChatGPT-4. Cardiology- and gastroenterology-related responses were independently graded by a board-certified cardiologist and gastroenterologist, respectively, who specialize in amyloidosis. These 2 reviewers (RG and DCK) also graded general questions for which disagreements were resolved with discussion. Neurology-related responses were graded by a board-certified neurologist (AAH) who specializes in amyloidosis. Reviewers used the following grading scale: (1) comprehensive, (2) correct but inadequate, (3) some correct and some incorrect, and (4) completely incorrect. Questions were stratified by categories for further analysis. Reproducibility was assessed by inputting each question twice into each model. The readability of ChatGPT-4 responses was also evaluated using the Textstat library in Python (Python Software Foundation) and the Textstat readability package in R software (R Foundation for Statistical Computing).

RESULTS

ChatGPT-4 (n=98) provided 93 (95%) responses with accurate information, and 82 (84%) were comprehensive. ChatGPT-3.5 (n=83) provided 74 (89%) responses with accurate information, and 66 (79%) were comprehensive. When examined by question category, ChatGTP-4 and ChatGPT-3.5 provided 53 (95%) and 48 (86%) comprehensive responses, respectively, to "general questions" (n=56). When examined by subject, ChatGPT-4 and ChatGPT-3.5 performed best in response to cardiology questions (n=12) with both models producing 10 (83%) comprehensive responses. For gastroenterology (n=15), ChatGPT-4 received comprehensive grades for 9 (60%) responses, and ChatGPT-3.5 provided 8 (53%) responses. Overall, 96 of 98 (98%) responses for ChatGPT-4 and 73 of 83 (88%) for ChatGPT-3.5 were reproducible. The readability of ChatGPT-4's responses ranged from 10th to beyond graduate US grade levels with an average of 15.5 (SD 1.9).

CONCLUSIONS

Large language models are a promising tool for accurate and reliable health information for patients living with amyloidosis. However, ChatGPT's responses exceeded the American Medical Association's recommended fifth- to sixth-grade reading level. Future studies focusing on improving response accuracy and readability are warranted. Prior to widespread implementation, the technology's limitations and ethical implications must be further explored to ensure patient safety and equitable implementation.

Collapse

Şahin MF, Ateş H, Keleş A, Özcan R, Doğan Ç, Akgül M, Yazıcı CM. Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis. J Med Syst 2024;48:38. [PMID: 38568432 PMCID: PMC10990980 DOI: 10.1007/s10916-024-02056-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 03/22/2024] [Indexed: 04/05/2024]

Shen OY, Pratap JS, Li X, Chen NC, Bhashyam AR. How Does ChatGPT Use Source Information Compared With Google? A Text Network Analysis of Online Health Information. Clin Orthop Relat Res 2024;482:578-588. [PMID: 38517757 PMCID: PMC10936961 DOI: 10.1097/corr.0000000000002995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/03/2023] [Accepted: 01/08/2024] [Indexed: 03/24/2024]

Abstract

BACKGROUND

The lay public is increasingly using ChatGPT (a large language model) as a source of medical information. Traditional search engines such as Google provide several distinct responses to each search query and indicate the source for each response, but ChatGPT provides responses in paragraph form in prose without providing the sources used, which makes it difficult or impossible to ascertain whether those sources are reliable. One practical method to infer the sources used by ChatGPT is text network analysis. By understanding how ChatGPT uses source information in relation to traditional search engines, physicians and physician organizations can better counsel patients on the use of this new tool.

QUESTIONS/PURPOSES

(1) In terms of key content words, how similar are ChatGPT and Google Search responses for queries related to topics in orthopaedic surgery? (2) Does the source distribution (academic, governmental, commercial, or form of a scientific manuscript) differ for Google Search responses based on the topic's level of medical consensus, and how is this reflected in the text similarity between ChatGPT and Google Search responses? (3) Do these results vary between different versions of ChatGPT?

METHODS

We evaluated three search queries relating to orthopaedic conditions: "What is the cause of carpal tunnel syndrome?," "What is the cause of tennis elbow?," and "Platelet-rich plasma for thumb arthritis?" These were selected because of their relatively high, medium, and low consensus in the medical evidence, respectively. Each question was posed to ChatGPT version 3.5 and version 4.0 20 times for a total of 120 responses. Text network analysis using term frequency-inverse document frequency (TF-IDF) was used to compare text similarity between responses from ChatGPT and Google Search. In the field of information retrieval, TF-IDF is a weighted statistical measure of the importance of a key word to a document in a collection of documents. Higher TF-IDF scores indicate greater similarity between two sources. TF-IDF scores are most often used to compare and rank the text similarity of documents. Using this type of text network analysis, text similarity between ChatGPT and Google Search can be determined by calculating and summing the TF-IDF for all keywords in a ChatGPT response and comparing it with each Google search result to assess their text similarity to each other. In this way, text similarity can be used to infer relative content similarity. To answer our first question, we characterized the text similarity between ChatGPT and Google Search responses by finding the TF-IDF scores of the ChatGPT response and each of the 20 Google Search results for each question. Using these scores, we could compare the similarity of each ChatGPT response to the Google Search results. To provide a reference point for interpreting TF-IDF values, we generated randomized text samples with the same term distribution as the Google Search results. By comparing ChatGPT TF-IDF to the random text sample, we could assess whether TF-IDF values were statistically significant from TF-IDF values obtained by random chance, and it allowed us to test whether text similarity was an appropriate quantitative statistical measure of relative content similarity. To answer our second question, we classified the Google Search results to better understand sourcing. Google Search provides 20 or more distinct sources of information, but ChatGPT gives only a single prose paragraph in response to each query. So, to answer this question, we used TF-IDF to ascertain whether the ChatGPT response was principally driven by one of four source categories: academic, government, commercial, or material that took the form of a scientific manuscript but was not peer-reviewed or indexed on a government site (such as PubMed). We then compared the TF-IDF similarity between ChatGPT responses and the source category. To answer our third research question, we repeated both analyses and compared the results when using ChatGPT 3.5 versus ChatGPT 4.0.

RESULTS

The ChatGPT response was dominated by the top Google Search result. For example, for carpal tunnel syndrome, the top result was an academic website with a mean TF-IDF of 7.2. A similar result was observed for the other search topics. To provide a reference point for interpreting TF-IDF values, a randomly generated sample of text compared with Google Search would have a mean TF-IDF of 2.7 ± 1.9, controlling for text length and keyword distribution. The observed TF-IDF distribution was higher for ChatGPT responses than for random text samples, supporting the claim that keyword text similarity is a measure of relative content similarity. When comparing source distribution, the ChatGPT response was most similar to the most common source category from Google Search. For the subject where there was strong consensus (carpal tunnel syndrome), the ChatGPT response was most similar to high-quality academic sources rather than lower-quality commercial sources (TF-IDF 8.6 versus 2.2). For topics with low consensus, the ChatGPT response paralleled lower-quality commercial websites compared with higher-quality academic websites (TF-IDF 14.6 versus 0.2). ChatGPT 4.0 had higher text similarity to Google Search results than ChatGPT 3.5 (mean increase in TF-IDF similarity of 0.80 to 0.91; p < 0.001). The ChatGPT 4.0 response was still dominated by the top Google Search result and reflected the most common search category for all search topics.

CONCLUSION

ChatGPT responses are similar to individual Google Search results for queries related to orthopaedic surgery, but the distribution of source information can vary substantially based on the relative level of consensus on a topic. For example, for carpal tunnel syndrome, where there is widely accepted medical consensus, ChatGPT responses had higher similarity to academic sources and therefore used those sources more. When fewer academic or government sources are available, especially in our search related to platelet-rich plasma, ChatGPT appears to have relied more heavily on a small number of nonacademic sources. These findings persisted even as ChatGPT was updated from version 3.5 to version 4.0.

CLINICAL RELEVANCE

Physicians should be aware that ChatGPT and Google likely use the same sources for a specific question. The main difference is that ChatGPT can draw upon multiple sources to create one aggregate response, while Google maintains its distinctness by providing multiple results. For topics with a low consensus and therefore a low number of quality sources, there is a much higher chance that ChatGPT will use less-reliable sources, in which case physicians should take the time to educate patients on the topic or provide resources that give more reliable information. Physician organizations should make it clear when the evidence is limited so that ChatGPT can reflect the lack of quality information or evidence.

Collapse

Parikh AO, Oca MC, Conger JR, McCoy A, Chang J, Zhang-Nunes S. Accuracy and Bias in Artificial Intelligence Chatbot Recommendations for Oculoplastic Surgeons. Cureus 2024;16:e57611. [PMID: 38707042 PMCID: PMC11069401 DOI: 10.7759/cureus.57611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/30/2024] [Indexed: 05/07/2024] Open

Young BK, Zhao PY. Large Language Models and the Shoreline of Ophthalmology. JAMA Ophthalmol 2024;142:375-376. [PMID: 38386327 DOI: 10.1001/jamaophthalmol.2023.6937] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]

Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. Assessment of a Large Language Model's Responses to Questions and Cases About Glaucoma and Retina Management. JAMA Ophthalmol 2024;142:371-375. [PMID: 38386351 PMCID: PMC10884943 DOI: 10.1001/jamaophthalmol.2023.6917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Accepted: 12/03/2023] [Indexed: 02/23/2024]

Abstract

Importance

Large language models (LLMs) are revolutionizing medical diagnosis and treatment, offering unprecedented accuracy and ease surpassing conventional search engines. Their integration into medical assistance programs will become pivotal for ophthalmologists as an adjunct for practicing evidence-based medicine. Therefore, the diagnostic and treatment accuracy of LLM-generated responses compared with fellowship-trained ophthalmologists can help assess their accuracy and validate their potential utility in ophthalmic subspecialties.

Objective

To compare the diagnostic accuracy and comprehensiveness of responses from an LLM chatbot with those of fellowship-trained glaucoma and retina specialists on ophthalmological questions and real patient case management.

Design, Setting, and Participants

This comparative cross-sectional study recruited 15 participants aged 31 to 67 years, including 12 attending physicians and 3 senior trainees, from eye clinics affiliated with the Department of Ophthalmology at Icahn School of Medicine at Mount Sinai, New York, New York. Glaucoma and retina questions (10 of each type) were randomly selected from the American Academy of Ophthalmology's commonly asked questions Ask an Ophthalmologist. Deidentified glaucoma and retinal cases (10 of each type) were randomly selected from ophthalmology patients seen at Icahn School of Medicine at Mount Sinai-affiliated clinics. The LLM used was GPT-4 (version dated May 12, 2023). Data were collected from June to August 2023.

Main Outcomes and Measures

Responses were assessed via a Likert scale for medical accuracy and completeness. Statistical analysis involved the Mann-Whitney U test and the Kruskal-Wallis test, followed by pairwise comparison.

Results

The combined question-case mean rank for accuracy was 506.2 for the LLM chatbot and 403.4 for glaucoma specialists (n = 831; Mann-Whitney U = 27976.5; P < .001), and the mean rank for completeness was 528.3 and 398.7, respectively (n = 828; Mann-Whitney U = 25218.5; P < .001). The mean rank for accuracy was 235.3 for the LLM chatbot and 216.1 for retina specialists (n = 440; Mann-Whitney U = 15518.0; P = .17), and the mean rank for completeness was 258.3 and 208.7, respectively (n = 439; Mann-Whitney U = 13123.5; P = .005). The Dunn test revealed a significant difference between all pairwise comparisons, except specialist vs trainee in rating chatbot completeness. The overall pairwise comparisons showed that both trainees and specialists rated the chatbot's accuracy and completeness more favorably than those of their specialist counterparts, with specialists noting a significant difference in the chatbot's accuracy (z = 3.23; P = .007) and completeness (z = 5.86; P < .001).

Conclusions and Relevance

This study accentuates the comparative proficiency of LLM chatbots in diagnostic accuracy and completeness compared with fellowship-trained ophthalmologists in various clinical scenarios. The LLM chatbot outperformed glaucoma specialists and matched retina specialists in diagnostic and treatment accuracy, substantiating its role as a promising diagnostic adjunct in ophthalmology.

Collapse

Chen X, Zhang W, Zhao Z, Xu P, Zheng Y, Shi D, He M. ICGA-GPT: report generation and question answering for indocyanine green angiography images. Br J Ophthalmol 2024:bjo-2023-324446. [PMID: 38508675 DOI: 10.1136/bjo-2023-324446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 03/03/2024] [Indexed: 03/22/2024]

Abstract

BACKGROUND

Indocyanine green angiography (ICGA) is vital for diagnosing chorioretinal diseases, but its interpretation and patient communication require extensive expertise and time-consuming efforts. We aim to develop a bilingual ICGA report generation and question-answering (QA) system.

METHODS

Our dataset comprised 213 129 ICGA images from 2919 participants. The system comprised two stages: image-text alignment for report generation by a multimodal transformer architecture, and large language model (LLM)-based QA with ICGA text reports and human-input questions. Performance was assessed using both qualitative metrics (including Bilingual Evaluation Understudy (BLEU), Consensus-based Image Description Evaluation (CIDEr), Recall-Oriented Understudy for Gisting Evaluation-Longest Common Subsequence (ROUGE-L), Semantic Propositional Image Caption Evaluation (SPICE), accuracy, sensitivity, specificity, precision and F1 score) and subjective evaluation by three experienced ophthalmologists using 5-point scales (5 refers to high quality).

RESULTS

We produced 8757 ICGA reports covering 39 disease-related conditions after bilingual translation (66.7% English, 33.3% Chinese). The ICGA-GPT model's report generation performance was evaluated with BLEU scores (1-4) of 0.48, 0.44, 0.40 and 0.37; CIDEr of 0.82; ROUGE of 0.41 and SPICE of 0.18. For disease-based metrics, the average specificity, accuracy, precision, sensitivity and F1 score were 0.98, 0.94, 0.70, 0.68 and 0.64, respectively. Assessing the quality of 50 images (100 reports), three ophthalmologists achieved substantial agreement (kappa=0.723 for completeness, kappa=0.738 for accuracy), yielding scores from 3.20 to 3.55. In an interactive QA scenario involving 100 generated answers, the ophthalmologists provided scores of 4.24, 4.22 and 4.10, displaying good consistency (kappa=0.779).

CONCLUSION

This pioneering study introduces the ICGA-GPT model for report generation and interactive QA for the first time, underscoring the potential of LLMs in assisting with automated ICGA image interpretation.

Collapse

Cohen SA, Brant A, Fisher AC, Pershing S, Do D, Pan C. Dr. Google vs. Dr. ChatGPT: Exploring the Use of Artificial Intelligence in Ophthalmology by Comparing the Accuracy, Safety, and Readability of Responses to Frequently Asked Patient Questions Regarding Cataracts and Cataract Surgery. Semin Ophthalmol 2024:1-8. [PMID: 38516983 DOI: 10.1080/08820538.2024.2326058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 02/27/2024] [Indexed: 03/23/2024]

Abstract

PURPOSE

Patients are using online search modalities to learn about their eye health. While Google remains the most popular search engine, the use of large language models (LLMs) like ChatGPT has increased. Cataract surgery is the most common surgical procedure in the US, and there is limited data on the quality of online information that populates after searches related to cataract surgery on search engines such as Google and LLM platforms such as ChatGPT. We identified the most common patient frequently asked questions (FAQs) about cataracts and cataract surgery and evaluated the accuracy, safety, and readability of the answers to these questions provided by both Google and ChatGPT. We demonstrated the utility of ChatGPT in writing notes and creating patient education materials.

METHODS

The top 20 FAQs related to cataracts and cataract surgery were recorded from Google. Responses to the questions provided by Google and ChatGPT were evaluated by a panel of ophthalmologists for accuracy and safety. Evaluators were also asked to distinguish between Google and LLM chatbot answers. Five validated readability indices were used to assess the readability of responses. ChatGPT was instructed to generate operative notes, post-operative instructions, and customizable patient education materials according to specific readability criteria.

RESULTS

Responses to 20 patient FAQs generated by ChatGPT were significantly longer and written at a higher reading level than responses provided by Google (p < .001), with an average grade level of 14.8 (college level). Expert reviewers were correctly able to distinguish between a human-reviewed and chatbot generated response an average of 31% of the time. Google answers contained incorrect or inappropriate material 27% of the time, compared with 6% of LLM generated answers (p < .001). When expert reviewers were asked to compare the responses directly, chatbot responses were favored (66%).

CONCLUSIONS

When comparing the responses to patients' cataract FAQs provided by ChatGPT and Google, practicing ophthalmologists overwhelming preferred ChatGPT responses. LLM chatbot responses were less likely to contain inaccurate information. ChatGPT represents a viable information source for eye health for patients with higher health literacy. ChatGPT may also be used by ophthalmologists to create customizable patient education materials for patients with varying health literacy.

Collapse

Chervonski E, Harish KB, Rockman CB, Sadek M, Teter KA, Jacobowitz GR, Berland TL, Lohr J, Moore C, Maldonado TS. Generative artificial intelligence chatbots may provide appropriate informational responses to common vascular surgery questions by patients. Vascular 2024:17085381241240550. [PMID: 38500300 DOI: 10.1177/17085381241240550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]

Abstract

OBJECTIVES

Generative artificial intelligence (AI) has emerged as a promising tool to engage with patients. The objective of this study was to assess the quality of AI responses to common patient questions regarding vascular surgery disease processes.

METHODS

OpenAI's ChatGPT-3.5 and Google Bard were queried with 24 mock patient questions spanning seven vascular surgery disease domains. Six experienced vascular surgery faculty at a tertiary academic center independently graded AI responses on their accuracy (rated 1-4 from completely inaccurate to completely accurate), completeness (rated 1-4 from totally incomplete to totally complete), and appropriateness (binary). Responses were also evaluated with three readability scales.

RESULTS

ChatGPT responses were rated, on average, more accurate than Bard responses (3.08 ± 0.33 vs 2.82 ± 0.40, p < .01). ChatGPT responses were scored, on average, more complete than Bard responses (2.98 ± 0.34 vs 2.62 ± 0.36, p < .01). Most ChatGPT responses (75.0%, n = 18) and almost half of Bard responses (45.8%, n = 11) were unanimously deemed appropriate. Almost one-third of Bard responses (29.2%, n = 7) were deemed inappropriate by at least two reviewers (29.2%), and two Bard responses (8.4%) were considered inappropriate by the majority. The mean Flesch Reading Ease, Flesch-Kincaid Grade Level, and Gunning Fog Index of ChatGPT responses were 29.4 ± 10.8, 14.5 ± 2.2, and 17.7 ± 3.1, respectively, indicating that responses were readable with a post-secondary education. Bard's mean readability scores were 58.9 ± 10.5, 8.2 ± 1.7, and 11.0 ± 2.0, respectively, indicating that responses were readable with a high-school education (p < .0001 for three metrics). ChatGPT's mean response length (332 ± 79 words) was higher than Bard's mean response length (183 ± 53 words, p < .001). There was no difference in the accuracy, completeness, readability, or response length of ChatGPT or Bard between disease domains (p > .05 for all analyses).

CONCLUSIONS

AI offers a novel means of educating patients that avoids the inundation of information from "Dr Google" and the time barriers of physician-patient encounters. ChatGPT provides largely valid, though imperfect, responses to myriad patient questions at the expense of readability. While Bard responses are more readable and concise, their quality is poorer. Further research is warranted to better understand failure points for large language models in vascular surgery patient education.

Collapse

Hershenhouse JS, Cacciamani GE. Comment on: Assessing ChatGPT's ability to answer questions pertaining to erectile dysfunction. Int J Impot Res 2024:10.1038/s41443-023-00821-2. [PMID: 38467775 DOI: 10.1038/s41443-023-00821-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Revised: 12/12/2023] [Accepted: 12/21/2023] [Indexed: 03/13/2024]

Carlà MM, Gambini G, Baldascino A, Giannuzzi F, Boselli F, Crincoli E, D'Onofrio NC, Rizzo S. Exploring AI-chatbots' capability to suggest surgical planning in ophthalmology: ChatGPT versus Google Gemini analysis of retinal detachment cases. Br J Ophthalmol 2024:bjo-2023-325143. [PMID: 38448201 DOI: 10.1136/bjo-2023-325143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 02/16/2024] [Indexed: 03/08/2024]

Nikdel M, Ghadimi H, Tavakoli M, Suh DW. Assessment of the Responses of the Artificial Intelligence-based Chatbot ChatGPT-4 to Frequently Asked Questions About Amblyopia and Childhood Myopia. J Pediatr Ophthalmol Strabismus 2024;61:86-89. [PMID: 37882183 DOI: 10.3928/01913913-20231005-02] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/27/2023]

Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, Mansour HA, Abishek RM, Xu D, Sridhar J, Yonekawa Y, Kuriyan AE. Assessing ChatGPT-3.5 Versus ChatGPT-4 Performance in Surgical Treatment of Retinal Diseases: A Comparative Study. Ophthalmic Surg Lasers Imaging Retina 2024:1-2. [PMID: 38531015 DOI: 10.3928/23258160-20240227-02] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2024]

Yalla GR, Hyman N, Hock LE, Zhang Q, Shukla AG, Kolomeyer NN. Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted From Patient Brochures. Cureus 2024;16:e56766. [PMID: 38650824 PMCID: PMC11034394 DOI: 10.7759/cureus.56766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/23/2024] [Indexed: 04/25/2024] Open

Abstract

Introduction With the potential for artificial intelligence (AI) chatbots to serve as the primary source of glaucoma information to patients, it is essential to characterize the information that chatbots provide such that providers can tailor discussions, anticipate patient concerns, and identify misleading information. Therefore, the purpose of this study was to evaluate glaucoma information from AI chatbots, including ChatGPT-4, Bard, and Bing, by analyzing response accuracy, comprehensiveness, readability, word count, and character count in comparison to each other and glaucoma-related American Academy of Ophthalmology (AAO) patient materials. Methods Section headers from AAO glaucoma-related patient education brochures were adapted into question form and asked five times to each AI chatbot (ChatGPT-4, Bard, and Bing). Two sets of responses from each chatbot were used to evaluate the accuracy of AI chatbot responses and AAO brochure information, and the comprehensiveness of AI chatbot responses compared to the AAO brochure information, scored 1-5 by three independent glaucoma-trained ophthalmologists. Readability (assessed with Flesch-Kincaid Grade Level (FKGL), corresponding to the United States school grade levels), word count, and character count were determined for all chatbot responses and AAO brochure sections. Results Accuracy scores for AAO, ChatGPT, Bing, and Bard were 4.84, 4.26, 4.53, and 3.53, respectively. On direct comparison, AAO was more accurate than ChatGPT (p=0.002), and Bard was the least accurate (Bard versus AAO, p<0.001; Bard versus ChatGPT, p<0.002; Bard versus Bing, p=0.001). ChatGPT had the most comprehensive responses (ChatGPT versus Bing, p<0.001; ChatGPT versus Bard p=0.008), with comprehensiveness scores for ChatGPT, Bing, and Bard at 3.32, 2.16, and 2.79, respectively. AAO information and Bard responses were at the most accessible readability levels (AAO versus ChatGPT, AAO versus Bing, Bard versus ChatGPT, Bard versus Bing, all p<0.0001), with readability levels for AAO, ChatGPT, Bing, and Bard at 8.11, 13.01, 11.73, and 7.90, respectively. Bing responses had the lowest word and character count. Conclusion AI chatbot responses varied in accuracy, comprehensiveness, and readability. With accuracy scores and comprehensiveness below that of AAO brochures and elevated readability levels, AI chatbots require improvements to be a more useful supplementary source of glaucoma information for patients. Physicians must be aware of these limitations such that patients are asked about existing knowledge and questions and are then provided with clarifying and comprehensive information.

Collapse

Kianian R, Sun D, Crowell EL, Tsui E. The Use of Large Language Models to Generate Education Materials about Uveitis. Ophthalmol Retina 2024;8:195-201. [PMID: 37716431 DOI: 10.1016/j.oret.2023.09.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2023] [Revised: 09/08/2023] [Accepted: 09/08/2023] [Indexed: 09/18/2023]

Abstract

OBJECTIVE

To assess large language models in generating readable uveitis information and in improving the readability of online health information.

DESIGN

Evaluation of technology.

SUBJECTS

Not applicable.

METHODS

ChatGPT and Bard were asked the following prompts: (prompt A) "considering that the average American reads at a 6th grade level, using the Flesch-Kincaid Grade Level (FKGL) formula, can you write patient-targeted health information on uveitis of around 6th grade level?" and (prompt B) "can you write patient-targeted health information on uveitis that is easy to understand by an average American?" Additionally, ChatGPT and Bard were asked the following prompt from the first-page results of Google when the term "uveitis" was searched: "Considering that the average American reads at a 6th grade level, using the FKGL formula, can you rewrite the following text to 6th grade level: [insert text]." The readability of each response was analyzed and compared using several metrics described below.

MAIN OUTCOME MEASURES

The FKGL is a highly validated readability assessment tool that assigns a grade level to a given text, the total number of words, sentences, syllables, and complex words. Complex words were defined as those with > 2 syllables.

RESULTS

ChatGPT and Bard generated responses with lower FKGL scores (i.e., easier to understand) in response to prompt A compared with prompt B. This was only significant for ChatGPT (P < 0.0001). The mean FKGL of responses to ChatGPT (6.3 ± 1.2) was significantly lower (P < 0.0001) than Bard 10.5 ± 0.8. ChatGPT responses also contained less complex words than Bard (P < 0.0001). Online health information on uveitis had a mean grade level of 11.0 ± 1.4. ChatGPT lowered the FKGL to 8.0 ± 1.0 (P < 0.0001) when asked to rewrite the content. Bard was not able to do so (mean FKGL of 11.1 ± 1.6).

CONCLUSIONS

ChatGPT can aid clinicians in producing easier-to-understand health information on uveitis for patients compared with already-existing content. It can also help with reducing the difficulty of the language used for uveitis health information targeted for patients.

FINANCIAL DISCLOSURE(S)

Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

Collapse

Onder CE, Koc G, Gokbulut P, Taskaldiran I, Kuskonmaz SM. Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy. Sci Rep 2024;14:243. [PMID: 38167988 PMCID: PMC10761760 DOI: 10.1038/s41598-023-50884-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 12/27/2023] [Indexed: 01/05/2024] Open

Patnaik SS, Hoffmann U. Quantitative evaluation of ChatGPT versus Bard responses to anaesthesia-related queries. Br J Anaesth 2024;132:169-171. [PMID: 37945414 DOI: 10.1016/j.bja.2023.09.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Revised: 09/28/2023] [Accepted: 09/29/2023] [Indexed: 11/12/2023] Open

Ray PP. Can we depend on LLMs to persuade myopia-related issues? Ophthalmic Physiol Opt 2024;44:231-232. [PMID: 37635304 DOI: 10.1111/opo.13226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 08/17/2023] [Indexed: 08/29/2023]

Biswas S, Logan NS, Davies LN, Sheppard AL, Wolffsohn JS. Authors' Reply: Assessing the utility of ChatGPT as an artificial intelligence-based large language model for information to answer questions on myopia. Ophthalmic Physiol Opt 2024;44:233-234. [PMID: 37635297 DOI: 10.1111/opo.13227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Accepted: 08/17/2023] [Indexed: 08/29/2023]

Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, Mansour HA, Abishek RM, Xu D, Sridhar J, Yonekawa Y, Kuriyan AE. Reply. Ophthalmol Retina 2024;8:e1-e2. [PMID: 37815785 DOI: 10.1016/j.oret.2023.09.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 09/06/2023] [Indexed: 10/11/2023]

Wei K, Fritz C, Rajasekaran K. Answering head and neck cancer questions: An assessment of ChatGPT responses. Am J Otolaryngol 2024;45:104085. [PMID: 37844413 DOI: 10.1016/j.amjoto.2023.104085] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Accepted: 10/01/2023] [Indexed: 10/18/2023]

Tangadulrat P, Sono S, Tangtrakulwanich B. Using ChatGPT for Clinical Practice and Medical Education: Cross-Sectional Survey of Medical Students' and Physicians' Perceptions. JMIR MEDICAL EDUCATION 2023;9:e50658. [PMID: 38133908 PMCID: PMC10770783 DOI: 10.2196/50658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Revised: 10/17/2023] [Accepted: 12/11/2023] [Indexed: 12/23/2023]

Abstract

BACKGROUND

ChatGPT is a well-known large language model-based chatbot. It could be used in the medical field in many aspects. However, some physicians are still unfamiliar with ChatGPT and are concerned about its benefits and risks.

OBJECTIVE

We aim to evaluate the perception of physicians and medical students toward using ChatGPT in the medical field.

METHODS

A web-based questionnaire was sent to medical students, interns, residents, and attending staff with questions regarding their perception toward using ChatGPT in clinical practice and medical education. Participants were also asked to rate their perception of ChatGPT's generated response about knee osteoarthritis.

RESULTS

Participants included 124 medical students, 46 interns, 37 residents, and 32 attending staff. After reading ChatGPT's response, 132 of the 239 (55.2%) participants had a positive rating about using ChatGPT for clinical practice. The proportion of positive answers was significantly lower in graduated physicians (48/115, 42%) compared with medical students (84/124, 68%; P<.001). Participants listed a lack of a patient-specific treatment plan, updated evidence, and a language barrier as ChatGPT's pitfalls. Regarding using ChatGPT for medical education, the proportion of positive responses was also significantly lower in graduate physicians (71/115, 62%) compared to medical students (103/124, 83.1%; P<.001). Participants were concerned that ChatGPT's response was too superficial, might lack scientific evidence, and might need expert verification.

CONCLUSIONS

Medical students generally had a positive perception of using ChatGPT for guiding treatment and medical education, whereas graduated doctors were more cautious in this regard. Nonetheless, both medical students and graduated doctors positively perceived using ChatGPT for creating patient educational materials.

Collapse

Wong M, Lim ZW, Pushpanathan K, Cheung CY, Wang YX, Chen D, Tham YC. Review of emerging trends and projection of future developments in large language models research in ophthalmology. Br J Ophthalmol 2023:bjo-2023-324734. [PMID: 38164563 DOI: 10.1136/bjo-2023-324734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 11/14/2023] [Indexed: 01/03/2024]

Abstract

BACKGROUND

Large language models (LLMs) are fast emerging as potent tools in healthcare, including ophthalmology. This systematic review offers a twofold contribution: it summarises current trends in ophthalmology-related LLM research and projects future directions for this burgeoning field.

METHODS

We systematically searched across various databases (PubMed, Europe PMC, Scopus and Web of Science) for articles related to LLM use in ophthalmology, published between 1 January 2022 and 31 July 2023. Selected articles were summarised, and categorised by type (editorial, commentary, original research, etc) and their research focus (eg, evaluating ChatGPT's performance in ophthalmology examinations or clinical tasks).

FINDINGS

We identified 32 articles meeting our criteria, published between January and July 2023, with a peak in June (n=12). Most were original research evaluating LLMs' proficiency in clinically related tasks (n=9). Studies demonstrated that ChatGPT-4.0 outperformed its predecessor, ChatGPT-3.5, in ophthalmology exams. Furthermore, ChatGPT excelled in constructing discharge notes (n=2), evaluating diagnoses (n=2) and answering general medical queries (n=6). However, it struggled with generating scientific articles or abstracts (n=3) and answering specific subdomain questions, especially those regarding specific treatment options (n=2). ChatGPT's performance relative to other LLMs (Google's Bard, Microsoft's Bing) varied by study design. Ethical concerns such as data hallucination (n=27), authorship (n=5) and data privacy (n=2) were frequently cited.

INTERPRETATION

While LLMs hold transformative potential for healthcare and ophthalmology, concerns over accountability, accuracy and data security remain. Future research should focus on application programming interface integration, comparative assessments of popular LLMs, their ability to interpret image-based data and the establishment of standardised evaluation frameworks.

Collapse

Kunze KN. Editorial Commentary: Recognizing and Avoiding Medical Misinformation Across Digital Platforms: Smoke, Mirrors (and Streaming). Arthroscopy 2023;39:2454-2455. [PMID: 37981387 DOI: 10.1016/j.arthro.2023.06.054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Revised: 06/27/2023] [Accepted: 06/30/2023] [Indexed: 11/21/2023]

Ferro Desideri L, Roth J, Zinkernagel M, Anguita R. "Application and accuracy of artificial intelligence-derived large language models in patients with age related macular degeneration". Int J Retina Vitreous 2023;9:71. [PMID: 37980501 PMCID: PMC10657493 DOI: 10.1186/s40942-023-00511-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Accepted: 11/11/2023] [Indexed: 11/20/2023] Open

Haidar O, Jaques A, McCaughran PW, Metcalfe MJ. AI-Generated Information for Vascular Patients: Assessing the Standard of Procedure-Specific Information Provided by the ChatGPT AI-Language Model. Cureus 2023;15:e49764. [PMID: 38046759 PMCID: PMC10691169 DOI: 10.7759/cureus.49764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/30/2023] [Indexed: 12/05/2023] Open

Abstract

Introduction Ensuring access to high-quality information is paramount to facilitating informed surgical decision-making. The use of the internet to access health-related information is increasing, along with the growing prevalence of AI language models such as ChatGPT. We aim to assess the standard of AI-generated patient-facing information through a qualitative analysis of its readability and quality. Materials and methods We performed a retrospective qualitative analysis of information regarding three common vascular procedures: endovascular aortic repair (EVAR), endovenous laser ablation (EVLA), and femoro-popliteal bypass (FPBP). The ChatGPT responses were compared to patient information leaflets provided by the vascular charity, Circulation Foundation UK. Readability was assessed using four readability scores: the Flesch-Kincaid reading ease (FKRE) score, the Flesch-Kincaid grade level (FKGL), the Gunning fog score (GFS), and the simple measure of gobbledygook (SMOG) index. Quality was assessed using the DISCERN tool by two independent assessors. Results The mean FKRE score was 33.3, compared to 59.1 for the information provided by the Circulation Foundation (SD=14.5, p=0.025) indicating poor readability of AI-generated information. The FFKGL indicated that the expected grade of students likely to read and understand ChatGPT responses was consistently higher than compared to information leaflets at 12.7 vs. 9.4 (SD=1.9, p=0.002). Two metrics measure readability in terms of the number of years of education required to understand a piece of writing: the GFS and SMOG. Both scores indicated that AI-generated answers were less accessible. The GFS for ChatGPT-provided information was 16.7 years versus 12.8 years for the leaflets (SD=2.2, p=0.002) and the SMOG index scores were 12.2 and 9.4 years for ChatGPT and the patient information leaflets, respectively (SD=1.7, p=0.001). The DISCERN scores were consistently higher in human-generated patient information leaflets compared to AI-generated information across all procedures; the mean score for the information provided by ChatGPT was 50.3 vs. 56.0 for the Circulation Foundation information leaflets (SD=3.38, p<0.001). Conclusion We concluded that AI-generated information about vascular surgical procedures is currently poor in both the readability of text and the quality of information. Patients should be directed to reputable, human-generated information sources from trusted professional bodies to supplement direct education from the clinician during the pre-procedure consultation process.

Collapse

Hernandez CA, Vazquez Gonzalez AE, Polianovskaia A, Amoro Sanchez R, Muyolema Arce V, Mustafa A, Vypritskaya E, Perez Gutierrez O, Bashir M, Eighaei Sedeh A. The Future of Patient Education: AI-Driven Guide for Type 2 Diabetes. Cureus 2023;15:e48919. [PMID: 38024047 PMCID: PMC10654048 DOI: 10.7759/cureus.48919] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/16/2023] [Indexed: 12/01/2023] Open

Abstract

Introduction and aim The surging incidence of type 2 diabetes has become a growing concern for the healthcare sector. This chronic ailment, characterized by its complex blend of genetic and lifestyle determinants, has witnessed a notable increase in recent times, exerting substantial pressure on healthcare resources. As more individuals turn to online platforms for health guidance and embrace the utilization of Chat Generative Pre-trained Transformer (ChatGPT; San Francisco, CA: OpenAI), a text-generating AI (TGAI), to get insights into their well-being, evaluating its effectiveness and reliability becomes crucial. This research primarily aimed to evaluate the correctness of TGAI responses to type 2 diabetes (T2DM) inquiries via ChatGPT. Furthermore, this study aimed to examine the consistency of TGAI in addressing common queries on T2DM complications for patient education. Material and methods Questions on T2DM were formulated by experienced physicians and screened by research personnel before querying ChatGPT. Each question was posed thrice, and the collected answers were summarized. Responses were then sorted into three distinct categories as follows: (a) appropriate, (b) inappropriate, and (c) unreliable by two seasoned physicians. In instances of differing opinions, a third physician was consulted to achieve consensus. Results From the initial set of 110 T2DM questions, 40 were dismissed by experts for relevance, resulting in a final count of 70. An overwhelming 98.5% of the AI's answers were judged as appropriate, thus underscoring its reliability over traditional online search engines. Nonetheless, a 1.5% rate of inappropriate responses underlines the importance of ongoing AI improvements and strict adherence to medical protocols. Conclusion TGAI provides medical information of high quality and reliability. This study underscores TGAI's impressive effectiveness in delivering reliable information about T2DM, with 98.5% of responses aligning with the standard of care. These results hold promise for integrating AI platforms as supplementary tools to enhance patient education and outcomes.

Collapse

Kleebayoon A, Wiwanitkit V. Re: Momenaei et al.: Appropriateness and readability of ChatGPT-4 generated responses for surgical treatment of retinal diseases (Ophthalmol Retina. 2023;7:862-868.). Ophthalmol Retina 2023;7:e15. [PMID: 37379882 DOI: 10.1016/j.oret.2023.06.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 06/22/2023] [Indexed: 06/30/2023]

Nazir T, Ahmad U, Mal M, Rehman MU, Saeed R, Kalia J. Microsoft Bing vs Google Bard in Neurology: A Comparative Study of AI-Generated Patient Education Material.. [DOI: 10.1101/2023.08.25.23294641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]

Abstract AbstractBackgroundPatient education is an essential component of healthcare, and artificial intelligence (AI) language models such as Google Bard and Microsoft Bing have the potential to improve information transmission and enhance patient care. However, it is crucial to evaluate the quality, accuracy, and understandability of the materials generated by these models before applying them in medical practice. This study aimed to assess and compare the quality of patient education materials produced by Google Bard and Microsoft Bing in response to questions related to neurological conditions.MethodsA cross-sectional study design was used to evaluate and compare the ability of Google Bard and Microsoft Bing to generate patient education materials. The study included the top ten prevalent neurological diseases based on WHO prevalence data. Ten board-certified neurologists and four neurology residents evaluated the responses generated by the models on six quality metrics. The scores for each model were compiled and averaged across all measures, and the significance of any observed variations was assessed using an independent t-test.ResultsGoogle Bard performed better than Microsoft Bing in all six-quality metrics, with an overall mean score of 79% and 69%, respectively. Google Bard outperformed Microsoft Bing in all measures for eight questions, while Microsoft Bing performed marginally better in terms of objectivity and clarity for the epilepsy query.ConclusionThis study showed that Google Bard performs better than Microsoft Bing in generating patient education materials for neurological diseases. However, healthcare professionals should take into account both AI models’ advantages and disadvantages when providing support for health information requirements. Future studies can help determine the underlying causes of these variations and guide cooperative initiatives to create more user-focused AI-generated patient education materials. Finally, researchers should consider the perception of patients regarding AI-generated patient education material and its impact on implementing these solutions in healthcare settings. Collapse