1
|
Maywood MJ, Parikh R, Deobhakta A, Begaj T. PERFORMANCE ASSESSMENT OF AN ARTIFICIAL INTELLIGENCE CHATBOT IN CLINICAL VITREORETINAL SCENARIOS. Retina 2024; 44:954-964. [PMID: 38271674 DOI: 10.1097/iae.0000000000004053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2024]
Abstract
PURPOSE To determine how often ChatGPT is able to provide accurate and comprehensive information regarding clinical vitreoretinal scenarios. To assess the types of sources ChatGPT primarily uses and to determine whether they are hallucinated. METHODS This was a retrospective cross-sectional study. The authors designed 40 open-ended clinical scenarios across four main topics in vitreoretinal disease. Responses were graded on correctness and comprehensiveness by three blinded retina specialists. The primary outcome was the number of clinical scenarios that ChatGPT answered correctly and comprehensively. Secondary outcomes included theoretical harm to patients, the distribution of the type of references used by the chatbot, and the frequency of hallucinated references. RESULTS In June 2023, ChatGPT answered 83% of clinical scenarios (33/40) correctly but provided a comprehensive answer in only 52.5% of cases (21/40). Subgroup analysis demonstrated an average correct score of 86.7% in neovascular age-related macular degeneration, 100% in diabetic retinopathy, 76.7% in retinal vascular disease, and 70% in the surgical domain. There were six incorrect responses with one case (16.7%) of no harm, three cases (50%) of possible harm, and two cases (33.3%) of definitive harm. CONCLUSION ChatGPT correctly answered more than 80% of complex open-ended vitreoretinal clinical scenarios, with a reduced capability to provide a comprehensive response.
Collapse
Affiliation(s)
- Michael J Maywood
- Department of Ophthalmology, Corewell Health William Beaumont University Hospital, Royal Oak, Michigan
| | - Ravi Parikh
- Manhattan Retina and Eye Consultants, New York, New York
- Department of Ophthalmology, New York University School of Medicine, New York, New York
| | | | - Tedi Begaj
- Department of Ophthalmology, Corewell Health William Beaumont University Hospital, Royal Oak, Michigan
- Associated Retinal Consultants, Royal Oak, Michigan
| |
Collapse
|
2
|
Bressler NM. JAMA Ophthalmology-The Year in Review, 2023. JAMA Ophthalmol 2024; 142:405-406. [PMID: 38512250 DOI: 10.1001/jamaophthalmol.2024.0435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/22/2024]
|
3
|
Mihalache A, Huang RS, Popovic MM, Muni RH. Artificial intelligence chatbot and Academy Preferred Practice Pattern ® Guidelines on cataract and glaucoma. J Cataract Refract Surg 2024; 50:534-535. [PMID: 38468154 DOI: 10.1097/j.jcrs.0000000000001317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Accepted: 09/08/2023] [Indexed: 03/13/2024]
Affiliation(s)
- Andrew Mihalache
- From the Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada (Mihalache, Huang); Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada (Popovic, Muni); Department of Ophthalmology, St. Michael's Hospital/Unity Health Toronto, Toronto, Ontario, Canada (Muni)
| | | | | | | |
Collapse
|
4
|
Biswas S, Davies LN, Sheppard AL, Logan NS, Wolffsohn JS. Utility of artificial intelligence-based large language models in ophthalmic care. Ophthalmic Physiol Opt 2024; 44:641-671. [PMID: 38404172 DOI: 10.1111/opo.13284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 01/23/2024] [Accepted: 01/25/2024] [Indexed: 02/27/2024]
Abstract
PURPOSE With the introduction of ChatGPT, artificial intelligence (AI)-based large language models (LLMs) are rapidly becoming popular within the scientific community. They use natural language processing to generate human-like responses to queries. However, the application of LLMs and comparison of the abilities among different LLMs with their human counterparts in ophthalmic care remain under-reported. RECENT FINDINGS Hitherto, studies in eye care have demonstrated the utility of ChatGPT in generating patient information, clinical diagnosis and passing ophthalmology question-based examinations, among others. LLMs' performance (median accuracy, %) is influenced by factors such as the iteration, prompts utilised and the domain. Human expert (86%) demonstrated the highest proficiency in disease diagnosis, while ChatGPT-4 outperformed others in ophthalmology examinations (75.9%), symptom triaging (98%) and providing information and answering questions (84.6%). LLMs exhibited superior performance in general ophthalmology but reduced accuracy in ophthalmic subspecialties. Although AI-based LLMs like ChatGPT are deemed more efficient than their human counterparts, these AIs are constrained by their nonspecific and outdated training, no access to current knowledge, generation of plausible-sounding 'fake' responses or hallucinations, inability to process images, lack of critical literature analysis and ethical and copyright issues. A comprehensive evaluation of recently published studies is crucial to deepen understanding of LLMs and the potential of these AI-based LLMs. SUMMARY Ophthalmic care professionals should undertake a conservative approach when using AI, as human judgement remains essential for clinical decision-making and monitoring the accuracy of information. This review identified the ophthalmic applications and potential usages which need further exploration. With the advancement of LLMs, setting standards for benchmarking and promoting best practices is crucial. Potential clinical deployment requires the evaluation of these LLMs to move away from artificial settings, delve into clinical trials and determine their usefulness in the real world.
Collapse
Affiliation(s)
- Sayantan Biswas
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - Leon N Davies
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - Amy L Sheppard
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - Nicola S Logan
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - James S Wolffsohn
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| |
Collapse
|
5
|
Mihalache A, Huang RS, Cruz-Pimentel M, Patil NS, Popovic MM, Pandya BU, Shor R, Pereira A, Muni RH. Artificial intelligence chatbot interpretation of ophthalmic multimodal imaging cases. Eye (Lond) 2024:10.1038/s41433-024-03074-5. [PMID: 38649474 DOI: 10.1038/s41433-024-03074-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 03/15/2024] [Accepted: 04/05/2024] [Indexed: 04/25/2024] Open
Affiliation(s)
- Andrew Mihalache
- Temerty School of Medicine, University of Toronto, Toronto, ON, Canada
| | - Ryan S Huang
- Temerty School of Medicine, University of Toronto, Toronto, ON, Canada
| | - Miguel Cruz-Pimentel
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, ON, Canada
| | - Nikhil S Patil
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada
| | - Marko M Popovic
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, ON, Canada
| | - Bhadra U Pandya
- Temerty School of Medicine, University of Toronto, Toronto, ON, Canada
| | - Reut Shor
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, ON, Canada
| | - Austin Pereira
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, ON, Canada
| | - Rajeev H Muni
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, ON, Canada.
- Department of Ophthalmology, St. Michael's Hospital/Unity Health Toronto, Toronto, ON, Canada.
| |
Collapse
|
6
|
Mihalache A, Grad J, Patil NS, Huang RS, Popovic MM, Mallipatna A, Kertes PJ, Muni RH. Google Gemini and Bard artificial intelligence chatbot performance in ophthalmology knowledge assessment. Eye (Lond) 2024:10.1038/s41433-024-03067-4. [PMID: 38615098 DOI: 10.1038/s41433-024-03067-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 03/08/2024] [Accepted: 04/04/2024] [Indexed: 04/15/2024] Open
Abstract
PURPOSE With the popularization of ChatGPT (Open AI, San Francisco, California, United States) in recent months, understanding the potential of artificial intelligence (AI) chatbots in a medical context is important. Our study aims to evaluate Google Gemini and Bard's (Google, Mountain View, California, United States) knowledge in ophthalmology. METHODS In this study, we evaluated Google Gemini and Bard's performance on EyeQuiz, a platform containing ophthalmology board certification examination practice questions, when used from the United States (US). Accuracy, response length, response time, and provision of explanations were evaluated. Subspecialty-specific performance was noted. A secondary analysis was conducted using Bard from Vietnam, and Gemini from Vietnam, Brazil, and the Netherlands. RESULTS Overall, Google Gemini and Bard both had accuracies of 71% across 150 text-based multiple-choice questions. The secondary analysis revealed an accuracy of 67% using Bard from Vietnam, with 32 questions (21%) answered differently than when using Bard from the US. Moreover, the Vietnam version of Gemini achieved an accuracy of 74%, with 23 (15%) answered differently than the US version of Gemini. While the Brazil (68%) and Netherlands (65%) versions of Gemini performed slightly worse than the US version, differences in performance across the various country-specific versions of Bard and Gemini were not statistically significant. CONCLUSION Google Gemini and Bard had an acceptable performance in responding to ophthalmology board examination practice questions. Subtle variability was noted in the performance of the chatbots across different countries. The chatbots also tended to provide a confident explanation even when providing an incorrect answer.
Collapse
Affiliation(s)
- Andrew Mihalache
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada
| | - Justin Grad
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada
| | - Nikhil S Patil
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada
| | - Ryan S Huang
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada
| | - Marko M Popovic
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, ON, Canada
| | - Ashwin Mallipatna
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, ON, Canada
- Department of Ophthalmology, Hospital for Sick Children, University of Toronto, Toronto, ON, Canada
| | - Peter J Kertes
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, ON, Canada
- John and Liz Tory Eye Centre, Sunnybrook Health Sciences Centre, Toronto, ON, Canada
| | - Rajeev H Muni
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, ON, Canada.
- Department of Ophthalmology, St. Michael's Hospital/Unity Health Toronto, Toronto, ON, Canada.
| |
Collapse
|
7
|
Mihalache A, Huang RS, Popovic MM, Patil NS, Pandya BU, Shor R, Pereira A, Kwok JM, Yan P, Wong DT, Kertes PJ, Muni RH. Accuracy of an Artificial Intelligence Chatbot's Interpretation of Clinical Ophthalmic Images. JAMA Ophthalmol 2024; 142:321-326. [PMID: 38421670 PMCID: PMC10905373 DOI: 10.1001/jamaophthalmol.2024.0017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 12/19/2023] [Indexed: 03/02/2024]
Abstract
Importance Ophthalmology is reliant on effective interpretation of multimodal imaging to ensure diagnostic accuracy. The new ability of ChatGPT-4 (OpenAI) to interpret ophthalmic images has not yet been explored. Objective To evaluate the performance of the novel release of an artificial intelligence chatbot that is capable of processing imaging data. Design, Setting, and Participants This cross-sectional study used a publicly available dataset of ophthalmic cases from OCTCases, a medical education platform based out of the Department of Ophthalmology and Vision Sciences at the University of Toronto, with accompanying clinical multimodal imaging and multiple-choice questions. Across 137 available cases, 136 contained multiple-choice questions (99%). Exposures The chatbot answered questions requiring multimodal input from October 16 to October 23, 2023. Main Outcomes and Measures The primary outcome was the accuracy of the chatbot in answering multiple-choice questions pertaining to image recognition in ophthalmic cases, measured as the proportion of correct responses. χ2 Tests were conducted to compare the proportion of correct responses across different ophthalmic subspecialties. Results A total of 429 multiple-choice questions from 136 ophthalmic cases and 448 images were included in the analysis. The chatbot answered 299 of multiple-choice questions correctly across all cases (70%). The chatbot's performance was better on retina questions than neuro-ophthalmology questions (77% vs 58%; difference = 18%; 95% CI, 7.5%-29.4%; χ21 = 11.4; P < .001). The chatbot achieved a better performance on nonimage-based questions compared with image-based questions (82% vs 65%; difference = 17%; 95% CI, 7.8%-25.1%; χ21 = 12.2; P < .001).The chatbot performed best on questions in the retina category (77% correct) and poorest in the neuro-ophthalmology category (58% correct). The chatbot demonstrated intermediate performance on questions from the ocular oncology (72% correct), pediatric ophthalmology (68% correct), uveitis (67% correct), and glaucoma (61% correct) categories. Conclusions and Relevance In this study, the recent version of the chatbot accurately responded to approximately two-thirds of multiple-choice questions pertaining to ophthalmic cases based on imaging interpretation. The multimodal chatbot performed better on questions that did not rely on the interpretation of imaging modalities. As the use of multimodal chatbots becomes increasingly widespread, it is imperative to stress their appropriate integration within medical contexts.
Collapse
Affiliation(s)
- Andrew Mihalache
- Temerty School of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Ryan S. Huang
- Temerty School of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Marko M. Popovic
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Nikhil S. Patil
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Bhadra U. Pandya
- Temerty School of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Reut Shor
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Austin Pereira
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Jason M. Kwok
- Temerty School of Medicine, University of Toronto, Toronto, Ontario, Canada
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Peng Yan
- Temerty School of Medicine, University of Toronto, Toronto, Ontario, Canada
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - David T. Wong
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Ophthalmology, St Michael’s Hospital/Unity Health Toronto, Toronto, Ontario, Canada
| | - Peter J. Kertes
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
- John and Liz Tory Eye Centre, Sunnybrook Health Science Centre, Toronto, Ontario, Canada
| | - Rajeev H. Muni
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Ophthalmology, St Michael’s Hospital/Unity Health Toronto, Toronto, Ontario, Canada
| |
Collapse
|
8
|
Tao BKL, Hua N, Milkovich J, Micieli JA. ChatGPT-3.5 and Bing Chat in ophthalmology: an updated evaluation of performance, readability, and informative sources. Eye (Lond) 2024:10.1038/s41433-024-03037-w. [PMID: 38509182 DOI: 10.1038/s41433-024-03037-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2023] [Revised: 03/04/2024] [Accepted: 03/14/2024] [Indexed: 03/22/2024] Open
Abstract
BACKGROUND/OBJECTIVES Experimental investigation. Bing Chat (Microsoft) integration with ChatGPT-4 (OpenAI) integration has conferred the capability of accessing online data past 2021. We investigate its performance against ChatGPT-3.5 on a multiple-choice question ophthalmology exam. SUBJECTS/METHODS In August 2023, ChatGPT-3.5 and Bing Chat were evaluated against 913 questions derived from the Academy's Basic and Clinical Science Collection collection. For each response, the sub-topic, performance, Simple Measure of Gobbledygook readability score (measuring years of required education to understand a given passage), and cited resources were collected. The primary outcomes were the comparative scores between models, and qualitatively, the resources referenced by Bing Chat. Secondary outcomes included performance stratified by response readability, question type (explicit or situational), and BCSC sub-topic. RESULTS Across 913 questions, ChatGPT-3.5 scored 59.69% [95% CI 56.45,62.94] while Bing Chat scored 73.60% [95% CI 70.69,76.52]. Both models performed significantly better in explicit than clinical reasoning questions. Both models performed best on general medicine questions than ophthalmology subsections. Bing Chat referenced 927 online entities and provided at-least one citation to 836 of the 913 questions. The use of more reliable (peer-reviewed) sources was associated with higher likelihood of correct response. The most-cited resources were eyewiki.aao.org, aao.org, wikipedia.org, and ncbi.nlm.nih.gov. Bing Chat showed significantly better readability than ChatGPT-3.5, averaging a reading level of grade 11.4 [95% CI 7.14, 15.7] versus 12.4 [95% CI 8.77, 16.1], respectively (p-value < 0.0001, ρ = 0.25). CONCLUSIONS The online access, improved readability, and citation feature of Bing Chat confers additional utility for ophthalmology learners. We recommend critical appraisal of cited sources during response interpretation.
Collapse
Affiliation(s)
- Brendan Ka-Lok Tao
- Faculty of Medicine, The University of British Columbia, 317-2194 Health Sciences Mall, Vancouver, BC, V6T 1Z3, Canada
| | - Nicholas Hua
- Temerty Faculty of Medicine, University of Toronto, 1 King's College Circle, Toronto, ON, M5S 1A8, Canada
| | - John Milkovich
- Temerty Faculty of Medicine, University of Toronto, 1 King's College Circle, Toronto, ON, M5S 1A8, Canada
| | - Jonathan Andrew Micieli
- Temerty Faculty of Medicine, University of Toronto, 1 King's College Circle, Toronto, ON, M5S 1A8, Canada.
- Department of Ophthalmology and Vision Sciences, University of Toronto, 340 College Street, Toronto, ON, M5T 3A9, Canada.
- Division of Neurology, Department of Medicine, University of Toronto, 6 Queen's Park Crescent West, Toronto, ON, M5S 3H2, Canada.
- Kensington Vision and Research Center, 340 College Street, Toronto, ON, M5T 3A9, Canada.
- St. Michael's Hospital, 36 Queen Street East, Toronto, ON, M5B 1W8, Canada.
- Toronto Western Hospital, 399 Bathurst Street, Toronto, ON, M5T 2S8, Canada.
- University Health Network, 190 Elizabeth Street, Toronto, ON, M5G 2C4, Canada.
| |
Collapse
|
9
|
Mihalache A, Huang RS, Patil NS, Popovic MM, Lee WW, Yan P, Cruz-Pimentel M, Muni RH. Chatbot and Academy Preferred Practice Pattern Guidelines on Retinal Diseases. Ophthalmol Retina 2024:S2468-6530(24)00117-9. [PMID: 38499086 DOI: 10.1016/j.oret.2024.03.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 02/24/2024] [Accepted: 03/12/2024] [Indexed: 03/20/2024]
Affiliation(s)
- Andrew Mihalache
- Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Ryan S Huang
- Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Nikhil S Patil
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Marko M Popovic
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Wei Wei Lee
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Peng Yan
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada; Toronto Western Hospital, University Health Network, University of Toronto, Toronto, Ontario, Canada; Department of Ophthalmology, Kensington Vision and Research Center, Toronto, Ontario, Canada
| | - Miguel Cruz-Pimentel
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Rajeev H Muni
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada; Department of Ophthalmology, St. Michael's Hospital/Unity Health Toronto, Toronto, Ontario, Canada.
| |
Collapse
|
10
|
Mihalache A, Huang RS, Popovic MM, Muni RH. ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. MEDICAL TEACHER 2024; 46:366-372. [PMID: 37839017 DOI: 10.1080/0142159x.2023.2249588] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/17/2023]
Abstract
PURPOSE ChatGPT-4 is an upgraded version of an artificial intelligence chatbot. The performance of ChatGPT-4 on the United States Medical Licensing Examination (USMLE) has not been independently characterized. We aimed to assess the performance of ChatGPT-4 at responding to USMLE Step 1, Step 2CK, and Step 3 practice questions. METHOD Practice multiple-choice questions for the USMLE Step 1, Step 2CK, and Step 3 were compiled. Of 376 available questions, 319 (85%) were analyzed by ChatGPT-4 on March 21st, 2023. Our primary outcome was the performance of ChatGPT-4 for the practice USMLE Step 1, Step 2CK, and Step 3 examinations, measured as the proportion of multiple-choice questions answered correctly. Our secondary outcomes were the mean length of questions and responses provided by ChatGPT-4. RESULTS ChatGPT-4 responded to 319 text-based multiple-choice questions from USMLE practice test material. ChatGPT-4 answered 82 of 93 (88%) questions correctly on USMLE Step 1, 91 of 106 (86%) on Step 2CK, and 108 of 120 (90%) on Step 3. ChatGPT-4 provided explanations for all questions. ChatGPT-4 spent 30.8 ± 11.8 s on average responding to practice questions for USMLE Step 1, 23.0 ± 9.4 s per question for Step 2CK, and 23.1 ± 8.3 s per question for Step 3. The mean length of practice USMLE multiple-choice questions that were answered correctly and incorrectly by ChatGPT-4 was similar (difference = 17.48 characters, SE = 59.75, 95%CI = [-100.09,135.04], t = 0.29, p = 0.77). The mean length of ChatGPT-4's correct responses to practice questions was significantly shorter than the mean length of incorrect responses (difference = 79.58 characters, SE = 35.42, 95%CI = [9.89,149.28], t = 2.25, p = 0.03). CONCLUSIONS ChatGPT-4 answered a remarkably high proportion of practice questions correctly for USMLE examinations. ChatGPT-4 performed substantially better at USMLE practice questions than previous models of the same AI chatbot.
Collapse
Affiliation(s)
- Andrew Mihalache
- Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Ryan S Huang
- Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Marko M Popovic
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Rajeev H Muni
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Ophthalmology, St. Michael's Hospital/Unity Health Toronto, Toronto, Ontario, Canada
| |
Collapse
|
11
|
Marshall RF, Mallem K, Xu H, Thorne J, Burkholder B, Chaon B, Liberman P, Berkenstock M. Investigating the Accuracy and Completeness of an Artificial Intelligence Large Language Model About Uveitis: An Evaluation of ChatGPT. Ocul Immunol Inflamm 2024:1-4. [PMID: 38394625 DOI: 10.1080/09273948.2024.2317417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 02/06/2024] [Indexed: 02/25/2024]
Abstract
PURPOSE To assess the accuracy and completeness of ChatGPT-generated answers regarding uveitis description, prevention, treatment, and prognosis. METHODS Thirty-two uveitis-related questions were generated by a uveitis specialist and inputted into ChatGPT 3.5. Answers were compiled into a survey and were reviewed by five uveitis specialists using standardized Likert scales of accuracy and completeness. RESULTS In total, the median accuracy score for all the uveitis questions (n = 32) was 4.00 (between "more correct than incorrect" and "nearly all correct"), and the median completeness score was 2.00 ("adequate, addresses all aspects of the question and provides the minimum amount of information required to be considered complete"). The interrater variability assessment had a total kappa value of 0.0278 for accuracy and 0.0847 for completeness. CONCLUSION ChatGPT can provide relatively high accuracy responses for various questions related to uveitis; however, the answers it provides are incomplete, with some inaccuracies. Its utility in providing medical information requires further validation and development prior to serving as a source of uveitis information for patients.
Collapse
Affiliation(s)
- Rayna F Marshall
- The Drexel University College of Medicine, Philadelphia, Pennsylvania, USA
| | - Krishna Mallem
- The Drexel University College of Medicine, Philadelphia, Pennsylvania, USA
| | - Hannah Xu
- University of California San Diego, San Diego, California, USA
| | - Jennifer Thorne
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Bryn Burkholder
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Benjamin Chaon
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Paulina Liberman
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Meghan Berkenstock
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| |
Collapse
|
12
|
Olis M, Dyjak P, Weppelmann TA. Performance of three artificial intelligence chatbots on Ophthalmic Knowledge Assessment Program materials. CANADIAN JOURNAL OF OPHTHALMOLOGY 2024:S0008-4182(24)00026-7. [PMID: 38408734 DOI: 10.1016/j.jcjo.2024.01.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 12/21/2023] [Accepted: 01/29/2024] [Indexed: 02/28/2024]
Affiliation(s)
- Mathew Olis
- Nova Southeastern Kiran C. Patel College of Osteopathic Medicine, Tampa, FL
| | - Patrick Dyjak
- Kansas City University College of Osteopathic Medicine, Kansas City, MO
| | - Thomas A Weppelmann
- Department of Ophthalmology, Morsani College of Medicine, University of South Florida, Tampa, FL; James A Haley Veterans Hospital Eye Clinic, Department of Veterans Affairs, Tampa, FL.
| |
Collapse
|
13
|
Gritti MN, AlTurki H, Farid P, Morgan CT. Progression of an Artificial Intelligence Chatbot (ChatGPT) for Pediatric Cardiology Educational Knowledge Assessment. Pediatr Cardiol 2024; 45:309-313. [PMID: 38170274 DOI: 10.1007/s00246-023-03385-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Accepted: 12/13/2023] [Indexed: 01/05/2024]
Abstract
Artificial intelligence chatbots, like ChatGPT, have become powerful tools that are disrupting how humans interact with technology. The potential uses within medicine are vast. In medical education, these chatbots have shown improvements, in a short time span, in generalized medical examinations. We evaluated the overall performance and improvement between ChatGPT 3.5 and 4.0 in a test of pediatric cardiology knowledge. ChatGPT 3.5 and ChatGPT 4.0 were used to answer text-based multiple-choice questions derived from a Pediatric Cardiology Board Review textbook. Each chatbot was given an 88 question test, subcategorized into 11 topics. We excluded questions with modalities other than text (sound clips or images). Statistical analysis was done using an unpaired two-tailed t-test. Of the same 88 questions, ChatGPT 4.0 answered 66% of the questions correctly (n = 58/88) which was significantly greater (p < 0.0001) than ChatGPT 3.5, which only answered 38% (33/88). The ChatGPT 4.0 version also did better on each subspeciality topic as compared to ChatGPT 3.5. While acknowledging that ChatGPT does not yet offer subspecialty level knowledge in pediatric cardiology, the performance in pediatric cardiology educational assessments showed a considerable improvement in a short period of time between ChatGPT 3.5 and 4.0.
Collapse
Affiliation(s)
- Michael N Gritti
- Division of Cardiology, The Labatt Family Heart Centre, The Hospital for Sick Children, 555 University Ave, Toronto, ON, M5G 1X8, Canada.
- Department of Pediatrics, University of Toronto, Toronto, ON, Canada.
| | - Hussain AlTurki
- Department of Pediatrics, University of Toronto, Toronto, ON, Canada
- Department of Pediatrics, The Hospital for Sick Children, Toronto, ON, Canada
| | - Pedrom Farid
- Division of Cardiology, The Labatt Family Heart Centre, The Hospital for Sick Children, 555 University Ave, Toronto, ON, M5G 1X8, Canada
- Schulich School of Medicine and Dentistry, University of Western Ontario, London, ON, Canada
| | - Conall T Morgan
- Division of Cardiology, The Labatt Family Heart Centre, The Hospital for Sick Children, 555 University Ave, Toronto, ON, M5G 1X8, Canada
- Department of Pediatrics, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
14
|
Danesh A, Pazouki H, Danesh F, Danesh A, Vardar-Sengul S. Artificial intelligence in dental education: ChatGPT's performance on the periodontic in-service examination. J Periodontol 2024. [PMID: 38197146 DOI: 10.1002/jper.23-0514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 11/15/2023] [Accepted: 11/16/2023] [Indexed: 01/11/2024]
Abstract
BACKGROUND ChatGPT's (Chat Generative Pre-Trained Transformer) remarkable capacity to generate human-like output makes it an appealing learning tool for healthcare students worldwide. Nevertheless, the chatbot's responses may be subject to inaccuracies, putting forth an intense risk of misinformation. ChatGPT's capabilities should be examined in every corner of healthcare education, including dentistry and its specialties, to understand the potential of misinformation associated with the chatbot's use as a learning tool. Our investigation aims to explore ChatGPT's foundation of knowledge in the field of periodontology by evaluating the chatbot's performance on questions obtained from an in-service examination administered by the American Academy of Periodontology (AAP). METHODS ChatGPT3.5 and ChatGPT4 were evaluated on 311 multiple-choice questions obtained from the 2023 in-service examination administered by the AAP. The dataset of in-service examination questions was accessed through Nova Southeastern University's Department of Periodontology. Our study excluded questions containing an image as ChatGPT does not accept image inputs. RESULTS ChatGPT3.5 and ChatGPT4 answered 57.9% and 73.6% of in-service questions correctly on the 2023 Periodontics In-Service Written Examination, respectively. A two-tailed t test was incorporated to compare independent sample means, and sample proportions were compared using a two-tailed χ2 test. A p value below the threshold of 0.05 was deemed statistically significant. CONCLUSION While ChatGPT4 showed a higher proficiency compared to ChatGPT3.5, both chatbot models leave considerable room for misinformation with their responses relating to periodontology. The findings of the study encourage residents to scrutinize the periodontic information generated by ChatGPT to account for the chatbot's current limitations.
Collapse
Affiliation(s)
- Arman Danesh
- Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada
| | - Hirad Pazouki
- Faculty of Science, Western University, London, Ontario, Canada
| | - Farzad Danesh
- Elgin Mills Endodontic Specialists, Richmond Hill, Ontario, Canada
| | - Arsalan Danesh
- Department of Periodontology, College of Dental Medicine, Nova Southeastern University, Davie, Florida, USA
| | - Saynur Vardar-Sengul
- Department of Periodontology, College of Dental Medicine, Nova Southeastern University, Davie, Florida, USA
| |
Collapse
|
15
|
Wong M, Lim ZW, Pushpanathan K, Cheung CY, Wang YX, Chen D, Tham YC. Review of emerging trends and projection of future developments in large language models research in ophthalmology. Br J Ophthalmol 2023:bjo-2023-324734. [PMID: 38164563 DOI: 10.1136/bjo-2023-324734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 11/14/2023] [Indexed: 01/03/2024]
Abstract
BACKGROUND Large language models (LLMs) are fast emerging as potent tools in healthcare, including ophthalmology. This systematic review offers a twofold contribution: it summarises current trends in ophthalmology-related LLM research and projects future directions for this burgeoning field. METHODS We systematically searched across various databases (PubMed, Europe PMC, Scopus and Web of Science) for articles related to LLM use in ophthalmology, published between 1 January 2022 and 31 July 2023. Selected articles were summarised, and categorised by type (editorial, commentary, original research, etc) and their research focus (eg, evaluating ChatGPT's performance in ophthalmology examinations or clinical tasks). FINDINGS We identified 32 articles meeting our criteria, published between January and July 2023, with a peak in June (n=12). Most were original research evaluating LLMs' proficiency in clinically related tasks (n=9). Studies demonstrated that ChatGPT-4.0 outperformed its predecessor, ChatGPT-3.5, in ophthalmology exams. Furthermore, ChatGPT excelled in constructing discharge notes (n=2), evaluating diagnoses (n=2) and answering general medical queries (n=6). However, it struggled with generating scientific articles or abstracts (n=3) and answering specific subdomain questions, especially those regarding specific treatment options (n=2). ChatGPT's performance relative to other LLMs (Google's Bard, Microsoft's Bing) varied by study design. Ethical concerns such as data hallucination (n=27), authorship (n=5) and data privacy (n=2) were frequently cited. INTERPRETATION While LLMs hold transformative potential for healthcare and ophthalmology, concerns over accountability, accuracy and data security remain. Future research should focus on application programming interface integration, comparative assessments of popular LLMs, their ability to interpret image-based data and the establishment of standardised evaluation frameworks.
Collapse
Affiliation(s)
| | - Zhi Wei Lim
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Krithi Pushpanathan
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Carol Y Cheung
- Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| | - Ya Xing Wang
- Beijing Institute of Ophthalmology, Beijing Tongren Hospital, Capital University of Medical Science, Beijing, China
| | - David Chen
- Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Yih Chung Tham
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
| |
Collapse
|
16
|
Ittarat M, Cheungpasitporn W, Chansangpetch S. Personalized Care in Eye Health: Exploring Opportunities, Challenges, and the Road Ahead for Chatbots. J Pers Med 2023; 13:1679. [PMID: 38138906 PMCID: PMC10744965 DOI: 10.3390/jpm13121679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 11/29/2023] [Accepted: 11/30/2023] [Indexed: 12/24/2023] Open
Abstract
In modern eye care, the adoption of ophthalmology chatbots stands out as a pivotal technological progression. These digital assistants present numerous benefits, such as better access to vital information, heightened patient interaction, and streamlined triaging. Recent evaluations have highlighted their performance in both the triage of ophthalmology conditions and ophthalmology knowledge assessment, underscoring their potential and areas for improvement. However, assimilating these chatbots into the prevailing healthcare infrastructures brings challenges. These encompass ethical dilemmas, legal compliance, seamless integration with electronic health records (EHR), and fostering effective dialogue with medical professionals. Addressing these challenges necessitates the creation of bespoke standards and protocols for ophthalmology chatbots. The horizon for these chatbots is illuminated by advancements and anticipated innovations, poised to redefine the delivery of eye care. The synergy of artificial intelligence (AI) and machine learning (ML) with chatbots amplifies their diagnostic prowess. Additionally, their capability to adapt linguistically and culturally ensures they can cater to a global patient demographic. In this article, we explore in detail the utilization of chatbots in ophthalmology, examining their accuracy, reliability, data protection, security, transparency, potential algorithmic biases, and ethical considerations. We provide a comprehensive review of their roles in the triage of ophthalmology conditions and knowledge assessment, emphasizing their significance and future potential in the field.
Collapse
Affiliation(s)
- Mantapond Ittarat
- Surin Hospital and Surin Medical Education Center, Suranaree University of Technology, Surin 32000, Thailand;
| | | | - Sunee Chansangpetch
- Center of Excellence in Glaucoma, Chulalongkorn University, Bangkok 10330, Thailand;
- Department of Ophthalmology, Faculty of Medicine, Chulalongkorn University and King Chulalongkorn Memorial Hospital, Thai Red Cross Society, Bangkok 10330, Thailand
| |
Collapse
|
17
|
Cohen A, Alter R, Lessans N, Meyer R, Brezinov Y, Levin G. Performance of ChatGPT in Israeli Hebrew OBGYN national residency examinations. Arch Gynecol Obstet 2023; 308:1797-1802. [PMID: 37668790 DOI: 10.1007/s00404-023-07185-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2023] [Accepted: 08/02/2023] [Indexed: 09/06/2023]
Abstract
PURPOSE Previous studies of ChatGPT performance in the field of medical examinations have reached contradictory results. Moreover, the performance of ChatGPT in other languages other than English is yet to be explored. We aim to study the performance of ChatGPT in Hebrew OBGYN-'Shlav-Alef' (Phase 1) examination. METHODS A performance study was conducted using a consecutive sample of text-based multiple choice questions, originated from authentic Hebrew OBGYN-'Shlav-Alef' examinations in 2021-2022. We constructed 150 multiple choice questions from consecutive text-based-only original questions. We compared the performance of ChatGPT performance to the real-life actual performance of OBGYN residents who completed the tests in 2021-2022. We also compared ChatGTP Hebrew performance vs. previously published English medical tests. RESULTS In 2021-2022, 27.8% of OBGYN residents failed the 'Shlav-Alef' examination and the mean score of the residents was 68.4. Overall, 150 authentic questions were evaluated (one examination). ChatGPT correctly answered 58 questions (38.7%) and reached a failed score. The performance of Hebrew ChatGPT was lower when compared to actual performance of residents: 38.7% vs. 68.4%, p < .001. In a comparison to ChatGPT performance in 9,091 English language questions in the field of medicine, the performance of Hebrew ChatGPT was lower (38.7% in Hebrew vs. 60.7% in English, p < .001). CONCLUSIONS ChatGPT answered correctly on less than 40% of Hebrew OBGYN resident examination questions. Residents cannot rely on ChatGPT for the preparation of this examination. Efforts should be made to improve ChatGPT performance in other languages besides English.
Collapse
Affiliation(s)
- Adiel Cohen
- Department of Obstetrics and Gynecology, Hadassah Medical Organization and Faculty of Medicine, Hebrew University of Jerusalem, Ein Kerem, P.O.B. 12000, 91120, Jerusalem, Israel.
| | - Roie Alter
- Department of Obstetrics and Gynecology, Hadassah Medical Organization and Faculty of Medicine, Hebrew University of Jerusalem, Ein Kerem, P.O.B. 12000, 91120, Jerusalem, Israel
| | - Naama Lessans
- Department of Obstetrics and Gynecology, Hadassah Medical Organization and Faculty of Medicine, Hebrew University of Jerusalem, Ein Kerem, P.O.B. 12000, 91120, Jerusalem, Israel
| | - Raanan Meyer
- Department of Obstetrics and Gynecology, Chaim Sheba Medical Center, Ramat-Gan, Israel
- Faculty of Medicine, Tel-Aviv University, Tel-Aviv, Israel
- Cedar-Sinai Medical Center, Los Angeles, USA
| | - Yoav Brezinov
- Lady Davis Institute for Cancer Research, Jewish General Hospital, McGill University, Montreal, Canada
| | - Gabriel Levin
- Lady Davis Institute for Cancer Research, Jewish General Hospital, McGill University, Montreal, Canada
- The Department of Gynecoloic Oncology, Hadassah Medical Center, Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
| |
Collapse
|
18
|
Hu X, Ran AR, Nguyen TX, Szeto S, Yam JC, Chan CKM, Cheung CY. What can GPT-4 do for Diagnosing Rare Eye Diseases? A Pilot Study. Ophthalmol Ther 2023; 12:3395-3402. [PMID: 37656399 PMCID: PMC10640532 DOI: 10.1007/s40123-023-00789-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Accepted: 08/01/2023] [Indexed: 09/02/2023] Open
Abstract
INTRODUCTION Generative pretrained transformer-4 (GPT-4) has gained widespread attention from society, and its potential has been extensively evaluated in many areas. However, investigation of GPT-4's use in medicine, especially in the ophthalmology field, is still limited. This study aims to evaluate GPT-4's capability to identify rare ophthalmic diseases in three simulated scenarios for different end-users, including patients, family physicians, and junior ophthalmologists. METHODS We selected ten treatable rare ophthalmic disease cases from the publicly available EyeRounds service. We gradually increased the amount of information fed into GPT-4 to simulate the scenarios of patient, family physician, and junior ophthalmologist using GPT-4. GPT-4's responses were evaluated from two aspects: suitability (appropriate or inappropriate) and accuracy (right or wrong) by senior ophthalmologists (> 10 years' experiences). RESULTS Among the 30 responses, 83.3% were considered "appropriate" by senior ophthalmologists. In the scenarios of simulated patient, family physician, and junior ophthalmologist, seven (70%), ten (100%), and eight (80%) responses were graded as "appropriate" by senior ophthalmologists. However, compared to the ground truth, GPT-4 could only output several possible diseases generally without "right" responses in the simulated patient scenarios. In contrast, in the simulated family physician scenario, 50% of GPT-4's responses were "right," and in the simulated junior ophthalmologist scenario, the model achieved a higher "right" rate of 90%. CONCLUSION To our knowledge, this is the first proof-of-concept study that evaluates GPT-4's capacity to identify rare eye diseases in simulated scenarios involving patients, family physicians, and junior ophthalmologists. The results indicate that GPT-4 has the potential to serve as a consultation assisting tool for patients and family physicians to receive referral suggestions and an assisting tool for junior ophthalmologists to diagnose rare eye diseases. However, it is important to approach GPT-4 with caution and acknowledge the need for verification and careful referrals in clinical settings.
Collapse
Affiliation(s)
- Xiaoyan Hu
- Department of Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - An Ran Ran
- Department of Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Truong X Nguyen
- Department of Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Simon Szeto
- Department of Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Jason C Yam
- Department of Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
| | | | - Carol Y Cheung
- Department of Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China.
| |
Collapse
|
19
|
Fowler T, Pullen S, Birkett L. Performance of ChatGPT and Bard on the official part 1 FRCOphth practice questions. Br J Ophthalmol 2023:bjo-2023-324091. [PMID: 37932006 DOI: 10.1136/bjo-2023-324091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Accepted: 10/08/2023] [Indexed: 11/08/2023]
Abstract
BACKGROUND Chat Generative Pre-trained Transformer (ChatGPT), a large language model by OpenAI, and Bard, Google's artificial intelligence (AI) chatbot, have been evaluated in various contexts. This study aims to assess these models' proficiency in the part 1 Fellowship of the Royal College of Ophthalmologists (FRCOphth) Multiple Choice Question (MCQ) examination, highlighting their potential in medical education. METHODS Both models were tested on a sample question bank for the part 1 FRCOphth MCQ exam. Their performances were compared with historical human performance on the exam, focusing on the ability to comprehend, retain and apply information related to ophthalmology. We also tested it on the book 'MCQs for FRCOpth part 1', and assessed its performance across subjects. RESULTS ChatGPT demonstrated a strong performance, surpassing historical human pass marks and examination performance, while Bard underperformed. The comparison indicates the potential of certain AI models to match, and even exceed, human standards in such tasks. CONCLUSION The results demonstrate the potential of AI models, such as ChatGPT, in processing and applying medical knowledge at a postgraduate level. However, performance varied among different models, highlighting the importance of appropriate AI selection. The study underlines the potential for AI applications in medical education and the necessity for further investigation into their strengths and limitations.
Collapse
Affiliation(s)
- Thomas Fowler
- Department of Medicine, Barking Havering and Redbridge University Hospitals NHS Trust, London, UK
| | - Simon Pullen
- Department of Anaesthetics, Princess Alexandra Hospital, Harlow, UK
| | - Liam Birkett
- Emergency Medicine, Royal Free Hospital, London, UK
| |
Collapse
|
20
|
Antaki F, Milad D, Chia MA, Giguère CÉ, Touma S, El-Khoury J, Keane PA, Duval R. Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering. Br J Ophthalmol 2023:bjo-2023-324438. [PMID: 37923374 DOI: 10.1136/bjo-2023-324438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 10/01/2023] [Indexed: 11/07/2023]
Abstract
BACKGROUND Evidence on the performance of Generative Pre-trained Transformer 4 (GPT-4), a large language model (LLM), in the ophthalmology question-answering domain is needed. METHODS We tested GPT-4 on two 260-question multiple choice question sets from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions question banks. We compared the accuracy of GPT-4 models with varying temperatures (creativity setting) and evaluated their responses in a subset of questions. We also compared the best-performing GPT-4 model to GPT-3.5 and to historical human performance. RESULTS GPT-4-0.3 (GPT-4 with a temperature of 0.3) achieved the highest accuracy among GPT-4 models, with 75.8% on the BCSC set and 70.0% on the OphthoQuestions set. The combined accuracy was 72.9%, which represents an 18.3% raw improvement in accuracy compared with GPT-3.5 (p<0.001). Human graders preferred responses from models with a temperature higher than 0 (more creative). Exam section, question difficulty and cognitive level were all predictive of GPT-4-0.3 answer accuracy. GPT-4-0.3's performance was numerically superior to human performance on the BCSC (75.8% vs 73.3%) and OphthoQuestions (70.0% vs 63.0%), but the difference was not statistically significant (p=0.55 and p=0.09). CONCLUSION GPT-4, an LLM trained on non-ophthalmology-specific data, performs significantly better than its predecessor on simulated ophthalmology board-style exams. Remarkably, its performance tended to be superior to historical human performance, but that difference was not statistically significant in our study.
Collapse
Affiliation(s)
- Fares Antaki
- Moorfields Eye Hospital NHS Foundation Trust, London, UK
- Institute of Ophthalmology, UCL, London, UK
- The CHUM School of Artificial Intelligence in Healthcare, Montreal, Quebec, Canada
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Centre Hospitalier de l'Universite de Montreal (CHUM), Montreal, Quebec, Canada
| | - Daniel Milad
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Centre Hospitalier de l'Universite de Montreal (CHUM), Montreal, Quebec, Canada
- Department of Ophthalmology, Hopital Maisonneuve-Rosemont, Montreal, Quebec, Canada
| | - Mark A Chia
- Moorfields Eye Hospital NHS Foundation Trust, London, UK
- Institute of Ophthalmology, UCL, London, UK
| | | | - Samir Touma
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Centre Hospitalier de l'Universite de Montreal (CHUM), Montreal, Quebec, Canada
- Department of Ophthalmology, Hopital Maisonneuve-Rosemont, Montreal, Quebec, Canada
| | - Jonathan El-Khoury
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Centre Hospitalier de l'Universite de Montreal (CHUM), Montreal, Quebec, Canada
- Department of Ophthalmology, Hopital Maisonneuve-Rosemont, Montreal, Quebec, Canada
| | - Pearse A Keane
- Moorfields Eye Hospital NHS Foundation Trust, London, UK
- Institute of Ophthalmology, UCL, London, UK
- NIHR Moorfields Biomedical Research Centre, London, UK
| | - Renaud Duval
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hopital Maisonneuve-Rosemont, Montreal, Quebec, Canada
| |
Collapse
|
21
|
Mihalache A, Popovic MM, Muni RH. Advances in Artificial Intelligence Chatbot Technology in Ophthalmology-Reply. JAMA Ophthalmol 2023; 141:1088-1089. [PMID: 37856111 DOI: 10.1001/jamaophthalmol.2023.4623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2023]
Affiliation(s)
- Andrew Mihalache
- Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Marko M Popovic
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Rajeev H Muni
- Department of Ophthalmology, St. Michael's Hospital/Unity Health Toronto, Toronto, Ontario, Canada
| |
Collapse
|
22
|
Wiedemann P. Artificial intelligence in ophthalmology. Int J Ophthalmol 2023; 16:1357-1360. [PMID: 37724277 PMCID: PMC10409517 DOI: 10.18240/ijo.2023.09.01] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 07/19/2023] [Indexed: 09/20/2023] Open
|
23
|
Liu J, Zheng J, Cai X, Wu D, Yin C. A descriptive study based on the comparison of ChatGPT and evidence-based neurosurgeons. iScience 2023; 26:107590. [PMID: 37705958 PMCID: PMC10495632 DOI: 10.1016/j.isci.2023.107590] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Revised: 06/21/2023] [Accepted: 08/04/2023] [Indexed: 09/15/2023] Open
Abstract
ChatGPT is an artificial intelligence product developed by OpenAI. This study aims to investigate whether ChatGPT can respond in accordance with evidence-based medicine in neurosurgery. We generated 50 neurosurgical questions covering neurosurgical diseases. Each question was posed three times to GPT-3.5 and GPT-4.0. We also recruited three neurosurgeons with high, middle, and low seniority to respond to questions. The results were analyzed regarding ChatGPT's overall performance score, mean scores by the items' specialty classification, and question type. In conclusion, GPT-3.5's ability to respond in accordance with evidence-based medicine was comparable to that of neurosurgeons with low seniority, and GPT-4.0's ability was comparable to that of neurosurgeons with high seniority. Although ChatGPT is yet to be comparable to a neurosurgeon with high seniority, future upgrades could enhance its performance and abilities.
Collapse
Affiliation(s)
- Jiayu Liu
- Department of Neurosurgery, the First Medical Centre, Chinese PLA General Hospital, Beijing 100853, China
| | - Jiqi Zheng
- School of Health Humanities, Peking University, Beijing 100191, China
| | - Xintian Cai
- Department of Graduate School, Xinjiang Medical University, Urumqi 830001, China
| | - Dongdong Wu
- Department of Information, Daping Hospital, Army Medical University, Chongqing 400042, China
| | - Chengliang Yin
- Faculty of Medicine, Macau University of Science and Technology, Macau 999078, China
| |
Collapse
|