1
|
Agnihotri AP, Nagel ID, Artiaga JCM, Guevarra MCB, Sosuan GMN, Kalaw FGP. Large Language Models in Ophthalmology: A Review of Publications from Top Ophthalmology Journals. OPHTHALMOLOGY SCIENCE 2025; 5:100681. [PMID: 40114712 PMCID: PMC11925577 DOI: 10.1016/j.xops.2024.100681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Revised: 11/27/2024] [Accepted: 12/13/2024] [Indexed: 03/22/2025]
Abstract
Purpose To review and evaluate the current literature on the application and impact of large language models (LLMs) in the field of ophthalmology, focusing on studies published in high-ranking ophthalmology journals. Design This is a retrospective review of published articles. Participants This study did not involve human participation. Methods Articles published in the first quartile (Q1) of ophthalmology journals on Scimago Journal & Country Rank discussing different LLMs up to June 7, 2024, were reviewed, parsed, and analyzed. Main Outcome Measures All available articles were parsed and analyzed, which included the article and author characteristics and data regarding the LLM used and its applications, focusing on its use in medical education, clinical assistance, research, and patient education. Results There were 35 Q1-ranked journals identified, 19 of which contained articles discussing LLMs, with 101 articles eligible for review. One-third were original investigations (32%; 32/101), with an average of 5.3 authors per article. The United States (50.4%; 51/101) was the most represented country, followed by the United Kingdom (25.7%; 26/101) and Canada (16.8%; 17/101). ChatGPT was the most used LLM among the studies, with different versions discussed and compared. Large language model applications were discussed relevant to their implications in medical education, clinical assistance, research, and patient education. Conclusions The numerous publications on the use of LLM in ophthalmology can provide valuable insights for stakeholders and consumers of these applications. Large language models present significant opportunities for advancement in ophthalmology, particularly in team science, education, clinical assistance, and research. Although LLMs show promise, they also show challenges such as performance inconsistencies, bias, and ethical concerns. The study emphasizes the need for ongoing artificial intelligence improvement, ethical guidelines, and multidisciplinary collaboration. Financial Disclosures The author(s) have no proprietary or commercial interest in any materials discussed in this article.
Collapse
Affiliation(s)
- Akshay Prashant Agnihotri
- Jacobs Retina Center, University of California, San Diego, La Jolla, California
- Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California, San Diego, La Jolla, California
- Retina Care Hospital, Nagpur, India
| | - Ines Doris Nagel
- Jacobs Retina Center, University of California, San Diego, La Jolla, California
- Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California, San Diego, La Jolla, California
- Department of Ophthalmology, University Hospital Augsburg, Augsburg, Germany
| | - Jose Carlo M Artiaga
- Department of Ophthalmology and Visual Sciences, Philippine General Hospital, University of the Philippines Manila, Manila City, Philippines
- International Eye Institute, St. Luke's Medical Center Global City, Taguig City, Philippines
| | - Ma Carmela B Guevarra
- Department of Ophthalmology, Massachusetts Eye and Ear, Boston, Massachusetts
- Harvard Medical School, Department of Ophthalmology, Boston, Massachusetts
| | - George Michael N Sosuan
- Department of Ophthalmology and Visual Sciences, Philippine General Hospital, University of the Philippines Manila, Manila City, Philippines
| | - Fritz Gerald P Kalaw
- Jacobs Retina Center, University of California, San Diego, La Jolla, California
- Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California, San Diego, La Jolla, California
- Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California, San Diego, La Jolla, California
| |
Collapse
|
2
|
Mikhail D, Milad D, Antaki F, Milad J, Farah A, Khairy T, El-Khoury J, Bachour K, Szigiato AA, Nayman T, Mullie GA, Duval R. Multimodal Performance of GPT-4 in Complex Ophthalmology Cases. J Pers Med 2025; 15:160. [PMID: 40278339 PMCID: PMC12028970 DOI: 10.3390/jpm15040160] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2025] [Revised: 04/04/2025] [Accepted: 04/10/2025] [Indexed: 04/26/2025] Open
Abstract
Objectives: The integration of multimodal capabilities into GPT-4 represents a transformative leap for artificial intelligence in ophthalmology, yet its utility in scenarios requiring advanced reasoning remains underexplored. This study evaluates GPT-4's multimodal performance on open-ended diagnostic and next-step reasoning tasks in complex ophthalmology cases, comparing it against human expertise. Methods: GPT-4 was assessed across three study arms: (1) text-based case details with figure descriptions, (2) cases with text and accompanying ophthalmic figures, and (3) cases with figures only (no figure descriptions). We compared GPT-4's diagnostic and next-step accuracy across arms and benchmarked its performance against three board-certified ophthalmologists. Results: GPT-4 achieved 38.4% (95% CI [33.9%, 43.1%]) diagnostic accuracy and 57.8% (95% CI [52.8%, 62.2%]) next-step accuracy when prompted with figures without descriptions. Diagnostic accuracy declined significantly compared to text-only prompts (p = 0.007), though the next-step performance was similar (p = 0.140). Adding figure descriptions restored diagnostic accuracy (49.3%) to near parity with text-only prompts (p = 0.684). Using figures without descriptions, GPT-4's diagnostic accuracy was comparable to two ophthalmologists (p = 0.30, p = 0.41) but fell short of the highest-performing ophthalmologist (p = 0.0004). For next-step accuracy, GPT-4 was similar to one ophthalmologist (p = 0.22) but underperformed relative to the other two (p = 0.0015, p = 0.0017). Conclusions: GPT-4's diagnostic performance diminishes when relying solely on ophthalmic images without textual context, highlighting limitations in its current multimodal capabilities. Despite this, GPT-4 demonstrated comparable performance to at least one ophthalmologist on both diagnostic and next-step reasoning tasks, emphasizing its potential as an assistive tool. Future research should refine multimodal prompts and explore iterative or sequential prompting strategies to optimize AI-driven interpretation of complex ophthalmic datasets.
Collapse
Affiliation(s)
- David Mikhail
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 1A1, Canada;
| | - Daniel Milad
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Department of Ophthalmology, Hôpital Maisonneuve-Rosemont, Montreal, QC H1T 2M4, Canada
| | - Fares Antaki
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Cole Eye Institute, Cleveland Clinic, Cleveland, OH 44195, USA
- CHUM School of Artificial Intelligence in Healthcare (SAIH), Centre Hospitalier de l’Université de Montréal (CHUM), Montreal, QC H2X 3E4, Canada
| | - Jason Milad
- Department of Software Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada;
| | - Andrew Farah
- Faculty of Medicine, McGill University, Montreal, QC H3A 0G4, Canada
| | - Thomas Khairy
- Faculty of Medicine, McGill University, Montreal, QC H3A 0G4, Canada
| | - Jonathan El-Khoury
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Department of Ophthalmology, Hôpital Maisonneuve-Rosemont, Montreal, QC H1T 2M4, Canada
| | - Kenan Bachour
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Department of Ophthalmology, Hôpital Maisonneuve-Rosemont, Montreal, QC H1T 2M4, Canada
| | | | - Taylor Nayman
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Department of Ophthalmology, Hôpital Maisonneuve-Rosemont, Montreal, QC H1T 2M4, Canada
| | - Guillaume A. Mullie
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Department of Ophthalmology, St. Mary’s Hospital Center, Montreal, QC H3T 1M5, Canada
| | - Renaud Duval
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Department of Ophthalmology, Hôpital Maisonneuve-Rosemont, Montreal, QC H1T 2M4, Canada
| |
Collapse
|
3
|
Takita H, Kabata D, Walston SL, Tatekawa H, Saito K, Tsujimoto Y, Miki Y, Ueda D. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digit Med 2025; 8:175. [PMID: 40121370 PMCID: PMC11929846 DOI: 10.1038/s41746-025-01543-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Accepted: 02/26/2025] [Indexed: 03/25/2025] Open
Abstract
While generative artificial intelligence (AI) has shown potential in medical diagnostics, comprehensive evaluation of its diagnostic performance and comparison with physicians has not been extensively explored. We conducted a systematic review and meta-analysis of studies validating generative AI models for diagnostic tasks published between June 2018 and June 2024. Analysis of 83 studies revealed an overall diagnostic accuracy of 52.1%. No significant performance difference was found between AI models and physicians overall (p = 0.10) or non-expert physicians (p = 0.93). However, AI models performed significantly worse than expert physicians (p = 0.007). Several models demonstrated slightly higher performance compared to non-experts, although the differences were not significant. Generative AI demonstrates promising diagnostic capabilities with accuracy varying by model. Although it has not yet achieved expert-level reliability, these findings suggest potential for enhancing healthcare delivery and medical education when implemented with appropriate understanding of its limitations.
Collapse
Affiliation(s)
- Hirotaka Takita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Daijiro Kabata
- Center for Mathematical and Data Science, Kobe University, Kobe, Japan
| | - Shannon L Walston
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
- Department of Artificial Intelligence, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Hiroyuki Tatekawa
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Kenichi Saito
- Center for Digital Transformation of Health Care, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Yasushi Tsujimoto
- Oku Medical Clinic, Osaka, Japan
- Department of Health Promotion and Human Behavior, Kyoto University Graduate School of Medicine/School of Public Health, Kyoto University, Kyoto, Japan
- Scientific Research WorkS Peer Support Group (SRWS-PSG), Osaka, Japan
| | - Yukio Miki
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Daiju Ueda
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
- Department of Artificial Intelligence, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
- Center for Health Science Innovation, Osaka Metropolitan University, Osaka, Japan.
| |
Collapse
|
4
|
Mikhail D, Mihalache A, Huang RS, Khairy T, Popovic MM, Milad D, Shor R, Pereira A, Kwok J, Yan P, Wong DT, Kertes PJ, Duval R, Muni RH. Performance of ChatGPT in French language analysis of multimodal retinal cases. J Fr Ophtalmol 2025; 48:104391. [PMID: 39708623 DOI: 10.1016/j.jfo.2024.104391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Revised: 11/13/2024] [Accepted: 11/25/2024] [Indexed: 12/23/2024]
Abstract
PURPOSE Prior literature has suggested a reduced performance of large language models (LLMs) in non-English analyses, including Arabic and French. However, there are no current studies testing the multimodal performance of ChatGPT in French ophthalmology cases, and comparing this to the results observed in the English literature. We compared the performance of ChatGPT-4 in French and English on open-ended prompts using multimodal input data from retinal cases. METHODS GPT-4 was prompted in English and French using a public dataset containing 67 retinal cases from the ophthalmology education website OCTCases.com. The clinical case and accompanying ophthalmic images comprised the prompt, along with the open-ended question: "What is the most likely diagnosis?" Systematic prompting was used to identify and compare relevant factor(s) contributing to correct and incorrect responses. Diagnostic accuracy was the primary outcome, defined as the proportion of correctly diagnosed cases in French and English. Diagnoses were compared with the answer key on OCTCases to confirm correct or incorrect responses. Clinically relevant factors reported by the LLM as contributory to its decision-making were secondary endpoints. RESULTS The diagnostic accuracies of GPT-4 in English and French were 35.8% and 28.4%, respectively (χ2, P=0.36). Imaging findings were reported as most influential for correct diagnosis in English (37.5%) and French (42.1%) (P=0.76). In incorrectly diagnosed cases, imaging findings were primarily implicated in English (35.6%) and French (33.3%) (P=0.81). In incorrectly diagnosed cases, the differential diagnosis list contained the correct diagnosis in 39.5% of English cases and 41.7% of French cases (P=0.83). CONCLUSION Our results suggest that GPT-4 performed similarly in English and French on all quantitative performance metrics measured. Ophthalmic images were identified in both languages as critical for correct diagnosis. Future research should assess LLM comprehension through the clarity, grammatical, cultural, and idiomatic accuracy of its responses.
Collapse
Affiliation(s)
- D Mikhail
- Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - A Mihalache
- Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - R S Huang
- Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - T Khairy
- Faculty of Medicine, McGill University, Montreal, Quebec, Canada
| | - M M Popovic
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - D Milad
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
| | - R Shor
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - A Pereira
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - J Kwok
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - P Yan
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - D T Wong
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada; Department of Ophthalmology, St. Michael's Hospital/Unity Health Toronto, Toronto, Ontario, Canada
| | - P J Kertes
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada; John and Liz Tory Eye Centre, Sunnybrook Health Science Centre, Toronto, Ontario, Canada
| | - R Duval
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada; Department of Ophthalmology, Hospital Maisonneuve-Rosemont, Montreal, Quebec, Canada
| | - R H Muni
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada; Department of Ophthalmology, St. Michael's Hospital/Unity Health Toronto, Toronto, Ontario, Canada.
| |
Collapse
|
5
|
Sabaner MC, Anguita R, Antaki F, Balas M, Boberg-Ans LC, Ferro Desideri L, Grauslund J, Hansen MS, Klefter ON, Potapenko I, Rasmussen MLR, Subhi Y. Opportunities and Challenges of Chatbots in Ophthalmology: A Narrative Review. J Pers Med 2024; 14:1165. [PMID: 39728077 DOI: 10.3390/jpm14121165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2024] [Revised: 12/18/2024] [Accepted: 12/19/2024] [Indexed: 12/28/2024] Open
Abstract
Artificial intelligence (AI) is becoming increasingly influential in ophthalmology, particularly through advancements in machine learning, deep learning, robotics, neural networks, and natural language processing (NLP). Among these, NLP-based chatbots are the most readily accessible and are driven by AI-based large language models (LLMs). These chatbots have facilitated new research avenues and have gained traction in both clinical and surgical applications in ophthalmology. They are also increasingly being utilized in studies on ophthalmology-related exams, particularly those containing multiple-choice questions (MCQs). This narrative review evaluates both the opportunities and the challenges of integrating chatbots into ophthalmology research, with separate assessments of studies involving open- and close-ended questions. While chatbots have demonstrated sufficient accuracy in handling MCQ-based studies, supporting their use in education, additional exam security measures are necessary. The research on open-ended question responses suggests that AI-based LLM chatbots could be applied across nearly all areas of ophthalmology. They have shown promise for addressing patient inquiries, offering medical advice, patient education, supporting triage, facilitating diagnosis and differential diagnosis, and aiding in surgical planning. However, the ethical implications, confidentiality concerns, physician liability, and issues surrounding patient privacy remain pressing challenges. Although AI has demonstrated significant promise in clinical patient care, it is currently most effective as a supportive tool rather than as a replacement for human physicians.
Collapse
Affiliation(s)
- Mehmet Cem Sabaner
- Department of Ophthalmology, Kastamonu University, Training and Research Hospital, 37150 Kastamonu, Türkiye
| | - Rodrigo Anguita
- Department of Ophthalmology, Inselspital, University Hospital Bern, University of Bern, 3010 Bern, Switzerland
- Moorfields Eye Hospital National Health Service Foundation Trust, London EC1V 2PD, UK
| | - Fares Antaki
- Moorfields Eye Hospital National Health Service Foundation Trust, London EC1V 2PD, UK
- The CHUM School of Artificial Intelligence in Healthcare, Montreal, QC H2X 0A9, Canada
- Cole Eye Institute, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Michael Balas
- Department of Ophthalmology & Vision Sciences, University of Toronto, Toronto, ON M5T 2S8, Canada
| | | | - Lorenzo Ferro Desideri
- Department of Ophthalmology, Inselspital, University Hospital Bern, University of Bern, 3010 Bern, Switzerland
- Graduate School for Health Sciences, University of Bern, 3012 Bern, Switzerland
| | - Jakob Grauslund
- Department of Ophthalmology, Odense University Hospital, 5000 Odense, Denmark
- Department of Clinical Research, University of Southern Denmark, 5230 Odense, Denmark
- Department of Ophthalmology, Vestfold Hospital Trust, 3103 Tønsberg, Norway
| | | | - Oliver Niels Klefter
- Department of Ophthalmology, Rigshospitalet, 2100 Copenhagen, Denmark
- Department of Clinical Medicine, University of Copenhagen, 1172 Copenhagen, Denmark
| | - Ivan Potapenko
- Department of Ophthalmology, Rigshospitalet, 2100 Copenhagen, Denmark
| | - Marie Louise Roed Rasmussen
- Department of Ophthalmology, Rigshospitalet, 2100 Copenhagen, Denmark
- Department of Clinical Medicine, University of Copenhagen, 1172 Copenhagen, Denmark
| | - Yousif Subhi
- Department of Clinical Research, University of Southern Denmark, 5230 Odense, Denmark
- Department of Ophthalmology, Rigshospitalet, 2100 Copenhagen, Denmark
- Department of Clinical Medicine, University of Copenhagen, 1172 Copenhagen, Denmark
| |
Collapse
|
6
|
Chotcomwongse P, Ruamviboonsuk P, Grzybowski A. Utilizing Large Language Models in Ophthalmology: The Current Landscape and Challenges. Ophthalmol Ther 2024; 13:2543-2558. [PMID: 39180701 PMCID: PMC11408418 DOI: 10.1007/s40123-024-01018-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 08/01/2024] [Indexed: 08/26/2024] Open
Abstract
A large language model (LLM) is an artificial intelligence (AI) model that uses natural language processing (NLP) to understand, interpret, and generate human-like language responses from unstructured text input. Its real-time response capabilities and eloquent dialogue enhance the interactive user experience in human-AI communication like never before. By gathering several sources on the internet, LLM chatbots can interact and respond to a wide range of queries, including problem solving, text summarization, and creating informative notes. Since ophthalmology is one of the medical fields integrating image analysis, telemedicine, AI, and other technologies, LLMs are likely to play an important role in eye care in the near future. This review summarizes the performance and potential applicability of LLMs in ophthalmology according to currently available publications.
Collapse
Affiliation(s)
- Peranut Chotcomwongse
- Vitreoretina Unit, Department of Ophthalmology, Rajavithi Hospital, Rungsit University, Bangkok, Thailand
| | - Paisan Ruamviboonsuk
- Vitreoretina Unit, Department of Ophthalmology, Rajavithi Hospital, Rungsit University, Bangkok, Thailand
| | - Andrzej Grzybowski
- University of Warmia and Mazury, Olsztyn, Poland.
- Institute for Research in Ophthalmology, Foundation for Ophthalmology Development, 61-553, Poznan, Poland.
| |
Collapse
|
7
|
Mihalache A, Popovic MM, Muni RH. Need for Custom Artificial Intelligence Chatbots in Ophthalmology. JAMA Ophthalmol 2024; 142:806-807. [PMID: 39023863 DOI: 10.1001/jamaophthalmol.2024.2738] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Affiliation(s)
- Andrew Mihalache
- Temerty School of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Marko M Popovic
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Rajeev H Muni
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Ophthalmology, St Michael's Hospital/Unity Health Toronto, Toronto, Ontario, Canada
| |
Collapse
|