1
|
Waqas A, Khan A, Ozturk ZG, Saeed-Vafa D, Chen W, Dhillon J, Bychkov A, Bui MM, Ullah E, Khalil F, Chumbalkar V, Jameel Z, Bittar HT, Singh RS, Parwani AV, Schabath MB, Rasool G. REASONING BEYOND ACCURACY: EXPERT EVALUATION OF LARGE LANGUAGE MODELS IN DIAGNOSTIC PATHOLOGY. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2025.04.11.25325686. [PMID: 40297448 PMCID: PMC12036407 DOI: 10.1101/2025.04.11.25325686] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/30/2025]
Abstract
Background Diagnostic pathology depends on complex, structured reasoning to interpret clinical, histologic, and molecular data. Replicating this cognitive process algorithmically remains a significant challenge. As large language models (LLMs) gain traction in medicine, it is critical to determine whether they have clinical utility by providing reasoning in highly specialized domains such as pathology. Methods We evaluated the performance of four reasoning LLMs (OpenAI o1, OpenAI o3-mini, Gemini 2.0 Flash Thinking Experimental, and DeepSeek-R1 671B) on 15 board-style open-ended pathology questions. Responses were independently reviewed by 11 pathologists using a structured framework that assessed language quality (accuracy, relevance, coherence, depth, and conciseness) and seven diagnostic reasoning strategies. Scores were normalized and aggregated for analysis. We also evaluated inter-observer agreement to assess scoring consistency. Model comparisons were conducted using one-way ANOVA and Tukey's Honestly Significant Difference (HSD) test. Results Gemini and DeepSeek significantly outperformed OpenAI o1 and OpenAI o3-mini in overall reasoning quality (p < 0.05), particularly in analytical depth and coherence. While all models achieved comparable accuracy, only Gemini and DeepSeek consistently applied expert-like reasoning strategies, including algorithmic, inductive, and Bayesian approaches. Performance varied by reasoning type: models performed best in algorithmic and deductive reasoning and poorest in heuristic and pattern recognition. Inter-observer agreement was highest for Gemini (p < 0.05), indicating greater consistency and interpretability. Models with more in-depth reasoning (Gemini and DeepSeek) were generally less concise. Conclusion Advanced LLMs such as Gemini and DeepSeek can approximate aspects of expert-level diagnostic reasoning in pathology, particularly in algorithmic and structured approaches. However, limitations persist in contextual reasoning, heuristic decision-making, and consistency across questions. Addressing these gaps, along with trade-offs between depth and conciseness, will be essential for the safe and effective integration of AI tools into clinical pathology workflows.
Collapse
Affiliation(s)
- Asim Waqas
- Department of Cancer Epidemiology, H. Lee Moffitt Cancer Center & Research Institute, Tampa, FL
| | - Asma Khan
- Armed Forces Institute of Pathology, Rawalpindi, Pakistan
| | | | | | - Weishen Chen
- Department of Dermatology & Cutaneous Surgery, University of South Florida, Tampa, FL
| | - Jasreman Dhillon
- Department of Anatomic Pathology, H. Lee Moffitt Cancer Center & Research Institute
| | - Andrey Bychkov
- Department of Pathology, Kameda Medical Center, Kamogawa City, Chiba Prefecture, Japan
| | - Marilyn M Bui
- Department of Pathology, H. Lee Moffitt Cancer Center & Research Institute
| | - Ehsan Ullah
- Department of Surgery, Health New Zealand, Counties Manukau, Auckland, New Zealand
| | - Farah Khalil
- Department of Pathology, H. Lee Moffitt Cancer Center & Research Institute
| | - Vaibhav Chumbalkar
- Department of Pathology, H. Lee Moffitt Cancer Center & Research Institute
| | - Zena Jameel
- Department of Pathology, H. Lee Moffitt Cancer Center & Research Institute
| | | | - Rajendra S Singh
- Dermatopathology and Digital Pathology, Summit Health, Berkley Heights, NJ
| | - Anil V Parwani
- Department of Pathology, The Ohio State University, Columbus, Ohio
| | - Matthew B Schabath
- Department of Cancer Epidemiology, H. Lee Moffitt Cancer Center & Research Institute
| | - Ghulam Rasool
- Department of Machine Learning, H. Lee Moffitt Cancer Center & Research Institute
| |
Collapse
|
2
|
Brodsky V, Ullah E, Bychkov A, Song AH, Walk EE, Louis P, Rasool G, Singh RS, Mahmood F, Bui MM, Parwani AV. Generative Artificial Intelligence in Anatomic Pathology. Arch Pathol Lab Med 2025; 149:298-318. [PMID: 39836377 DOI: 10.5858/arpa.2024-0215-ra] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/20/2024] [Indexed: 01/22/2025]
Abstract
CONTEXT.— Generative artificial intelligence (AI) has emerged as a transformative force in various fields, including anatomic pathology, where it offers the potential to significantly enhance diagnostic accuracy, workflow efficiency, and research capabilities. OBJECTIVE.— To explore the applications, benefits, and challenges of generative AI in anatomic pathology, with a focus on its impact on diagnostic processes, workflow efficiency, education, and research. DATA SOURCES.— A comprehensive review of current literature and recent advancements in the application of generative AI within anatomic pathology, categorized into unimodal and multimodal applications, and evaluated for clinical utility, ethical considerations, and future potential. CONCLUSIONS.— Generative AI demonstrates significant promise in various domains of anatomic pathology, including diagnostic accuracy enhanced through AI-driven image analysis, virtual staining, and synthetic data generation; workflow efficiency, with potential for improvement by automating routine tasks, quality control, and reflex testing; education and research, facilitated by AI-generated educational content, synthetic histology images, and advanced data analysis methods; and clinical integration, with preliminary surveys indicating cautious optimism for nondiagnostic AI tasks and growing engagement in academic settings. Ethical and practical challenges require rigorous validation, prompt engineering, federated learning, and synthetic data generation to help ensure trustworthy, reliable, and unbiased AI applications. Generative AI can potentially revolutionize anatomic pathology, enhancing diagnostic accuracy, improving workflow efficiency, and advancing education and research. Successful integration into clinical practice will require continued interdisciplinary collaboration, careful validation, and adherence to ethical standards to ensure the benefits of AI are realized while maintaining the highest standards of patient care.
Collapse
Affiliation(s)
- Victor Brodsky
- From the Department of Pathology and Immunology, Washington University School of Medicine in St Louis, St Louis, Missouri (Brodsky)
| | - Ehsan Ullah
- the Department of Surgery, Health New Zealand, Counties Manukau, New Zealand (Ullah)
| | - Andrey Bychkov
- the Department of Pathology, Kameda Medical Center, Kamogawa City, Chiba Prefecture, Japan (Bychkov)
- the Department of Pathology, Nagasaki University, Nagasaki, Japan (Bychkov)
| | - Andrew H Song
- the Department of Pathology, Brigham and Women's Hospital, Boston, Massachusetts (Song, Mahmood)
| | - Eric E Walk
- Office of the Chief Medical Officer, PathAI, Boston, Massachusetts (Walk)
| | - Peter Louis
- the Department of Pathology and Laboratory Medicine, Rutgers Robert Wood Johnson Medical School, New Brunswick, New Jersey (Louis)
| | - Ghulam Rasool
- the Department of Oncologic Sciences, Morsani College of Medicine and Department of Electrical Engineering, University of South Florida, Tampa (Rasool)
- the Department of Machine Learning, Moffitt Cancer Center and Research Institute, Tampa, Florida (Rasool)
- Department of Machine Learning, Neuro-Oncology, Moffitt Cancer Center and Research Institute, Tampa, Florida (Rasool)
| | - Rajendra S Singh
- Dermatopathology and Digital Pathology, Summit Health, Berkley Heights, New Jersey (Singh)
| | - Faisal Mahmood
- the Department of Pathology, Brigham and Women's Hospital, Boston, Massachusetts (Song, Mahmood)
| | - Marilyn M Bui
- Department of Machine Learning, Pathology, Moffitt Cancer Center and Research Institute, Tampa, Florida (Bui)
| | - Anil V Parwani
- the Department of Pathology, The Ohio State University, Columbus (Parwani)
| |
Collapse
|
3
|
Zhou Z, Qin P, Cheng X, Shao M, Ren Z, Zhao Y, Li Q, Liu L. ChatGPT in Oncology Diagnosis and Treatment: Applications, Legal and Ethical Challenges. Curr Oncol Rep 2025; 27:336-354. [PMID: 39998782 DOI: 10.1007/s11912-025-01649-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/01/2025] [Indexed: 02/27/2025]
Abstract
PURPOSE OF REVIEW This study aims to systematically review the trajectory of artificial intelligence (AI) development in the medical field, with a particular emphasis on ChatGPT, a cutting-edge tool that is transforming oncology's diagnosis and treatment practices. RECENT FINDINGS Recent advancements have demonstrated that ChatGPT can be effectively utilized in various areas, including collecting medical histories, conducting radiological & pathological diagnoses, generating electronic medical record (EMR), providing nutritional support, participating in Multidisciplinary Team (MDT) and formulating personalized, multidisciplinary treatment plans. However, some significant challenges related to data privacy and legal issues that need to be addressed for the safe and effective integration of ChatGPT into clinical practice. ChatGPT, an emerging AI technology, opens up new avenues and viewpoints for oncology diagnosis and treatment. If current technological and legal challenges can be overcome, ChatGPT is expected to play a more significant role in oncology diagnosis and treatment in the future, providing better treatment options and improving the quality of medical services.
Collapse
Affiliation(s)
- Zihan Zhou
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Peng Qin
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Xi Cheng
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Maoxuan Shao
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Zhaozheng Ren
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Yiting Zhao
- Stomatological College of Nanjing Medical University, Nanjing, 211166, China
| | - Qiunuo Li
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Lingxiang Liu
- Department of Oncology, The First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing, 210029, Jiangsu, China.
| |
Collapse
|
4
|
Jain S, Chakraborty B, Agarwal A, Sharma R. Performance of Large Language Models (ChatGPT and Gemini Advanced) in Gastrointestinal Pathology and Clinical Review of Applications in Gastroenterology. Cureus 2025; 17:e81618. [PMID: 40322390 PMCID: PMC12048130 DOI: 10.7759/cureus.81618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/02/2025] [Indexed: 05/08/2025] Open
Abstract
Introduction Artificial intelligence (AI) chatbots have been widely tested in their performance on various examinations, with limited data on their performance in clinical scenarios. The role of Chat Generative Pre-Trained Transformer (ChatGPT) (OpenAI, San Francisco, California, United States) and Gemini Advanced (Google LLC, Mountain View, California, United States) in multiple aspects of gastroenterology including answering patient questions, providing medical advice, and as tools to potentially assist healthcare providers has shown some promise, though associated with many limitations. We aimed to study the performance of ChatGPT-4.0, ChatGPT-3.5, and Gemini Advanced across 20 clinicopathologic scenarios in the unexplored realm of gastrointestinal pathology. Materials and methods Twenty clinicopathological scenarios in gastrointestinal pathology were provided to these three large language models. Two fellowship-trained pathologists independently assessed their responses, evaluating both the diagnostic accuracy and confidence of the models. The results were then compared using the chi-squared test. The study also evaluated each model's ability in four key areas, namely, (1) ability to provide differential diagnoses, (2) interpretation of immunohistochemical stains, (3) ability to deliver a concise final diagnosis, (4) and explanation provided for the thought process, using a five-point scoring system. The mean, median score±standard deviation (SD), and interquartile ranges were calculated. A comparative analysis of these four parameters across ChatGPT-4.0, ChatGPT-3.5, and Gemini Advanced was conducted using the Mann-Whitney U test. A p-value of <0.05 was considered statistically significant. Other parameters evaluated were the ability to provide a tumor, node, and metastasis (TNM) stage and the incidence of pseudo-references "hallucinations" while citing reference material. Results Gemini Advanced (diagnostic accuracy: p=0.01; providing differential diagnosis: p=0.03) and ChatGPT-4.0 (interpretation of immunohistochemistry (IHC) stains: p=0.001; providing differential diagnosis: p=0.002) performed significantly better in certain realms than ChatGPT-3.5, indicating continuously improving data training sets. However, the mean performances of ChatGPT-4.0 and Gemini Advanced ranged between 3.0 and 3.7 and were at best classified as average. None of the models could provide the accurate TNM staging for these clinical scenarios, with 25-50% citing references that do not exist (hallucinations). Conclusion This study indicated that though these models are evolving, they need human supervision and definite improvements before being used in clinical medicine. This is the first study of its kind in gastrointestinal pathology to the best of our knowledge.
Collapse
Affiliation(s)
- Swachi Jain
- Pathology and Laboratory Medicine, Icahn School of Medicine at Mount Sinai, New York, USA
| | | | | | - Rashi Sharma
- Pathology and Laboratory Medicine, Medanta-The Medicity, Gurgaon, IND
| |
Collapse
|
5
|
Rashidi HH, Pantanowitz J, Chamanzar A, Fennell B, Wang Y, Gullapalli RR, Tafti A, Deebajah M, Albahra S, Glassy E, Hanna MG, Pantanowitz L. Generative Artificial Intelligence in Pathology and Medicine: A Deeper Dive. Mod Pathol 2025; 38:100687. [PMID: 39689760 DOI: 10.1016/j.modpat.2024.100687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Revised: 11/26/2024] [Accepted: 11/27/2024] [Indexed: 12/19/2024]
Abstract
This review article builds upon the introductory piece in our 7-part series, delving deeper into the transformative potential of generative artificial intelligence (Gen AI) in pathology and medicine. The article explores the applications of Gen AI models in pathology and medicine, including the use of custom chatbots for diagnostic report generation, synthetic image synthesis for training new models, data set augmentation, hypothetical scenario generation for educational purposes, and the use of multimodal along with multiagent models. This article also provides an overview of the common categories within Gen AI models, discussing open-source and closed-source models, as well as specific examples of popular models such as GPT-4, Llama, Mistral, DALL-E, Stable Diffusion, and their associated frameworks (eg, transformers, generative adversarial networks, diffusion-based neural networks), along with their limitations and challenges, especially within the medical domain. We also review common libraries and tools that are currently deemed necessary to build and integrate such models. Finally, we look to the future, discussing the potential impact of Gen AI on health care, including benefits, challenges, and concerns related to privacy, bias, ethics, application programming interface costs, and security measures.
Collapse
Affiliation(s)
- Hooman H Rashidi
- Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania; Computational Pathology and AI Center of Excellence (CPACE), University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania.
| | | | - Alireza Chamanzar
- Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania; Computational Pathology and AI Center of Excellence (CPACE), University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania
| | - Brandon Fennell
- Department of Medicine, UCSF, School of Medicine, San Francisco, California
| | - Yanshan Wang
- Computational Pathology and AI Center of Excellence (CPACE), University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania; Department of Health Information Management, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Rama R Gullapalli
- Departments of Pathology and Chemical and Biological Engineering, University of New Mexico, Albuquerque, New Mexico
| | - Ahmad Tafti
- Computational Pathology and AI Center of Excellence (CPACE), University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania; Department of Health Information Management, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Mustafa Deebajah
- Pathology & Laboratory Medicine Institute, Cleveland Clinic, Cleveland, Ohio
| | - Samer Albahra
- Pathology & Laboratory Medicine Institute, Cleveland Clinic, Cleveland, Ohio
| | - Eric Glassy
- Affiliated Pathologists Medical Group, California
| | - Matthew G Hanna
- Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania; Computational Pathology and AI Center of Excellence (CPACE), University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania
| | - Liron Pantanowitz
- Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania; Computational Pathology and AI Center of Excellence (CPACE), University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania.
| |
Collapse
|
6
|
Ye H. A more precise interpretation of the potential value of artificial intelligence tools in medical education is needed. Postgrad Med J 2025:qgaf024. [PMID: 39927740 DOI: 10.1093/postmj/qgaf024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2025] [Accepted: 01/28/2025] [Indexed: 02/11/2025]
Affiliation(s)
- Hongnan Ye
- Department of Medical Education and Research, Beijing Alumni Association of China Medical University, No. 9 Wenhuiyuan North Road, Haidian District, Beijing 100000, China
| |
Collapse
|
7
|
Laohawetwanit T, Apornvirat S, Namboonlue C. Thinking like a pathologist: Morphologic approach to hepatobiliary tumors by ChatGPT. Am J Clin Pathol 2025; 163:3-11. [PMID: 39030695 DOI: 10.1093/ajcp/aqae087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Accepted: 06/22/2024] [Indexed: 07/21/2024] Open
Abstract
OBJECTIVES This research aimed to evaluate the effectiveness of ChatGPT in accurately diagnosing hepatobiliary tumors using histopathologic images. METHODS The study compared the diagnostic accuracies of the GPT-4 model, providing the same set of images and 2 different input prompts. The first prompt, the morphologic approach, was designed to mimic pathologists' approach to analyzing tissue morphology. In contrast, the second prompt functioned without incorporating this morphologic analysis feature. Diagnostic accuracy and consistency were analyzed. RESULTS A total of 120 photomicrographs, composed of 60 images of each hepatobiliary tumor and nonneoplastic liver tissue, were used. The findings revealed that the morphologic approach significantly enhanced the diagnostic accuracy and consistency of the artificial intelligence (AI). This version was particularly more accurate in identifying hepatocellular carcinoma (mean accuracy: 62.0% vs 27.3%), bile duct adenoma (10.7% vs 3.3%), and cholangiocarcinoma (68.7% vs 16.0%), as well as in distinguishing nonneoplastic liver tissues (77.3% vs 37.5%) (Ps ≤ .01). It also demonstrated higher diagnostic consistency than the other model without a morphologic analysis (κ: 0.46 vs 0.27). CONCLUSIONS This research emphasizes the importance of incorporating pathologists' diagnostic approaches into AI to enhance accuracy and consistency in medical diagnostics. It mainly showcases the AI's histopathologic promise when replicating expert diagnostic processes.
Collapse
Affiliation(s)
- Thiyaphat Laohawetwanit
- Division of Pathology, Chulabhorn International College of Medicine, Thammasat University, Pathum Thani, Thailand
- Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand
| | - Sompon Apornvirat
- Division of Pathology, Chulabhorn International College of Medicine, Thammasat University, Pathum Thani, Thailand
- Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand
| | | |
Collapse
|
8
|
Ding L, Fan L, Shen M, Wang Y, Sheng K, Zou Z, An H, Jiang Z. Evaluating ChatGPT's diagnostic potential for pathology images. Front Med (Lausanne) 2025; 11:1507203. [PMID: 39917264 PMCID: PMC11798939 DOI: 10.3389/fmed.2024.1507203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 12/27/2024] [Indexed: 02/09/2025] Open
Abstract
Background Chat Generative Pretrained Transformer (ChatGPT) is a type of large language model (LLM) developed by OpenAI, known for its extensive knowledge base and interactive capabilities. These attributes make it a valuable tool in the medical field, particularly for tasks such as answering medical questions, drafting clinical notes, and optimizing the generation of radiology reports. However, keeping accuracy in medical contexts is the biggest challenge to employing GPT-4 in a clinical setting. This study aims to investigate the accuracy of GPT-4, which can process both text and image inputs, in generating diagnoses from pathological images. Methods This study analyzed 44 histopathological images from 16 organs and 100 colorectal biopsy photomicrographs. The initial evaluation was conducted using the standard GPT-4 model in January 2024, with a subsequent re-evaluation performed in July 2024. The diagnostic accuracy of GPT-4 was assessed by comparing its outputs to a reference standard using statistical measures. Additionally, four pathologists independently reviewed the same images to compare their diagnoses with the model's outputs. Both scanned and photographed images were tested to evaluate GPT-4's generalization ability across different image types. Results GPT-4 achieved an overall accuracy of 0.64 in identifying tumor imaging and tissue origins. For colon polyp classification, accuracy varied from 0.57 to 0.75 in different subtypes. The model achieved 0.88 accuracy in distinguishing low-grade from high-grade dysplasia and 0.75 in distinguishing high-grade dysplasia from adenocarcinoma, with a high sensitivity in detecting adenocarcinoma. Consistency between initial and follow-up evaluations showed slight to moderate agreement, with Kappa values ranging from 0.204 to 0.375. Conclusion GPT-4 demonstrates the ability to diagnose pathological images, showing improved performance over earlier versions. Its diagnostic accuracy in cancer is comparable to that of pathology residents. These findings suggest that GPT-4 holds promise as a supportive tool in pathology diagnostics, offering the potential to assist pathologists in routine diagnostic workflows.
Collapse
Affiliation(s)
- Liya Ding
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Lei Fan
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Department of Pathology, Ninghai County Traditional Chinese Medicine Hospital, Ningbo, China
| | - Miao Shen
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Department of Pathology, Deqing People’s Hospital, Hangzhou, China
| | - Yawen Wang
- College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Kaiqin Sheng
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zijuan Zou
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Huimin An
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zhinong Jiang
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| |
Collapse
|
9
|
Tarris G, Martin L. Performance assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to problem solving in pathology in French language. Digit Health 2025; 11:20552076241310630. [PMID: 39896270 PMCID: PMC11786284 DOI: 10.1177/20552076241310630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 12/09/2024] [Indexed: 02/04/2025] Open
Abstract
Digital teaching diversifies the ways of knowledge assessment, as natural language processing offers the possibility of answering questions posed by students and teachers. Objective This study evaluated ChatGPT's, Bard's and Gemini's performances on second year of medical studies' (DFGSM2) Pathology exams from the Health Sciences Center of Dijon (France) in 2018-2022. Methods From 2018 to 2022, exam scores, discriminating powers and discordance rates were retrieved. Seventy questions (25 first-order single response questions and 45 second-order multiple response questions) were submitted on May 2023 to ChatGPT 3.5 and Bard 2.0, and on September 2024 to Gemini 1.5 and ChatGPT-4. Chatbot's and student's average scores were compared, as well as discriminating powers of questions answered by chatbots. The percentage of student-chatbot identical answers was retrieved, and linear regression analysis correlated the scores of chatbots with student's discordance rates. Chatbot's reliability was assessed by submitting the questions in four successive rounds and comparing score variability using a Fleiss' Kappa and a Cohen's Kappa. Results Newer chatbots outperformed both students and older chatbots as for the overall scores and multiple-response questions. All chatbots outperformed students on less discriminating questions. Oppositely, all chatbots were outperformed by students to questions with a high discriminating power. Chatbot's scores were correlated to student discordance rates. ChatGPT 4 and Gemini 1.5 provided variable answers, due to effects linked to prompt engineering. Conclusion Our study in line with the literature confirms chatbot's moderate performance for questions requiring complex reasoning, with ChatGPT outperforming Google chatbots. The use of NLP software based on distributional semantics remains a challenge for the generation of questions in French. Drawbacks to the use of NLP software in generating questions include the generation of hallucinations and erroneous medical knowledge which have to be taken into count when using NLP software in medical education.
Collapse
Affiliation(s)
- Georges Tarris
- Department of Pathology, University Hospital François Mitterrand of Dijon–Bourgogne, Dijon, France
- University of Burgundy Health Sciences Center, Dijon, France
| | - Laurent Martin
- Department of Pathology, University Hospital François Mitterrand of Dijon–Bourgogne, Dijon, France
- University of Burgundy Health Sciences Center, Dijon, France
| |
Collapse
|
10
|
Apornvirat S, Thinpanja W, Damrongkiet K, Benjakul N, Laohawetwanit T. Comparing customized ChatGPT and pathology residents in histopathologic description and diagnosis of common diseases. Ann Diagn Pathol 2024; 73:152359. [PMID: 38972166 DOI: 10.1016/j.anndiagpath.2024.152359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 06/30/2024] [Accepted: 07/01/2024] [Indexed: 07/09/2024]
Abstract
This study aimed to evaluate and analyze the performance of a customized Chat Generative Pre-Trained Transformer (ChatGPT), known as GPT, against pathology residents in providing microscopic descriptions and diagnosing diseases from histopathological images. A dataset of representative photomicrographs from 70 diseases across 14 organ systems was analyzed by a customized version of ChatGPT-4 (GPT-4) and pathology residents. Two pathologists independently evaluated the microscopic descriptions and diagnoses using a predefined scoring system (0-4 for microscopic descriptions and 0-2 for pathological diagnoses), with higher scores indicating greater accuracy. Microscopic descriptions that received perfect scores, which included all relevant keywords and findings, were then presented to the standard version of ChatGPT to assess its diagnostic capabilities based on these descriptions. GPT-4 showed consistency in microscopic description and diagnosis scores across five rounds, accomplishing median scores of 50 % and 48.6 %, respectively. However, its performance was still inferior to junior and senior pathology residents (73.9 % and 93.9 % description scores and 63.9 % and 87.9 % diagnosis scores, respectively). When analyzing classic ChatGPT's understanding of microscopic descriptions provided by residents, it correctly diagnosed 35 (87.5 %) of cases from junior residents and 44 (68.8 %) from senior residents, given that the initial descriptions consisted of keywords and relevant findings. While GPT-4 can accurately interpret some histopathological images, its overall performance is currently inferior to that of pathology residents. However, ChatGPT's ability to accurately interpret and diagnose diseases from the descriptions provided by residents suggests that this technology could serve as a valuable support tool in pathology diagnostics.
Collapse
Affiliation(s)
- Sompon Apornvirat
- Division of Pathology, Chulabhorn International College of Medicine, Thammasat University, Pathum Thani, Thailand; Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand
| | - Warut Thinpanja
- Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand
| | - Khampee Damrongkiet
- Department of Pathology, King Chulalongkorn Memorial Hospital, Bangkok, Thailand; Department of Anatomical Pathology, Faculty of Medicine Vajira Hospital, Navamindradhiraj University, Bangkok, Thailand
| | - Nontawat Benjakul
- Department of Anatomical Pathology, Faculty of Medicine Vajira Hospital, Navamindradhiraj University, Bangkok, Thailand; Vajira Pathology-Clinical-Correlation Target Research Interest Group, Faculty of Medicine Vajira Hospital, Navamindradhiraj University, Bangkok, Thailand
| | - Thiyaphat Laohawetwanit
- Division of Pathology, Chulabhorn International College of Medicine, Thammasat University, Pathum Thani, Thailand; Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand.
| |
Collapse
|
11
|
Du W, Jin X, Harris JC, Brunetti A, Johnson E, Leung O, Li X, Walle S, Yu Q, Zhou X, Bian F, McKenzie K, Kanathanavanich M, Ozcelik Y, El-Sharkawy F, Koga S. Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions. Ann Diagn Pathol 2024; 73:152392. [PMID: 39515029 DOI: 10.1016/j.anndiagpath.2024.152392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2024] [Revised: 10/31/2024] [Accepted: 11/01/2024] [Indexed: 11/16/2024]
Abstract
Large language models (LLMs), such as ChatGPT and Bard, have shown potential in various medical applications. This study aimed to evaluate the performance of LLMs, specifically ChatGPT and Bard, in pathology by comparing their performance with those of pathology trainees, and to assess the consistency of their responses. We selected 150 multiple-choice questions from 15 subspecialties, excluding those with images. Both ChatGPT and Bard were tested on these questions across three separate sessions between June 2023 and January 2024, and their responses were compared with those of 16 pathology trainees (8 junior and 8 senior) from two hospitals. Questions were categorized into easy, intermediate, and difficult based on trainee performance. Consistency and variability in LLM responses were analyzed across three evaluation sessions. ChatGPT significantly outperformed Bard and trainees, achieving an average total score of 82.2% compared to Bard's 49.5%, junior trainees' 45.1%, and senior trainees' 56.0%. ChatGPT's performance was notably stronger in difficult questions (63.4%-68.3%) compared to Bard (31.7%-34.1%) and trainees (4.9%-48.8%). For easy questions, ChatGPT (83.1%-91.5%) and trainees (73.7%-100.0%) showed similar high scores. Consistency analysis revealed that ChatGPT showed a high consistency rate of 80%-85% across three tests, whereas Bard exhibited greater variability with consistency rates of 54%-61%. While LLMs show significant promise in pathology education and practice, continued development and human oversight are crucial for reliable clinical application.
Collapse
Affiliation(s)
- Wei Du
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Xueting Jin
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Jaryse Carol Harris
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Alessandro Brunetti
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Erika Johnson
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Olivia Leung
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Xingchen Li
- Department of Pathology and Laboratory Medicine, Pennsylvania Hospital, Philadelphia, PA, United States of America
| | - Selemon Walle
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Qing Yu
- Department of Pathology and Laboratory Medicine, Pennsylvania Hospital, Philadelphia, PA, United States of America
| | - Xiao Zhou
- Department of Pathology and Laboratory Medicine, Pennsylvania Hospital, Philadelphia, PA, United States of America
| | - Fang Bian
- Department of Pathology and Laboratory Medicine, Pennsylvania Hospital, Philadelphia, PA, United States of America
| | - Kajanna McKenzie
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Manita Kanathanavanich
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Yusuf Ozcelik
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Farah El-Sharkawy
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America.
| |
Collapse
|
12
|
Apornvirat S, Namboonlue C, Laohawetwanit T. Comparative analysis of ChatGPT and Bard in answering pathology examination questions requiring image interpretation. Am J Clin Pathol 2024; 162:252-260. [PMID: 38619043 DOI: 10.1093/ajcp/aqae036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 03/08/2024] [Indexed: 04/16/2024] Open
Abstract
OBJECTIVES To evaluate the accuracy of ChatGPT and Bard in answering pathology examination questions requiring image interpretation. METHODS The study evaluated ChatGPT-4 and Bard's performance using 86 multiple-choice questions, with 17 (19.8%) focusing on general pathology and 69 (80.2%) on systemic pathology. Of these, 62 (72.1%) included microscopic images, and 57 (66.3%) were first-order questions focusing on diagnosing the disease. The authors presented these artificial intelligence (AI) tools with questions, both with and without clinical contexts, and assessed their answers against a reference standard set by pathologists. RESULTS ChatGPT-4 achieved a 100% (n = 86) accuracy rate in questions with clinical context, surpassing Bard's 87.2% (n = 75). Without context, the accuracy of both AI tools declined significantly, with ChatGPT-4 at 52.3% (n = 45) and Bard at 38.4% (n = 33). ChatGPT-4 consistently outperformed Bard across various categories, particularly in systemic pathology and first-order questions. A notable issue identified was Bard's tendency to "hallucinate" or provide plausible but incorrect answers, especially without clinical context. CONCLUSIONS This study demonstrated the potential of ChatGPT and Bard in pathology education, stressing the importance of clinical context for accurate AI interpretations of pathology images. It underlined the need for careful AI integration in medical education.
Collapse
Affiliation(s)
- Sompon Apornvirat
- Division of Pathology, Chulabhorn International College of Medicine, Thammasat University, Pathum Thani, Thailand
- Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand
| | | | - Thiyaphat Laohawetwanit
- Division of Pathology, Chulabhorn International College of Medicine, Thammasat University, Pathum Thani, Thailand
- Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand
| |
Collapse
|
13
|
Paul S, Govindaraj S, Jk J. ChatGPT Versus National Eligibility cum Entrance Test for Postgraduate (NEET PG). Cureus 2024; 16:e63048. [PMID: 39050297 PMCID: PMC11268980 DOI: 10.7759/cureus.63048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/24/2024] [Indexed: 07/27/2024] Open
Abstract
Introduction With both suspicion and excitement, artificial intelligence tools are being integrated into nearly every aspect of human existence, including medical sciences and medical education. The newest large language model (LLM) in the class of autoregressive language models is ChatGPT. While ChatGPT's potential to revolutionize clinical practice and medical education is under investigation, further research is necessary to understand its strengths and limitations in this field comprehensively. Methods Two hundred National Eligibility cum Entrance Test for Postgraduate 2023 questions were gathered from various public education websites and individually entered into Microsoft Bing (GPT-4 Version 2.2.1). Microsoft Bing Chatbot is currently the only platform incorporating all of GPT-4's multimodal features, including image recognition. The results were subsequently analyzed. Results Out of 200 questions, ChatGPT-4 answered 129 correctly. The most tested specialties were medicine (15%), obstetrics and gynecology (15%), general surgery (14%), and pathology (10%), respectively. Conclusion This study sheds light on how well the GPT-4 performs in addressing the NEET-PG entrance test. ChatGPT has potential as an adjunctive instrument within medical education and clinical settings. Its capacity to react intelligently and accurately in complicated clinical settings demonstrates its versatility.
Collapse
Affiliation(s)
- Sam Paul
- General Surgery, St John's Medical College Hospital, Bengaluru, IND
| | - Sridar Govindaraj
- Surgical Gastroenterology and Laparoscopy, St John's Medical College Hospital, Bengaluru, IND
| | - Jerisha Jk
- Pediatrics and Neonatology, Christian Medical College Ludhiana, Ludhiana, IND
| |
Collapse
|
14
|
Miao J, Thongprayoon C, Fülöp T, Cheungpasitporn W. Enhancing clinical decision-making: Optimizing ChatGPT's performance in hypertension care. J Clin Hypertens (Greenwich) 2024; 26:588-593. [PMID: 38646920 PMCID: PMC11088425 DOI: 10.1111/jch.14822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 03/27/2024] [Accepted: 03/28/2024] [Indexed: 04/23/2024]
Affiliation(s)
- Jing Miao
- Division of NephrologyDepartment of Medicine, Mayo ClinicRochesterMinnesotaUSA
| | - Charat Thongprayoon
- Division of NephrologyDepartment of Medicine, Mayo ClinicRochesterMinnesotaUSA
| | - Tibor Fülöp
- Division of NephrologyDepartment of Medicine, Medical University of South CarolinaCharlestonSouth CarolinaUSA
- Medicine ServiceRalph H. Johnson VA Medical CenterCharlestonSouth CarolinaUSA
| | | |
Collapse
|
15
|
Cheng J. Applications of Large Language Models in Pathology. Bioengineering (Basel) 2024; 11:342. [PMID: 38671764 PMCID: PMC11047860 DOI: 10.3390/bioengineering11040342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Revised: 03/27/2024] [Accepted: 03/29/2024] [Indexed: 04/28/2024] Open
Abstract
Large language models (LLMs) are transformer-based neural networks that can provide human-like responses to questions and instructions. LLMs can generate educational material, summarize text, extract structured data from free text, create reports, write programs, and potentially assist in case sign-out. LLMs combined with vision models can assist in interpreting histopathology images. LLMs have immense potential in transforming pathology practice and education, but these models are not infallible, so any artificial intelligence generated content must be verified with reputable sources. Caution must be exercised on how these models are integrated into clinical practice, as these models can produce hallucinations and incorrect results, and an over-reliance on artificial intelligence may lead to de-skilling and automation bias. This review paper provides a brief history of LLMs and highlights several use cases for LLMs in the field of pathology.
Collapse
Affiliation(s)
- Jerome Cheng
- Department of Pathology, University of Michigan, Ann Arbor, MI 48105, USA
| |
Collapse
|
16
|
Miao J, Thongprayoon C, Suppadungsuk S, Garcia Valencia OA, Cheungpasitporn W. Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications. MEDICINA (KAUNAS, LITHUANIA) 2024; 60:445. [PMID: 38541171 PMCID: PMC10972059 DOI: 10.3390/medicina60030445] [Citation(s) in RCA: 27] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Revised: 03/01/2024] [Accepted: 03/06/2024] [Indexed: 04/04/2025]
Abstract
The integration of large language models (LLMs) into healthcare, particularly in nephrology, represents a significant advancement in applying advanced technology to patient care, medical research, and education. These advanced models have progressed from simple text processors to tools capable of deep language understanding, offering innovative ways to handle health-related data, thus improving medical practice efficiency and effectiveness. A significant challenge in medical applications of LLMs is their imperfect accuracy and/or tendency to produce hallucinations-outputs that are factually incorrect or irrelevant. This issue is particularly critical in healthcare, where precision is essential, as inaccuracies can undermine the reliability of these models in crucial decision-making processes. To overcome these challenges, various strategies have been developed. One such strategy is prompt engineering, like the chain-of-thought approach, which directs LLMs towards more accurate responses by breaking down the problem into intermediate steps or reasoning sequences. Another one is the retrieval-augmented generation (RAG) strategy, which helps address hallucinations by integrating external data, enhancing output accuracy and relevance. Hence, RAG is favored for tasks requiring up-to-date, comprehensive information, such as in clinical decision making or educational applications. In this article, we showcase the creation of a specialized ChatGPT model integrated with a RAG system, tailored to align with the KDIGO 2023 guidelines for chronic kidney disease. This example demonstrates its potential in providing specialized, accurate medical advice, marking a step towards more reliable and efficient nephrology practices.
Collapse
Affiliation(s)
- Jing Miao
- Division of Nephrology and Hypertension, Department of Medicine, Mayo Clinic, Rochester, MN 55905, USA; (J.M.); (C.T.); (S.S.); (O.A.G.V.)
| | - Charat Thongprayoon
- Division of Nephrology and Hypertension, Department of Medicine, Mayo Clinic, Rochester, MN 55905, USA; (J.M.); (C.T.); (S.S.); (O.A.G.V.)
| | - Supawadee Suppadungsuk
- Division of Nephrology and Hypertension, Department of Medicine, Mayo Clinic, Rochester, MN 55905, USA; (J.M.); (C.T.); (S.S.); (O.A.G.V.)
- Chakri Naruebodindra Medical Institute, Faculty of Medicine Ramathibodi Hospital, Mahidol University, Samut Prakan 10540, Thailand
| | - Oscar A. Garcia Valencia
- Division of Nephrology and Hypertension, Department of Medicine, Mayo Clinic, Rochester, MN 55905, USA; (J.M.); (C.T.); (S.S.); (O.A.G.V.)
| | - Wisit Cheungpasitporn
- Division of Nephrology and Hypertension, Department of Medicine, Mayo Clinic, Rochester, MN 55905, USA; (J.M.); (C.T.); (S.S.); (O.A.G.V.)
| |
Collapse
|