1
|
Lafourcade C, Kérourédan O, Ballester B, Richert R. Accuracy, consistency, and contextual understanding of large language models in restorative dentistry and endodontics. J Dent 2025; 157:105764. [PMID: 40246058 DOI: 10.1016/j.jdent.2025.105764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Revised: 03/26/2025] [Accepted: 04/15/2025] [Indexed: 04/19/2025] Open
Abstract
OBJECTIVE This study aimed to evaluate and compare the performance of several large language models (LLMs) in the context of restorative dentistry and endodontics, focusing on their accuracy, consistency, and contextual understanding. METHODS The dataset was extracted from the national educational archives of the Collège National des Enseignants en Odontologie Conservatrice (CNEOC) and includes all chapters from the reference manual for dental residency applicants. Multiple-choice questions (MCQs) were selected following a review by three independent academic experts. Four LLMs were assessed: ChatGPT-3.5, ChatGPT-4 (OpenAI), Claude-3 (Anthropic), and Mistral 7B (Mistral AI). Model accuracy was determined by comparing responses with expert-provided answers. Consistency was measured through robustness (the ability to provide identical responses to paraphrased questions) and repeatability (the ability to provide identical responses to the same question). Contextual understanding was evaluated based on the model's ability to categorise questions correctly and infer terms from definitions. Additionally, accuracy was reassessed after providing the LLMs with the relevant full course chapter. RESULTS A total of 517 MCQs and 539 definitions were included. ChatGPT-4 and Claude-3 demonstrated significantly higher accuracy and repeatability than Mistral 7B, with ChatGPT-4 showing the greater robustness. Advanced LLMs displayed high accuracy in presenting dental content, although performance varied on closely related concepts. Supplying course chapters generally improved response accuracy, though inconsistently across topics. CONCLUSION Even the most advanced LLMs, such as ChatGPT-4 and Claude 3, achieve moderate performance and require cautious use due to inconsistencies in robustness. Future studies should focus on integrating validated content and refining prompt engineering to enhance the educational and clinical utility of LLMs. CLINICAL SIGNIFICANCE The findings underscore the potential of advanced LLMs and context-based prompting in restorative dentistry and endodontics.
Collapse
Affiliation(s)
- Claire Lafourcade
- UFR des Sciences Odontologiques, Université de Bordeaux, Bordeaux, France; CHU de Bordeaux, Pôle de Médecine et Chirurgie bucco-dentaire, Bordeaux, France
| | - Olivia Kérourédan
- UFR des Sciences Odontologiques, Université de Bordeaux, Bordeaux, France; CHU de Bordeaux, Pôle de Médecine et Chirurgie bucco-dentaire, Bordeaux, France; UMR 1026 BioTis INSERM, Université de Bordeaux, Bordeaux, France
| | - Benoit Ballester
- Assistance Publique Des Hôpitaux de Marseille, Marseille, France; Aix Marseille Univ, Inserm, IRD, SESSTIM, Sciences Economiques & Sociales de la Santé & Traitement de l'Information Médicale, ISSPAM, Marseille, France
| | - Raphael Richert
- Faculté d'Odontologie Université Lyon 1, Lyon, France; INSA Lyon, CNRS, LaMCoS, UMR5259, Villeurbanne, France; Hospices Civils de Lyon, PAM Odontologie, Lyon, France.
| |
Collapse
|
2
|
Özbay Y, Erdoğan D, Dinçer GA. Evaluation of the performance of large language models in clinical decision-making in endodontics. BMC Oral Health 2025; 25:648. [PMID: 40296000 PMCID: PMC12039063 DOI: 10.1186/s12903-025-06050-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2025] [Accepted: 04/23/2025] [Indexed: 04/30/2025] Open
Abstract
BACKGROUND Artificial intelligence (AI) chatbots are excellent at generating language. The growing use of generative AI large language models (LLMs) in healthcare and dentistry, including endodontics, raises questions about their accuracy. The potential of LLMs to assist clinicians' decision-making processes in endodontics is worth evaluating. This study aims to comparatively evaluate the answers provided by Google Bard, ChatGPT-3.5, and ChatGPT-4 to clinically relevant questions from the field of Endodontics. METHODS 40 open-ended questions covering different areas of endodontics were prepared and were introduced to Google Bard, ChatGPT-3.5, and ChatGPT-4. Validity of the questions was evaluated using the Lawshe Content Validity Index. Two experienced endodontists, blinded to the chatbots, evaluated the answers using a 3-point Likert scale. All responses deemed to contain factually wrong information were noted and a misinformation rate for each LLM was calculated (number of answers containing wrong information/total number of questions). The One-way analysis of variance and Post Hoc Tukey test were used to analyze the data and significance was considered to be p < 0.05. RESULTS ChatGPT-4 demonstrated the highest score and the lowest misinformation rate (P = 0.008) followed by ChatGPT-3.5 and Google Bard respectively. The difference between ChatGPT-4 and Google Bard was statistically significant (P = 0.004). CONCLUSION ChatGPT-4 provided more accurate and informative information in endodontics. However, all LLMs produced varying levels of incomplete or incorrect answers.
Collapse
Affiliation(s)
- Yağız Özbay
- Department of Endodontics, Faculty of Dentistry, Karabük University, Karabük, Türkiye.
| | | | - Gözde Akbal Dinçer
- Department of Endodontics, Faculty of Dentistry, Okan University, İstanbul , Türkiye
| |
Collapse
|
3
|
Suárez A, Arena S, Herranz Calzada A, Castillo Varón AI, Diaz-Flores García V, Freire Y. Decoding wisdom: Evaluating ChatGPT's accuracy and reproducibility in analyzing orthopantomographic images for third molar assessment. Comput Struct Biotechnol J 2025; 28:141-147. [PMID: 40271108 PMCID: PMC12017887 DOI: 10.1016/j.csbj.2025.04.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2025] [Revised: 04/08/2025] [Accepted: 04/09/2025] [Indexed: 04/25/2025] Open
Abstract
The integration of Artificial Intelligence (AI) into healthcare has opened new avenues for clinical decision support, particularly in radiology. The aim of this study was to evaluate the accuracy and reproducibility of ChatGPT-4o in the radiographic image interpretation of orthopantomograms (OPGs) for assessment of lower third molars, simulating real patient requests for tooth extraction. Thirty OPGs were analyzed, each paired with a standardized prompt submitted to ChatGPT-4o, generating 900 responses (30 per radiograph). Two oral surgery experts independently evaluated the responses using a three-point Likert scale (correct, partially correct/incomplete, incorrect), with disagreements resolved by a third expert. ChatGPT-4o achieved an accuracy rate of 38.44 % (95 % CI: 35.27 %-41.62 %). The percentage agreement among repeated responses was 82.7 %, indicating high consistency, though Gwet's coefficient of agreement (60.4 %) suggested only moderate repeatability. While the model correctly identified general features in some cases, it frequently provided incomplete or fabricated information, particularly in complex radiographs involving overlapping structures or underdeveloped roots. These findings highlight ChatGPT-4o's current limitations in dental radiographic interpretation. Although it demonstrated some capability in analyzing OPGs, its accuracy and reliability remain insufficient for unsupervised clinical use. Professional oversight is essential to prevent diagnostic errors. Further refinement and specialized training of AI models are needed to enhance their performance and ensure safe integration into dental practice, especially in patient-facing applications.
Collapse
Affiliation(s)
- Ana Suárez
- Department of Pre-Clinic Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Stefania Arena
- Department of Pre-Clinic Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Alberto Herranz Calzada
- Department of Pre-Clinic Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
- Department of Pre-Clinic Dentistry I, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Ana Isabel Castillo Varón
- Department of Medicine. Faculty of Medicine, Health and Sports. Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Victor Diaz-Flores García
- Department of Pre-Clinic Dentistry I, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Yolanda Freire
- Department of Pre-Clinic Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| |
Collapse
|
4
|
Tuygunov N, Samaranayake L, Khurshid Z, Rewthamrongsris P, Schwendicke F, Osathanon T, Yahya NA. The Transformative Role of Artificial Intelligence in Dentistry: A Comprehensive Overview Part 2: The Promise and Perils, and the International Dental Federation Communique. Int Dent J 2025; 75:397-404. [PMID: 40011130 PMCID: PMC11976557 DOI: 10.1016/j.identj.2025.02.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2025] [Accepted: 02/05/2025] [Indexed: 02/28/2025] Open
Abstract
In the final part of this two part article on artificial intelligence (AI) in dentistry we review its transformative role, focusing on AI in dental education, patient communications, challenges of integration, strategies to overcome barriers, ethical considerations, and finally, the recently released International Dental Federation (FDI) Communique (white paper) on AI in Dentistry. AI in dental education is highlighted for its potential in enhancing theoretical and practical dimensions, including patient telemonitoring and virtual training ecosystems. Challenges of AI integration in dentistry are outlined, such as data availability, bias, and human accountability. Strategies to overcome these challenges include promoting AI literacy, establishing regulations, and focusing on specific AI implementations. Ethical considerations in AI integration within dentistry, such as patient privacy and algorithm bias, are emphasized. The need for clear guidelines and ongoing evaluation of AI systems is crucial. The FDI White Paper on AI in Dentistry provides insights into the significance of AI in oral care, dental education, and research, along with standards for governance. It discusses AI's impact on individual patients, community health, dental education, and research. The paper addresses biases, limited generalizability, accessibility, and regulatory requirements for AI in dental practice. In conclusion, AI plays a significant role in modern dental care, offering benefits in diagnosis, treatment planning, and decision-making. While facing challenges, strategic initiatives focusing on AI literacy, regulations, and targeted implementations can help overcome barriers and maximize the potential of AI in dentistry. Ethical considerations and ongoing evaluation are essential for ensuring responsible, effective and efficacious deployment of AI technologies in dental ecosystem.
Collapse
Affiliation(s)
- Nozimjon Tuygunov
- Faculty of Dentistry, Kimyo International University in Tashkent, Tashkent, Uzbekistan.
| | - Lakshman Samaranayake
- Faculty of Dentistry, University of Hong Kong, Sai Ying Pun, Hong Kong; Dr DY Patil Dental College and Hospital, Dr DY Patil Vidyapeeth, Pimpri, Pune, India; Center of Excellence for Dental Stem Cell Biology, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand.
| | - Zohaib Khurshid
- Department of Prosthodontics and Dental Implantology, College of Dentistry, King Faisal University, Al-Ahsa, Saudi Arabia; Department of Anatomy, Faculty of Dentistry, Center of Excellence for Regenerative Dentistry, Chulalongkorn University, Bangkok, Thailand
| | - Paak Rewthamrongsris
- Center of Excellence for Dental Stem Cell Biology, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand
| | - Falk Schwendicke
- Clinic for Conservative Dentistry and Periodontology, University Hospital of the Ludwig-Maximilians- University Munich, Munich, Germany
| | - Thanaphum Osathanon
- Center of Excellence for Dental Stem Cell Biology, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand
| | - Noor Azlin Yahya
- Clinic for Conservative Dentistry and Periodontology, University Hospital of the Ludwig-Maximilians- University Munich, Munich, Germany; Department of Restorative Dentistry, Faculty of Dentistry, Universiti Malaya, Kuala Lumpur, Malaysia
| |
Collapse
|
5
|
Şişman AÇ, Acar AH. Artificial intelligence-based chatbot assistance in clinical decision-making for medically complex patients in oral surgery: a comparative study. BMC Oral Health 2025; 25:351. [PMID: 40055745 PMCID: PMC11887094 DOI: 10.1186/s12903-025-05732-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2024] [Accepted: 02/27/2025] [Indexed: 05/13/2025] Open
Abstract
AIM This study aims to evaluate the potential of AI-based chatbots in assisting with clinical decision-making in the management of medically complex patients in oral surgery. MATERIALS AND METHODS A team of oral and maxillofacial surgeons developed a pool of open-ended questions de novo. The validity of the questions was assessed using Lawshe's Content Validity Index. The questions, which focused on systemic diseases and common conditions that may raise concerns during oral surgery, were presented to ChatGPT 3.5 and Claude-instant in two separate sessions, spaced one week apart. Two experienced maxillofacial surgeons, blinded to the chatbots, assessed the responses for quality, accuracy, and completeness using a modified DISCERN tool and Likert scale. Intraclass correlation, Mann-Whitney U test, skewness, and kurtosis coefficients were employed to compare the performances of the chatbots. RESULTS Most responses were high quality: 86% and 79.6% for ChatGPT, and 81.25% and 89% for Claude-instant in sessions 1 and 2, respectively. In terms of accuracy, ChatGPT had 92% and 93.4% of its responses rated as completely correct in sessions 1 and 2, respectively, while Claude-instant had 95.2% and 89%. For completeness, ChatGPT had 88.5% and 86.8% of its responses rated as adequate or comprehensive in sessions 1 and 2, respectively, while Claude-instant had 95.2% and 86%. CONCLUSION Ongoing software developments and the increasing acceptance of chatbots among healthcare professionals hold promise that these tools can provide rapid solutions to the high demand for medical care, ease professionals' workload, reduce costs, and save time.
Collapse
Affiliation(s)
- Alanur Çiftçi Şişman
- Hamidiye Faculty of Dental Medicine, Department of Oral and Maxillofacial Surgery, University of Health Sciences, Istanbul, Türkiye.
| | - Ahmet Hüseyin Acar
- Faculty of Dentistry, Department of Oral and Maxillofacial Surgery, Istanbul Medeniyet University, Istanbul, Türkiye
| |
Collapse
|
6
|
Danesh A, Danesh A, Danesh F. Innovating dental diagnostics: ChatGPT's accuracy on diagnostic challenges. Oral Dis 2025; 31:911-917. [PMID: 39039720 DOI: 10.1111/odi.15082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 06/13/2024] [Accepted: 07/12/2024] [Indexed: 07/24/2024]
Abstract
INTRODUCTION Complex patient diagnoses in dentistry require a multifaceted approach which combines interpretations of clinical observations with an in-depth understanding of patient history and presenting problems. The present study aims to elucidate the implications of ChatGPT (OpenAI) as a comprehensive diagnostic tool in the dental clinic through examining the chatbot's diagnostic performance on challenging patient cases retrieved from the literature. METHODS Our study subjected ChatGPT3.5 and ChatGPT4 to descriptions of patient cases for diagnostic challenges retrieved from the literature. Sample means were compared using a two-tailed t-test, while sample proportions were compared using a two-tailed χ2 test. A p-value below the threshold of 0.05 was deemed statistically significant. RESULTS When prompted to generate their own differential diagnoses, ChatGPT3.5 and ChatGPT4 achieved a diagnostic accuracy of 40% and 62%, respectively. When basing their diagnostic processes on a differential diagnosis retrieved from the literature, ChatGPT3.5 and ChatGPT4 achieved a diagnostic accuracy of 70% and 80%, respectively. CONCLUSION ChatGPT displays an impressive capacity to correctly diagnose complex diagnostic challenges in the field of dentistry. Our study paints a promising potential for the chatbot to 1 day serve as a comprehensive diagnostic tool in the dental clinic.
Collapse
Affiliation(s)
- Arman Danesh
- Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada
| | - Arsalan Danesh
- Faculty of Dentistry, University of British Columbia, Vancouver, British Columbia, Canada
| | - Farzad Danesh
- Elgin Mills Endodontic Specialists, Richmond Hill, Ontario, Canada
| |
Collapse
|
7
|
Eraslan R, Ayata M, Yagci F, Albayrak H. Exploring the potential of artificial intelligence chatbots in prosthodontics education. BMC MEDICAL EDUCATION 2025; 25:321. [PMID: 40016760 PMCID: PMC11869545 DOI: 10.1186/s12909-025-06849-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/03/2025] [Accepted: 02/10/2025] [Indexed: 03/01/2025]
Abstract
BACKGROUND The purpose of this study was to evaluate the performance of widely used artificial intelligence (AI) chatbots in answering prosthodontics questions from the Dentistry Specialization Residency Examination (DSRE). METHODS A total of 126 DSRE prosthodontics questions were divided into seven subtopics (dental morphology, materials science, fixed dentures, removable partial dentures, complete dentures, occlusion/temporomandibular joint, and dental implantology). Questions were translated into English by the authors, and this version of the questions were asked to five chatbots (ChatGPT-3.5, Gemini Advanced, Claude Pro, Microsoft Copilot, and Perplexity) within a 7-day period. Statistical analyses, including chi-square and z-tests, were performed to compare accuracy rates across the chatbots and subtopics at a significance level of 0.05. RESULTS The overall accuracy rates for the chatbots were as follows: Copilot (73%), Gemini (63.5%), ChatGPT-3.5 (61.1%), Claude Pro (57.9%), and Perplexity (54.8%). Copilot significantly outperformed Perplexity (P = 0.035). However, no significant differences in accuracy were found across subtopics among chatbots. Questions on dental implantology had the highest accuracy rate (75%), while questions on removable partial dentures had the lowest (50.8%). CONCLUSION Copilot showed the highest accuracy rate (73%), significantly outperforming Perplexity (54.8%). AI models demonstrate potential as educational support tools but currently face limitations in serving as reliable educational tools across all areas of prosthodontics. Future advancements in AI may lead to better integration and more effective use in dental education.
Collapse
Affiliation(s)
- Ravza Eraslan
- Department of Prosthodontics, Faculty of Dentistry, Erciyes University, Kayseri, Türkiye
| | - Mustafa Ayata
- Private Practice, Ortoperio Oral and Dental Health Polyclinic, Kayseri, Türkiye.
| | - Filiz Yagci
- Department of Prosthodontics, Faculty of Dentistry, Erciyes University, Kayseri, Türkiye
| | - Haydar Albayrak
- Department of Prosthodontics, Faculty of Dentistry, Erciyes University, Kayseri, Türkiye
| |
Collapse
|
8
|
On SW, Cho SW, Park SY, Ha JW, Yi SM, Park IY, Byun SH, Yang BE. Chat Generative Pre-Trained Transformer (ChatGPT) in Oral and Maxillofacial Surgery: A Narrative Review on Its Research Applications and Limitations. J Clin Med 2025; 14:1363. [PMID: 40004892 PMCID: PMC11856154 DOI: 10.3390/jcm14041363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2025] [Revised: 02/17/2025] [Accepted: 02/17/2025] [Indexed: 02/27/2025] Open
Abstract
Objectives: This review aimed to evaluate the role of ChatGPT in original research articles within the field of oral and maxillofacial surgery (OMS), focusing on its applications, limitations, and future directions. Methods: A literature search was conducted in PubMed using predefined search terms and Boolean operators to identify original research articles utilizing ChatGPT published up to October 2024. The selection process involved screening studies based on their relevance to OMS and ChatGPT applications, with 26 articles meeting the final inclusion criteria. Results: ChatGPT has been applied in various OMS-related domains, including clinical decision support in real and virtual scenarios, patient and practitioner education, scientific writing and referencing, and its ability to answer licensing exam questions. As a clinical decision support tool, ChatGPT demonstrated moderate accuracy (approximately 70-80%). It showed moderate to high accuracy (up to 90%) in providing patient guidance and information. However, its reliability remains inconsistent across different applications, necessitating further evaluation. Conclusions: While ChatGPT presents potential benefits in OMS, particularly in supporting clinical decisions and improving access to medical information, it should not be regarded as a substitute for clinicians and must be used as an adjunct tool. Further validation studies and technological refinements are required to enhance its reliability and effectiveness in clinical and research settings.
Collapse
Affiliation(s)
- Sung-Woon On
- Division of Oral and Maxillofacial Surgery, Department of Dentistry, Dongtan Sacred Heart Hospital, Hallym University College of Medicine, Hwaseong 18450, Republic of Korea; (S.-W.O.); (J.-W.H.)
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
| | - Seoung-Won Cho
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
| | - Sang-Yoon Park
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Department of Oral and Maxillofacial Surgery, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
| | - Ji-Won Ha
- Division of Oral and Maxillofacial Surgery, Department of Dentistry, Dongtan Sacred Heart Hospital, Hallym University College of Medicine, Hwaseong 18450, Republic of Korea; (S.-W.O.); (J.-W.H.)
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
| | - Sang-Min Yi
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Department of Oral and Maxillofacial Surgery, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
| | - In-Young Park
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
- Department of Orthodontics, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
| | - Soo-Hwan Byun
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Department of Oral and Maxillofacial Surgery, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
| | - Byoung-Eun Yang
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Department of Oral and Maxillofacial Surgery, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
| |
Collapse
|
9
|
Zhang Q, Wu Z, Song J, Luo S, Chai Z. Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health. Int Dent J 2025; 75:151-157. [PMID: 39147663 PMCID: PMC11806297 DOI: 10.1016/j.identj.2024.06.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 06/12/2024] [Accepted: 06/19/2024] [Indexed: 08/17/2024] Open
Abstract
AIM Given the increasing interest in using large language models (LLMs) for self-diagnosis, this study aimed to evaluate the comprehensiveness of two prominent LLMs, ChatGPT-3.5 and ChatGPT-4, in addressing common queries related to gingival and endodontic health across different language contexts and query types. METHODS We assembled a set of 33 common real-life questions related to gingival and endodontic healthcare, including 17 common-sense questions and 16 expert questions. Each question was presented to the LLMs in both English and Chinese. Three specialists were invited to evaluate the comprehensiveness of the responses on a five-point Likert scale, where a higher score indicated greater quality responses. RESULTS LLMs performed significantly better in English, with an average score of 4.53, compared to 3.95 in Chinese (Mann-Whitney U test, P < .05). Responses to common sense questions received higher scores than those to expert questions, with averages of 4.46 and 4.02 (Mann-Whitney U test, P < .05). Among the LLMs, ChatGPT-4 consistently outperformed ChatGPT-3.5, achieving average scores of 4.45 and 4.03 (Mann-Whitney U test, P < .05). CONCLUSIONS ChatGPT-4 provides more comprehensive responses than ChatGPT-3.5 for queries related to gingival and endodontic health. Both LLMs perform better in English and on common sense questions. However, the performance discrepancies across different language contexts and the presence of inaccurate responses suggest that further evaluation and understanding of their limitations are crucial to avoid potential misunderstandings. CLINICAL RELEVANCE This study revealed the performance differences of ChatGPT-3.5 and ChatGPT-4 in handling gingival and endodontic health issues across different language contexts, providing insights into the comprehensiveness and limitations of LLMs in addressing common oral healthcare queries.
Collapse
Affiliation(s)
- Qian Zhang
- College of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China
| | - Zhengyu Wu
- College of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China
| | - Jinlin Song
- College of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China
| | - Shuicai Luo
- Quanzhou Institute of Equipment Manufacturing, Haixi Institute, Chinese Academy of Sciences, Quanzhou, China
| | - Zhaowu Chai
- College of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China.
| |
Collapse
|
10
|
Nguyen HC, Dang HP, Nguyen TL, Hoang V, Nguyen VA. Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study. PLoS One 2025; 20:e0317423. [PMID: 39879192 PMCID: PMC11778630 DOI: 10.1371/journal.pone.0317423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Accepted: 12/28/2024] [Indexed: 01/31/2025] Open
Abstract
OBJECTIVES This study aims to evaluate the performance of the latest large language models (LLMs) in answering dental multiple choice questions (MCQs), including both text-based and image-based questions. MATERIAL AND METHODS A total of 1490 MCQs from two board review books for the United States National Board Dental Examination were selected. This study evaluated six of the latest LLMs as of August 2024, including ChatGPT 4.0 omni (OpenAI), Gemini Advanced 1.5 Pro (Google), Copilot Pro with GPT-4 Turbo (Microsoft), Claude 3.5 Sonnet (Anthropic), Mistral Large 2 (Mistral AI), and Llama 3.1 405b (Meta). χ2 tests were performed to determine whether there were significant differences in the percentages of correct answers among LLMs for both the total sample and each discipline (p < 0.05). RESULTS Significant differences were observed in the percentage of accurate answers among the six LLMs across text-based questions, image-based questions, and the total sample (p<0.001). For the total sample, Copilot (85.5%), Claude (84.0%), and ChatGPT (83.8%) demonstrated the highest accuracy, followed by Mistral (78.3%) and Gemini (77.1%), with Llama (72.4%) exhibiting the lowest. CONCLUSIONS Newer versions of LLMs demonstrate superior performance in answering dental MCQs compared to earlier versions. Copilot, Claude, and ChatGPT achieved high accuracy on text-based questions and low accuracy on image-based questions. LLMs capable of handling image-based questions demonstrated superior performance compared to LLMs limited to text-based questions. CLINICAL RELEVANCE Dental clinicians and students should prioritize the most up-to-date LLMs when supporting their learning, clinical practice, and research.
Collapse
Affiliation(s)
| | - Hai Phong Dang
- Faculty of Dentistry, PHENIKAA University, Hanoi, Vietnam
| | | | - Viet Hoang
- Faculty of Dentistry, Van Lang University, Ho Chi Minh City, Vietnam
| | | |
Collapse
|
11
|
Blasingame MN, Koonce TY, Williams AM, Giuse DA, Su J, Krump PA, Giuse NB. Evaluating a large language model's ability to answer clinicians' requests for evidence summaries. J Med Libr Assoc 2025; 113:65-77. [PMID: 39975503 PMCID: PMC11835037 DOI: 10.5195/jmla.2025.1985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/21/2025] Open
Abstract
Objective This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses. Methods Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat. Results Of the 216 evaluated questions, aiChat's response was assessed as "correct" for 180 (83.3%) questions, "partially correct" for 35 (16.2%) questions, and "incorrect" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated. Conclusions Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.
Collapse
Affiliation(s)
- Mallory N Blasingame
- , Information Scientist & Assistant Director for Evidence Provision, Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN
| | - Taneya Y Koonce
- , Deputy Director, Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN
| | - Annette M Williams
- , Senior Information Scientist and Associate Director for Metadata Management, Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN
| | - Dario A Giuse
- , Associate Professor, Department of Biomedical Informatics, Vanderbilt University School of Medicine and Vanderbilt University Medical Center, Nashville, TN
| | - Jing Su
- , Senior Information Scientist, Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN
| | - Poppy A Krump
- , Information Scientist, Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN
| | - Nunzia Bettinsoli Giuse
- , Professor of Biomedical Informatics and Professor of Medicine; Vice President for Knowledge Management; and Director, Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN
| |
Collapse
|
12
|
Farhadi Nia M, Ahmadi M, Irankhah E. Transforming dental diagnostics with artificial intelligence: advanced integration of ChatGPT and large language models for patient care. FRONTIERS IN DENTAL MEDICINE 2025; 5:1456208. [PMID: 39917691 PMCID: PMC11797834 DOI: 10.3389/fdmed.2024.1456208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Accepted: 10/16/2024] [Indexed: 02/09/2025] Open
Abstract
Artificial intelligence has dramatically reshaped our interaction with digital technologies, ushering in an era where advancements in AI algorithms and Large Language Models (LLMs) have natural language processing (NLP) systems like ChatGPT. This study delves into the impact of cutting-edge LLMs, notably OpenAI's ChatGPT, on medical diagnostics, with a keen focus on the dental sector. Leveraging publicly accessible datasets, these models augment the diagnostic capabilities of medical professionals, streamline communication between patients and healthcare providers, and enhance the efficiency of clinical procedures. The advent of ChatGPT-4 is poised to make substantial inroads into dental practices, especially in the realm of oral surgery. This paper sheds light on the current landscape and explores potential future research directions in the burgeoning field of LLMs, offering valuable insights for both practitioners and developers. Furthermore, it critically assesses the broad implications and challenges within various sectors, including academia and healthcare, thus mapping out an overview of AI's role in transforming dental diagnostics for enhanced patient care.
Collapse
Affiliation(s)
- Masoumeh Farhadi Nia
- Department of Electrical and Computer Engineering, University of Massachusetts Lowell, Lowell, MA, United States
| | - Mohsen Ahmadi
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, United States
- Department of Industrial Engineering, Urmia University of Technology, Urmia, Iran
| | - Elyas Irankhah
- Department of Mechanical Engineering, University of Massachusetts Lowell, Lowell, MA, United States
| |
Collapse
|
13
|
Monteiro MA, Cunha JLS, de Carvalho Chaves-Júnior S. Potential of ChatGPT in children's oral health education: A friend or foe in guidance for parents and caregivers? Int J Paediatr Dent 2025; 35:11-12. [PMID: 38676298 DOI: 10.1111/ipd.13196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Revised: 04/10/2024] [Accepted: 04/11/2024] [Indexed: 04/28/2024]
Affiliation(s)
| | - John Lennon Silva Cunha
- Postgraduate Program in Dentistry, Department of Dentistry, State University of Paraíba (UEPB), Campina Grande, PB, Brazil
- Center of Biological and Health Sciences, Federal University of Western Bahia (UFOB), Barreiras, BA, Brazil
| | | |
Collapse
|
14
|
Pagano S, Strumolo L, Michalk K, Schiegl J, Pulido LC, Reinhard J, Maderbacher G, Renkawitz T, Schuster M. Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study. Comput Struct Biotechnol J 2024; 28:9-15. [PMID: 39850460 PMCID: PMC11754967 DOI: 10.1016/j.csbj.2024.12.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2024] [Revised: 12/14/2024] [Accepted: 12/18/2024] [Indexed: 01/25/2025] Open
Abstract
Background Large Language Models (LLMs) such as ChatGPT are gaining attention for their potential applications in healthcare. This study aimed to evaluate the diagnostic sensitivity of various LLMs in detecting hip or knee osteoarthritis (OA) using only patient-reported data collected via a structured questionnaire, without prior medical consultation. Methods A prospective observational study was conducted at an orthopaedic outpatient clinic specialized in hip and knee OA treatment. A total of 115 patients completed a paper-based questionnaire covering symptoms, medical history, and demographic information. The diagnostic performance of five different LLMs-including four versions of ChatGPT, two of Gemini, Llama, Gemma 2, and Mistral-Nemo-was analysed. Model-generated diagnoses were compared against those provided by experienced orthopaedic clinicians, which served as the reference standard. Results GPT-4o achieved the highest diagnostic sensitivity at 92.3 %, significantly outperforming other LLMs. The completeness of patient responses to symptom-related questions was the strongest predictor of accuracy for GPT-4o (p < 0.001). Inter-model agreement was moderate among GPT-4 versions, whereas models such as Llama-3.1 demonstrated notably lower accuracy and concordance. Conclusions GPT-4o demonstrated high accuracy and consistency in diagnosing OA based solely on patient-reported questionnaires, underscoring its potential as a supplementary diagnostic tool in clinical settings. Nevertheless, the reliance on patient-reported data without direct physician involvement highlights the critical need for medical oversight to ensure diagnostic accuracy. Further research is needed to refine LLM capabilities and expand their utility in broader diagnostic applications.
Collapse
Affiliation(s)
- Stefano Pagano
- Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany
| | - Luigi Strumolo
- Freelance health consultant & senior data analyst, Avellino, Italy
| | - Katrin Michalk
- Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany
| | - Julia Schiegl
- Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany
| | - Loreto C. Pulido
- Department of Orthopaedics Hospital of Trauma Surgery, Marktredwitz Hospital, Marktredwitz, Germany
| | - Jan Reinhard
- Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany
| | - Guenther Maderbacher
- Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany
| | - Tobias Renkawitz
- Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany
| | - Marie Schuster
- Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany
| |
Collapse
|
15
|
Umer F, Batool I, Naved N. Innovation and application of Large Language Models (LLMs) in dentistry - a scoping review. BDJ Open 2024; 10:90. [PMID: 39617779 PMCID: PMC11609263 DOI: 10.1038/s41405-024-00277-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Revised: 11/03/2024] [Accepted: 11/04/2024] [Indexed: 01/31/2025] Open
Abstract
OBJECTIVE Large Language Models (LLMs) have revolutionized healthcare, yet their integration in dentistry remains underexplored. Therefore, this scoping review aims to systematically evaluate current literature on LLMs in dentistry. DATA SOURCES The search covered PubMed, Scopus, IEEE Xplore, and Google Scholar, with studies selected based on predefined criteria. Data were extracted to identify applications, evaluation metrics, prompting strategies, and deployment levels of LLMs in dental practice. RESULTS From 4079 records, 17 studies met the inclusion criteria. ChatGPT was the predominant model, mainly used for post-operative patient queries. Likert scale was the most reported evaluation metric, and only two studies employed advanced prompting strategies. Most studies were at level 3 of deployment, indicating practical application but requiring refinement. CONCLUSION LLMs showed extensive applicability in dental specialties; however, reliance on ChatGPT necessitates diversified assessments across multiple LLMs. Standardizing reporting practices and employing advanced prompting techniques are crucial for transparency and reproducibility, necessitating continuous efforts to optimize LLM utility and address existing challenges.
Collapse
Affiliation(s)
- Fahad Umer
- Associate Professor, Operative Dentistry & Endodontics, Aga Khan University Hospital, Karachi, Pakistan
| | - Itrat Batool
- Resident, Operative Dentistry & Endodontics, Aga Khan University Hospital, Karachi, Pakistan
| | - Nighat Naved
- Resident, Operative Dentistry & Endodontics, Aga Khan University Hospital, Karachi, Pakistan.
| |
Collapse
|
16
|
Frodl A, Fuchs A, Yilmaz T, Izadpanah K, Schmal H, Siegel M. ChatGPT as a Source for Patient Information on Patellofemoral Surgery-A Comparative Study Amongst Laymen, Doctors, and Experts. Clin Pract 2024; 14:2376-2384. [PMID: 39585014 PMCID: PMC11587161 DOI: 10.3390/clinpract14060186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2024] [Revised: 09/23/2024] [Accepted: 09/26/2024] [Indexed: 11/26/2024] Open
Abstract
INTRODUCTION In November 2022, OpenAI launched ChatGPT for public use through a free online platform. ChatGPT is an artificial intelligence (AI) chatbot trained on a broad dataset encompassing a wide range of topics, including medical literature. The usability in the medical field and the quality of AI-generated responses are widely discussed and are the subject of current investigations. Patellofemoral pain is one of the most common conditions among young adults, often prompting patients to seek advice. This study examines the quality of ChatGPT as a source of information regarding patellofemoral conditions and surgery, hypothesizing that there will be differences in the evaluation of responses generated by ChatGPT between populations with different levels of expertise in patellofemoral disorders. METHODS A comparison was conducted between laymen, doctors (non-orthopedic), and experts in patellofemoral disorders based on a list of 12 questions. These questions were divided into descriptive and recommendatory categories, with each category further split into basic and advanced content. Questions were used to prompt ChatGPT in April 2024 using the ChatGPT 4.0 engine, and answers were evaluated using a custom tool inspired by the Ensuring Quality Information for Patients (EQIP) instrument. Evaluations were performed independently by laymen, non-orthopedic doctors, and experts, with the results statistically analyzed using a Mann-Whitney U Test. A p-value of less than 0.05 was considered statistically significant. RESULTS The study included data from seventeen participants: four experts in patellofemoral disorders, seven non-orthopedic doctors, and six laymen. Experts rated the answers lower on average compared to non-experts. Significant differences were observed in the ratings of descriptive answers with increasing complexity. The average score for experts was 29.3 ± 5.8, whereas non-experts averaged 35.3 ± 5.7. For recommendatory answers, experts also gave lower ratings, particularly for more complex questions. CONCLUSION ChatGPT provides good quality answers to questions concerning patellofemoral disorders, although questions with higher complexity were rated lower by patellofemoral experts compared to non-experts. This study emphasizes the potential of ChatGPT as a complementary tool for patient information on patellofemoral disorders, although the quality of the answers fluctuates with the complexity of the questions, which might not be recognized by non-experts. The lack of personalized recommendations and the problem of "AI hallucinations" remain a challenge. Human expertise and judgement, especially from trained healthcare experts, remain irreplaceable.
Collapse
Affiliation(s)
- Andreas Frodl
- Department of Orthopedic Surgery and Traumatology, Freiburg University Hospital, Albert Ludwigs University Freiburg, Hugstetter Straße 55, 79106 Freiburg, Germany
| | - Andreas Fuchs
- Department of Orthopedic Surgery and Traumatology, Freiburg University Hospital, Albert Ludwigs University Freiburg, Hugstetter Straße 55, 79106 Freiburg, Germany
| | - Tayfun Yilmaz
- Department of Orthopedic Surgery and Traumatology, Freiburg University Hospital, Albert Ludwigs University Freiburg, Hugstetter Straße 55, 79106 Freiburg, Germany
| | - Kaywan Izadpanah
- Department of Orthopedic Surgery and Traumatology, Freiburg University Hospital, Albert Ludwigs University Freiburg, Hugstetter Straße 55, 79106 Freiburg, Germany
| | - Hagen Schmal
- Department of Orthopedic Surgery and Traumatology, Freiburg University Hospital, Albert Ludwigs University Freiburg, Hugstetter Straße 55, 79106 Freiburg, Germany
- Department of Orthopedic Surgery, University Hospital Odense, Sdr. Boulevard 29, 5000 Odense, Denmark
| | - Markus Siegel
- Department of Orthopedic Surgery and Traumatology, Freiburg University Hospital, Albert Ludwigs University Freiburg, Hugstetter Straße 55, 79106 Freiburg, Germany
| |
Collapse
|
17
|
Hartman H, Essis MD, Tung WS, Oh I, Peden S, Gianakos AL. Can ChatGPT-4 Diagnose and Treat Like an Orthopaedic Surgeon? Testing Clinical Decision Making and Diagnostic Ability in Soft-Tissue Pathologies of the Foot and Ankle. J Am Acad Orthop Surg 2024:00124635-990000000-01126. [PMID: 39442011 DOI: 10.5435/jaaos-d-24-00595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Accepted: 08/28/2024] [Indexed: 10/25/2024] Open
Abstract
INTRODUCTION ChatGPT-4, a chatbot with an ability to carry human-like conversation, has attracted attention after demonstrating aptitude to pass professional licensure examinations. The purpose of this study was to explore the diagnostic and decision-making capacities of ChatGPT-4 in clinical management specifically assessing for accuracy in the identification and treatment of soft-tissue foot and ankle pathologies. METHODS This study presented eight soft-tissue-related foot and ankle cases to ChatGPT-4, with each case assessed by three fellowship-trained foot and ankle orthopaedic surgeons. The evaluation system included five criteria within a Likert scale, scoring from 5 (lowest) to 25 (highest possible). RESULTS The average sum score of all cases was 22.0. The Morton neuroma case received the highest score (24.7), and the peroneal tendon tear case received the lowest score (16.3). Subgroup analyses of each of the 5 criterion using showed no notable differences in surgeon grading. Criteria 3 (provide alternative treatments) and 4 (provide comprehensive information) were graded markedly lower than criteria 1 (diagnose), 2 (treat), and 5 (provide accurate information) (for both criteria 3 and 4: P = 0.007; P = 0.032; P < 0.0001). Criteria 5 was graded markedly higher than criteria 2, 3, and 4 (P = 0.02; P < 0.0001; P < 0.0001). CONCLUSION This study demonstrates that ChatGPT-4 effectively diagnosed and provided reliable treatment options for most soft-tissue foot and ankle cases presented, noting consistency among surgeon evaluators. Individual criterion assessment revealed that ChatGPT-4 was most effective in diagnosing and suggesting appropriate treatment, but limitations were seen in the chatbot's ability to provide comprehensive information and alternative treatment options. In addition, the chatbot successfully did not suggest fabricated treatment options, a common concern in prior literature. This resource could be useful for clinicians seeking reliable patient education materials without the fear of inconsistencies, although comprehensive information beyond treatment may be limited.
Collapse
Affiliation(s)
- Hayden Hartman
- From the Lincoln Memorial University, DeBusk College of Osteopathic Medicine, Knoxville, TN (Hartman), and the Department of Orthopaedics and Rehabilitation, Yale University, New Haven, CT (Essis, Tung, Oh, Peden, and Gianakos)
| | | | | | | | | | | |
Collapse
|
18
|
Jeong H, Han SS, Yu Y, Kim S, Jeon KJ. How well do large language model-based chatbots perform in oral and maxillofacial radiology? Dentomaxillofac Radiol 2024; 53:390-395. [PMID: 38848473 PMCID: PMC11358622 DOI: 10.1093/dmfr/twae021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Revised: 04/30/2024] [Accepted: 05/19/2024] [Indexed: 06/09/2024] Open
Abstract
OBJECTIVES This study evaluated the performance of four large language model (LLM)-based chatbots by comparing their test results with those of dental students on an oral and maxillofacial radiology examination. METHODS ChatGPT, ChatGPT Plus, Bard, and Bing Chat were tested on 52 questions from regular dental college examinations. These questions were categorized into three educational content areas: basic knowledge, imaging and equipment, and image interpretation. They were also classified as multiple-choice questions (MCQs) and short-answer questions (SAQs). The accuracy rates of the chatbots were compared with the performance of students, and further analysis was conducted based on the educational content and question type. RESULTS The students' overall accuracy rate was 81.2%, while that of the chatbots varied: 50.0% for ChatGPT, 65.4% for ChatGPT Plus, 50.0% for Bard, and 63.5% for Bing Chat. ChatGPT Plus achieved a higher accuracy rate for basic knowledge than the students (93.8% vs. 78.7%). However, all chatbots performed poorly in image interpretation, with accuracy rates below 35.0%. All chatbots scored less than 60.0% on MCQs, but performed better on SAQs. CONCLUSIONS The performance of chatbots in oral and maxillofacial radiology was unsatisfactory. Further training using specific, relevant data derived solely from reliable sources is required. Additionally, the validity of these chatbots' responses must be meticulously verified.
Collapse
Affiliation(s)
- Hui Jeong
- Department of Oral and Maxillofacial Radiology, Yonsei University College of Dentistry, Seoul 03722, Republic of Korea
| | - Sang-Sun Han
- Department of Oral and Maxillofacial Radiology, Yonsei University College of Dentistry, Seoul 03722, Republic of Korea
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul 03722, Republic of Korea
- Oral Science Research Center, College of Dentistry, Yonsei University, Seoul 03722, Republic of Korea
| | - Youngjae Yu
- Department of Artificial Intelligence, Yonsei University College of Computing, Seoul 03722, Republic of Korea
| | - Saejin Kim
- Department of Artificial Intelligence, Yonsei University College of Computing, Seoul 03722, Republic of Korea
| | - Kug Jin Jeon
- Department of Oral and Maxillofacial Radiology, Yonsei University College of Dentistry, Seoul 03722, Republic of Korea
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul 03722, Republic of Korea
| |
Collapse
|
19
|
Batool I, Naved N, Kazmi SMR, Umer F. Leveraging Large Language Models in the delivery of post-operative dental care: a comparison between an embedded GPT model and ChatGPT. BDJ Open 2024; 10:48. [PMID: 38866751 PMCID: PMC11169374 DOI: 10.1038/s41405-024-00226-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 05/01/2024] [Accepted: 05/07/2024] [Indexed: 06/14/2024] Open
Abstract
OBJECTIVE This study underscores the transformative role of Artificial Intelligence (AI) in healthcare, particularly the promising applications of Large Language Models (LLMs) in the delivery of post-operative dental care. The aim is to evaluate the performance of an embedded GPT model and its comparison with ChatGPT-3.5 turbo. The assessment focuses on aspects like response accuracy, clarity, relevance, and up-to-date knowledge in addressing patient concerns and facilitating informed decision-making. MATERIAL AND METHODS An embedded GPT model, employing GPT-3.5-16k, was crafted via GPT-trainer to answer postoperative questions in four dental specialties including Operative Dentistry & Endodontics, Periodontics, Oral & Maxillofacial Surgery, and Prosthodontics. The generated responses were validated by thirty-six dental experts, nine from each specialty, employing a Likert scale, providing comprehensive insights into the embedded GPT model's performance and its comparison with GPT3.5 turbo. For content validation, a quantitative Content Validity Index (CVI) was used. The CVI was calculated both at the item level (I-CVI) and scale level (S-CVI/Ave). To adjust I-CVI for chance agreement, a modified kappa statistic (K*) was computed. RESULTS The overall content validity of responses generated via embedded GPT model and ChatGPT was 65.62% and 61.87% respectively. Moreover, the embedded GPT model revealed a superior performance surpassing ChatGPT with an accuracy of 62.5% and clarity of 72.5%. In contrast, the responses generated via ChatGPT achieved slightly lower scores, with an accuracy of 52.5% and clarity of 67.5%. However, both models performed equally well in terms of relevance and up-to-date knowledge. CONCLUSION In conclusion, embedded GPT model showed better results as compared to ChatGPT in providing post-operative dental care emphasizing the benefits of embedding and prompt engineering, paving the way for future advancements in healthcare applications.
Collapse
Affiliation(s)
- Itrat Batool
- Section of Dentistry, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan
| | - Nighat Naved
- Section of Dentistry, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan
| | - Syed Murtaza Raza Kazmi
- Section of Dentistry, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan
| | - Fahad Umer
- Section of Dentistry, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan.
| |
Collapse
|
20
|
Ahmed SK. The future of oral cancer care: Integrating ChatGPT into clinical practice. ORAL ONCOLOGY REPORTS 2024; 10:100317. [DOI: 10.1016/j.oor.2024.100317] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2025]
|
21
|
Blasingame MN, Koonce TY, Williams AM, Giuse DA, Su J, Krump PA, Giuse NB. Evaluating a Large Language Model's Ability to Answer Clinicians' Requests for Evidence Summaries. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.05.01.24306691. [PMID: 38746273 PMCID: PMC11092721 DOI: 10.1101/2024.05.01.24306691] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Objective This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses. Methods Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally-managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat. Results Of the 216 evaluated questions, aiChat's response was assessed as "correct" for 180 (83.3%) questions, "partially correct" for 35 (16.2%) questions, and "incorrect" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.39). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated. Conclusions Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.
Collapse
|
22
|
Frosolini A, Catarzi L, Benedetti S, Latini L, Chisci G, Franz L, Gennaro P, Gabriele G. The Role of Large Language Models (LLMs) in Providing Triage for Maxillofacial Trauma Cases: A Preliminary Study. Diagnostics (Basel) 2024; 14:839. [PMID: 38667484 PMCID: PMC11048758 DOI: 10.3390/diagnostics14080839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 04/10/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024] Open
Abstract
BACKGROUND In the evolving field of maxillofacial surgery, integrating advanced technologies like Large Language Models (LLMs) into medical practices, especially for trauma triage, presents a promising yet largely unexplored potential. This study aimed to evaluate the feasibility of using LLMs for triaging complex maxillofacial trauma cases by comparing their performance against the expertise of a tertiary referral center. METHODS Utilizing a comprehensive review of patient records in a tertiary referral center over a year-long period, standardized prompts detailing patient demographics, injury characteristics, and medical histories were created. These prompts were used to assess the triage suggestions of ChatGPT 4.0 and Google GEMINI against the center's recommendations, supplemented by evaluating the AI's performance using the QAMAI and AIPI questionnaires. RESULTS The results in 10 cases of major maxillofacial trauma indicated moderate agreement rates between LLM recommendations and the referral center, with some variances in the suggestion of appropriate examinations (70% ChatGPT and 50% GEMINI) and treatment plans (60% ChatGPT and 45% GEMINI). Notably, the study found no statistically significant differences in several areas of the questionnaires, except in the diagnosis accuracy (GEMINI: 3.30, ChatGPT: 2.30; p = 0.032) and relevance of the recommendations (GEMINI: 2.90, ChatGPT: 3.50; p = 0.021). A Spearman correlation analysis highlighted significant correlations within the two questionnaires, specifically between the QAMAI total score and AIPI treatment scores (rho = 0.767, p = 0.010). CONCLUSIONS This exploratory investigation underscores the potential of LLMs in enhancing clinical decision making for maxillofacial trauma cases, indicating a need for further research to refine their application in healthcare settings.
Collapse
Affiliation(s)
- Andrea Frosolini
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Lisa Catarzi
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Simone Benedetti
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Linda Latini
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Glauco Chisci
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Leonardo Franz
- Phoniatris and Audiology Unit, Department of Neuroscience DNS, University of Padova, 35122 Treviso, Italy;
- Artificial Intelligence in Medicine and Innovation in Clinical Research and Methodology (PhD Program), Department of Clinical and Experimental Sciences, University of Brescia, 25121 Brescia, Italy
| | - Paolo Gennaro
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Guido Gabriele
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| |
Collapse
|
23
|
Freire Y, Santamaría Laorden A, Orejas Pérez J, Gómez Sánchez M, Díaz-Flores García V, Suárez A. ChatGPT performance in prosthodontics: Assessment of accuracy and repeatability in answer generation. J Prosthet Dent 2024; 131:659.e1-659.e6. [PMID: 38310063 DOI: 10.1016/j.prosdent.2024.01.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 01/17/2024] [Accepted: 01/18/2024] [Indexed: 02/05/2024]
Abstract
STATEMENT OF PROBLEM The artificial intelligence (AI) software program ChatGPT is based on large language models (LLMs) and is widely accessible. However, in prosthodontics, little is known about its performance in generating answers. PURPOSE The purpose of this study was to determine the performance of ChatGPT in generating answers about removable dental prostheses (RDPs) and tooth-supported fixed dental prostheses (FDPs). MATERIAL AND METHODS Thirty short questions were designed about RDPs and tooth-supported FDP, and 30 answers were generated for each of the questions using ChatGPT-4 in October 2023. The 900 generated answers were independently graded by experts using a 3-point Likert scale. The relative frequency and absolute percentage of answers were described. Accuracy was assessed using the Wald binomial method, while repeatability was evaluated using percentage agreement, Brennan and Prediger coefficient, Conger generalized Cohen kappa, Fleiss kappa, Gwet AC, and Krippendorff alpha methods. Confidence intervals were set at 95%. Statistical analysis was performed using the STATA software program. RESULTS The performance of ChatGPT in generating answers related to RDP and tooth-supported FDP was limited. The answers showed a reliability of 25.6%, with a confidence range between 22.9% and 28.6%. The repeatability ranged from substantial to moderate. CONCLUSIONS The results show that currently ChatGPT has limited ability to generate answers related to RDPs and tooth-supported FDPs. Therefore, ChatGPT cannot replace a dentist, and, if professionals were to use it, they should be aware of its limitations.
Collapse
Affiliation(s)
- Yolanda Freire
- Assistant Professor, Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, European University of Madrid (UEM), Madrid, Spain
| | - Andrea Santamaría Laorden
- Assistant Professor, Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, European University of Madrid (UEM), Madrid, Spain
| | - Jaime Orejas Pérez
- Assistant Professor, Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, European University of Madrid (UEM), Madrid, Spain
| | - Margarita Gómez Sánchez
- Assistant Professor, Vice Dean of Dentistry, Department of Pre-Clinic Dentistry and Clinical Dentistry, Faculty of Biomedical and Health Sciences, European University of Madrid (UEM), Madrid, Spain
| | - Víctor Díaz-Flores García
- Assistant Professor, Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, European University of Madrid (UEM), Madrid, Spain.
| | - Ana Suárez
- Associate Professor, Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, European University of Madrid (UEM), Madrid, Spain
| |
Collapse
|