1
|
Essis MD, Hartman H, Tung WS, Oh I, Peden S, Gianakos AL. Comparison of ChatGPT's Diagnostic and Management Accuracy of Foot and Ankle Bone-Related Pathologies to Orthopaedic Surgeons. J Am Acad Orthop Surg 2025:00124635-990000000-01297. [PMID: 40233367 DOI: 10.5435/jaaos-d-24-01049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Accepted: 01/21/2025] [Indexed: 04/17/2025] Open
Abstract
INTRODUCTION The steep rise in utilization of large language model chatbots, such as ChatGPT, has spilled into medicine in recent years. The newest version of ChatGPT, ChatGPT-4, has passed medical licensure examinations and, specifically in orthopaedics, has performed at the level of a postgraduate level three orthopaedic surgery resident on the Orthopaedic In-Service Training Examination question bank sets. The purpose of this study was to evaluate ChatGPT-4's diagnostic and decision-making capacity in the clinical management of bone-related injuries of the foot and ankle. METHODS Eight bone-related foot and ankle orthopaedic cases were presented to ChatGPT-4 and subsequently evaluated by three fellowship-trained foot and ankle orthopaedic surgeons. Cases were scored using criteria on a Likert scale, graded from a total score of 5 (lowest) to 25 (highest) across five criteria. ChatGPT-4 was referred to as "Dr. GPT," establishing a peer dynamic so that the role of an orthopaedic surgeon was emulated by the chatbot. RESULTS The average score across all criteria for each case was 4.53 of 5, noting an overall average sum score of 22.7 of 25 for all cases. The pathology with the highest score was the second metatarsal stress fracture (24.3), whereas the case with the lowest score was hallux rigidus (21.3). Kendall correlation analysis of interrater reliability showed variable correlation among surgeons, without statistical significance. CONCLUSION ChatGPT-4 effectively diagnosed and provided appropriate treatment options for simple bone-related foot and ankle cases. Importantly, ChatGPT did not fabricate treatment options (ie, hallucination phenomenon), which has been previously well-documented in the literature, notably receiving its second-highest overall average score in this criterion. ChatGPT struggled to provide comprehensive information beyond standard treatment options. Overall, ChatGPT has the potential to serve as a widely accessible resource for patients and nonorthopaedic clinicians, although limitations may exist in the delivery of comprehensive information.
Collapse
Affiliation(s)
- Maritza Diane Essis
- From the Department of Orthopaedic Surgery, Yale Medicine, Orthopaedics and Rehabilitation, New Haven, CT (Essis, Tung, Oh, Peden, and Gianakos), and Lincoln Memorial University, DeBusk College of Osteopathic Medicine, Knoxville, TN (Hartman)
| | | | | | | | | | | |
Collapse
|
2
|
Kunze KN, Gerhold C, Dave U, Abunnur N, Mamonov A, Nwachukwu BU, Verma NN, Chahla J. Large Language Model Use Cases in Health Care Research Are Redundant and Often Lack Appropriate Methodological Conduct: A Scoping Review and Call for Improved Practices. Arthroscopy 2025:S0749-8063(25)00253-1. [PMID: 40209833 DOI: 10.1016/j.arthro.2025.03.066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/02/2024] [Revised: 03/17/2025] [Accepted: 03/19/2025] [Indexed: 04/12/2025]
Abstract
PURPOSE To describe the current use cases of large language models (LLMs) in musculoskeletal medicine and to evaluate the methodologic conduct of these investigations in order to safeguard future implementation of LLMs in clinical research and identify key areas for methodological improvement. METHODS A comprehensive literature search was performed in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines using PubMed, Cochrane Library, and Embase databases to identify eligible studies. Included studies evaluated the use of LLMs within any realm of orthopaedic surgery, regardless of its application in a clinical or educational setting. Methodological Index for Non-Randomized Studies criteria was used to assess the quality of all included studies. RESULTS In total, 114 studies published from 2022 to 2024 were identified. Extensive use case redundancy was observed, and 5 main categories of clinical applications of LLMs were identified: 48 studies (42.1%) that assessed the ability to answer patient questions, 24 studies (21.1%) that evaluated the ability to diagnose and manage medical conditions, 21 studies (18.4%) that evaluated the ability to take orthopaedic examinations, 11 studies (9.6%) that analyzed the ability to develop or evaluate patient educational materials, and 10 studies (8.8%) concerning other applications, such as generating images, generating discharge documents and clinical letters, writing scientific abstracts and manuscripts, and enhancing billing efficiency. General orthopaedics was the focus of most included studies (n = 39, 34.2%), followed by orthopaedic sports medicine (n = 18, 15.8%), and adult reconstructive surgery (n = 17, 14.9%). ChatGPT 3.5 was the most common LLM used or evaluated (n = 79, 69.2%), followed by ChatGPT 4.0 (n = 47, 41.2%). Methodological inconsistency was prevalent among studies, with 36 (31.6%) studies failing to disclose the exact prompts used, 64 (56.1%) failing to disclose the exact outputs generated by the LLM, and only 7 (6.1%) evaluating different prompting strategies to elicit desired outputs. No studies attempted to investigate how the influence of race or gender influenced model outputs. CONCLUSIONS Among studies evaluating LLM health care use cases, the scope of clinical investigations was limited, with most studies showing redundant use cases. Because of infrequently reported descriptions of prompting strategies, incomplete model specifications, failure to disclose exact model outputs, and limited attempts to address bias, methodological inconsistency was concerningly extensive. CLINICAL RELEVANCE A comprehensive understanding of current LLM use cases is critical to familiarize providers with the possibilities through which this technology may be used in clinical practice. As LLM health care applications transition from research to clinical integration, model transparency and trustworthiness is critical. The results of the current study suggest that guidance is urgently needed, with focus on promoting appropriate methodological conduct practices and novel use cases to advance the field.
Collapse
Affiliation(s)
- Kyle N Kunze
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A..
| | | | - Udit Dave
- Midwest Orthopaedics at Rush, Chicago, Illinois, U.S.A
| | - Nezar Abunnur
- Midwest Orthopaedics at Rush, Chicago, Illinois, U.S.A
| | | | - Benedict U Nwachukwu
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | | | - Jorge Chahla
- Midwest Orthopaedics at Rush, Chicago, Illinois, U.S.A
| |
Collapse
|
3
|
Fares MY, Parmar T, Boufadel P, Daher M, Berg J, Witt A, Hill BW, Horneff JG, Khan AZ, Abboud JA. An Assessment of the Performance of Different Chatbots on Shoulder and Elbow Questions. J Clin Med 2025; 14:2289. [PMID: 40217738 PMCID: PMC11989822 DOI: 10.3390/jcm14072289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2025] [Revised: 03/10/2025] [Accepted: 03/14/2025] [Indexed: 04/14/2025] Open
Abstract
Background/Objectives: The utility of artificial intelligence (AI) in medical education has recently garnered significant interest, with several studies exploring its applications across various educational domains; however, its role in orthopedic education, particularly in shoulder and elbow surgery, remains scarcely studied. This study aims to evaluate the performance of multiple AI models in answering shoulder- and elbow-related questions from the AAOS ResStudy question bank. Methods: A total of 50 shoulder- and elbow-related questions from the AAOS ResStudy question bank were selected for the study. Questions were categorized according to anatomical location, topic, concept, and difficulty. Each question, along with the possible multiple-choice answers, was provided to each chatbot. The performance of each chatbot was recorded and analyzed to identify significant differences between the chatbots' performances across various categories. Results: The overall average performance of all chatbots was 60.4%. There were significant differences in the performances of different chatbots (p = 0.034): GPT-4o performed best, answering 74% of the questions correctly. AAOS members outperformed all chatbots, with an average accuracy of 79.4%. There were no significant differences in performance between shoulder and elbow questions (p = 0.931). Topic-wise, chatbots did worse on questions relating to "Adhesive Capsulitis" than those relating to "Instability" (p = 0.013), "Nerve Injuries" (p = 0.002), and "Arthroplasty" (p = 0.028). Concept-wise, the best performance was seen in "Diagnosis" (71.4%), but there were no significant differences in scores between different chatbots. Difficulty analysis revealed that chatbots performed significantly better on easy questions (68.5%) compared to moderate (45.4%; p = 0.04) and hard questions (40.0%; p = 0.012). Conclusions: AI chatbots show promise as supplementary tools in medical education and clinical decision-making, but their limitations necessitate cautious and complementary use alongside expert human judgment.
Collapse
Affiliation(s)
- Mohamad Y. Fares
- Division of Shoulder and Elbow Surgery, Rothman Orthopaedic Institute, Philadelphia, PA 19107, USA; (P.B.); (J.A.A.)
| | - Tarishi Parmar
- Penn State College of Medicine, The Pennsylvania State University, Hershey, PA 17033, USA;
| | - Peter Boufadel
- Division of Shoulder and Elbow Surgery, Rothman Orthopaedic Institute, Philadelphia, PA 19107, USA; (P.B.); (J.A.A.)
| | - Mohammad Daher
- Department of Orthopedic Surgery, The Warren Alpert Medical School, Brown University, Providence, RI 02912, USA;
| | - Jonathan Berg
- Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA;
| | - Austin Witt
- Baylor University Medical Center, Dallas, TX 75246, USA;
| | - Brian W. Hill
- Palm Beach Orthopaedic Institute, West Palm Beach, FL 33401, USA;
| | - John G. Horneff
- Division of Shoulder and Elbow Surgery, Department of Orthopaedics, University of Pennsylvania, Philadelphia, PA 19104, USA;
| | - Adam Z. Khan
- Southern Permanente Medical Group, Pasadena, CA 91188, USA;
| | - Joseph A. Abboud
- Division of Shoulder and Elbow Surgery, Rothman Orthopaedic Institute, Philadelphia, PA 19107, USA; (P.B.); (J.A.A.)
| |
Collapse
|
4
|
Burisch C, Bellary A, Breuckmann F, Ehlers J, Thal SC, Sellmann T, Gödde D. ChatGPT-4 Performance on German Continuing Medical Education-Friend or Foe (Trick or Treat)? Protocol for a Randomized Controlled Trial. JMIR Res Protoc 2025; 14:e63887. [PMID: 39913914 PMCID: PMC11843049 DOI: 10.2196/63887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 12/02/2024] [Accepted: 12/27/2024] [Indexed: 02/24/2025] Open
Abstract
BACKGROUND The increasing development and spread of artificial and assistive intelligence is opening up new areas of application not only in applied medicine but also in related fields such as continuing medical education (CME), which is part of the mandatory training program for medical doctors in Germany. This study aimed to determine whether medical laypersons can successfully conduct training courses specifically for physicians with the help of a large language model (LLM) such as ChatGPT-4. This study aims to qualitatively and quantitatively investigate the impact of using artificial intelligence (AI; specifically ChatGPT) on the acquisition of credit points in German postgraduate medical education. OBJECTIVE Using this approach, we wanted to test further possible applications of AI in the postgraduate medical education setting and obtain results for practical use. Depending on the results, the potential influence of LLMs such as ChatGPT-4 on CME will be discussed, for example, as part of a SWOT (strengths, weaknesses, opportunities, threats) analysis. METHODS We designed a randomized controlled trial, in which adult high school students attempt to solve CME tests across six medical specialties in three study arms in total with 18 CME training courses per study arm under different interventional conditions with varying amounts of permitted use of ChatGPT-4. Sample size calculation was performed including guess probability (20% correct answers, SD=40%; confidence level of 1-α=.95/α=.05; test power of 1-β=.95; P<.05). The study was registered at open scientific framework. RESULTS As of October 2024, the acquisition of data and students to participate in the trial is ongoing. Upon analysis of our acquired data, we predict our findings to be ready for publication as soon as early 2025. CONCLUSIONS We aim to prove that the advances in AI, especially LLMs such as ChatGPT-4 have considerable effects on medical laypersons' ability to successfully pass CME tests. The implications that this holds on how the concept of continuous medical education requires reevaluation are yet to be contemplated. TRIAL REGISTRATION OSF Registries 10.17605/OSF.IO/MZNUF; https://osf.io/mznuf. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) PRR1-10.2196/63887.
Collapse
Affiliation(s)
- Christian Burisch
- State of North Rhine-Westphalia, Regional Government Düsseldorf, Leibniz-Gymnasium, Essen, Germany
- Department of Didactics and Education Research in the Health Sector, Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Abhav Bellary
- Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Frank Breuckmann
- Department of Cardiology, Pneumology, Neurology and Intensive Care Medicine, Klinik Kitzinger Land, Kitzingen, Germany
- Department of Cardiology and Vascular Medicine, West German Heart and Vascular Center Essen, University Duisburg-Essen, Essen, Germany
| | - Jan Ehlers
- Department of Didactics and Education Research in the Health Sector, Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Serge C Thal
- Department of Anesthesiology, HELIOS University Hospital, Wuppertal, Germany
- Department of Anaesthesiology I, Witten-Herdecke University, Witten, Germany
| | - Timur Sellmann
- Department of Anaesthesiology I, Witten-Herdecke University, Witten, Germany
- Department of Anesthesiology and Intensive Care Medicine, Evangelisches Krankenhaus Hospital, BETHESDA zu Duisburg, Duisburg, Germany
| | - Daniel Gödde
- Department of Pathology and Molecular Pathology, HELIOS University Hospital Wuppertal, University Witten/Herdecke, Witten, Germany
| |
Collapse
|
5
|
Jin H, Guo J, Lin Q, Wu S, Hu W, Li X. Comparative study of Claude 3.5-Sonnet and human physicians in generating discharge summaries for patients with renal insufficiency: assessment of efficiency, accuracy, and quality. Front Digit Health 2024; 6:1456911. [PMID: 39703756 PMCID: PMC11655460 DOI: 10.3389/fdgth.2024.1456911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2024] [Accepted: 11/20/2024] [Indexed: 12/21/2024] Open
Abstract
Background The rapid development of artificial intelligence (AI) has shown great potential in medical document generation. This study aims to evaluate the performance of Claude 3.5-Sonnet, an advanced AI model, in generating discharge summaries for patients with renal insufficiency, compared to human physicians. Methods A prospective, comparative study was conducted involving 100 patients (50 with acute kidney injury and 50 with chronic kidney disease) from the nephrology department of Ningbo Hangzhou Bay Hospital between January and June 2024. Discharge summaries were independently generated by Claude 3.5-Sonnet and human physicians. The main evaluation indicators included accuracy, generation time, and overall quality. Results Claude 3.5-Sonnet demonstrated comparable accuracy to human physicians in generating discharge summaries for both AKI (90 vs. 92 points, p > 0.05) and CKD patients (88 vs. 90 points, p > 0.05). The AI model significantly outperformed human physicians in terms of efficiency, requiring only about 30 s to generate a summary compared to over 15 min for physicians (p < 0.001). The overall quality scores showed no significant difference between AI-generated and physician-written summaries for both AKI (26 vs. 27 points, p > 0.05) and CKD patients (25 vs. 26 points, p > 0.05). Conclusion Claude 3.5-Sonnet demonstrates high efficiency and reliability in generating discharge summaries for patients with renal insufficiency, with accuracy and quality comparable to those of human physicians. These findings suggest that AI has significant potential to improve the efficiency of medical documentation, though further research is needed to optimize its integration into clinical practice and address ethical and privacy concerns.
Collapse
Affiliation(s)
- Haijiao Jin
- Department of Nephrology, Ren Ji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- Department of Nephrology, Ningbo Hangzhou Bay Hospital, Zhejiang, China
- Molecular Cell Lab for Kidney Disease, Shanghai, China
- Shanghai Peritoneal Dialysis Research Center, Shanghai, China
- Uremia Diagnosis and Treatment Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Jinglu Guo
- Department of Nephrology, Ningbo Hangzhou Bay Hospital, Zhejiang, China
| | - Qisheng Lin
- Department of Nephrology, Ren Ji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- Molecular Cell Lab for Kidney Disease, Shanghai, China
- Shanghai Peritoneal Dialysis Research Center, Shanghai, China
- Uremia Diagnosis and Treatment Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Shaun Wu
- WORK Medical Technology Group LTD., Hangzhou, China
| | - Weiguo Hu
- Department of Medical Education, Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Xiaoyang Li
- Department of Medical Education, Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China
| |
Collapse
|
6
|
Dagher T, Dwyer EP, Baker HP, Kalidoss S, Strelzow JA. "Dr. AI Will See You Now": How Do ChatGPT-4 Treatment Recommendations Align With Orthopaedic Clinical Practice Guidelines? Clin Orthop Relat Res 2024; 482:2098-2106. [PMID: 39246048 PMCID: PMC11556953 DOI: 10.1097/corr.0000000000003234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Accepted: 08/02/2024] [Indexed: 09/10/2024]
Abstract
BACKGROUND Artificial intelligence (AI) is engineered to emulate tasks that have historically required human interaction and intellect, including learning, pattern recognition, decision-making, and problem-solving. Although AI models like ChatGPT-4 have demonstrated satisfactory performance on medical licensing exams, suggesting a potential for supporting medical diagnostics and decision-making, no study of which we are aware has evaluated the ability of these tools to make treatment recommendations when given clinical vignettes and representative medical imaging of common orthopaedic conditions. As AI continues to advance, a thorough understanding of its strengths and limitations is necessary to inform safe and helpful integration into medical practice. QUESTIONS/PURPOSES (1) What is the concordance between ChatGPT-4-generated treatment recommendations for common orthopaedic conditions with both the American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines (CPGs) and an orthopaedic attending physician's treatment plan? (2) In what specific areas do the ChatGPT-4-generated treatment recommendations diverge from the AAOS CPGs? METHODS Ten common orthopaedic conditions with associated AAOS CPGs were identified: carpal tunnel syndrome, distal radius fracture, glenohumeral joint osteoarthritis, rotator cuff injury, clavicle fracture, hip fracture, hip osteoarthritis, knee osteoarthritis, ACL injury, and acute Achilles rupture. For each condition, the medical records of 10 deidentified patients managed at our facility were used to construct clinical vignettes that each had an isolated, single diagnosis with adequate clarity. The vignettes also encompassed a range of diagnostic severity to evaluate more thoroughly adherence to the treatment guidelines outlined by the AAOS. These clinical vignettes were presented alongside representative radiographic imaging. The model was prompted to provide a single treatment plan recommendation. Each treatment plan was compared with established AAOS CPGs and to the treatment plan documented by the attending orthopaedic surgeon treating the specific patient. Vignettes where ChatGPT-4 recommendations diverged from CPGs were reviewed to identify patterns of error and summarized. RESULTS ChatGPT-4 provided treatment recommendations in accordance with the AAOS CPGs in 90% (90 of 100) of clinical vignettes. Concordance between ChatGPT-generated plans and the plan recommended by the treating orthopaedic attending physician was 78% (78 of 100). One hundred percent (30 of 30) of ChatGPT-4 recommendations for fracture vignettes and hip and knee arthritis vignettes matched with CPG recommendations, whereas the model struggled most with recommendations for carpal tunnel syndrome (3 of 10 instances demonstrated discordance). ChatGPT-4 recommendations diverged from AAOS CPGs for three carpal tunnel syndrome vignettes; two ACL injury, rotator cuff injury, and glenohumeral joint osteoarthritis vignettes; as well as one acute Achilles rupture vignette. In these situations, ChatGPT-4 most often struggled to correctly interpret injury severity and progression, incorporate patient factors (such as lifestyle or comorbidities) into decision-making, and recognize a contraindication to surgery. CONCLUSION ChatGPT-4 can generate accurate treatment plans aligned with CPGs but can also make mistakes when it is required to integrate multiple patient factors into decision-making and understand disease severity and progression. Physicians must critically assess the full clinical picture when using AI tools to support their decision-making. CLINICAL RELEVANCE ChatGPT-4 may be used as an on-demand diagnostic companion, but patient-centered decision-making should continue to remain in the hands of the physician.
Collapse
Affiliation(s)
- Tanios Dagher
- Department of Orthopaedic Surgery, The University of Chicago, Chicago, IL, USA
| | - Emma P. Dwyer
- Department of Orthopaedic Surgery, The University of Chicago, Chicago, IL, USA
| | - Hayden P. Baker
- Department of Orthopaedic Surgery, The University of Chicago, Chicago, IL, USA
| | - Senthooran Kalidoss
- Department of Orthopaedic Surgery, The University of Chicago, Chicago, IL, USA
| | - Jason A. Strelzow
- Department of Orthopaedic Surgery, The University of Chicago, Chicago, IL, USA
| |
Collapse
|
7
|
Villarreal-Espinosa JB, Berreta RS, Allende F, Garcia JR, Ayala S, Familiari F, Chahla J. Accuracy assessment of ChatGPT responses to frequently asked questions regarding anterior cruciate ligament surgery. Knee 2024; 51:84-92. [PMID: 39241674 DOI: 10.1016/j.knee.2024.08.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 06/21/2024] [Accepted: 08/14/2024] [Indexed: 09/09/2024]
Abstract
BACKGROUND The emergence of artificial intelligence (AI) has allowed users to have access to large sources of information in a chat-like manner. Thereby, we sought to evaluate ChatGPT-4 response's accuracy to the 10 patient most frequently asked questions (FAQs) regarding anterior cruciate ligament (ACL) surgery. METHODS A list of the top 10 FAQs pertaining to ACL surgery was created after conducting a search through all Sports Medicine Fellowship Institutions listed on the Arthroscopy Association of North America (AANA) and American Orthopaedic Society of Sports Medicine (AOSSM) websites. A Likert scale was used to grade response accuracy by two sports medicine fellowship-trained surgeons. Cohen's kappa was used to assess inter-rater agreement. Reproducibility of the responses over time was also assessed. RESULTS Five of the 10 responses received a 'completely accurate' grade by two-fellowship trained surgeons with three additional replies receiving a 'completely accurate' status by at least one. Moreover, inter-rater reliability accuracy assessment revealed a moderate agreement between fellowship-trained attending physicians (weighted kappa = 0.57, 95% confidence interval 0.15-0.99). Additionally, 80% of the responses were reproducible over time. CONCLUSION ChatGPT can be considered an accurate additional tool to answer general patient questions regarding ACL surgery. None the less, patient-surgeon interaction should not be deferred and must continue to be the driving force for information retrieval. Thus, the general recommendation is to address any questions in the presence of a qualified specialist.
Collapse
Affiliation(s)
| | | | - Felicitas Allende
- Department of Orthopedics, Rush University Medical Center, Chicago, IL, USA
| | - José Rafael Garcia
- Department of Orthopedics, Rush University Medical Center, Chicago, IL, USA
| | - Salvador Ayala
- Department of Orthopedics, Rush University Medical Center, Chicago, IL, USA
| | | | - Jorge Chahla
- Department of Orthopedics, Rush University Medical Center, Chicago, IL, USA.
| |
Collapse
|
8
|
Suárez A, Jiménez J, Llorente de Pedro M, Andreu-Vázquez C, Díaz-Flores García V, Gómez Sánchez M, Freire Y. Beyond the Scalpel: Assessing ChatGPT's potential as an auxiliary intelligent virtual assistant in oral surgery. Comput Struct Biotechnol J 2024; 24:46-52. [PMID: 38162955 PMCID: PMC10755495 DOI: 10.1016/j.csbj.2023.11.058] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 11/28/2023] [Accepted: 11/28/2023] [Indexed: 01/03/2024] Open
Abstract
AI has revolutionized the way we interact with technology. Noteworthy advances in AI algorithms and large language models (LLM) have led to the development of natural generative language (NGL) systems such as ChatGPT. Although these LLM can simulate human conversations and generate content in real time, they face challenges related to the topicality and accuracy of the information they generate. This study aimed to assess whether ChatGPT-4 could provide accurate and reliable answers to general dentists in the field of oral surgery, and thus explore its potential as an intelligent virtual assistant in clinical decision making in oral surgery. Thirty questions related to oral surgery were posed to ChatGPT4, each question repeated 30 times. Subsequently, a total of 900 responses were obtained. Two surgeons graded the answers according to the guidelines of the Spanish Society of Oral Surgery, using a three-point Likert scale (correct, partially correct/incomplete, and incorrect). Disagreements were arbitrated by an experienced oral surgeon, who provided the final grade Accuracy was found to be 71.7%, and consistency of the experts' grading across iterations, ranged from moderate to almost perfect. ChatGPT-4, with its potential capabilities, will inevitably be integrated into dental disciplines, including oral surgery. In the future, it could be considered as an auxiliary intelligent virtual assistant, though it would never replace oral surgery experts. Proper training and verified information by experts will remain vital to the implementation of the technology. More comprehensive research is needed to ensure the safe and successful application of AI in oral surgery.
Collapse
Affiliation(s)
- Ana Suárez
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Jaime Jiménez
- Department of Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - María Llorente de Pedro
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Cristina Andreu-Vázquez
- Department of Veterinary Medicine, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Víctor Díaz-Flores García
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Margarita Gómez Sánchez
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Yolanda Freire
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| |
Collapse
|
9
|
Zhang C, Liu S, Zhou X, Zhou S, Tian Y, Wang S, Xu N, Li W. Examining the Role of Large Language Models in Orthopedics: Systematic Review. J Med Internet Res 2024; 26:e59607. [PMID: 39546795 DOI: 10.2196/59607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 08/01/2024] [Accepted: 09/11/2024] [Indexed: 11/17/2024] Open
Abstract
BACKGROUND Large language models (LLMs) can understand natural language and generate corresponding text, images, and even videos based on prompts, which holds great potential in medical scenarios. Orthopedics is a significant branch of medicine, and orthopedic diseases contribute to a significant socioeconomic burden, which could be alleviated by the application of LLMs. Several pioneers in orthopedics have conducted research on LLMs across various subspecialties to explore their performance in addressing different issues. However, there are currently few reviews and summaries of these studies, and a systematic summary of existing research is absent. OBJECTIVE The objective of this review was to comprehensively summarize research findings on the application of LLMs in the field of orthopedics and explore the potential opportunities and challenges. METHODS PubMed, Embase, and Cochrane Library databases were searched from January 1, 2014, to February 22, 2024, with the language limited to English. The terms, which included variants of "large language model," "generative artificial intelligence," "ChatGPT," and "orthopaedics," were divided into 2 categories: large language model and orthopedics. After completing the search, the study selection process was conducted according to the inclusion and exclusion criteria. The quality of the included studies was assessed using the revised Cochrane risk-of-bias tool for randomized trials and CONSORT-AI (Consolidated Standards of Reporting Trials-Artificial Intelligence) guidance. Data extraction and synthesis were conducted after the quality assessment. RESULTS A total of 68 studies were selected. The application of LLMs in orthopedics involved the fields of clinical practice, education, research, and management. Of these 68 studies, 47 (69%) focused on clinical practice, 12 (18%) addressed orthopedic education, 8 (12%) were related to scientific research, and 1 (1%) pertained to the field of management. Of the 68 studies, only 8 (12%) recruited patients, and only 1 (1%) was a high-quality randomized controlled trial. ChatGPT was the most commonly mentioned LLM tool. There was considerable heterogeneity in the definition, measurement, and evaluation of the LLMs' performance across the different studies. For diagnostic tasks alone, the accuracy ranged from 55% to 93%. When performing disease classification tasks, ChatGPT with GPT-4's accuracy ranged from 2% to 100%. With regard to answering questions in orthopedic examinations, the scores ranged from 45% to 73.6% due to differences in models and test selections. CONCLUSIONS LLMs cannot replace orthopedic professionals in the short term. However, using LLMs as copilots could be a potential approach to effectively enhance work efficiency at present. More high-quality clinical trials are needed in the future, aiming to identify optimal applications of LLMs and advance orthopedics toward higher efficiency and precision.
Collapse
Affiliation(s)
- Cheng Zhang
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Shanshan Liu
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Xingyu Zhou
- Peking University Health Science Center, Beijing, China
| | - Siyu Zhou
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Yinglun Tian
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Shenglin Wang
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Nanfang Xu
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Weishi Li
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| |
Collapse
|
10
|
Sanders A, Lim R, Jones D, Vosburg RW. Artificial intelligence large language model scores highly on focused practice designation in metabolic and bariatric surgery board practice questions. Surg Endosc 2024; 38:6678-6681. [PMID: 39317906 DOI: 10.1007/s00464-024-11267-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Accepted: 09/11/2024] [Indexed: 09/26/2024]
Abstract
BACKGROUND Artificial intelligence models such as ChatGPT (Open AI) have performed well on the exams of various medical and surgical fields. It is not yet known how ChatGPT performs on similar metabolic and bariatric surgery (MBS) questions. OBJECTIVE Assess the performance of ChatGPT on Focused Practice Designation in Metabolic and Bariatric Surgery board-style questions. SETTING United States. METHODS Questions obtained from the largest commercially available bank of FPD-MBS practice questions were entered into ChatGPT-4, as is, without prior training. We assessed the overall percentage correct as well as the percentage correct within each of the five American Board of Surgery (ABS) question categories. One-way ANOVA was used to determine if the frequency of correct answers differed between categories. RESULTS Out of 255 questions, ChatGPT-4 correctly answered 189 (74.1%). Between the five question categories there was no difference between the frequency of correct answers (p = 0.22). It did not matter if questions were entered individually or in groups of up to 10. CONCLUSION Without prior training, ChatGPT-4 scored highly when evaluated on the largest practice question bank for the FPD-MBS exam.
Collapse
Affiliation(s)
- A Sanders
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA, USA
- Department of Surgery, Grand Strand Medical Center, Myrtle Beach, SC, USA
| | - R Lim
- Atrium Health, Charlotte, NC, USA
| | - D Jones
- Department of Surgery Rutgers Health, New Jersey Medical School, Newark, NJ, USA
| | - R W Vosburg
- Department of Surgery, Grand Strand Medical Center, Myrtle Beach, SC, USA.
| |
Collapse
|
11
|
Hartman H, Essis MD, Tung WS, Oh I, Peden S, Gianakos AL. Can ChatGPT-4 Diagnose and Treat Like an Orthopaedic Surgeon? Testing Clinical Decision Making and Diagnostic Ability in Soft-Tissue Pathologies of the Foot and Ankle. J Am Acad Orthop Surg 2024:00124635-990000000-01126. [PMID: 39442011 DOI: 10.5435/jaaos-d-24-00595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Accepted: 08/28/2024] [Indexed: 10/25/2024] Open
Abstract
INTRODUCTION ChatGPT-4, a chatbot with an ability to carry human-like conversation, has attracted attention after demonstrating aptitude to pass professional licensure examinations. The purpose of this study was to explore the diagnostic and decision-making capacities of ChatGPT-4 in clinical management specifically assessing for accuracy in the identification and treatment of soft-tissue foot and ankle pathologies. METHODS This study presented eight soft-tissue-related foot and ankle cases to ChatGPT-4, with each case assessed by three fellowship-trained foot and ankle orthopaedic surgeons. The evaluation system included five criteria within a Likert scale, scoring from 5 (lowest) to 25 (highest possible). RESULTS The average sum score of all cases was 22.0. The Morton neuroma case received the highest score (24.7), and the peroneal tendon tear case received the lowest score (16.3). Subgroup analyses of each of the 5 criterion using showed no notable differences in surgeon grading. Criteria 3 (provide alternative treatments) and 4 (provide comprehensive information) were graded markedly lower than criteria 1 (diagnose), 2 (treat), and 5 (provide accurate information) (for both criteria 3 and 4: P = 0.007; P = 0.032; P < 0.0001). Criteria 5 was graded markedly higher than criteria 2, 3, and 4 (P = 0.02; P < 0.0001; P < 0.0001). CONCLUSION This study demonstrates that ChatGPT-4 effectively diagnosed and provided reliable treatment options for most soft-tissue foot and ankle cases presented, noting consistency among surgeon evaluators. Individual criterion assessment revealed that ChatGPT-4 was most effective in diagnosing and suggesting appropriate treatment, but limitations were seen in the chatbot's ability to provide comprehensive information and alternative treatment options. In addition, the chatbot successfully did not suggest fabricated treatment options, a common concern in prior literature. This resource could be useful for clinicians seeking reliable patient education materials without the fear of inconsistencies, although comprehensive information beyond treatment may be limited.
Collapse
Affiliation(s)
- Hayden Hartman
- From the Lincoln Memorial University, DeBusk College of Osteopathic Medicine, Knoxville, TN (Hartman), and the Department of Orthopaedics and Rehabilitation, Yale University, New Haven, CT (Essis, Tung, Oh, Peden, and Gianakos)
| | | | | | | | | | | |
Collapse
|
12
|
Sallam M, Al-Salahat K, Eid H, Egger J, Puladi B. Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions. ADVANCES IN MEDICAL EDUCATION AND PRACTICE 2024; 15:857-871. [PMID: 39319062 PMCID: PMC11421444 DOI: 10.2147/amep.s479801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Accepted: 09/15/2024] [Indexed: 09/26/2024]
Abstract
Introduction Artificial intelligence (AI) chatbots excel in language understanding and generation. These models can transform healthcare education and practice. However, it is important to assess the performance of such AI models in various topics to highlight its strengths and possible limitations. This study aimed to evaluate the performance of ChatGPT (GPT-3.5 and GPT-4), Bing, and Bard compared to human students at a postgraduate master's level in Medical Laboratory Sciences. Methods The study design was based on the METRICS checklist for the design and reporting of AI-based studies in healthcare. The study utilized a dataset of 60 Clinical Chemistry multiple-choice questions (MCQs) initially conceived for assessing 20 MSc students. The revised Bloom's taxonomy was used as the framework for classifying the MCQs into four cognitive categories: Remember, Understand, Analyze, and Apply. A modified version of the CLEAR tool was used for the assessment of the quality of AI-generated content, with Cohen's κ for inter-rater agreement. Results Compared to the mean students' score which was 0.68±0.23, GPT-4 scored 0.90 ± 0.30, followed by Bing (0.77 ± 0.43), GPT-3.5 (0.73 ± 0.45), and Bard (0.67 ± 0.48). Statistically significant better performance was noted in lower cognitive domains (Remember and Understand) in GPT-3.5 (P=0.041), GPT-4 (P=0.003), and Bard (P=0.017) compared to the higher cognitive domains (Apply and Analyze). The CLEAR scores indicated that ChatGPT-4 performance was "Excellent" compared to the "Above average" performance of ChatGPT-3.5, Bing, and Bard. Discussion The findings indicated that ChatGPT-4 excelled in the Clinical Chemistry exam, while ChatGPT-3.5, Bing, and Bard were above average. Given that the MCQs were directed to postgraduate students with a high degree of specialization, the performance of these AI chatbots was remarkable. Due to the risk of academic dishonesty and possible dependence on these AI models, the appropriateness of MCQs as an assessment tool in higher education should be re-evaluated.
Collapse
Affiliation(s)
- Malik Sallam
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan
- Scientific Approaches to Fight Epidemics of Infectious Diseases (SAFE-ID) Research Group, The University of Jordan, Amman, Jordan
| | - Khaled Al-Salahat
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Scientific Approaches to Fight Epidemics of Infectious Diseases (SAFE-ID) Research Group, The University of Jordan, Amman, Jordan
| | - Huda Eid
- Scientific Approaches to Fight Epidemics of Infectious Diseases (SAFE-ID) Research Group, The University of Jordan, Amman, Jordan
| | - Jan Egger
- Institute for AI in Medicine (IKIM), University Medicine Essen (AöR), Essen, Germany
| | - Behrus Puladi
- Institute of Medical Informatics, University Hospital RWTH Aachen, Aachen, Germany
| |
Collapse
|
13
|
Kiser KJ, Waters M, Reckford J, Lundeberg C, Abraham CD. Large Language Models to Help Appeal Denied Radiotherapy Services. JCO Clin Cancer Inform 2024; 8:e2400129. [PMID: 39250740 DOI: 10.1200/cci.24.00129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 06/19/2024] [Accepted: 07/29/2024] [Indexed: 09/11/2024] Open
Abstract
PURPOSE Large language model (LLM) artificial intelligences may help physicians appeal insurer denials of prescribed medical services, a task that delays patient care and contributes to burnout. We evaluated LLM performance at this task for denials of radiotherapy services. METHODS We evaluated generative pretrained transformer 3.5 (GPT-3.5; OpenAI, San Francisco, CA), GPT-4, GPT-4 with internet search functionality (GPT-4web), and GPT-3.5ft. The latter was developed by fine-tuning GPT-3.5 via an OpenAI application programming interface with 53 examples of appeal letters written by radiation oncologists. Twenty test prompts with simulated patient histories were programmatically presented to the LLMs, and output appeal letters were scored by three blinded radiation oncologists for language representation, clinical detail inclusion, clinical reasoning validity, literature citations, and overall readiness for insurer submission. RESULTS Interobserver agreement between radiation oncologists' scores was moderate or better for all domains (Cohen's kappa coefficients: 0.41-0.91). GPT-3.5, GPT-4, and GPT-4web wrote letters that were on average linguistically clear, summarized provided clinical histories without confabulation, reasoned appropriately, and were scored useful to expedite the insurance appeal process. GPT-4 and GPT-4web letters demonstrated superior clinical reasoning and were readier for submission than GPT-3.5 letters (P < .001). Fine-tuning increased GPT-3.5ft confabulation and compromised performance compared with other LLMs across all domains (P < .001). All LLMs, including GPT-4web, were poor at supporting clinical assertions with existing, relevant, and appropriately cited primary literature. CONCLUSION When prompted appropriately, three commercially available LLMs drafted letters that physicians deemed would expedite appealing insurer denials of radiotherapy services. LLMs may decrease this task's clerical workload on providers. However, LLM performance worsened when fine-tuned with a task-specific, small training data set.
Collapse
Affiliation(s)
- Kendall J Kiser
- Department of Radiation Oncology, Washington University School of Medicine in St Louis, St Louis, MO
| | - Michael Waters
- Department of Radiation Oncology, Washington University School of Medicine in St Louis, St Louis, MO
| | - Jocelyn Reckford
- Department of Radiation Oncology, Washington University School of Medicine in St Louis, St Louis, MO
| | | | - Christopher D Abraham
- Department of Radiation Oncology, Washington University School of Medicine in St Louis, St Louis, MO
| |
Collapse
|
14
|
Posner KM, Bakus C, Basralian G, Chester G, Zeiman M, O'Malley GR, Klein GR. Evaluating ChatGPT's Capabilities on Orthopedic Training Examinations: An Analysis of New Image Processing Features. Cureus 2024; 16:e55945. [PMID: 38601421 PMCID: PMC11005479 DOI: 10.7759/cureus.55945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/11/2024] [Indexed: 04/12/2024] Open
Abstract
Introduction The efficacy of integrating artificial intelligence (AI) models like ChatGPT into the medical field, specifically orthopedic surgery, has yet to be fully determined. The most recent adaptation of ChatGPT that has yet to be explored is its image analysis capabilities. This study assesses ChatGPT's performance in answering Orthopedic In-Training Examination (OITE) questions, including those that require image analysis. Methods Questions from the 2014, 2015, 2021, and 2022 AAOS OITE were screened for inclusion. All questions without images were entered into ChatGPT 3.5 and 4.0 twice. Questions that necessitated the use of images were only entered into ChatGPT 4.0 twice, as this is the only version of the system that can analyze images. The responses were recorded and compared to AAOS's correct answers, evaluating the AI's accuracy and precision. Results A total of 940 questions were included in the final analysis (457 questions with images and 483 questions without images). ChatGPT 4.0 performed significantly better on questions that did not require image analysis (67.81% vs 47.59%, p<0.001). Discussion While the use of AI in orthopedics is an intriguing possibility, this evaluation demonstrates how, even with the addition of image processing capabilities, ChatGPT still falls short in terms of its accuracy. As AI technology evolves, ongoing research is vital to harness AI's potential effectively, ensuring it complements rather than attempts to replace the nuanced skills of orthopedic surgeons.
Collapse
Affiliation(s)
- Kevin M Posner
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Cassandra Bakus
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Grace Basralian
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Grace Chester
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Mallery Zeiman
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Geoffrey R O'Malley
- Department of Orthopedic Surgery, Hackensack University Medical Center, Hackensack, USA
| | - Gregg R Klein
- Department of Orthopedic Surgery, Hackensack University Medical Center, Hackensack, USA
| |
Collapse
|