1
|
Gupta S, Tarapore R, Haislup B, Fillar A. Microsoft Copilot Provides More Accurate and Reliable Information About Anterior Cruciate Ligament Injury and Repair Than ChatGPT and Google Gemini; However, No Resource Was Overall the Best. Arthrosc Sports Med Rehabil 2025; 7:101043. [PMID: 40297090 PMCID: PMC12034078 DOI: 10.1016/j.asmr.2024.101043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2024] [Accepted: 10/30/2024] [Indexed: 04/30/2025] Open
Abstract
Purpose To analyze and compare the quality, accuracy, and readability of information regarding anterior cruciate ligament (ACL) injury and reconstruction provided by various artificial intelligence AI interfaces (Google Gemini, Microsoft Copilot, and OpenAI ChatGPT). Methods Twenty questions regarding ACL reconstruction were inputted into ChatGPT 3.5, Gemini, and the more precise subinterface within Copilot and were categorized on the basis of the Rothwell criteria into Fact, Policy, and Value. The answers generated were analyzed using the DISCERN scale, JAMA benchmark criteria, and Flesch-Kincaid Reading Ease Score and Grade Level. The citations provided by Gemini and Copilot were further categorized by source of citation. Results All 3 AI interfaces generated DISCERN scores (≥50) demonstrating "good" quality of information except for Policy and Value by Copilot which were scored as "excellent" (≥70). The information provided by Copilot demonstrated greater reliability, with a JAMA benchmark criterion of 3 (of 4) as compared with Gemini (1) and ChatGPT (0). In terms of readability, the Flesch-Kincaid Reading Ease Score scores of all 3 sources were <30, apart from Fact by Copilot (31.9) demonstrating very complex answers. Similarly, all Flesch-Kincaid Grade Level scores were >13, indicating a minimum readability level of college level or college graduate. Finally, both Copilot and Gemini had a majority of references provided by journals (65.6% by Gemini and 75.4% by Copilot), followed by academic sources, whereas Copilot provided a greater number of overall citations (163) as compared with Gemini (64). Conclusions Microsoft Copilot was a better resource for patients to learn about ACL injuries and reconstruction compared with Google Gemini or OpenAI ChatGPT in terms of quality of information, reliability, and readability. The answers provided by LLMs are highly complex and no resource was overall the best. Clinical Relevance As artificial intelligence models continually evolve and demonstrate increased potential for answering complex surgical questions, it is important to investigate the quality and usefulness of the responses for patients. Although these resources may be helpful, they should not be used as a substitute for any discussions with health care providers.
Collapse
Affiliation(s)
- Suhasini Gupta
- UMass Chan Medical School, Worcester, Massachusetts, U.S.A
| | - Rae Tarapore
- Department of Orthopaedic Surgery, MedStar Union Memorial Hospital, Baltimore, Maryland, U.S.A
| | - Brett Haislup
- Department of Orthopaedic Surgery, MedStar Union Memorial Hospital, Baltimore, Maryland, U.S.A
| | - Allison Fillar
- Department of Orthopaedic Surgery, MedStar Union Memorial Hospital, Baltimore, Maryland, U.S.A
| |
Collapse
|
2
|
Kolac UC, Karademir OM, Ayik G, Kaymakoglu M, Familiari F, Huri G. Can popular AI large language models provide reliable answers to frequently asked questions about rotator cuff tears? JSES Int 2025; 9:390-397. [PMID: 40182256 PMCID: PMC11962600 DOI: 10.1016/j.jseint.2024.11.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2025] Open
Abstract
Background Rotator cuff tears are common upper-extremity injuries that significantly impair shoulder function, leading to pain, reduced range of motion, and a decrease in quality of life. With the increasing reliance on artificial intelligence large language models (AI LLMs) for health information, it is crucial to evaluate the quality and readability of the information provided by these models. Methods A pool of 50 questions was generated related to rotator cuff tear by querying popular AI LLMs (ChatGPT 3.5, ChatGPT 4, Gemini, and Microsoft CoPilot) and using Google search. After that, responses from the AI LLMs were saved and evaluated. For information quality the DISCERN tool and a Likert Scale was used, for readability the Patient Education Materials Assessment Tool for Printable Materials (PEMAT) Understandability Score and the Flesch-Kincaid Reading Ease Score was used. Two orthopedic surgeons assessed the responses, and discrepancies were resolved by a senior author. Results Out of 198 answers, the median DISCERN score was 40, with 56.6% considered sufficient. The Likert Scale showed 96% sufficiency. The median PEMAT Understandability score was 83.33, with 77.3% sufficiency, while the Flesch-Kincaid Reading Ease score had a median of 42.05 with 88.9% sufficiency. Overall, 39.8% of the answers were sufficient in both information quality and readability. Differences were found among AI models in DISCERN, Likert, PEMAT Understandability, and Flesch-Kincaid scores. Conclusion AI LLMs generally cannot offer sufficient information quality and readability. While they are not ready for use in medical field, they show a promising future. There is a necessity for continuous re-evaluation of these models due to their rapid evolution. Developing new, comprehensive tools for evaluating medical information quality and readability is crucial for ensuring these models can effectively support patient education. Future research should focus on enhancing readability and consistent information quality to better serve patients.
Collapse
Affiliation(s)
- Ulas Can Kolac
- Department of Orthopedics and Traumatology, Hacettepe University Faculty of Medicine, Ankara, Turkey
| | | | - Gokhan Ayik
- Department of Orthopedics and Traumatology, Yuksek Ihtisas University Faculty of Medicine, Ankara, Turkey
| | - Mehmet Kaymakoglu
- Department of Orthopedics and Traumatology, Faculty of Medicine, Izmir University of Economics, Izmir, Turkey
| | - Filippo Familiari
- Department of Orthopaedics, Magna Graecia University of Catanzaro, Italy, Catanzaro, Italy
- Research Center on Musculoskeletal Health, MusculoSkeletalHealth@UMG, Magna Graecia University, Catanzaro, Italy
| | - Gazi Huri
- Department of Orthopedics and Traumatology, Hacettepe University Faculty of Medicine, Ankara, Turkey
- Aspetar, Orthopedic and Sports Medicine Hospital, FIFA Medical Center of Excellence, Doha, Qatar
| |
Collapse
|
3
|
Mastrokostas PG, Mastrokostas LE, Emara AK, Wellington IJ, Ginalis E, Houten JK, Khalsa AS, Saleh A, Razi AE, Ng MK. Modern Internet Search Analytics: Is There a Difference in What Patients are Searching Regarding the Operative and Nonoperative Management of Scoliosis? Global Spine J 2025; 15:103-111. [PMID: 38613478 PMCID: PMC11571444 DOI: 10.1177/21925682241248110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/15/2024] Open
Abstract
STUDY DESIGN Observational Study. OBJECTIVES This study aimed to investigate the most searched types of questions and online resources implicated in the operative and nonoperative management of scoliosis. METHODS Six terms related to operative and nonoperative scoliosis treatment were searched on Google's People Also Ask section on October 12, 2023. The Rothwell classification was used to sort questions into fact, policy, or value categories, and associated websites were classified by type. Fischer's exact tests compared question type and websites encountered between operative and nonoperative questions. Statistical significance was set at the .05 level. RESULTS The most common questions concerning operative and nonoperative management were fact (53.4%) and value (35.5%) questions, respectively. The most common subcategory pertaining to operative and nonoperative questions were specific activities/restrictions (21.7%) and evaluation of treatment (33.3%), respectively. Questions on indications/management (13.2% vs 31.2%, P < .001) and evaluation of treatment (10.1% vs 33.3%, P < .001) were associated with nonoperative scoliosis management. Medical practice websites were the most common website to which questions concerning operative (31.9%) and nonoperative (51.4%) management were directed to. Operative questions were more likely to be directed to academic websites (21.7% vs 10.0%, P = .037) and less likely to be directed to medical practice websites (31.9% vs 51.4%, P = .007) than nonoperative questions. CONCLUSIONS During scoliosis consultations, spine surgeons should emphasize the postoperative recovery process and efficacy of conservative treatment modalities for the operative and nonoperative management of scoliosis, respectively. Future research should assess the impact of website encounters on patients' decision-making.
Collapse
Affiliation(s)
- Paul G. Mastrokostas
- College of Medicine, State University of New York (SUNY) Downstate, Brooklyn, NY, USA
| | | | - Ahmed K. Emara
- Department of Orthopaedic Surgery, Cleveland Clinic, Cleveland, OH, USA
| | - Ian J. Wellington
- Department of Orthopaedic Surgery, University of Connecticut, Hartford, CT, USA
| | | | - John K. Houten
- Department of Neurosurgery, Mount Sinai School of Medicine, New York, NY, USA
| | - Amrit S. Khalsa
- Department of Orthopaedic Surgery, University of Pennsylvania, Philadelphia, PA, USA
| | - Ahmed Saleh
- Maimonides Medical Center, Department of Orthopaedic Surgery, Brooklyn, NY, USA
| | - Afshin E. Razi
- Maimonides Medical Center, Department of Orthopaedic Surgery, Brooklyn, NY, USA
| | - Mitchell K. Ng
- Maimonides Medical Center, Department of Orthopaedic Surgery, Brooklyn, NY, USA
| |
Collapse
|
4
|
Eryilmaz A, Aydin M, Turemis C, Surucu S. ChatGPT-4.0 vs. Google: Which Provides More Academic Answers to Patients' Questions on Arthroscopic Meniscus Repair? Cureus 2024; 16:e76380. [PMID: 39867098 PMCID: PMC11760333 DOI: 10.7759/cureus.76380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/25/2024] [Indexed: 01/28/2025] Open
Abstract
Purpose The purpose of this study was to evaluate the ability of a Chat Generative Pre-trained Transformer (ChatGPT) to provide academic answers to frequently asked questions using a comparison with Google web search FAQs and answers. This study attempted to determine what patients ask on Google and ChatGPT and whether ChatGPT and Google provide factual information for patients about arthroscopic meniscus repair. Method A cleanly installed Google Chrome browser and ChatGPT were used to ensure no individual cookies, browsing history, other side data, or sponsored sites. The term "arthroscopic meniscus repair" was entered into the Google Chrome browser and ChatGPT. The first 15 frequently asked questions (FAQs), answers, and sources of answers to FAQs were identified from both ChatGPT and Google search engines. Results Timeline of recovery (20%) and technical details (20%) were the most commonly asked question categories of a total of 30 questions. Technical details and timeline of recovery questions were more commonly asked on ChatGPT compared to Google (technical detail: 33.3% vs. 6.6%, p=0.168; timeline of recovery: 26.6% vs. 13.3%, p=0.651). Answers to questions were more commonly from academic websites in website categories in ChatGPT compared to Google (93.3% vs. 20%, p=0.0001). The most common answers to frequently asked questions were academic (20%) and commercial (20%) in Google. Conclusion Compared to Google, ChatGPT provided significantly fewer references to commercial content and offered responses that were more aligned with academic sources. ChatGPT may be a valuable adjunct in patient education when used under physician supervision, ensuring information aligns with evidence-based practices.
Collapse
Affiliation(s)
- Atahan Eryilmaz
- Orthopedic Surgery, Haseki Training and Research Hospital, Istanbul, TUR
| | - Mahmud Aydin
- Orthopedic Surgery, Sisli Memorial Hospital, Istanbul, TUR
| | - Cihangir Turemis
- Orthopedic Surgery, Cesme Alper Cizgenakat State Hospital, Izmir, TUR
| | - Serkan Surucu
- Orthopedics and Rehabilitation, Yale University, New Haven, USA
| |
Collapse
|
5
|
Mastrokostas PG, Mastrokostas LE, Emara AK, Wellington IJ, Ginalis E, Houten JK, Khalsa AS, Saleh A, Razi AE, Ng MK. GPT-4 as a Source of Patient Information for Anterior Cervical Discectomy and Fusion: A Comparative Analysis Against Google Web Search. Global Spine J 2024; 14:2389-2398. [PMID: 38513636 PMCID: PMC11529100 DOI: 10.1177/21925682241241241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/23/2024] Open
Abstract
STUDY DESIGN Comparative study. OBJECTIVES This study aims to compare Google and GPT-4 in terms of (1) question types, (2) response readability, (3) source quality, and (4) numerical response accuracy for the top 10 most frequently asked questions (FAQs) about anterior cervical discectomy and fusion (ACDF). METHODS "Anterior cervical discectomy and fusion" was searched on Google and GPT-4 on December 18, 2023. Top 10 FAQs were classified according to the Rothwell system. Source quality was evaluated using JAMA benchmark criteria and readability was assessed using Flesch Reading Ease and Flesch-Kincaid grade level. Differences in JAMA scores, Flesch-Kincaid grade level, Flesch Reading Ease, and word count between platforms were analyzed using Student's t-tests. Statistical significance was set at the .05 level. RESULTS Frequently asked questions from Google were varied, while GPT-4 focused on technical details and indications/management. GPT-4 showed a higher Flesch-Kincaid grade level (12.96 vs 9.28, P = .003), lower Flesch Reading Ease score (37.07 vs 54.85, P = .005), and higher JAMA scores for source quality (3.333 vs 1.800, P = .016). Numerically, 6 out of 10 responses varied between platforms, with GPT-4 providing broader recovery timelines for ACDF. CONCLUSIONS This study demonstrates GPT-4's ability to elevate patient education by providing high-quality, diverse information tailored to those with advanced literacy levels. As AI technology evolves, refining these tools for accuracy and user-friendliness remains crucial, catering to patients' varying literacy levels and information needs in spine surgery.
Collapse
Affiliation(s)
- Paul G. Mastrokostas
- College of Medicine, State University of New York (SUNY) Downstate, Brooklyn, NY, USA
| | | | - Ahmed K. Emara
- Department of Orthopaedic Surgery, Cleveland Clinic, Cleveland, OH, USA
| | - Ian J. Wellington
- Department of Orthopaedic Surgery, University of Connecticut, Hartford, CT, USA
| | | | - John K. Houten
- Department of Neurosurgery, Mount Sinai School of Medicine, New York, NY, USA
| | - Amrit S. Khalsa
- Department of Orthopaedic Surgery, University of Pennsylvania, Philadelphia, PA, USA
| | - Ahmed Saleh
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| | - Afshin E. Razi
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| | - Mitchell K. Ng
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| |
Collapse
|
6
|
Singh SP, Ramprasad A, Luu A, Zaidi R, Siddiqui Z, Pham T. Health Literacy Analytics of Accessible Patient Resources in Cardiovascular Medicine: What are Patients Wanting to Know? Kans J Med 2023; 16:309-315. [PMID: 38298385 PMCID: PMC10829858 DOI: 10.17161/kjm.vol16.20554] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 12/21/2023] [Indexed: 02/02/2024] Open
Abstract
Introduction There remains an increasing utilization of internet-based resources as a first line of medical knowledge. Among patients with cardiovascular disease, these resources often are relied upon for numerous diagnostic and therapeutic modalities. However, the reliability of this information is not fully understood. The aim of this study was to provide a descriptive profile on the literacy quality, readability, and transparency of publicly available educational resources in cardiology. Methods The frequently asked questions and associated online educational articles on common cardiovascular diagnostic and therapeutic interventions were investigated using publicly available data from the Google RankBrain machine learning algorithm after applying inclusion and exclusion criteria. Independent raters evaluated questions for Rothwell's Classification and readability calculations. Results Collectively, 520 questions and articles were evaluated across 13 cardiac interventions, resulting in 3,120 readability scores. The sources of articles were most frequently from academic institutions followed by commercial sources. Most questions were classified as "Fact" at 76.0% (n = 395), and questions regarding "Technical Details" of each intervention were the most common subclassification at 56.3% (n = 293). Conclusions Our data show that patients most often are using online search query programs to seek information regarding specific knowledge of each cardiovascular intervention rather than form an evaluation of the intervention. Additionally, these online patient educational resources continue to not meet grade-level reading recommendations.
Collapse
Affiliation(s)
- Som P Singh
- University of Missouri-Kansas City School of Medicine, Kansas City, MO
- University of Texas Health Sciences Center at Houston, Houston, TX
| | - Aarya Ramprasad
- University of Missouri-Kansas City School of Medicine, Kansas City, MO
| | - Anh Luu
- University of Missouri-Kansas City School of Medicine, Kansas City, MO
| | - Rohma Zaidi
- University of Missouri-Kansas City School of Medicine, Kansas City, MO
| | - Zoya Siddiqui
- University of Missouri-Kansas City School of Medicine, Kansas City, MO
| | - Trung Pham
- University of Missouri-Kansas City School of Medicine, Kansas City, MO
| |
Collapse
|
7
|
Ahmed A, Jassim S, Karkuri A. Readability of Online Information on the Latarjet Procedure. Cureus 2023; 15:e49184. [PMID: 38024088 PMCID: PMC10662536 DOI: 10.7759/cureus.49184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/21/2023] [Indexed: 12/01/2023] Open
Abstract
Introduction A common complication of first-time or recurrent shoulder dislocations is bone loss at the humeral head and glenoid. Recurrent shoulder instability is often a result of bony defects in the glenoid following shoulder dislocations. In the setting of glenoid bone loss, surgical interventions are generally required to restore stability. The Latarjet procedure is a challenging operation and, due to its complexity, may be associated with operative complications. It can be difficult to explain the procedure to patients in a manner that is easily comprehensible, which may lead to confusion or being overwhelmed with information. Hence, it is important that the information available to patients is easily accessible and perceivable to allow for adequate health literacy. Health literacy is defined as the ability to make health decisions in the context of everyday life. Methods The search engines Google and Bing were accessed on a single day in the month of July 2023, searching the terms "Latarjet surgery" and "Latarjet procedure." For each term on both search engines, the first three pages were evaluated, resulting in a total of 114 websites for review. Out of these, 25 websites met the inclusion criteria and underwent further in-depth analysis through the online readability software, WEB FX. This software generated a Flesch Reading Ease Score (FRES) and a Reading Grade Level (RGL) for each website. Results In our study, the mean FRES was 50.3 (SD ±12.5), categorizing the data as 'fairly difficult to read.' The mean RGL score was 8.12 (SD ±2.35), which exceeds the recommended target. Conclusion In conclusion, the results of this study have demonstrated that the material available on the Internet about the Latarjet procedure is above the recommended readability levels for the majority of the population. Our findings align with similar studies assessing the readability of online patient information. Based on these findings, physicians should provide patients with vetted information to facilitate a better understanding of the procedure, thereby enabling patients to make more informed decisions regarding their health.
Collapse
Affiliation(s)
- Aathir Ahmed
- Orthopaedics, Royal College of Surgeons in Ireland, Dublin, IRL
| | - Sarmed Jassim
- Surgery, Royal College of Surgeons in Ireland, Dublin, IRL
| | - Ahmed Karkuri
- Orthopaedic Surgery, Sligo University Hospital, Sligo, IRL
| |
Collapse
|