1
|
Kunze KN, Gerhold C, Dave U, Abunnur N, Mamonov A, Nwachukwu BU, Verma NN, Chahla J. Large Language Model Use Cases in Health Care Research Are Redundant and Often Lack Appropriate Methodological Conduct: A Scoping Review and Call for Improved Practices. Arthroscopy 2025:S0749-8063(25)00253-1. [PMID: 40209833 DOI: 10.1016/j.arthro.2025.03.066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/02/2024] [Revised: 03/17/2025] [Accepted: 03/19/2025] [Indexed: 04/12/2025]
Abstract
PURPOSE To describe the current use cases of large language models (LLMs) in musculoskeletal medicine and to evaluate the methodologic conduct of these investigations in order to safeguard future implementation of LLMs in clinical research and identify key areas for methodological improvement. METHODS A comprehensive literature search was performed in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines using PubMed, Cochrane Library, and Embase databases to identify eligible studies. Included studies evaluated the use of LLMs within any realm of orthopaedic surgery, regardless of its application in a clinical or educational setting. Methodological Index for Non-Randomized Studies criteria was used to assess the quality of all included studies. RESULTS In total, 114 studies published from 2022 to 2024 were identified. Extensive use case redundancy was observed, and 5 main categories of clinical applications of LLMs were identified: 48 studies (42.1%) that assessed the ability to answer patient questions, 24 studies (21.1%) that evaluated the ability to diagnose and manage medical conditions, 21 studies (18.4%) that evaluated the ability to take orthopaedic examinations, 11 studies (9.6%) that analyzed the ability to develop or evaluate patient educational materials, and 10 studies (8.8%) concerning other applications, such as generating images, generating discharge documents and clinical letters, writing scientific abstracts and manuscripts, and enhancing billing efficiency. General orthopaedics was the focus of most included studies (n = 39, 34.2%), followed by orthopaedic sports medicine (n = 18, 15.8%), and adult reconstructive surgery (n = 17, 14.9%). ChatGPT 3.5 was the most common LLM used or evaluated (n = 79, 69.2%), followed by ChatGPT 4.0 (n = 47, 41.2%). Methodological inconsistency was prevalent among studies, with 36 (31.6%) studies failing to disclose the exact prompts used, 64 (56.1%) failing to disclose the exact outputs generated by the LLM, and only 7 (6.1%) evaluating different prompting strategies to elicit desired outputs. No studies attempted to investigate how the influence of race or gender influenced model outputs. CONCLUSIONS Among studies evaluating LLM health care use cases, the scope of clinical investigations was limited, with most studies showing redundant use cases. Because of infrequently reported descriptions of prompting strategies, incomplete model specifications, failure to disclose exact model outputs, and limited attempts to address bias, methodological inconsistency was concerningly extensive. CLINICAL RELEVANCE A comprehensive understanding of current LLM use cases is critical to familiarize providers with the possibilities through which this technology may be used in clinical practice. As LLM health care applications transition from research to clinical integration, model transparency and trustworthiness is critical. The results of the current study suggest that guidance is urgently needed, with focus on promoting appropriate methodological conduct practices and novel use cases to advance the field.
Collapse
Affiliation(s)
- Kyle N Kunze
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A..
| | | | - Udit Dave
- Midwest Orthopaedics at Rush, Chicago, Illinois, U.S.A
| | - Nezar Abunnur
- Midwest Orthopaedics at Rush, Chicago, Illinois, U.S.A
| | | | - Benedict U Nwachukwu
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | | | - Jorge Chahla
- Midwest Orthopaedics at Rush, Chicago, Illinois, U.S.A
| |
Collapse
|
2
|
Milner JD, Quinn MS, Schmitt P, Hall RP, Bokshan S, Petit L, O’Donnell R, Marcaccio SE, DeFroda SF, Tabaddor RR, Owens BD. Performance of Artificial Intelligence in Addressing Questions Regarding Management of Osteochondritis Dissecans. Sports Health 2025:19417381251326549. [PMID: 40170344 PMCID: PMC11966633 DOI: 10.1177/19417381251326549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2025] Open
Abstract
BACKGROUND Large language model (LLM)-based artificial intelligence (AI) chatbots, such as ChatGPT and Gemini, have become widespread sources of information. Few studies have evaluated LLM responses to questions about orthopaedic conditions, especially osteochondritis dissecans (OCD). HYPOTHESIS ChatGPT and Gemini will generate accurate responses that align with American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines. STUDY DESIGN Cohort study. LEVEL OF EVIDENCE Level 2. METHODS LLM prompts were created based on AAOS clinical guidelines on OCD diagnosis and treatment, and responses from ChatGPT and Gemini were collected. Seven fellowship-trained orthopaedic surgeons evaluated LLM responses on a 5-point Likert scale, based on 6 categories: relevance, accuracy, clarity, completeness, evidence-based, and consistency. RESULTS ChatGPT and Gemini exhibited strong performance across all criteria. ChatGPT mean scores were highest for clarity (4.771 ± 0.141 [mean ± SD]). Gemini scored highest for relevance and accuracy (4.286 ± 0.296, 4.286 ± 0.273). For both LLMs, the lowest scores were for evidence-based responses (ChatGPT, 3.857 ± 0.352; Gemini, 3.743 ± 0.353). For all other categories, ChatGPT mean scores were higher than Gemini scores. The consistency of responses between the 2 LLMs was rated at an overall mean of 3.486 ± 0.371. Inter-rater reliability ranged from 0.4 to 0.67 (mean, 0.59) and was highest (0.67) in the accuracy category and lowest (0.4) in the consistency category. CONCLUSION LLM performance emphasizes the potential for gathering clinically relevant and accurate answers to questions regarding the diagnosis and treatment of OCD and suggests that ChatGPT may be a better model for this purpose than the Gemini model. Further evaluation of LLM information regarding other orthopaedic procedures and conditions may be necessary before LLMs can be recommended as an accurate source of orthopaedic information. CLINICAL RELEVANCE Little is known about the ability of AI to provide answers regarding OCD.
Collapse
Affiliation(s)
- John D. Milner
- Department of Orthopaedic Surgery, Brown University, Warren Alpert Medical School, Providence, Rhode Island
| | - Matthew S. Quinn
- Department of Orthopaedic Surgery, Brown University, Warren Alpert Medical School, Providence, Rhode Island
| | - Phillip Schmitt
- Department of Orthopaedic Surgery, Brown University, Warren Alpert Medical School, Providence, Rhode Island
| | - Rigel P. Hall
- Creighton University School of Medicine, Phoenix, Arizona
| | | | - Logan Petit
- Connecticut Orthopaedics, Hamden, Connecticut
| | - Ryan O’Donnell
- Department of Orthopedic Surgery, St Luke’s University Health Network, Bethlehem, Pennsylvania
| | - Stephen E. Marcaccio
- Department of Orthopaedic Surgery, Brown University, Warren Alpert Medical School, Providence, Rhode Island
| | - Steven F. DeFroda
- Department of Orthopedic Surgery, Missouri Orthopedic Institute, University of Missouri, Columbia, Missouri
| | - Ramin R. Tabaddor
- Department of Orthopaedic Surgery, Brown University, Warren Alpert Medical School, Providence, Rhode Island
| | - Brett D. Owens
- Department of Orthopaedic Surgery, Brown University, Warren Alpert Medical School, Providence, Rhode Island
| |
Collapse
|
3
|
Henry JP, Tamer P, Suderi GR. Internet-Based Patient Portals Increase Patient Connectivity Following Total Knee Arthroplasty. J Knee Surg 2025. [PMID: 40169134 DOI: 10.1055/a-2542-7427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/03/2025]
Abstract
Many healthcare-related processes have undergone substantial transformation by the internet since the turn of the century. This technological revolution has fostered a fundamental shift from medical paternalism to patient autonomy and empowerment via a "patient-centric approach." Patient portals, or internet-enabled access to an electronic medical record, permit patients to access, manage, and share their health-related information. Patient connectivity following total knee arthroplasty (TKA) has the potential to positively influence overall outcomes, patient experience, and satisfaction. To understand current trends in patient portal usage, modalities of connectivity, and the implications following TKA. A systematic literature review was performed by searching PubMed and Google Scholar. Articles specific to portal usage and connectivity after TKA or total joint arthroplasty were subsequently identified for further review. Patient portals and internet-based digital connectivity platforms enable physicians, team members, and patients to communicate in the perioperative period both directly and indirectly. Communication can be through web-based patient portals, messaging services/apps, preprogrammed alerts (e.g., mobile applications or wearable devices), audio mediums, or videoconferencing. The spectrum and utilization of available patient engagement platforms continues to expand as the importance and implications of patient engagement and connectivity continue to be elucidated. Connectivity through patient portals or other mediums will continue to have an expanding role in all aspects of orthopedic surgery, patient care, and engagement. This includes preoperative education, postoperative rehabilitation, patient care, and, perhaps most importantly, collection of outcome measures. The level of evidence is V (expert opinion).
Collapse
Affiliation(s)
- James P Henry
- Department of Orthopaedic Surgery, Huntington Hospital, Northwell Health, Huntington, New York
| | - Pierre Tamer
- Department of Orthopaedic Surgery, Lenox Hill Hospital, Northwell Health, New York, New York
| | - Giles R Suderi
- Department of Orthopaedic Surgery, Lenox Hill Hospital, Northwell Health, New York, New York
| |
Collapse
|
4
|
Rodriguez HC, Rust BD, Roche MW, Gupta A. Artificial intelligence and machine learning in knee arthroplasty. Knee 2025; 54:28-49. [PMID: 40022960 DOI: 10.1016/j.knee.2025.02.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 10/09/2024] [Accepted: 02/07/2025] [Indexed: 03/04/2025]
Abstract
BACKGROUND Artificial intelligence (AI) and its subset, machine learning (ML), have significantly impacted clinical medicine, particularly in knee arthroplasty (KA). These technologies utilize algorithms for tasks such as predictive analytics and image recognition, improving preoperative planning, intraoperative navigation, and postoperative complication anticipation. This systematic review presents AI-driven tools' clinical implications in total and unicompartmental KA, focusing on enhancing patient outcomes and operational efficiency. METHODS A systematic search was conducted across multiple databases including Cochrane Central Register of Controlled Trials, Embase, OVID Medline, PubMed, and Web of Science, following the PRISMA guidelines for studies published in the English language till March 2024. Inclusion criteria targeted adult human models without geographical restrictions, specifically related to total or unicompartmental KA. RESULTS A total of 153 relevant studies were identified, covering various aspects of ML application for KA. Topics of studies included imaging modalities (n = 28), postoperative primary KA complications (n = 26), inpatient status (length of stay, readmissions, and cost) (n = 24), implant configuration (n = 14), revision (n = 12), patient-reported outcome measures (PROMs) (n = 11), function (n = 11), procedural communication (n = 8), total knee arthroplasty/unicompartmental knee arthroplasty prediction (n = 6), outpatient status (n = 4), perioperative efficiency (n = 4), patient satisfaction (n = 3), opioid usage (n = 3). A total of 66 ML models were described, with 48.7% of studies using multiple approaches. CONCLUSION This review assesses ML applications in knee arthroplasty, highlighting their potential to improve patient outcomes. While current algorithms and AI show promise, our findings suggest areas for enhancement in predictive performance before widespread clinical adoption.
Collapse
Affiliation(s)
- Hugo C Rodriguez
- Larkin Community Hospital, Department of Orthopaedic Surgery, South Miami, FL, USA; Hospital for Special Surgery, West Palm Beach, FL, USA
| | - Brandon D Rust
- Nova Southeastern University, Dr. Kiran C. Patel College of Osteopathic Medicine, Fort Lauderdale, FL, USA
| | | | | |
Collapse
|
5
|
Nwachukwu BU, Varady NH, Allen AA, Dines JS, Altchek DW, Williams RJ, Kunze KN. Currently Available Large Language Models Do Not Provide Musculoskeletal Treatment Recommendations That Are Concordant With Evidence-Based Clinical Practice Guidelines. Arthroscopy 2025; 41:263-275.e6. [PMID: 39173690 DOI: 10.1016/j.arthro.2024.07.040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Revised: 07/26/2024] [Accepted: 07/29/2024] [Indexed: 08/24/2024]
Abstract
PURPOSE To determine whether several leading, commercially available large language models (LLMs) provide treatment recommendations concordant with evidence-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS). METHODS All CPGs concerning the management of rotator cuff tears (n = 33) and anterior cruciate ligament injuries (n = 15) were extracted from the AAOS. Treatment recommendations from Chat-Generative Pretrained Transformer version 4 (ChatGPT-4), Gemini, Mistral-7B, and Claude-3 were graded by 2 blinded physicians as being concordant, discordant, or indeterminate (i.e., neutral response without definitive recommendation) with respect to AAOS CPGs. The overall concordance between LLM and AAOS recommendations was quantified, and the comparative overall concordance of recommendations among the 4 LLMs was evaluated through the Fisher exact test. RESULTS Overall, 135 responses (70.3%) were concordant, 43 (22.4%) were indeterminate, and 14 (7.3%) were discordant. Inter-rater reliability for concordance classification was excellent (κ = 0.92). Concordance with AAOS CPGs was most frequently observed with ChatGPT-4 (n = 38, 79.2%) and least frequently observed with Mistral-7B (n = 28, 58.3%). Indeterminate recommendations were most frequently observed with Mistral-7B (n = 17, 35.4%) and least frequently observed with Claude-3 (n = 8, 6.7%). Discordant recommendations were most frequently observed with Gemini (n = 6, 12.5%) and least frequently observed with ChatGPT-4 (n = 1, 2.1%). Overall, no statistically significant difference in concordant recommendations was observed across LLMs (P = .12). Of all recommendations, only 20 (10.4%) were transparent and provided references with full bibliographic details or links to specific peer-reviewed content to support recommendations. CONCLUSIONS Among leading commercially available LLMs, more than 1-in-4 recommendations concerning the evaluation and management of rotator cuff and anterior cruciate ligament injuries do not reflect current evidence-based CPGs. Although ChatGPT-4 showed the highest performance, clinically significant rates of recommendations without concordance or supporting evidence were observed. Only 10% of responses by LLMs were transparent, precluding users from fully interpreting the sources from which recommendations were provided. CLINICAL RELEVANCE Although leading LLMs generally provide recommendations concordant with CPGs, a substantial error rate exists, and the proportion of recommendations that do not align with these CPGs suggests that LLMs are not trustworthy clinical support tools at this time. Each off-the-shelf, closed-source LLM has strengths and weaknesses. Future research should evaluate and compare multiple LLMs to avoid bias associated with narrow evaluation of few models as observed in the current literature.
Collapse
Affiliation(s)
- Benedict U Nwachukwu
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - Nathan H Varady
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - Answorth A Allen
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - Joshua S Dines
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - David W Altchek
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - Riley J Williams
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - Kyle N Kunze
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A..
| |
Collapse
|
6
|
Benaim EH, O’Rourke SP, Dillon MT. What Do People Want to Know About Cochlear Implants: A Google Analytic Study. Laryngoscope 2025; 135:840-847. [PMID: 39192469 PMCID: PMC11729566 DOI: 10.1002/lary.31741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 07/19/2024] [Accepted: 08/19/2024] [Indexed: 08/29/2024]
Abstract
OBJECTIVE Identify the questions most frequently asked online about cochlear implants (CI) and assess the readability and quality of the content. METHODS A Google search engine observational study was conducted via a search response optimization (SEO) tool. The SEO tool listed the questions generated by Google's "People Also Ask" (PAA) feature for the search queries "cochlear implant" and "cochlear implant surgery." The top 50 PAA questions for each query were conceptually classified. Sourced websites were evaluated for readability, transparency and information quality, and ability to answer the question. Readability and accuracy in answering questions were also compared to the responses from ChatGPT 3.5. RESULTS The PAA questions were commonly related to technical details (21%), surgical factors (18%), and postoperative experiences (12%). Sourced websites mainly were from academic institutions, followed by commercial companies. Among all types of websites, readability, on average, did not meet the recommended standard for health-related patient education materials. Only two websites were at or below the 8th-grade level. Responses by ChatGPT had significantly poorer readability compared to the websites (p < 0.001). These online resources were not significantly different in the percentage of accurately answering the questions (websites: 78%, ChatGPT: 85%, p = 0.136). CONCLUSIONS The most searched topics were technical details about devices, surgical factors, and the postoperative experience. Unfortunately, most websites did not meet the ideal criteria of readability, quality, and credibility for patient education. These results highlight potential knowledge gaps for patients, deficits in current online education materials, and possible tools to better support CI candidate decision-making. LEVEL OF EVIDENCE NA Laryngoscope, 135:840-847, 2025.
Collapse
Affiliation(s)
- Ezer H. Benaim
- Department of Otolaryngology/Head and Neck Surgery, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, U.S.A
| | - Samuel P. O’Rourke
- Department of Otolaryngology/Head and Neck Surgery, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, U.S.A
| | - Margaret T. Dillon
- Department of Otolaryngology/Head and Neck Surgery, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, U.S.A
| |
Collapse
|
7
|
Zhang C, Liu S, Zhou X, Zhou S, Tian Y, Wang S, Xu N, Li W. Examining the Role of Large Language Models in Orthopedics: Systematic Review. J Med Internet Res 2024; 26:e59607. [PMID: 39546795 DOI: 10.2196/59607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 08/01/2024] [Accepted: 09/11/2024] [Indexed: 11/17/2024] Open
Abstract
BACKGROUND Large language models (LLMs) can understand natural language and generate corresponding text, images, and even videos based on prompts, which holds great potential in medical scenarios. Orthopedics is a significant branch of medicine, and orthopedic diseases contribute to a significant socioeconomic burden, which could be alleviated by the application of LLMs. Several pioneers in orthopedics have conducted research on LLMs across various subspecialties to explore their performance in addressing different issues. However, there are currently few reviews and summaries of these studies, and a systematic summary of existing research is absent. OBJECTIVE The objective of this review was to comprehensively summarize research findings on the application of LLMs in the field of orthopedics and explore the potential opportunities and challenges. METHODS PubMed, Embase, and Cochrane Library databases were searched from January 1, 2014, to February 22, 2024, with the language limited to English. The terms, which included variants of "large language model," "generative artificial intelligence," "ChatGPT," and "orthopaedics," were divided into 2 categories: large language model and orthopedics. After completing the search, the study selection process was conducted according to the inclusion and exclusion criteria. The quality of the included studies was assessed using the revised Cochrane risk-of-bias tool for randomized trials and CONSORT-AI (Consolidated Standards of Reporting Trials-Artificial Intelligence) guidance. Data extraction and synthesis were conducted after the quality assessment. RESULTS A total of 68 studies were selected. The application of LLMs in orthopedics involved the fields of clinical practice, education, research, and management. Of these 68 studies, 47 (69%) focused on clinical practice, 12 (18%) addressed orthopedic education, 8 (12%) were related to scientific research, and 1 (1%) pertained to the field of management. Of the 68 studies, only 8 (12%) recruited patients, and only 1 (1%) was a high-quality randomized controlled trial. ChatGPT was the most commonly mentioned LLM tool. There was considerable heterogeneity in the definition, measurement, and evaluation of the LLMs' performance across the different studies. For diagnostic tasks alone, the accuracy ranged from 55% to 93%. When performing disease classification tasks, ChatGPT with GPT-4's accuracy ranged from 2% to 100%. With regard to answering questions in orthopedic examinations, the scores ranged from 45% to 73.6% due to differences in models and test selections. CONCLUSIONS LLMs cannot replace orthopedic professionals in the short term. However, using LLMs as copilots could be a potential approach to effectively enhance work efficiency at present. More high-quality clinical trials are needed in the future, aiming to identify optimal applications of LLMs and advance orthopedics toward higher efficiency and precision.
Collapse
Affiliation(s)
- Cheng Zhang
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Shanshan Liu
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Xingyu Zhou
- Peking University Health Science Center, Beijing, China
| | - Siyu Zhou
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Yinglun Tian
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Shenglin Wang
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Nanfang Xu
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Weishi Li
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| |
Collapse
|
8
|
Aydin S, Karabacak M, Vlachos V, Margetis K. Large language models in patient education: a scoping review of applications in medicine. Front Med (Lausanne) 2024; 11:1477898. [PMID: 39534227 PMCID: PMC11554522 DOI: 10.3389/fmed.2024.1477898] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Accepted: 10/03/2024] [Indexed: 11/16/2024] Open
Abstract
Introduction Large Language Models (LLMs) are sophisticated algorithms that analyze and generate vast amounts of textual data, mimicking human communication. Notable LLMs include GPT-4o by Open AI, Claude 3.5 Sonnet by Anthropic, and Gemini by Google. This scoping review aims to synthesize the current applications and potential uses of LLMs in patient education and engagement. Materials and methods Following the PRISMA-ScR checklist and methodologies by Arksey, O'Malley, and Levac, we conducted a scoping review. We searched PubMed in June 2024, using keywords and MeSH terms related to LLMs and patient education. Two authors conducted the initial screening, and discrepancies were resolved by consensus. We employed thematic analysis to address our primary research question. Results The review identified 201 studies, predominantly from the United States (58.2%). Six themes emerged: generating patient education materials, interpreting medical information, providing lifestyle recommendations, supporting customized medication use, offering perioperative care instructions, and optimizing doctor-patient interaction. LLMs were found to provide accurate responses to patient queries, enhance existing educational materials, and translate medical information into patient-friendly language. However, challenges such as readability, accuracy, and potential biases were noted. Discussion LLMs demonstrate significant potential in patient education and engagement by creating accessible educational materials, interpreting complex medical information, and enhancing communication between patients and healthcare providers. Nonetheless, issues related to the accuracy and readability of LLM-generated content, as well as ethical concerns, require further research and development. Future studies should focus on improving LLMs and ensuring content reliability while addressing ethical considerations.
Collapse
Affiliation(s)
- Serhat Aydin
- School of Medicine, Koç University, Istanbul, Türkiye
| | - Mert Karabacak
- Department of Neurosurgery, Mount Sinai Health System, New York, NY, United States
| | - Victoria Vlachos
- College of Human Ecology, Cornell University, Ithaca, NY, United States
| | | |
Collapse
|
9
|
Hurley ET, Crook BS, Dickens JF. Editorial Commentary: At Present, ChatGPT Cannot Be Relied Upon to Answer Patient Questions and Requires Physician Expertise to Interpret Answers for Patients. Arthroscopy 2024; 40:2080-2082. [PMID: 38484923 DOI: 10.1016/j.arthro.2024.02.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/25/2024] [Accepted: 02/28/2024] [Indexed: 04/14/2024]
Abstract
ChatGPT is designed to provide accurate and reliable information to the best of its abilities based on the data input and knowledge available. Thus, ChatGPT is being studied as a patient information tool. This artificial intelligence (AI) tool has been shown to frequently provide technically correct information but with limitations. ChatGPT provides different answers to similar questions based on the prompts, and patients may not have expertise in prompting ChatGPT to elicit a best answer. (Prompting large language models has been shown to be a skill that can improve.) Of greater concern, ChatGPT fails to provide sources or references for its answers. At present, ChatGPT cannot be relied upon to address patient questions; in the future, ChatGPT will improve. Today, AI requires physician expertise to interpret AI answers for patients.
Collapse
|