1
|
Lassalle L, Regnard NE, Ventre J, Marty V, Clovis L, Zhang Z, Nitche N, Guermazi A, Laredo JD. Automated weight-bearing foot measurements using an artificial intelligence-based software. Skeletal Radiol 2025; 54:229-241. [PMID: 38880791 DOI: 10.1007/s00256-024-04726-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 06/05/2024] [Accepted: 06/05/2024] [Indexed: 06/18/2024]
Abstract
OBJECTIVE To assess the accuracy of an artificial intelligence (AI) software (BoneMetrics, Gleamer) in performing automated measurements on weight-bearing forefoot and lateral foot radiographs. METHODS Consecutive forefoot and lateral foot radiographs were retrospectively collected from three imaging institutions. Two senior musculoskeletal radiologists independently annotated key points to measure the hallux valgus, first-second metatarsal, and first-fifth metatarsal angles on forefoot radiographs and the talus-first metatarsal, medial arch, and calcaneus inclination angles on lateral foot radiographs. The ground truth was defined as the mean of their measurements. Statistical analysis included mean absolute error (MAE), bias assessed with Bland-Altman analysis between the ground truth and AI prediction, and intraclass coefficient (ICC) between the manual ratings. RESULTS Eighty forefoot radiographs were included (53 ± 17 years, 50 women), and 26 were excluded. Ninety-seven lateral foot radiographs were included (51 ± 20 years, 46 women), and 21 were excluded. MAE for the hallux valgus, first-second metatarsal, and first-fifth metatarsal angles on forefoot radiographs were respectively 1.2° (95% CI [1; 1.4], bias = - 0.04°, ICC = 0.98), 0.7° (95% CI [0.6; 0.9], bias = - 0.19°, ICC = 0.91) and 0.9° (95% CI [0.7; 1.1], bias = 0.44°, ICC = 0.96). MAE for the talus-first, medial arch, and calcaneal inclination angles on the lateral foot radiographs were respectively 3.9° (95% CI [3.4; 4.5], bias = 0.61° ICC = 0.88), 1.5° (95% CI [1.2; 1.8], bias = - 0.18°, ICC = 0.95) and 1° (95% CI [0.8; 1.2], bias = 0.74°, ICC = 0.99). Bias and MAE between the ground truth and the AI prediction were low across all measurements. ICC between the two manual ratings was excellent, except for the talus-first metatarsal angle. CONCLUSION AI demonstrated potential for accurate and automated measurements on weight-bearing forefoot and lateral foot radiographs.
Collapse
Affiliation(s)
- Louis Lassalle
- Réseau Imagerie Sud Francilien, Lieusaint, France.
- Clinique du Mousseau, Ramsay Santé, Evry, France.
- , Gleamer, Paris, France.
| | - Nor-Eddine Regnard
- Réseau Imagerie Sud Francilien, Lieusaint, France
- Clinique du Mousseau, Ramsay Santé, Evry, France
- , Gleamer, Paris, France
| | | | | | | | | | | | - Ali Guermazi
- Department of Radiology, Boston University School of Medicine, Boston, MA, USA
| | - Jean-Denis Laredo
- , Gleamer, Paris, France
- Service de Radiologie, Institut Mutualiste Montsouris, Paris, France
- Laboratoire (B3OA) de Biomécanique Et Biomatériaux Ostéo-Articulaires, Faculté de Médecine Paris-Cité, Paris, France
- Professeur Émérite d'Imagerie Médicale, Université Paris-Cité, Paris, France
| |
Collapse
|
2
|
Jaques A, Abdelghafour K, Perkins O, Nuttall H, Haidar O, Johal K. A Study of Orthopedic Patient Leaflets and Readability of AI-Generated Text in Foot and Ankle Surgery (SOLE-AI). Cureus 2024; 16:e75826. [PMID: 39822447 PMCID: PMC11737805 DOI: 10.7759/cureus.75826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/14/2024] [Indexed: 01/19/2025] Open
Abstract
Introduction The internet age has broadened the horizons of modern medicine, and the ever-increasing scope of artificial intelligence (AI) has made information about healthcare, common pathologies, and available treatment options much more accessible to the wider population. Patient autonomy relies on clear, accurate, and user-friendly information to give informed consent to an intervention. Our paper aims to outline the quality, readability, and accuracy of readily available information produced by AI relating to common foot and ankle procedures. Materials and methods A retrospective qualitative analysis of procedure-specific information relating to three common foot and ankle orthopedic procedures: ankle arthroscopy, ankle arthrodesis/fusion, and a gastrocnemius lengthening procedure was undertaken. Patient information leaflets (PILs) created by The British Orthopaedic Foot and Ankle Society (BOFAS) were compared to ChatGPT responses for readability, quality, and accuracy of information. Four language tools were used to assess readability: the Flesch-Kincaid reading ease (FKRE) score, the Flesch-Kincaid grade level (FKGL), the Gunning fog score (GFS), and the simple measure of gobbledygook (SMOG) index. Quality and accuracy were determined by using the DISCERN tool by five independent assessors. Results PILs produced by AI had significantly lower FKRE scores when compared to BOFAS -40.4 (SD: ±7.69) compared to 91.9 (SD: ±2.24) (p ≤ 0.0001), indicating poor readability of AI-generated text. DISCERN scoring highlighted a statistically significant improvement in accuracy and quality of human-generated information across two PILs with a mean score of 55.06 compared to 46.8. FKGL scoring indicated that the required grade of students to understand AI responses was consistently higher than compared to information leaflets at 11.7 versus 1.1 (p ≤ 0.0001). The number of years spent in education required to understand the ChatGPT-produced PILs was significantly higher in both GFS (14.46 vs. 2.0 years) (p < 0.0001) and SMOG (11.0 vs. 3.06 years) (p < 0.0001). Conclusion Despite significant advances in the implementation of AI in surgery, AI-generated PILs for common foot and ankle surgical procedures currently lack sufficient quality, depth, and readability - this risks leaving patients misinformed regarding upcoming procedures. We conclude that information from trusted professional bodies should be used to complement a clinical consultation, as there currently lacks sufficient evidence to support the routine implementation of AI-generated information into the consent process.
Collapse
Affiliation(s)
| | - Karim Abdelghafour
- Trauma and Orthopedics, Lister Hospital, Stevenage, GBR
- Trauma and Orthopedics, Cairo University Hospitals, Cairo, EGY
| | | | - Helen Nuttall
- Trauma and Orthopedics, Lister Hospital, Stevenage, GBR
| | - Omar Haidar
- Vascular Surgery, Lister Hospital, Stevenage, GBR
- General Surgery, Imperial College Healthcare NHS Trust, London, GBR
| | | |
Collapse
|
3
|
Kenig N, Monton Echeverria J, Muntaner Vives A. Artificial Intelligence in Surgery: A Systematic Review of Use and Validation. J Clin Med 2024; 13:7108. [PMID: 39685566 DOI: 10.3390/jcm13237108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2024] [Revised: 11/19/2024] [Accepted: 11/22/2024] [Indexed: 12/18/2024] Open
Abstract
Background: Artificial Intelligence (AI) holds promise for transforming healthcare, with AI models gaining increasing clinical use in surgery. However, new AI models are developed without established standards for their validation and use. Before AI can be widely adopted, it is crucial to ensure these models are both accurate and safe for patients. Without proper validation, there is a risk of integrating AI models into practice without sufficient evidence of their safety and accuracy, potentially leading to suboptimal patient outcomes. In this work, we review the current use and validation methods of AI models in clinical surgical settings and propose a novel classification system. Methods: A systematic review was conducted in PubMed and Cochrane using the keywords "validation", "artificial intelligence", and "surgery", following PRISMA guidelines. Results: The search yielded a total of 7627 articles, of which 102 were included for data extraction, encompassing 2,837,211 patients. A validation classification system named Surgical Validation Score (SURVAS) was developed. The primary applications of models were risk assessment and decision-making in the preoperative setting. Validation methods were ranked as high evidence in only 45% of studies, and only 14% of the studies provided publicly available datasets. Conclusions: AI has significant applications in surgery, but validation quality remains suboptimal, and public data availability is limited. Current AI applications are mainly focused on preoperative risk assessment and are suggested to improve decision-making. Classification systems such as SURVAS can help clinicians confirm the degree of validity of AI models before their application in practice.
Collapse
Affiliation(s)
- Nitzan Kenig
- Department of Plastic Surgery, Quironsalud Palmaplanas Hospital, 07010 Palma, Spain
| | | | - Aina Muntaner Vives
- Department Otolaryngology, Son Llatzer University Hospital, 07198 Palma, Spain
| |
Collapse
|
4
|
Mehta A, El-Najjar D, Howell H, Gupta P, Arciero E, Marigi EM, Parisien RL, Trofa DP. Artificial Intelligence Models Are Limited in Predicting Clinical Outcomes Following Hip Arthroscopy: A Systematic Review. JBJS Rev 2024; 12:01874474-202408000-00012. [PMID: 39172870 DOI: 10.2106/jbjs.rvw.24.00087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/24/2024]
Abstract
BACKGROUND Hip arthroscopy has seen a significant surge in utilization, but complications remain, and optimal functional outcomes are not guaranteed. Artificial intelligence (AI) has emerged as an effective supportive decision-making tool for surgeons. The purpose of this systematic review was to characterize the outcomes, performance, and validity (generalizability) of AI-based prediction models for hip arthroscopy in current literature. METHODS Two reviewers independently completed structured searches using PubMed/MEDLINE and Embase databases on August 10, 2022. The search query used the terms as follows: (artificial intelligence OR machine learning OR deep learning) AND (hip arthroscopy). Studies that investigated AI-based risk prediction models in hip arthroscopy were included. The primary outcomes of interest were the variable(s) predicted by the models, best model performance achieved (primarily based on area under the curve, but also accuracy, etc), and whether the model(s) had been externally validated (generalizable). RESULTS Seventy-seven studies were identified from the primary search. Thirteen studies were included in the final analysis. Six studies (n = 6,568) applied AI for predicting the achievement of minimal clinically important difference for various patient-reported outcome measures such as the visual analog scale and the International Hip Outcome Tool 12-Item Questionnaire, with area under a receiver-operating characteristic curve (AUC) values ranging from 0.572 to 0.94. Three studies used AI for predicting repeat hip surgery with AUC values between 0.67 and 0.848. Four studies focused on predicting other risks, such as prolonged postoperative opioid use, with AUC values ranging from 0.71 to 0.76. None of the 13 studies assessed the generalizability of their models through external validation. CONCLUSION AI is being investigated for predicting clinical outcomes after hip arthroscopy. However, the performance of AI models varies widely, with AUC values ranging from 0.572 to 0.94. Critically, none of the models have undergone external validation, limiting their clinical applicability. Further research is needed to improve model performance and ensure generalizability before these tools can be reliably integrated into patient care. LEVEL OF EVIDENCE Level IV. See Instructions for Authors for a complete description of levels of evidence.
Collapse
Affiliation(s)
- Apoorva Mehta
- Department of Orthopaedic Surgery, Columbia University Irving Medical Center, New York, New York
| | - Dany El-Najjar
- Department of Orthopaedic Surgery, Columbia University Irving Medical Center, New York, New York
| | - Harrison Howell
- Department of Orthopaedic Surgery, Columbia University Irving Medical Center, New York, New York
| | - Puneet Gupta
- Department of Orthopaedic Surgery, Columbia University Irving Medical Center, New York, New York
| | - Emily Arciero
- Department of Orthopaedic Surgery, Columbia University Irving Medical Center, New York, New York
| | - Erick M Marigi
- Department of Orthopaedic Surgery, Mayo Clinic, Rochester, Minnesota
| | | | - David P Trofa
- Department of Orthopaedic Surgery, Columbia University Irving Medical Center, New York, New York
| |
Collapse
|
5
|
Hakam HT, Prill R, Korte L, Lovreković B, Ostojić M, Ramadanov N, Muehlensiepen F. Human-Written vs AI-Generated Texts in Orthopedic Academic Literature: Comparative Qualitative Analysis. JMIR Form Res 2024; 8:e52164. [PMID: 38363631 PMCID: PMC10907945 DOI: 10.2196/52164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 11/09/2023] [Accepted: 12/13/2023] [Indexed: 02/17/2024] Open
Abstract
BACKGROUND As large language models (LLMs) are becoming increasingly integrated into different aspects of health care, questions about the implications for medical academic literature have begun to emerge. Key aspects such as authenticity in academic writing are at stake with artificial intelligence (AI) generating highly linguistically accurate and grammatically sound texts. OBJECTIVE The objective of this study is to compare human-written with AI-generated scientific literature in orthopedics and sports medicine. METHODS Five original abstracts were selected from the PubMed database. These abstracts were subsequently rewritten with the assistance of 2 LLMs with different degrees of proficiency. Subsequently, researchers with varying degrees of expertise and with different areas of specialization were asked to rank the abstracts according to linguistic and methodological parameters. Finally, researchers had to classify the articles as AI generated or human written. RESULTS Neither the researchers nor the AI-detection software could successfully identify the AI-generated texts. Furthermore, the criteria previously suggested in the literature did not correlate with whether the researchers deemed a text to be AI generated or whether they judged the article correctly based on these parameters. CONCLUSIONS The primary finding of this study was that researchers were unable to distinguish between LLM-generated and human-written texts. However, due to the small sample size, it is not possible to generalize the results of this study. As is the case with any tool used in academic research, the potential to cause harm can be mitigated by relying on the transparency and integrity of the researchers. With scientific integrity at stake, further research with a similar study design should be conducted to determine the magnitude of this issue.
Collapse
Affiliation(s)
- Hassan Tarek Hakam
- Center of Orthopaedics and Trauma Surgery, University Clinic of Brandenburg, Brandenburg Medical School, Brandenburg an der Havel, Germany
- Faculty of Health Sciences, University Clinic of Brandenburg, Brandenburg an der Havel, Germany
- Center of Evidence Based Practice in Brandenburg, a JBI Affiliated Group, Brandenburg an der Havel, Germany
| | - Robert Prill
- Faculty of Health Sciences, University Clinic of Brandenburg, Brandenburg an der Havel, Germany
- Center of Evidence Based Practice in Brandenburg, a JBI Affiliated Group, Brandenburg an der Havel, Germany
| | - Lisa Korte
- Center of Health Services Research, Faculty of Health Sciences, University Clinic of Brandenburg, Rüdersdorf bei Berlin, Germany
| | - Bruno Lovreković
- Faculty of Orthopaedics, University Hospital Merkur, Zagreb, Croatia
| | - Marko Ostojić
- Departement of Orthopaedics, University Hospital Mostar, Mostar, Bosnia and Herzegovina
| | - Nikolai Ramadanov
- Center of Orthopaedics and Trauma Surgery, University Clinic of Brandenburg, Brandenburg Medical School, Brandenburg an der Havel, Germany
- Faculty of Health Sciences, University Clinic of Brandenburg, Brandenburg an der Havel, Germany
| | - Felix Muehlensiepen
- Center of Evidence Based Practice in Brandenburg, a JBI Affiliated Group, Brandenburg an der Havel, Germany
- Center of Health Services Research, Faculty of Health Sciences, University Clinic of Brandenburg, Rüdersdorf bei Berlin, Germany
| |
Collapse
|
6
|
Kaczmarczyk K, Zakynthinaki M, Barton G, Baran M, Wit A. Biomechanical comparison of two surgical methods for Hallux Valgus deformity: Exploring the use of artificial neural networks as a decision-making tool for orthopedists. PLoS One 2024; 19:e0297504. [PMID: 38349907 PMCID: PMC10863859 DOI: 10.1371/journal.pone.0297504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 01/06/2024] [Indexed: 02/15/2024] Open
Abstract
Hallux Valgus foot deformity affects gait performance. Common treatment options include distal oblique metatarsal osteotomy and chevron osteotomy. Nonetheless, the current process of selecting the appropriate osteotomy method poses potential biases and risks, due to its reliance on subjective human judgment and interpretation. The inherent variability among clinicians, the potential influence of individual clinical experiences, or inherent measurement limitations may contribute to inconsistent evaluations. To address this, incorporating objective tools like neural networks, renowned for effective classification and decision-making support, holds promise in identifying optimal surgical approaches. The objective of this cross-sectional study was twofold. Firstly, it aimed to investigate the feasibility of classifying patients based on the type of surgery. Secondly, it sought to explore the development of a decision-making tool to assist orthopedists in selecting the optimal surgical approach. To achieve this, gait parameters of twenty-three women with moderate to severe Hallux Valgus were analyzed. These patients underwent either distal oblique metatarsal osteotomy or chevron osteotomy. The parameters exhibiting differences in preoperative and postoperative values were identified through various statistical tests such as normalization, Shapiro-Wilk, non-parametric Wilcoxon, Student t, and paired difference tests. Two artificial neural networks were constructed for patient classification based on the type of surgery and to simulate an optimal surgery type considering postoperative walking speed. The results of the analysis demonstrated a strong correlation between surgery type and postoperative gait parameters, with the first neural network achieving a remarkable 100% accuracy in classification. Additionally, cases were identified where there was a mismatch with the surgeon's decision. Our findings highlight the potential of artificial neural networks as a complementary tool for surgeons in making informed decisions. Addressing the study's limitations, future research may investigate a wider range of orthopedic procedures, examine additional gait parameters and use more diverse and extensive datasets to enhance statistical robustness.
Collapse
Affiliation(s)
- Katarzyna Kaczmarczyk
- Faculty of Rehabilitation, Józef Piłsudski Academy of Physical Education, Warsaw, Poland
| | - Maria Zakynthinaki
- School of Chemical and Environmental Engineering, Technical University of Crete, Chania, Greece
| | - Gabor Barton
- Research Institute for Sport and Exercise Sciences, Liverpool John Moores University, Liverpool, United Kingdom
| | - Mateusz Baran
- Faculty of Rehabilitation, Józef Piłsudski Academy of Physical Education, Warsaw, Poland
| | - Andrzej Wit
- Faculty of Rehabilitation, Józef Piłsudski Academy of Physical Education, Warsaw, Poland
| |
Collapse
|
7
|
Wang D, He Y, Ma Y, Wu H, Ni G. The Era of Artificial Intelligence: Talking About the Potential Application Value of ChatGPT/GPT-4 in Foot and Ankle Surgery. J Foot Ankle Surg 2024; 63:1-3. [PMID: 37516342 DOI: 10.1053/j.jfas.2023.07.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 07/12/2023] [Accepted: 07/19/2023] [Indexed: 07/31/2023]
Affiliation(s)
- Dongxue Wang
- School of Sport Medicine and Rehabilitation, Beijing Sport University, Beijing, China
| | - Yongbin He
- School of Sport Medicine and Rehabilitation, Beijing Sport University, Beijing, China
| | - Yixuan Ma
- College of Education, Beijing Sport University, Beijing, China
| | - Haiyang Wu
- Graduate School of Tianjin Medical University, Tianjin, China; Duke Molecular Physiology Institute, Duke University School of Medicine, Durham, NC.
| | - Guoxin Ni
- Department of Rehabilitation Medicine, The First Affiliated Hospital of Xiamen University, Xiamen, China.
| |
Collapse
|
8
|
Anastasio AT, Mills FB, Karavan MP, Adams SB. Evaluating the Quality and Usability of Artificial Intelligence-Generated Responses to Common Patient Questions in Foot and Ankle Surgery. FOOT & ANKLE ORTHOPAEDICS 2023; 8:24730114231209919. [PMID: 38027458 PMCID: PMC10666700 DOI: 10.1177/24730114231209919] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2023] Open
Abstract
Background Artificial intelligence (AI) platforms, such as ChatGPT, have become increasingly popular outlets for the consumption and distribution of health care-related advice. Because of a lack of regulation and oversight, the reliability of health care-related responses has become a topic of controversy in the medical community. To date, no study has explored the quality of AI-derived information as it relates to common foot and ankle pathologies. This study aims to assess the quality and educational benefit of ChatGPT responses to common foot and ankle-related questions. Methods ChatGPT was asked a series of 5 questions, including "What is the optimal treatment for ankle arthritis?" "How should I decide on ankle arthroplasty versus ankle arthrodesis?" "Do I need surgery for Jones fracture?" "How can I prevent Charcot arthropathy?" and "Do I need to see a doctor for my ankle sprain?" Five responses (1 per each question) were included after applying the exclusion criteria. The content was graded using DISCERN (a well-validated informational analysis tool) and AIRM (a self-designed tool for exercise evaluation). Results Health care professionals graded the ChatGPT-generated responses as bottom tier 4.5% of the time, middle tier 27.3% of the time, and top tier 68.2% of the time. Conclusion Although ChatGPT and other related AI platforms have become a popular means for medical information distribution, the educational value of the AI-generated responses related to foot and ankle pathologies was variable. With 4.5% of responses receiving a bottom-tier rating, 27.3% of responses receiving a middle-tier rating, and 68.2% of responses receiving a top-tier rating, health care professionals should be aware of the high viewership of variable-quality content easily accessible on ChatGPT. Level of Evidence Level III, cross sectional study.
Collapse
Affiliation(s)
| | - Frederic Baker Mills
- Department of Orthopaedic Surgery, Duke University Medical Center, Durham, NC, USA
| | - Mark P. Karavan
- Department of Orthopaedic Surgery, Duke University Medical Center, Durham, NC, USA
| | - Samuel B. Adams
- Department of Orthopaedic Surgery, Duke University Medical Center, Durham, NC, USA
| |
Collapse
|