1
|
Till T, Scherkl M, Stranger N, Singer G, Hankel S, Flucher C, Hržić F, Štajduhar I, Tschauner S. Impact of test set composition on AI performance in pediatric wrist fracture detection in X-rays. Eur Radiol 2025:10.1007/s00330-025-11669-z. [PMID: 40379941 DOI: 10.1007/s00330-025-11669-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2024] [Revised: 02/24/2025] [Accepted: 04/14/2025] [Indexed: 05/19/2025]
Abstract
OBJECTIVES To evaluate how different test set sampling strategies-random selection and balanced sampling-affect the performance of artificial intelligence (AI) models in pediatric wrist fracture detection using radiographs, aiming to highlight the need for standardization in test set design. MATERIALS AND METHODS This retrospective study utilized the open-sourced GRAZPEDWRI-DX dataset of 6091 pediatric wrist radiographs. Two test sets, each containing 4588 images, were constructed: one using a balanced approach based on case difficulty, projection type, and fracture presence and the other a random selection. EfficientNet and YOLOv11 models were trained and validated on 18,762 radiographs and tested on both sets. Binary classification and object detection tasks were evaluated using metrics such as precision, recall, F1 score, AP50, and AP50-95. Statistical comparisons between test sets were performed using nonparametric tests. RESULTS Performance metrics significantly decreased in the balanced test set with more challenging cases. For example, the precision for YOLOv11 models decreased from 0.95 in the random set to 0.83 in the balanced set. Similar trends were observed for recall, accuracy, and F1 score, indicating that models trained on easy-to-recognize cases performed poorly on more complex ones. These results were consistent across all model variants tested. CONCLUSION AI models for pediatric wrist fracture detection exhibit reduced performance when tested on balanced datasets containing more difficult cases, compared to randomly selected cases. This highlights the importance of constructing representative and standardized test sets that account for clinical complexity to ensure robust AI performance in real-world settings. KEY POINTS Question Do different sampling strategies based on samples' complexity have an influence in deep learning models' performance in fracture detection? Findings AI performance in pediatric wrist fracture detection significantly drops when tested on balanced datasets with more challenging cases, compared to randomly selected cases. Clinical relevance Without standardized and validated test datasets for AI that reflect clinical complexities, performance metrics may be overestimated, limiting the utility of AI in real-world settings.
Collapse
Affiliation(s)
- Tristan Till
- Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria
| | - Mario Scherkl
- Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria
| | - Nikolaus Stranger
- Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria.
| | - Georg Singer
- Department of Pediatric and Adolescent Surgery, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria
| | - Saskia Hankel
- Department of Pediatric and Adolescent Surgery, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria
| | - Christina Flucher
- Department of Pediatric and Adolescent Surgery, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria
| | - Franko Hržić
- Faculty of Engineering, Department of Computer Engineering, University of Rijeka, Vukovarska 58, Rijeka, 51000, Croatia
| | - Ivan Štajduhar
- Faculty of Engineering, Department of Computer Engineering, University of Rijeka, Vukovarska 58, Rijeka, 51000, Croatia
| | - Sebastian Tschauner
- Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria
| |
Collapse
|
2
|
Suen K, Zhang R, Kutaiba N. Accuracy of wrist fracture detection on radiographs by artificial intelligence compared to human clinicians. A systematic review and meta-analysis. Eur J Radiol 2024; 178:111593. [PMID: 38981178 DOI: 10.1016/j.ejrad.2024.111593] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 06/23/2024] [Accepted: 06/28/2024] [Indexed: 07/11/2024]
Abstract
PURPOSE The aim of the study is to perform a systematic review and meta-analysis comparing the diagnostic performance of artificial intelligence (AI) and human readers in the detection of wrist fractures. METHOD This study conducted a systematic review following PRISMA guidelines. Medline and Embase databases were searched for relevant articles published up to August 14, 2023. All included studies reported the diagnostic performance of AI to detect wrist fractures, with or without comparison to human readers. A meta-analysis was performed to calculate the pooled sensitivity and specificity of AI and human experts in detecting distal radius, and scaphoid fractures respectively. RESULTS Of 213 identified records, 20 studies were included after abstract screening and full-text review. Nine articles examined distal radius fractures, while eight studies examined scaphoid fractures. One study included distal radius and scaphoid fractures, and two studies examined paediatric distal radius fractures. The pooled sensitivity and specificity for AI in detecting distal radius fractures were 0.92 (95% CI 0.88-0.95) and 0.89 (0.84-0.92), respectively. The corresponding values for human readers were 0.95 (0.91-0.97) and 0.94 (0.91-0.96). For scaphoid fractures, pooled sensitivity and specificity for AI were 0.85 (0.73-0.92) and 0.83 (0.76-0.89), while human experts exhibited 0.71 (0.66-0.76) and 0.93 (0.90-0.95), respectively. CONCLUSION The results indicate comparable diagnostic accuracy between AI and human readers, especially for distal radius fractures. For the detection of scaphoid fractures, the human readers were similarly sensitive but more specific. These findings underscore the potential of AI to enhance fracture detection accuracy and improve clinical workflow, rather than to replace human intelligence.
Collapse
Affiliation(s)
- Kary Suen
- Department of Radiology, Austin Health, Victoria, Australia.
| | - Richard Zhang
- Department of Radiology, Austin Health, Victoria, Australia
| | - Numan Kutaiba
- Department of Radiology, Austin Health, Victoria, Australia
| |
Collapse
|
3
|
Nowroozi A, Salehi MA, Shobeiri P, Agahi S, Momtazmanesh S, Kaviani P, Kalra MK. Artificial intelligence diagnostic accuracy in fracture detection from plain radiographs and comparing it with clinicians: a systematic review and meta-analysis. Clin Radiol 2024; 79:579-588. [PMID: 38772766 DOI: 10.1016/j.crad.2024.04.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 04/09/2024] [Accepted: 04/15/2024] [Indexed: 05/23/2024]
Abstract
PURPOSE Fracture detection is one of the most commonly used and studied aspects of artificial intelligence (AI) in medicine. In this systematic review and meta-analysis, we aimed to summarize available literature and data regarding AI performance in fracture detection on plain radiographs and various factors affecting it. METHODS We systematically reviewed studies evaluating AI algorithms in detecting bone fractures in plain radiographs, combined their performance using meta-analysis (a bivariate regression approach), and compared it with that of clinicians. We also analyzed the factors potentially affecting algorithm performance using meta-regression. RESULTS Our analysis included 100 studies. In 83 studies with confusion matrices, AI algorithms showed a sensitivity of 91.43% and a specificity of 92.12% (Area under the summary receiver operator curve = 0.968). After adjustment and false discovery rate correction, tibia/fibula (excluding ankle) fractures were associated with higher (7.0%, p=0.004) AI sensitivity, while more recent publications (5.5%, p=0.003) and Xception architecture (6.6%, p<0.001) were associated with higher specificity. Clinicians and AI showed similar specificity in fracture identification, although AI leaned to higher sensitivity (7.6%, p=0.07). Radiologists, on the other hand, were more specific than AI overall and in several subgroups, and more sensitive to hip fractures before FDR correction. CONCLUSIONS Currently available AI aids could result in a significant improvement in care where radiologists are not readily available. Moreover, identifying factors affecting algorithm performance could guide AI development teams in their process of optimizing their products.
Collapse
Affiliation(s)
- A Nowroozi
- School of Medicine, Tehran University of Medical Sciences, Tehran, Iran
| | - M A Salehi
- School of Medicine, Tehran University of Medical Sciences, Tehran, Iran
| | - P Shobeiri
- School of Medicine, Tehran University of Medical Sciences, Tehran, Iran
| | - S Agahi
- School of Medicine, Tehran University of Medical Sciences, Tehran, Iran
| | - S Momtazmanesh
- School of Medicine, Tehran University of Medical Sciences, Tehran, Iran
| | - P Kaviani
- Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA
| | - M K Kalra
- Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA.
| |
Collapse
|
4
|
Oeding JF, Kunze KN, Messer CJ, Pareek A, Fufa DT, Pulos N, Rhee PC. Diagnostic Performance of Artificial Intelligence for Detection of Scaphoid and Distal Radius Fractures: A Systematic Review. J Hand Surg Am 2024; 49:411-422. [PMID: 38551529 DOI: 10.1016/j.jhsa.2024.01.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 01/19/2024] [Accepted: 01/31/2024] [Indexed: 05/05/2024]
Abstract
PURPOSE To review the existing literature to (1) determine the diagnostic efficacy of artificial intelligence (AI) models for detecting scaphoid and distal radius fractures and (2) compare the efficacy to human clinical experts. METHODS PubMed, OVID/Medline, and Cochrane libraries were queried for studies investigating the development, validation, and analysis of AI for the detection of scaphoid or distal radius fractures. Data regarding study design, AI model development and architecture, prediction accuracy/area under the receiver operator characteristic curve (AUROC), and imaging modalities were recorded. RESULTS A total of 21 studies were identified, of which 12 (57.1%) used AI to detect fractures of the distal radius, and nine (42.9%) used AI to detect fractures of the scaphoid. AI models demonstrated good diagnostic performance on average, with AUROC values ranging from 0.77 to 0.96 for scaphoid fractures and from 0.90 to 0.99 for distal radius fractures. Accuracy of AI models ranged between 72.0% to 90.3% and 89.0% to 98.0% for scaphoid and distal radius fractures, respectively. When compared to clinical experts, 13 of 14 (92.9%) studies reported that AI models demonstrated comparable or better performance. The type of fracture influenced model performance, with worse overall performance on occult scaphoid fractures; however, models trained specifically on occult fractures demonstrated substantially improved performance when compared to humans. CONCLUSIONS AI models demonstrated excellent performance for detecting scaphoid and distal radius fractures, with the majority demonstrating comparable or better performance compared with human experts. Worse performance was demonstrated on occult fractures. However, when trained specifically on difficult fracture patterns, AI models demonstrated improved performance. CLINICAL RELEVANCE AI models can help detect commonly missed occult fractures while enhancing workflow efficiency for distal radius and scaphoid fracture diagnoses. As performance varies based on fracture type, future studies focused on wrist fracture detection should clearly define whether the goal is to (1) identify difficult-to-detect fractures or (2) improve workflow efficiency by assisting in routine tasks.
Collapse
Affiliation(s)
- Jacob F Oeding
- School of Medicine, Mayo Clinic Alix School of Medicine, Rochester, MN; Department of Orthopaedics, Institute of Clinical Sciences, The Sahlgrenska Academy, University of Gotenburg, Gothenburg, Sweden.
| | - Kyle N Kunze
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, NY
| | - Caden J Messer
- School of Medicine, Mayo Clinic Alix School of Medicine, Rochester, MN
| | - Ayoosh Pareek
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, NY
| | - Duretti T Fufa
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, NY
| | - Nicholas Pulos
- Department of Orthopaedic Surgery, Mayo Clinic, Rochester, MN
| | - Peter C Rhee
- Department of Orthopaedic Surgery, Mayo Clinic, Rochester, MN
| |
Collapse
|
5
|
Till T, Tschauner S, Singer G, Lichtenegger K, Till H. Development and optimization of AI algorithms for wrist fracture detection in children using a freely available dataset. Front Pediatr 2023; 11:1291804. [PMID: 38188914 PMCID: PMC10768054 DOI: 10.3389/fped.2023.1291804] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Accepted: 12/05/2023] [Indexed: 01/09/2024] Open
Abstract
Introduction In the field of pediatric trauma computer-aided detection (CADe) and computer-aided diagnosis (CADx) systems have emerged offering a promising avenue for improved patient care. Especially children with wrist fractures may benefit from machine learning (ML) solutions, since some of these lesions may be overlooked on conventional X-ray due to minimal compression without dislocation or mistaken for cartilaginous growth plates. In this article, we describe the development and optimization of AI algorithms for wrist fracture detection in children. Methods A team of IT-specialists, pediatric radiologists and pediatric surgeons used the freely available GRAZPEDWRI-DX dataset containing annotated pediatric trauma wrist radiographs of 6,091 patients, a total number of 10,643 studies (20,327 images). First, a basic object detection model, a You Only Look Once object detector of the seventh generation (YOLOv7) was trained and tested on these data. Then, team decisions were taken to adjust data preparation, image sizes used for training and testing, and configuration of the detection model. Furthermore, we investigated each of these models using an Explainable Artificial Intelligence (XAI) method called Gradient Class Activation Mapping (Grad-CAM). This method visualizes where a model directs its attention to before classifying and regressing a certain class through saliency maps. Results Mean average precision (mAP) improved when applying optimizations pre-processing the dataset images (maximum increases of + 25.51% mAP@0.5 and + 39.78% mAP@[0.5:0.95]), as well as the object detection model itself (maximum increases of + 13.36% mAP@0.5 and + 27.01% mAP@[0.5:0.95]). Generally, when analyzing the resulting models using XAI methods, higher scoring model variations in terms of mAP paid more attention to broader regions of the image, prioritizing detection accuracy over precision compared to the less accurate models. Discussion This paper supports the implementation of ML solutions for pediatric trauma care. Optimization of a large X-ray dataset and the YOLOv7 model improve the model's ability to detect objects and provide valid diagnostic support to health care specialists. Such optimization protocols must be understood and advocated, before comparing ML performances against health care specialists.
Collapse
Affiliation(s)
- Tristan Till
- Department of Applied Computer Sciences, FH JOANNEUM - University of Applied Sciences, Graz, Austria
- Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Graz, Austria
| | - Sebastian Tschauner
- Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Graz, Austria
| | - Georg Singer
- Department of Pediatric and Adolescent Surgery, Medical University of Graz, Graz, Austria
| | - Klaus Lichtenegger
- Department of Applied Computer Sciences, FH JOANNEUM - University of Applied Sciences, Graz, Austria
| | - Holger Till
- Department of Pediatric and Adolescent Surgery, Medical University of Graz, Graz, Austria
| |
Collapse
|