1
|
Campagner A, Marconi L, Bianchi E, Arosio B, Rossi P, Annoni G, Lucchi TA, Montano N, Cabitza F. Uncovering hidden subtypes in dementia: An unsupervised machine learning approach to dementia diagnosis and personalization of care. J Biomed Inform 2025; 165:104799. [PMID: 40118356 DOI: 10.1016/j.jbi.2025.104799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2024] [Revised: 12/31/2024] [Accepted: 02/01/2025] [Indexed: 03/23/2025]
Abstract
OBJECTIVE Dementia represents a growing public health challenge, affecting an increasing number of individuals. It encompasses a broad spectrum of cognitive impairments, ranging from mild to severe stages, each of which demands varying levels of care. Current diagnostic approaches often treat dementia as a uniform condition, potentially overlooking clinically significant subtypes, which limits the effectiveness of treatment and care strategies. This study seeks to address the limitations of traditional diagnostic methods by applying unsupervised machine learning techniques to a large, multi-modal dataset of dementia patients (encompassing multiple data sources including clinical, demographic, gene expression and protein concentrations), with the aim of identifying distinct subtypes within the population. The primary focus is on differentiating between mild and severe stages of dementia to improve diagnostic accuracy and personalize treatment plans. METHODS The dataset analyzed included 911 individuals, described by 157 multi-modal characteristics, encompassing clinical, genomic, proteomic and sociodemographic features. After handling missing data, the dataset was reduced to 561 rows and 135 columns. Various dimensionality reduction techniques were applied to improve the feature-to-patient ratio, and unsupervised clustering methods were employed to identify potential subtypes. The major novelty in our methodology regards the combination of different techniques, bridging high-dimensional statistical inference, multi-modal dimensionality reduction and clustering analysis, to appropriately model the multi-modal nature of the data and ensure clinical relevance. RESULTS The analysis revealed distinct clusters within the dementia population, each characterized by specific clinical and demographic profiles. These profiles included variations in biomarkers, cognitive scores, and disability levels. The findings suggest the presence of previously unrecognized subgroups, distinguished by their genomic, proteomic, and clinical characteristics. CONCLUSION This study demonstrates that unsupervised machine learning can effectively identify clinically relevant subtypes of dementia, with important implications for diagnosis and personalized treatment. Further research is required to validate these findings and investigate their potential to improve patient outcomes.
Collapse
Affiliation(s)
| | - Luca Marconi
- Department of Computer Science, Systems and Communication, University of Milano-Bicocca, Milan, Italy
| | - Edoardo Bianchi
- Department of Computer Science, Systems and Communication, University of Milano-Bicocca, Milan, Italy
| | - Beatrice Arosio
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy
| | - Paolo Rossi
- General Medicine, Hospital San Leopoldo Mandic, Merate, Italy
| | - Giorgio Annoni
- Department of Medicine, University of Milano-Bicocca, Milan, Italy
| | | | - Nicola Montano
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy; Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milan, Italy
| | - Federico Cabitza
- IRCCS Ospedale Galeazzi Sant'Ambrogio, Milan, Italy; Department of Computer Science, Systems and Communication, University of Milano-Bicocca, Milan, Italy.
| |
Collapse
|
2
|
Chen C, Zhang W, Pan Y, Li Z. An interpretable hybrid machine learning approach for predicting three-month unfavorable outcomes in patients with acute ischemic stroke. Int J Med Inform 2025; 196:105807. [PMID: 39923294 DOI: 10.1016/j.ijmedinf.2025.105807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 12/13/2024] [Accepted: 01/21/2025] [Indexed: 02/11/2025]
Abstract
BACKGROUND Acute ischemic stroke (AIS) is a clinical disorder caused by nontraumatic cerebrovascular disease with a high incidence, mortality, and disability rate. Most stroke survivors are left with speech and physical impairments, and emotional problems. Despite technological advances and improved treatment options, death and disability after stroke remain a major problem. Our research aims to develop interpretable hybrid machine learning (ML) models to accurately predict three-month unfavorable outcomes in patients with AIS. METHODS Within the framework of this analysis, the model was trained using data from 731 cases in the dataset and subsequently validated using data from both internal and external validation datasets. A total of 25 models (including ML and deep learning models) were initially employed, along with 14 evaluation metrics, and the results were subjected to cluster analysis to objectively validate the model's effectiveness and assess the similarity of evaluation metrics. For the final model evaluation, 10 metrics selected after metric screening and calibration analysis were utilized to evaluate model performance, while clinical decision analysis, cost curve analysis, and model fairness analysis were applied to assess the clinical applicability of the model. Nested cross-validation and optimal hyperparameter search were employed to determine the best hyperparameter for the ML models. The SHAP diagram is utilized to provide further visual explanations regarding the importance of features and their interaction effects, ultimately leading to the establishment of a practical AIS three-month prognostic prediction platform. RESULTS The frequencies of unfavorable outcomes in the internal dataset and external validation dataset were 389 / 1045 (37.2 %) and 161 / 411 (39.2 %), respectively. Through cluster analysis of the results of 14 evaluation metrics across 25 models and a comparison of clinical applicability, 12 ML models were ultimately selected for further analysis. The findings revealed that XGBoost and CatBoost performed the best. Further ensemble modeling of these two models and adjustment of decision thresholds using cost curves resulted in the final model performing as follows on the internal validation set: PRAUC of 0.856 (0.801, 0.902), ROCAUC of 0.856 (0.801, 0.901), specificity of 0.879 (0.797, 0.953), balanced accuracy of 0.840 (0.763, 0.912) and MCC of 0.678 (0.591, 0.760). Similarly, the model exhibited excellent performance on the external validation set, with a PRAUC of 0.823 (0.775, 0.872), ROCAUC of 0.842 (0.801, 0.890), specificity of 0.888 (0.822, 0.920), balanced accuracy of 0.814 (0.751, 0.869) and MCC of 0.639 (0.546, 0.721). In terms of the important features of AIS three-month outcomes, albumin ranked highest, followed by FBG, BMI, Scr, WBC, and age, while gender exhibited significant interactions with other indicators. Ultimately, based on the final ensemble model and optimal decision thresholds, a tailored short-term prognostic prediction platform for AIS patients was developed. CONCLUSIONS We constructed an interpretable hybrid ML model that maintained good performance on both internal and external validation datasets using the most readily accessible 30 clinical data variables, indicating its ability to accurately predict the three-month unfavorable outcomes for AIS patients. Meanwhile, our superior predictive model provides practicality for routine and more frequent initial risk assessments, making it easier to integrate into network or mobile-based telemedicine solutions.
Collapse
Affiliation(s)
- Chen Chen
- School of Cyber Science and Engineering, Southeast University, Nanjing 211102 Jiangsu, China; School of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003 Jiangsu, China
| | - Wenkang Zhang
- Department of Cardiology, Zhongda Hospital, Southeast University, Nanjing 210009 Jiangsu, China; School of Medicine, Southeast University, Nanjing 210009 Jiangsu, China
| | - Yang Pan
- Department of Geriatric Neurology, Affiliated Nanjing Brain Hospital, Nanjing Medical University, Nanjing 210029 Jiangsu, China.
| | - Zhen Li
- Department of Geriatric Neurology, Affiliated Nanjing Brain Hospital, Nanjing Medical University, Nanjing 210029 Jiangsu, China; Department of Neurology, The First Affiliated Hospital of Soochow University, Suzhou 215000 Jiangsu, China.
| |
Collapse
|
3
|
Goktas P, Grzybowski A. Shaping the Future of Healthcare: Ethical Clinical Challenges and Pathways to Trustworthy AI. J Clin Med 2025; 14:1605. [PMID: 40095575 PMCID: PMC11900311 DOI: 10.3390/jcm14051605] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/06/2025] [Accepted: 02/22/2025] [Indexed: 03/19/2025] Open
Abstract
Background/Objectives: Artificial intelligence (AI) is transforming healthcare, enabling advances in diagnostics, treatment optimization, and patient care. Yet, its integration raises ethical, regulatory, and societal challenges. Key concerns include data privacy risks, algorithmic bias, and regulatory gaps that struggle to keep pace with AI advancements. This study aims to synthesize a multidisciplinary framework for trustworthy AI in healthcare, focusing on transparency, accountability, fairness, sustainability, and global collaboration. It moves beyond high-level ethical discussions to provide actionable strategies for implementing trustworthy AI in clinical contexts. Methods: A structured literature review was conducted using PubMed, Scopus, and Web of Science. Studies were selected based on relevance to AI ethics, governance, and policy in healthcare, prioritizing peer-reviewed articles, policy analyses, case studies, and ethical guidelines from authoritative sources published within the last decade. The conceptual approach integrates perspectives from clinicians, ethicists, policymakers, and technologists, offering a holistic "ecosystem" view of AI. No clinical trials or patient-level interventions were conducted. Results: The analysis identifies key gaps in current AI governance and introduces the Regulatory Genome-an adaptive AI oversight framework aligned with global policy trends and Sustainable Development Goals. It introduces quantifiable trustworthiness metrics, a comparative analysis of AI categories for clinical applications, and bias mitigation strategies. Additionally, it presents interdisciplinary policy recommendations for aligning AI deployment with ethical, regulatory, and environmental sustainability goals. This study emphasizes measurable standards, multi-stakeholder engagement strategies, and global partnerships to ensure that future AI innovations meet ethical and practical healthcare needs. Conclusions: Trustworthy AI in healthcare requires more than technical advancements-it demands robust ethical safeguards, proactive regulation, and continuous collaboration. By adopting the recommended roadmap, stakeholders can foster responsible innovation, improve patient outcomes, and maintain public trust in AI-driven healthcare.
Collapse
Affiliation(s)
- Polat Goktas
- UCD School of Computer Science, University College Dublin, D04 V1W8 Dublin, Ireland;
| | - Andrzej Grzybowski
- Department of Ophthalmology, University of Warmia and Mazury, 10-719 Olsztyn, Poland
- Institute for Research in Ophthalmology, Foundation for Ophthalmology Development, 61-553 Poznan, Poland
| |
Collapse
|
4
|
Campagner A, Agnello L, Carobene A, Padoan A, Del Ben F, Locatelli M, Plebani M, Ognibene A, Lorubbio M, De Vecchi E, Cortegiani A, Piva E, Poz D, Curcio F, Cabitza F, Ciaccio M. Complete Blood Count and Monocyte Distribution Width-Based Machine Learning Algorithms for Sepsis Detection: Multicentric Development and External Validation Study. J Med Internet Res 2025; 27:e55492. [PMID: 40009841 PMCID: PMC11904381 DOI: 10.2196/55492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 05/04/2024] [Accepted: 09/09/2024] [Indexed: 02/28/2025] Open
Abstract
BACKGROUND Sepsis is an organ dysfunction caused by a dysregulated host response to infection. Early detection is fundamental to improving the patient outcome. Laboratory medicine can play a crucial role by providing biomarkers whose alteration can be detected before the onset of clinical signs and symptoms. In particular, the relevance of monocyte distribution width (MDW) as a sepsis biomarker has emerged in the previous decade. However, despite encouraging results, MDW has poor sensitivity and positive predictive value when compared to other biomarkers. OBJECTIVE This study aims to investigate the use of machine learning (ML) to overcome the limitations mentioned earlier by combining different parameters and therefore improving sepsis detection. However, making ML models function in clinical practice may be problematic, as their performance may suffer when deployed in contexts other than the research environment. In fact, even widely used commercially available models have been demonstrated to generalize poorly in out-of-distribution scenarios. METHODS In this multicentric study, we developed ML models whose intended use is the early detection of sepsis on the basis of MDW and complete blood count parameters. In total, data from 6 patient cohorts (encompassing 5344 patients) collected at 5 different Italian hospitals were used to train and externally validate ML models. The models were trained on a patient cohort encompassing patients enrolled at the emergency department, and it was externally validated on 5 different cohorts encompassing patients enrolled at both the emergency department and the intensive care unit. The cohorts were selected to exhibit a variety of data distribution shifts compared to the training set, including label, covariate, and missing data shifts, enabling a conservative validation of the developed models. To improve generalizability and robustness to different types of distribution shifts, the developed ML models combine traditional methodologies with advanced techniques inspired by controllable artificial intelligence (AI), namely cautious classification, which gives the ML models the ability to abstain from making predictions, and explainable AI, which provides health operators with useful information about the models' functioning. RESULTS The developed models achieved good performance on the internal validation (area under the receiver operating characteristic curve between 0.91 and 0.98), as well as consistent generalization performance across the external validation datasets (area under the receiver operating characteristic curve between 0.75 and 0.95), outperforming baseline biomarkers and state-of-the-art ML models for sepsis detection. Controllable AI techniques were further able to improve performance and were used to derive an interpretable set of diagnostic rules. CONCLUSIONS Our findings demonstrate how controllable AI approaches based on complete blood count and MDW may be used for the early detection of sepsis while also demonstrating how the proposed methodology can be used to develop ML models that are more resistant to different types of data distribution shifts.
Collapse
Affiliation(s)
| | | | - Anna Carobene
- IRCCS San Raffaele Scientific Institute, Milano, Italy
| | - Andrea Padoan
- Department of Medicine, University of Padova, Padova, Italy
- Laboratory Medicine Unit, University-Hospital of Padova, Padova, Italy
| | - Fabio Del Ben
- IRCCS Centro Di Riferimento Oncologico Aviano, Aviano, Italy
| | | | - Mario Plebani
- Department of Medicine, University of Padova, Padova, Italy
- Laboratory Medicine Unit, University-Hospital of Padova, Padova, Italy
| | | | | | | | - Andrea Cortegiani
- University of Palermo, Palermo, Italy
- University Hospital Policlinico Paolo Giaccone, Palermo, Italy
| | - Elisa Piva
- Azienda Socio Sanitaria Territoriale di Mantova, Mantova, Italy
| | | | | | - Federico Cabitza
- IRCCS Ospedale Galeazzi Sant'Ambrogio, Milan, Italy
- Department of Computer Science, Systems and Communication, University of Milano-Bicocca, Milano, Italy
| | - Marcello Ciaccio
- University of Palermo, Palermo, Italy
- University Hospital Policlinico Paolo Giaccone, Palermo, Italy
| |
Collapse
|
5
|
Rosenbacke R, Melhus Å, McKee M, Stuckler D. How Explainable Artificial Intelligence Can Increase or Decrease Clinicians' Trust in AI Applications in Health Care: Systematic Review. JMIR AI 2024; 3:e53207. [PMID: 39476365 PMCID: PMC11561425 DOI: 10.2196/53207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 03/22/2024] [Accepted: 09/17/2024] [Indexed: 11/17/2024]
Abstract
BACKGROUND Artificial intelligence (AI) has significant potential in clinical practice. However, its "black box" nature can lead clinicians to question its value. The challenge is to create sufficient trust for clinicians to feel comfortable using AI, but not so much that they defer to it even when it produces results that conflict with their clinical judgment in ways that lead to incorrect decisions. Explainable AI (XAI) aims to address this by providing explanations of how AI algorithms reach their conclusions. However, it remains unclear whether such explanations foster an appropriate degree of trust to ensure the optimal use of AI in clinical practice. OBJECTIVE This study aims to systematically review and synthesize empirical evidence on the impact of XAI on clinicians' trust in AI-driven clinical decision-making. METHODS A systematic review was conducted in accordance with PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, searching PubMed and Web of Science databases. Studies were included if they empirically measured the impact of XAI on clinicians' trust using cognition- or affect-based measures. Out of 778 articles screened, 10 met the inclusion criteria. We assessed the risk of bias using standard tools appropriate to the methodology of each paper. RESULTS The risk of bias in all papers was moderate or moderate to high. All included studies operationalized trust primarily through cognitive-based definitions, with 2 also incorporating affect-based measures. Out of these, 5 studies reported that XAI increased clinicians' trust compared with standard AI, particularly when the explanations were clear, concise, and relevant to clinical practice. In addition, 3 studies found no significant effect of XAI on trust, and the presence of explanations does not automatically improve trust. Notably, 2 studies highlighted that XAI could either enhance or diminish trust, depending on the complexity and coherence of the provided explanations. The majority of studies suggest that XAI has the potential to enhance clinicians' trust in recommendations generated by AI. However, complex or contradictory explanations can undermine this trust. More critically, trust in AI is not inherently beneficial, as AI recommendations are not infallible. These findings underscore the nuanced role of explanation quality and suggest that trust can be modulated through the careful design of XAI systems. CONCLUSIONS Excessive trust in incorrect advice generated by AI can adversely impact clinical accuracy, just as can happen when correct advice is distrusted. Future research should focus on refining both cognitive and affect-based measures of trust and on developing strategies to achieve an appropriate balance in terms of trust, preventing both blind trust and undue skepticism. Optimizing trust in AI systems is essential for their effective integration into clinical practice.
Collapse
Affiliation(s)
- Rikard Rosenbacke
- Centre for Corporate Governance, Department of Accounting, Copenhagen Business School, Frederiksberg, Denmark
| | - Åsa Melhus
- Department of Medical Sciences, Clinical Microbiology, Uppsala University, Uppsala, Sweden
| | - Martin McKee
- European Observatory on Health Systems and Policies, London School of Hygiene & Tropical Medicine, London, United Kingdom
| | - David Stuckler
- Department of Social and Political Sciences, Bocconi University, Milano, Italy
| |
Collapse
|
6
|
Hassoun S, Bruckmann C, Ciardullo S, Perseghin G, Marra F, Curto A, Arena U, Broccolo F, Di Gaudio F. NAIF: A novel artificial intelligence-based tool for accurate diagnosis of stage F3/F4 liver fibrosis in the general adult population, validated with three external datasets. Int J Med Inform 2024; 185:105373. [PMID: 38395017 DOI: 10.1016/j.ijmedinf.2024.105373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Revised: 02/05/2024] [Accepted: 02/07/2024] [Indexed: 02/25/2024]
Abstract
OBJECTIVE The purpose of this study was to determine the effectiveness of a new AI-based tool called NAIF (NAFLD-AI-Fibrosis) in identifying individuals from the general population with advanced liver fibrosis (stage F3/F4). We compared NAIF's performance to two existing risk score calculators, aspartate aminotransferase-to-platelet ratio index (APRI) and fibrosis-4 (Fib4). METHODS To set up the algorithm for diagnosing severe liver fibrosis (defined as Fibroscan® values E ≥ 9.7 KPa), we used 19 blood biochemistry parameters and two demographic parameters in a group of 5,962 individuals from the NHANES population (2017-2020 pre-pandemic, public database). We then assessed the algorithm's performance by comparing its accuracy, precision, sensitivity, specificity, and F1 score values to those of APRI and Fib4 scoring systems. RESULTS In a kept-out sub dataset of the NHANES population, NAIF achieved a predictive precision of 72 %, a sensitivity of 61 %, and a specificity of 77 % in correctly identifying adults (aged 18-79 years) with severe liver fibrosis. Additionally, NAIF performed well when tested with two external datasets of Italian patients with a Fibroscan® score E ≥ 9.7 kPa, and with an external dataset of patients with diagnosis of severe liver fibrosis through biopsy. CONCLUSIONS The results of our study suggest that NAIF, using routinely available parameters, outperforms in sensitivity existing scoring methods (Fib4 and APRI) in diagnosing severe liver fibrosis, even when tested with external validation datasets. NAIF uses routinely available parameters, making it a promising tool for identifying individuals with advanced liver fibrosis from the general population. Word count abstract: 236.
Collapse
Affiliation(s)
- Samir Hassoun
- Unità Operativa Centro Controllo Qualità e Rischio Chimico (CQRC), Azienda Ospedaliera Villa Sofia Cervello, viale Strasburgo 233, 90146 Palermo, Italy.
| | - Chiara Bruckmann
- Unità Operativa Centro Controllo Qualità e Rischio Chimico (CQRC), Azienda Ospedaliera Villa Sofia Cervello, viale Strasburgo 233, 90146 Palermo, Italy.
| | - Stefano Ciardullo
- Department of Medicine and Surgery, University of Milano-Bicocca, via Modigliani 10, 20900 Monza, Italy; Department of Medicine and Rehabilitation, Policlinico di Monza, Monza, via Modigliani 10, 20900 Monza, Italy
| | - Gianluca Perseghin
- Department of Medicine and Surgery, University of Milano-Bicocca, via Modigliani 10, 20900 Monza, Italy; Department of Medicine and Rehabilitation, Policlinico di Monza, Monza, via Modigliani 10, 20900 Monza, Italy
| | - Fabio Marra
- Dipartimento di Medicina Sperimentale e Clinica, University of Florence, Largo Giovanni Alessandro Brambilla, 3, 50134 Firenze Italy
| | - Armando Curto
- Dipartimento di Medicina Sperimentale e Clinica, University of Florence, Largo Giovanni Alessandro Brambilla, 3, 50134 Firenze Italy
| | - Umberto Arena
- Dipartimento di Medicina Sperimentale e Clinica, University of Florence, Largo Giovanni Alessandro Brambilla, 3, 50134 Firenze Italy
| | - Francesco Broccolo
- Department of Experimental Medicine, University of Salento, 73100 Lecce, Italy.
| | - Francesca Di Gaudio
- Unità Operativa Centro Controllo Qualità e Rischio Chimico (CQRC), Azienda Ospedaliera Villa Sofia Cervello, viale Strasburgo 233, 90146 Palermo, Italy; PROMISE-Promotion of Health, Maternal-Childhood, Internal and Specialized Medicine of Excellence G. D'Alessandro, Piazza delle Cliniche, 2, 90127 Palermo, Italy
| |
Collapse
|
7
|
Chen TLW, Buddhiraju A, Seo HH, Subih MA, Tuchinda P, Kwon YM. Internal and External Validation of the Generalizability of Machine Learning Algorithms in Predicting Non-home Discharge Disposition Following Primary Total Knee Joint Arthroplasty. J Arthroplasty 2023; 38:1973-1981. [PMID: 36764409 DOI: 10.1016/j.arth.2023.01.065] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/24/2022] [Revised: 01/24/2023] [Accepted: 01/31/2023] [Indexed: 02/12/2023] Open
Abstract
BACKGROUND Nonhome discharge disposition following primary total knee arthroplasty (TKA) is associated with a higher rate of complications and constitutes a socioeconomic burden on the health care system. While existing algorithms predicting nonhome discharge disposition varied in degrees of mathematical complexity and prediction power, their capacity to generalize predictions beyond the development dataset remains limited. Therefore, this study aimed to establish the machine learning model generalizability by performing internal and external validations using nation-scale and institutional cohorts, respectively. METHODS Four machine learning models were trained using the national cohort. Recursive feature elimination and hyper-parameter tuning were applied. Internal validation was achieved through five-fold cross-validation during model training. The trained models' performance was externally validated using the institutional cohort and assessed by discrimination, calibration, and clinical utility. RESULTS The national (424,354 patients) and institutional (10,196 patients) cohorts had non-home discharge rates of 19.4 and 36.4%, respectively. The areas under the receiver operating curve of the model predictions were 0.83 to 0.84 during internal validation and increased to 0.88 to 0.89 during external validation. Artificial neural network and histogram-based gradient boosting elicited the best performance with a mean area under the receiver operating curve of 0.89, calibration slope of 1.39, and Brier score of 0.14, which indicated that the two models were robust in distinguishing non-home discharge and well-calibrated with accurate predictions of the probabilities. The low inter-dataset similarity indicated reliable external validation. Length of stay, age, body mass index, and sex were the strongest predictors of discharge destination after primary TKA. CONCLUSION The machine learning models demonstrated excellent predictive performance during both internal and external validations, supporting their generalizability across different patient cohorts and potential applicability in the clinical workflow.
Collapse
Affiliation(s)
- Tony Lin-Wei Chen
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Anirudh Buddhiraju
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Henry Hojoon Seo
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Murad Abdullah Subih
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Pete Tuchinda
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Young-Min Kwon
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| |
Collapse
|
8
|
Bottrighi A, Pennisi M. Exploring the State of Machine Learning and Deep Learning in Medicine: A Survey of the Italian Research Community. INFORMATION 2023; 14:513. [DOI: 10.3390/info14090513] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2025] Open
Abstract
Artificial intelligence (AI) is becoming increasingly important, especially in the medical field. While AI has been used in medicine for some time, its growth in the last decade is remarkable. Specifically, machine learning (ML) and deep learning (DL) techniques in medicine have been increasingly adopted due to the growing abundance of health-related data, the improved suitability of such techniques for managing large datasets, and more computational power. ML and DL methodologies are fostering the development of new “intelligent” tools and expert systems to process data, to automatize human–machine interactions, and to deliver advanced predictive systems that are changing every aspect of the scientific research, industry, and society. The Italian scientific community was instrumental in advancing this research area. This article aims to conduct a comprehensive investigation of the ML and DL methodologies and applications used in medicine by the Italian research community in the last five years. To this end, we selected all the papers published in the last five years with at least one of the authors affiliated to an Italian institution that in the title, in the abstract, or in the keywords present the terms “machine learning” or “deep learning” and reference a medical area. We focused our research on journal papers under the hypothesis that Italian researchers prefer to present novel but well-established research in scientific journals. We then analyzed the selected papers considering different dimensions, including the medical topic, the type of data, the pre-processing methods, the learning methods, and the evaluation methods. As a final outcome, a comprehensive overview of the Italian research landscape is given, highlighting how the community has increasingly worked on a very heterogeneous range of medical problems.
Collapse
Affiliation(s)
- Alessio Bottrighi
- Dipartimento di Scienze e Innovazione Tecnologica (DiSIT), Computer Science Institute, Università del Piemonte Orientale, 15121 Alessandria, Italy
- Laboratorio Integrato di Intelligenza Artificiale e Informatica in Medicina, Azienda Ospedaliera SS. Antonio e Biagio e Cesare Arrigo, Alessandria—e DiSIT—Università del Piemonte Orientale, 15121 Alessandria, Italy
| | - Marzio Pennisi
- Dipartimento di Scienze e Innovazione Tecnologica (DiSIT), Computer Science Institute, Università del Piemonte Orientale, 15121 Alessandria, Italy
- Laboratorio Integrato di Intelligenza Artificiale e Informatica in Medicina, Azienda Ospedaliera SS. Antonio e Biagio e Cesare Arrigo, Alessandria—e DiSIT—Università del Piemonte Orientale, 15121 Alessandria, Italy
| |
Collapse
|
9
|
Buddhiraju A, Chen TLW, Subih MA, Seo HH, Esposito JG, Kwon YM. Validation and Generalizability of Machine Learning Models for the Prediction of Discharge Disposition Following Revision Total Knee Arthroplasty. J Arthroplasty 2023; 38:S253-S258. [PMID: 36849013 DOI: 10.1016/j.arth.2023.02.054] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/06/2022] [Revised: 02/16/2023] [Accepted: 02/20/2023] [Indexed: 03/01/2023] Open
Abstract
BACKGROUND Postoperative discharge to facilities account for over 33% of the $ 2.7 billion revision total knee arthroplasty (TKA)-associated annual expenditures and are associated with increased complications when compared to home discharges. Prior studies predicting discharge disposition using advanced machine learning (ML) have been limited due to a lack of generalizability and validation. This study aimed to establish ML model generalizability by externally validating its prediction for nonhome discharge following revision TKA using national and institutional databases. METHODS The national and institutional cohorts comprised 52,533 and 1,628 patients, respectively, with 20.6 and 19.4% nonhome discharge rates. Five ML models were trained and internally validated (five-fold cross-validation) on a large national dataset. Subsequently, external validation was performed on our institutional dataset. Model performance was assessed using discrimination, calibration, and clinical utility. Global predictor importance plots and local surrogate models were used for interpretation. RESULTS The strongest predictors of nonhome discharge were patient age, body mass index, and surgical indication. The area under the receiver operating characteristic curve increased from internal to external validation and ranged between 0.77 and 0.79. Artificial neural network was the best predictive model for identifying patients at risk for nonhome discharge (area under the receiver operating characteristic curve = 0.78), and also the most accurate (calibration slope = 0.93, intercept = 0.02, and Brier score = 0.12). CONCLUSION All five ML models demonstrated good-to-excellent discrimination, calibration, and clinical utility on external validation, with artificial neural network being the best model for predicting discharge disposition following revision TKA. Our findings establish the generalizability of ML models developed using data from a national database. The integration of these predictive models into clinical workflow may assist in optimizing discharge planning, bed management, and cost containment associated with revision TKA.
Collapse
Affiliation(s)
- Anirudh Buddhiraju
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Tony L-W Chen
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Murad A Subih
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Henry H Seo
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - John G Esposito
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Young-Min Kwon
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| |
Collapse
|
10
|
Abstract
Medical imaging is a great asset for modern medicine, since it allows physicians to spatially interrogate a disease site, resulting in precise intervention for diagnosis and treatment, and to observe particular aspect of patients' conditions that otherwise would not be noticeable. Computational analysis of medical images, moreover, can allow the discovery of disease patterns and correlations among cohorts of patients with the same disease, thus suggesting common causes or providing useful information for better therapies and cures. Machine learning and deep learning applied to medical images, in particular, have produced new, unprecedented results that can pave the way to advanced frontiers of medical discoveries. While computational analysis of medical images has become easier, however, the possibility to make mistakes or generate inflated or misleading results has become easier, too, hindering reproducibility and deployment. In this article, we provide ten quick tips to perform computational analysis of medical images avoiding common mistakes and pitfalls that we noticed in multiple studies in the past. We believe our ten guidelines, if taken into practice, can help the computational-medical imaging community to perform better scientific research that eventually can have a positive impact on the lives of patients worldwide.
Collapse
Affiliation(s)
- Davide Chicco
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Rakesh Shiradkar
- Department of Biomedical Engineering, Emory University, Atlanta, Georgia, United States of America
| |
Collapse
|
11
|
Homeyer A, Geißler C, Schwen LO, Zakrzewski F, Evans T, Strohmenger K, Westphal M, Bülow RD, Kargl M, Karjauv A, Munné-Bertran I, Retzlaff CO, Romero-López A, Sołtysiński T, Plass M, Carvalho R, Steinbach P, Lan YC, Bouteldja N, Haber D, Rojas-Carulla M, Vafaei Sadr A, Kraft M, Krüger D, Fick R, Lang T, Boor P, Müller H, Hufnagl P, Zerbe N. Recommendations on compiling test datasets for evaluating artificial intelligence solutions in pathology. Mod Pathol 2022; 35:1759-1769. [PMID: 36088478 PMCID: PMC9708586 DOI: 10.1038/s41379-022-01147-y] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 07/24/2022] [Accepted: 07/25/2022] [Indexed: 12/24/2022]
Abstract
Artificial intelligence (AI) solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. This assessment requires appropriate test datasets. However, compiling such datasets is challenging and specific recommendations are missing. A committee of various stakeholders, including commercial AI developers, pathologists, and researchers, discussed key aspects and conducted extensive literature reviews on test datasets in pathology. Here, we summarize the results and derive general recommendations on compiling test datasets. We address several questions: Which and how many images are needed? How to deal with low-prevalence subsets? How can potential bias be detected? How should datasets be reported? What are the regulatory requirements in different countries? The recommendations are intended to help AI developers demonstrate the utility of their products and to help pathologists and regulatory agencies verify reported performance measures. Further research is needed to formulate criteria for sufficiently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic workflows in the future.
Collapse
Affiliation(s)
- André Homeyer
- Fraunhofer Institute for Digital Medicine MEVIS, Max-von-Laue-Straße 2, 28359, Bremen, Germany.
| | - Christian Geißler
- Technische Universität Berlin, DAI-Labor, Ernst-Reuter-Platz 7, 10587, Berlin, Germany
| | - Lars Ole Schwen
- Fraunhofer Institute for Digital Medicine MEVIS, Max-von-Laue-Straße 2, 28359, Bremen, Germany
| | - Falk Zakrzewski
- Institute of Pathology, Carl Gustav Carus University Hospital Dresden (UKD), TU Dresden (TUD), Fetscherstrasse 74, 01307, Dresden, Germany
| | - Theodore Evans
- Technische Universität Berlin, DAI-Labor, Ernst-Reuter-Platz 7, 10587, Berlin, Germany
| | - Klaus Strohmenger
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Institute of Pathology, Charitéplatz 1, 10117, Berlin, Germany
| | - Max Westphal
- Fraunhofer Institute for Digital Medicine MEVIS, Max-von-Laue-Straße 2, 28359, Bremen, Germany
| | - Roman David Bülow
- Institute of Pathology, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074, Aachen, Germany
| | - Michaela Kargl
- Medical University of Graz, Diagnostic and Research Center for Molecular BioMedicine, Diagnostic & Research Institute of Pathology, Neue Stiftingtalstrasse 6, 8010, Graz, Austria
| | - Aray Karjauv
- Technische Universität Berlin, DAI-Labor, Ernst-Reuter-Platz 7, 10587, Berlin, Germany
| | - Isidre Munné-Bertran
- MoticEurope, S.L.U., C. Les Corts, 12 Poligono Industrial, 08349, Barcelona, Spain
| | - Carl Orge Retzlaff
- Technische Universität Berlin, DAI-Labor, Ernst-Reuter-Platz 7, 10587, Berlin, Germany
| | | | | | - Markus Plass
- Medical University of Graz, Diagnostic and Research Center for Molecular BioMedicine, Diagnostic & Research Institute of Pathology, Neue Stiftingtalstrasse 6, 8010, Graz, Austria
| | - Rita Carvalho
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Institute of Pathology, Charitéplatz 1, 10117, Berlin, Germany
| | - Peter Steinbach
- Helmholtz-Zentrum Dresden Rossendorf, Bautzner Landstraße 400, 01328, Dresden, Germany
| | - Yu-Chia Lan
- Institute of Pathology, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074, Aachen, Germany
| | - Nassim Bouteldja
- Institute of Pathology, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074, Aachen, Germany
| | - David Haber
- Lakera AI AG, Zelgstrasse 7, 8003, Zürich, Switzerland
| | | | - Alireza Vafaei Sadr
- Institute of Pathology, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074, Aachen, Germany
| | | | - Daniel Krüger
- Olympus Soft Imaging Solutions GmbH, Johann-Krane-Weg 39, 48149, Münster, Germany
| | - Rutger Fick
- Tribun Health, 2 Rue du Capitaine Scott, 75015, Paris, France
| | - Tobias Lang
- Mindpeak GmbH, Zirkusweg 2, 20359, Hamburg, Germany
| | - Peter Boor
- Institute of Pathology, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074, Aachen, Germany
| | - Heimo Müller
- Medical University of Graz, Diagnostic and Research Center for Molecular BioMedicine, Diagnostic & Research Institute of Pathology, Neue Stiftingtalstrasse 6, 8010, Graz, Austria
| | - Peter Hufnagl
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Institute of Pathology, Charitéplatz 1, 10117, Berlin, Germany
| | - Norman Zerbe
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Institute of Pathology, Charitéplatz 1, 10117, Berlin, Germany
| |
Collapse
|
12
|
Bento N, Rebelo J, Barandas M, Carreiro AV, Campagner A, Cabitza F, Gamboa H. Comparing Handcrafted Features and Deep Neural Representations for Domain Generalization in Human Activity Recognition. SENSORS (BASEL, SWITZERLAND) 2022; 22:s22197324. [PMID: 36236427 PMCID: PMC9572241 DOI: 10.3390/s22197324] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Revised: 09/21/2022] [Accepted: 09/23/2022] [Indexed: 06/02/2023]
Abstract
Human Activity Recognition (HAR) has been studied extensively, yet current approaches are not capable of generalizing across different domains (i.e., subjects, devices, or datasets) with acceptable performance. This lack of generalization hinders the applicability of these models in real-world environments. As deep neural networks are becoming increasingly popular in recent work, there is a need for an explicit comparison between handcrafted and deep representations in Out-of-Distribution (OOD) settings. This paper compares both approaches in multiple domains using homogenized public datasets. First, we compare several metrics to validate three different OOD settings. In our main experiments, we then verify that even though deep learning initially outperforms models with handcrafted features, the situation is reversed as the distance from the training distribution increases. These findings support the hypothesis that handcrafted features may generalize better across specific domains.
Collapse
Affiliation(s)
- Nuno Bento
- Associação Fraunhofer Portugal Research, Rua Alfredo Allen 455/461, 4200-135 Porto, Portugal
| | - Joana Rebelo
- Associação Fraunhofer Portugal Research, Rua Alfredo Allen 455/461, 4200-135 Porto, Portugal
| | - Marília Barandas
- Associação Fraunhofer Portugal Research, Rua Alfredo Allen 455/461, 4200-135 Porto, Portugal
- Laboratório de Instrumentação, Engenharia Biomédica e Física da Radiação (LIBPhys–UNL), Departamento de Física, Faculdade de Ciências e Tecnologia (FCT), Universidade Nova de Lisboa, 2829-516 Caparica, Portugal
| | - André V. Carreiro
- Associação Fraunhofer Portugal Research, Rua Alfredo Allen 455/461, 4200-135 Porto, Portugal
| | - Andrea Campagner
- Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, 20126 Milan, Italy
| | - Federico Cabitza
- Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, 20126 Milan, Italy
- IRCCS Istituto Ortopedico Galeazzi, 20161 Milan, Italy
| | - Hugo Gamboa
- Associação Fraunhofer Portugal Research, Rua Alfredo Allen 455/461, 4200-135 Porto, Portugal
- Laboratório de Instrumentação, Engenharia Biomédica e Física da Radiação (LIBPhys–UNL), Departamento de Física, Faculdade de Ciências e Tecnologia (FCT), Universidade Nova de Lisboa, 2829-516 Caparica, Portugal
| |
Collapse
|
13
|
Araújo DC, Veloso AA, Borges KBG, Carvalho MDG. Prognosing the risk of COVID-19 death through a machine learning-based routine blood panel: A retrospective study in Brazil. Int J Med Inform 2022; 165:104835. [PMID: 35908372 PMCID: PMC9327247 DOI: 10.1016/j.ijmedinf.2022.104835] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2022] [Revised: 07/17/2022] [Accepted: 07/19/2022] [Indexed: 01/08/2023]
Abstract
BACKGROUND Despite an extensive network of primary care availability, Brazil has suffered profoundly during the COVID-19 pandemic, experiencing the greatest sanitary collapse in its history. Thus, it is important to understand phenotype risk factors for SARS-CoV-2 infection severity in the Brazilian population in order to provide novel insights into the pathogenesis of the disease. OBJECTIVE This study proposes to predict the risk of COVID-19 death through machine learning, using blood biomarkers data from patients admitted to two large hospitals in Brazil. METHODS We retrospectively collected blood biomarkers data in a 24-h time window from 6,979 patients with COVID-19 confirmed by positive RT-PCR admitted to two large hospitals in Brazil, of whom 291 (4.2%) died and 6,688 (95.8%) were discharged. We then developed a large-scale exploration of risk models to predict the probability of COVID-19 severity, finally choosing the best performing model regarding the average AUROC. To improve generalizability, for each model five different testing scenarios were conducted, including two external validations. RESULTS We developed a machine learning-based panel composed of parameters extracted from the complete blood count (lymphocytes, MCV, platelets and RDW), in addition to C-Reactive Protein, which yielded an average AUROC of 0.91 ± 0.01 to predict death by COVID-19 confirmed by positive RT-PCR within a 24-h window. CONCLUSION Our study suggests that routine laboratory variables could be useful to identify COVID-19 patients under higher risk of death using machine learning. Further studies are needed for validating the model in other populations and contexts, since the natural history of SARS-CoV-2 infection and its consequences on the hematopoietic system and other organs is still quite recent.
Collapse
Affiliation(s)
- Daniella Castro Araújo
- Huna, São Paulo, SP, Brazil; Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brazil.
| | - Adriano Alonso Veloso
- Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brazil
| | | | | |
Collapse
|
14
|
Mollaei N, Fujao C, Silva L, Rodrigues J, Cepeda C, Gamboa H. Human-Centered Explainable Artificial Intelligence: Automotive Occupational Health Protection Profiles in Prevention Musculoskeletal Symptoms. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:ijerph19159552. [PMID: 35954919 PMCID: PMC9368597 DOI: 10.3390/ijerph19159552] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 07/27/2022] [Accepted: 07/27/2022] [Indexed: 02/05/2023]
Abstract
In automotive and industrial settings, occupational physicians are responsible for monitoring workers’ health protection profiles. Workers’ Functional Work Ability (FWA) status is used to create Occupational Health Protection Profiles (OHPP). This is a novel longitudinal study in comparison with previous research that has predominantly relied on the causality and explainability of human-understandable models for industrial technical teams like ergonomists. The application of artificial intelligence can support the decision-making to go from a worker’s Functional Work Ability to explanations by integrating explainability into medical (restriction) and support in contexts of individual, work-related, and organizational risk conditions. A sample of 7857 for the prognosis part of OHPP based on Functional Work Ability in the Portuguese language in the automotive industry was taken from 2019 to 2021. The most suitable regression models to predict the next medical appointment for the workers’ body parts protection were the models based on CatBoost regression, with an RMSLE of 0.84 and 1.23 weeks (mean error), respectively. CatBoost algorithm is also used to predict the next body part severity of OHPP. This information can help our understanding of potential risk factors for OHPP and identify warning signs of the early stages of musculoskeletal symptoms and work-related absenteeism.
Collapse
Affiliation(s)
- Nafiseh Mollaei
- LIBPhys, Physics Department, Faculty of Sciences and Technology, Nova University of Lisbon, 2825-149 Caparica, Portugal; (L.S.); (J.R.); (C.C.); (H.G.)
- Correspondence:
| | - Carlos Fujao
- Volkswagen Autoeuropa, Industrial Engineering and Lean Management, Quinta da Marquesa, 2954-024 Quinta do Anjo, Portugal;
| | - Luis Silva
- LIBPhys, Physics Department, Faculty of Sciences and Technology, Nova University of Lisbon, 2825-149 Caparica, Portugal; (L.S.); (J.R.); (C.C.); (H.G.)
| | - Joao Rodrigues
- LIBPhys, Physics Department, Faculty of Sciences and Technology, Nova University of Lisbon, 2825-149 Caparica, Portugal; (L.S.); (J.R.); (C.C.); (H.G.)
| | - Catia Cepeda
- LIBPhys, Physics Department, Faculty of Sciences and Technology, Nova University of Lisbon, 2825-149 Caparica, Portugal; (L.S.); (J.R.); (C.C.); (H.G.)
| | - Hugo Gamboa
- LIBPhys, Physics Department, Faculty of Sciences and Technology, Nova University of Lisbon, 2825-149 Caparica, Portugal; (L.S.); (J.R.); (C.C.); (H.G.)
| |
Collapse
|
15
|
Campagner A, Sternini F, Cabitza F. Decisions are not all equal-Introducing a utility metric based on case-wise raters' perceptions. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 221:106930. [PMID: 35690505 DOI: 10.1016/j.cmpb.2022.106930] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Revised: 05/13/2022] [Accepted: 05/31/2022] [Indexed: 06/15/2023]
Abstract
Background and Objective Evaluation of AI-based decision support systems (AI-DSS) is of critical importance in practical applications, nonetheless common evaluation metrics fail to properly consider relevant and contextual information. In this article we discuss a novel utility metric, the weighted Utility (wU), for the evaluation of AI-DSS, which is based on the raters' perceptions of their annotation hesitation and of the relevance of the training cases. Methods We discuss the relationship between the proposed metric and other previous proposals; and we describe the application of the proposed metric for both model evaluation and optimization, through three realistic case studies. Results We show that our metric generalizes the well-known Net Benefit, as well as other common error-based and utility-based metrics. Through the empirical studies, we show that our metric can provide a more flexible tool for the evaluation of AI models. We also show that, compared to other optimization metrics, model optimization based on the wU can provide significantly better performance (AUC 0.862 vs 0.895, p-value <0.05), especially on cases judged to be more complex by the human annotators (AUC 0.85 vs 0.92, p-value <0.05). Conclusions We make the point for having utility as a primary concern in the evaluation and optimization of machine learning models in critical domains, like the medical one; and for the importance of a human-centred approach to assess the potential impact of AI models on human decision making also on the basis of further information that can be collected during the ground-truthing process.
Collapse
Affiliation(s)
- Andrea Campagner
- Dipartimento di Informatica, Sistemistica e Comunicazione, Università di Milano-Bicocca, Milano, Italy.
| | - Federico Sternini
- Polito(BIO)Med Lab, Politecnico di Torino, Torino, Italy; USE-ME-D srl, I3P Politecnico di Torino, Torino, Ital
| | - Federico Cabitza
- Dipartimento di Informatica, Sistemistica e Comunicazione, Università di Milano-Bicocca, Milano, Italy; IRCCS Istituto Ortopedico Galeazzi, Milan, Italy
| |
Collapse
|
16
|
Borsci S, Lehtola VV, Nex F, Yang MY, Augustijn EW, Bagheriye L, Brune C, Kounadi O, Li J, Moreira J, Van Der Nagel J, Veldkamp B, Le DV, Wang M, Wijnhoven F, Wolterink JM, Zurita-Milla R. Embedding artificial intelligence in society: looking beyond the EU AI master plan using the culture cycle. AI & SOCIETY 2022. [DOI: 10.1007/s00146-021-01383-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
AbstractThe European Union (EU) Commission’s whitepaper on Artificial Intelligence (AI) proposes shaping the emerging AI market so that it better reflects common European values. It is a master plan that builds upon the EU AI High-Level Expert Group guidelines. This article reviews the masterplan, from a culture cycle perspective, to reflect on its potential clashes with current societal, technical, and methodological constraints. We identify two main obstacles in the implementation of this plan: (i) the lack of a coherent EU vision to drive future decision-making processes at state and local levels and (ii) the lack of methods to support a sustainable diffusion of AI in our society. The lack of a coherent vision stems from not considering societal differences across the EU member states. We suggest that these differences may lead to a fractured market and an AI crisis in which different members of the EU will adopt nation-centric strategies to exploit AI, thus preventing the development of a frictionless market as envisaged by the EU. Moreover, the Commission aims at changing the AI development culture proposing a human-centred and safety-first perspective that is not supported by methodological advancements, thus taking the risks of unforeseen social and societal impacts of AI. We discuss potential societal, technical, and methodological gaps that should be filled to avoid the risks of developing AI systems at the expense of society. Our analysis results in the recommendation that the EU regulators and policymakers consider how to complement the EC programme with rules and compensatory mechanisms to avoid market fragmentation due to local and global ambitions. Moreover, regulators should go beyond the human-centred approach establishing a research agenda seeking answers to the technical and methodological open questions regarding the development and assessment of human-AI co-action aiming for a sustainable AI diffusion in the society.
Collapse
|
17
|
Oala L, Murchison AG, Balachandran P, Choudhary S, Fehr J, Leite AW, Goldschmidt PG, Johner C, Schörverth EDM, Nakasi R, Meyer M, Cabitza F, Baird P, Prabhu C, Weicken E, Liu X, Wenzel M, Vogler S, Akogo D, Alsalamah S, Kazim E, Koshiyama A, Piechottka S, Macpherson S, Shadforth I, Geierhofer R, Matek C, Krois J, Sanguinetti B, Arentz M, Bielik P, Calderon-Ramirez S, Abbood A, Langer N, Haufe S, Kherif F, Pujari S, Samek W, Wiegand T. Machine Learning for Health: Algorithm Auditing & Quality Control. J Med Syst 2021; 45:105. [PMID: 34729675 PMCID: PMC8562935 DOI: 10.1007/s10916-021-01783-y] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Accepted: 10/11/2021] [Indexed: 01/26/2023]
Abstract
Developers proposing new machine learning for health (ML4H) tools often pledge to match or even surpass the performance of existing tools, yet the reality is usually more complicated. Reliable deployment of ML4H to the real world is challenging as examples from diabetic retinopathy or Covid-19 screening show. We envision an integrated framework of algorithm auditing and quality control that provides a path towards the effective and reliable application of ML systems in healthcare. In this editorial, we give a summary of ongoing work towards that vision and announce a call for participation to the special issue Machine Learning for Health: Algorithm Auditing & Quality Control in this journal to advance the practice of ML4H auditing.
Collapse
Affiliation(s)
| | | | | | | | - Jana Fehr
- Hasso-Plattner-Institute of Digital Engineering, Potsdam, Germany
| | - Alixandro Werneck Leite
- Machine Learning Laboratory in Finance and Organizations, Universidade de Brasília, Brasília, Brazil
| | | | | | | | | | | | | | | | | | | | - Xiaoxuan Liu
- University Hospitals Birmingham NHS Foundation Trust & Academic Unit of Ophthalmology, Institute of Inflammation and Ageing, College of Medical and Dental Sciences, University of Birmingham, Birmingham, United Kingdom
| | | | | | | | - Shada Alsalamah
- Information Systems Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
- Digital Health and Innovation Department, Science Division, World Health Organization, Winterthur, Switzerland
| | - Emre Kazim
- University College London, London, United Kingdom
| | | | | | | | | | | | | | - Joachim Krois
- Oral Diagnostics Digital Health Health Services Research, Charité-Universitätsmedizin, Berlin, Germany
| | | | - Matthew Arentz
- Department of Global Health, University of Washington, Washington, USA
| | | | | | | | - Nicolas Langer
- Department of Psychology, University of Zurich, Zürich, Switzerland
| | | | - Ferath Kherif
- Laboratory for Research in Neuroimaging, Department of Clinical Neuroscience, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
| | - Sameer Pujari
- Digital Health and Innovation Department, Science Division, World Health Organization, Winterthur, Switzerland
| | | | | |
Collapse
|
18
|
Cabitza F, Campagner A, Soares F, García de Guadiana-Romualdo L, Challa F, Sulejmani A, Seghezzi M, Carobene A. The importance of being external. methodological insights for the external validation of machine learning models in medicine. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 208:106288. [PMID: 34352688 DOI: 10.1016/j.cmpb.2021.106288] [Citation(s) in RCA: 113] [Impact Index Per Article: 28.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Accepted: 07/09/2021] [Indexed: 06/13/2023]
Abstract
UNLABELLED Background and Objective Medical machine learning (ML) models tend to perform better on data from the same cohort than on new data, often due to overfitting, or co-variate shifts. For these reasons, external validation (EV) is a necessary practice in the evaluation of medical ML. However, there is still a gap in the literature on how to interpret EV results and hence assess the robustness of ML models. METHODS We fill this gap by proposing a meta-validation method, to assess the soundness of EV procedures. In doing so, we complement the usual way to assess EV by considering both dataset cardinality, and the similarity of the EV dataset with respect to the training set. We then investigate how the notions of cardinality and similarity can be used to inform on the reliability of a validation procedure, by integrating them into two summative data visualizations. RESULTS We illustrate our methodology by applying it to the validation of a state-of-the-art COVID-19 diagnostic model on 8 EV sets, collected across 3 different continents. The model performance was moderately impacted by data similarity (Pearson ρ = 0.38, p< 0.001). In the EV, the validated model reported good AUC (average: 0.84), acceptable calibration (average: 0.17) and utility (average: 0.50). The validation datasets were adequate in terms of dataset cardinality and similarity, thus suggesting the soundness of the results. We also provide a qualitative guideline to evaluate the reliability of validation procedures, and we discuss the importance of proper external validation in light of the obtained results. CONCLUSIONS In this paper, we propose a novel, lean methodology to: 1) study how the similarity between training and validation sets impacts the generalizability of a ML model; 2) assess the soundness of EV evaluations along three complementary performance dimensions: discrimination, utility and calibration; 3) draw conclusions on the robustness of the model under validation. We applied this methodology to a state-of-the-art model for the diagnosis of COVID-19 from routine blood tests, and showed how to interpret the results in light of the presented framework.
Collapse
Affiliation(s)
- Federico Cabitza
- University of Milano-Bicocca, Viale Sarca 336, Milano, 20126, Italy.
| | - Andrea Campagner
- University of Milano-Bicocca, Viale Sarca 336, Milano, 20126, Italy
| | - Felipe Soares
- Department of Industrial Engineering - Universidade Federal do Rio Grande do Sul. Porto Alegre, Brazil
| | | | - Feyissa Challa
- National Reference Laboratory for Clinical Chemistry, Ethiopian Public Health Institute, Addis Ababa, Ethiopia
| | - Adela Sulejmani
- Laboratorio di chimica clinica, Ospedale di Desio e Monza, ASST-Monza, Dipartimento di medicina e chirurgia, Universit di Milano-Bicocca, Monza, Italy
| | - Michela Seghezzi
- Laboratorio di chimica clinica, Ospedale Papa Giovanni XXIII, Bergamo, Italy
| | - Anna Carobene
- Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Milan, Italy
| |
Collapse
|
19
|
Cabitza F, Campagner A. The need to separate the wheat from the chaff in medical informatics: Introducing a comprehensive checklist for the (self)-assessment of medical AI studies. Int J Med Inform 2021; 153:104510. [PMID: 34108105 DOI: 10.1016/j.ijmedinf.2021.104510] [Citation(s) in RCA: 158] [Impact Index Per Article: 39.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Revised: 05/26/2021] [Accepted: 05/27/2021] [Indexed: 12/23/2022]
Abstract
This editorial aims to contribute to the current debate about the quality of studies that apply machine learning (ML) methodologies to medical data to extract value from them and provide clinicians with viable and useful tools supporting everyday care practices. We propose a practical checklist to help authors to self assess the quality of their contribution and to help reviewers to recognize and appreciate high-quality medical ML studies by distinguishing them from the mere application of ML techniques to medical data.
Collapse
Affiliation(s)
- Federico Cabitza
- DISCo, University of Milano-Bicocca, viale Sarca 336, Milano 20126, Italy.
| | - Andrea Campagner
- DISCo, University of Milano-Bicocca, viale Sarca 336, Milano 20126, Italy
| |
Collapse
|
20
|
Harada Y, Katsukura S, Kawamura R, Shimizu T. Effects of a Differential Diagnosis List of Artificial Intelligence on Differential Diagnoses by Physicians: An Exploratory Analysis of Data from a Randomized Controlled Study. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2021; 18:ijerph18115562. [PMID: 34070958 PMCID: PMC8196999 DOI: 10.3390/ijerph18115562] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 05/07/2021] [Accepted: 05/21/2021] [Indexed: 11/16/2022]
Abstract
A diagnostic decision support system (DDSS) is expected to reduce diagnostic errors. However, its effect on physicians' diagnostic decisions remains unclear. Our study aimed to assess the prevalence of diagnoses from artificial intelligence (AI) in physicians' differential diagnoses when using AI-driven DDSS that generates a differential diagnosis from the information entered by the patient before the clinical encounter on physicians' differential diagnoses. In this randomized controlled study, an exploratory analysis was performed. Twenty-two physicians were required to generate up to three differential diagnoses per case by reading 16 clinical vignettes. The participants were divided into two groups, an intervention group, and a control group, with and without a differential diagnosis list of AI, respectively. The prevalence of physician diagnosis identical with the differential diagnosis of AI (primary outcome) was significantly higher in the intervention group than in the control group (70.2% vs. 55.1%, p < 0.001). The primary outcome was significantly >10% higher in the intervention group than in the control group, except for attending physicians, and physicians who did not trust AI. This study suggests that at least 15% of physicians' differential diagnoses were affected by the differential diagnosis list in the AI-driven DDSS.
Collapse
Affiliation(s)
- Yukinori Harada
- Department of General Internal Medicine, Nagano Chuo Hospital, Nagano 380-0814, Japan;
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi 321-0293, Japan; (S.K.); (R.K.)
| | - Shinichi Katsukura
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi 321-0293, Japan; (S.K.); (R.K.)
| | - Ren Kawamura
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi 321-0293, Japan; (S.K.); (R.K.)
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi 321-0293, Japan; (S.K.); (R.K.)
- Correspondence: ; Tel.: +81-282-86-1111
| |
Collapse
|
21
|
Cabitza F, Campagner A, Sconfienza LM. Studying human-AI collaboration protocols: the case of the Kasparov's law in radiological double reading. Health Inf Sci Syst 2021; 9:8. [PMID: 33585029 PMCID: PMC7864624 DOI: 10.1007/s13755-021-00138-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Accepted: 01/13/2021] [Indexed: 12/17/2022] Open
Abstract
Purpose The integration of Artificial Intelligence into medical practices has recently been advocated for the promise to bring increased efficiency and effectiveness to these practices. Nonetheless, little research has so far been aimed at understanding the best human-AI interaction protocols in collaborative tasks, even in currently more viable settings, like independent double-reading screening tasks. Methods To this aim, we report about a retrospective case–control study, involving 12 board-certified radiologists, in the detection of knee lesions by means of Magnetic Resonance Imaging, in which we simulated the serial combination of two Deep Learning models with humans in eight double-reading protocols. Inspired by the so-called Kasparov’s Laws, we investigate whether the combination of humans and AI models could achieve better performance than AI models alone, and whether weak reader, when supported by fit-for-use interaction protocols, could out-perform stronger readers. Results We discuss two main findings: groups of humans who perform significantly worse than a state-of-the-art AI can significantly outperform it if their judgements are aggregated by majority voting (in concordance with the first part of the Kasparov’s law); small ensembles of significantly weaker readers can significantly outperform teams of stronger readers, supported by the same computational tool, when the judgments of the former ones are combined within “fit-for-use” protocols (in concordance with the second part of the Kasparov’s law). Conclusion Our study shows that good interaction protocols can guarantee improved decision performance that easily surpasses the performance of individual agents, even of realistic super-human AI systems. This finding highlights the importance of focusing on how to guarantee better co-operation within human-AI teams, so to enable safer and more human sustainable care practices.
Collapse
Affiliation(s)
- Federico Cabitza
- Università degli Studi di Milano-Bicocca, Viale Sarca 336, 20126 Milan, Italy
| | - Andrea Campagner
- Università degli Studi di Milano-Bicocca, Viale Sarca 336, 20126 Milan, Italy
| | - Luca Maria Sconfienza
- Department of Biomedical Sciences for Health, University of Milan, Milan, Italy.,IRCCS Istituto Ortopedico Galeazzi, Milan, Italy
| |
Collapse
|