1
|
Campagner A, Barandas M, Folgado D, Gamboa H, Cabitza F. Ensemble Predictors: Possibilistic Combination of Conformal Predictors for Multivariate Time Series Classification. IEEE Trans Pattern Anal Mach Intell 2024; PP:1-12. [PMID: 38607715 DOI: 10.1109/tpami.2024.3388097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/14/2024]
Abstract
In this article we propose a conceptual framework to study ensembles of conformal predictors (CP), that we call Ensemble Predictors (EP). Our approach is inspired by the application of imprecise probabilities in information fusion. Based on the proposed framework, we study, for the first time in the literature, the theoretical properties of CP ensembles in a general setting, by focusing on simple and commonly used possibilistic combination rules. We also illustrate the applicability of the proposed methods in the setting of multivariate time-series classification, showing that these methods provide better performance (in terms of both robustness, conservativeness, accuracy and running time) than both standard classification algorithms and other combination rules proposed in the literature, on a large set of benchmarks from the UCR time series archive.
Collapse
|
2
|
Cabitza F, Natali C, Famiglini L, Campagner A, Caccavella V, Gallazzi E. Never tell me the odds: Investigating pro-hoc explanations in medical decision making. Artif Intell Med 2024; 150:102819. [PMID: 38553159 DOI: 10.1016/j.artmed.2024.102819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 01/28/2024] [Accepted: 02/21/2024] [Indexed: 04/02/2024]
Abstract
This paper examines a kind of explainable AI, centered around what we term pro-hoc explanations, that is a form of support that consists of offering alternative explanations (one for each possible outcome) instead of a specific post-hoc explanation following specific advice. Specifically, our support mechanism utilizes explanations by examples, featuring analogous cases for each category in a binary setting. Pro-hoc explanations are an instance of what we called frictional AI, a general class of decision support aimed at achieving a useful compromise between the increase of decision effectiveness and the mitigation of cognitive risks, such as over-reliance, automation bias and deskilling. To illustrate an instance of frictional AI, we conducted an empirical user study to investigate its impact on the task of radiological detection of vertebral fractures in x-rays. Our study engaged 16 orthopedists in a 'human-first, second-opinion' interaction protocol. In this protocol, clinicians first made initial assessments of the x-rays without AI assistance and then provided their final diagnosis after considering the pro-hoc explanations. Our findings indicate that physicians, particularly those with less experience, perceived pro-hoc XAI support as significantly beneficial, even though it did not notably enhance their diagnostic accuracy. However, their increased confidence in final diagnoses suggests a positive overall impact. Given the promisingly high effect size observed, our results advocate for further research into pro-hoc explanations specifically, and into the broader concept of frictional AI.
Collapse
Affiliation(s)
- Federico Cabitza
- Università degli Studi di Milano-Bicocca, Milan, Italy; IRCCS Istituto Ortopedico Galeazzi, Milan, Italy.
| | - Chiara Natali
- Università degli Studi di Milano-Bicocca, Milan, Italy
| | | | | | | | - Enrico Gallazzi
- Istituto Ortopedico Gaetano Pini - ASST Pini-CTO, Milan, Italy
| |
Collapse
|
3
|
Famiglini L, Campagner A, Barandas M, La Maida GA, Gallazzi E, Cabitza F. Evidence-based XAI: An empirical approach to design more effective and explainable decision support systems. Comput Biol Med 2024; 170:108042. [PMID: 38308866 DOI: 10.1016/j.compbiomed.2024.108042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 12/19/2023] [Accepted: 01/26/2024] [Indexed: 02/05/2024]
Abstract
This paper proposes a user study aimed at evaluating the impact of Class Activation Maps (CAMs) as an eXplainable AI (XAI) method in a radiological diagnostic task, the detection of thoracolumbar (TL) fractures from vertebral X-rays. In particular, we focus on two oft-neglected features of CAMs, that is granularity and coloring, in terms of what features, lower-level vs higher-level, should the maps highlight and adopting which coloring scheme, to bring better impact to the decision-making process, both in terms of diagnostic accuracy (that is effectiveness) and of user-centered dimensions, such as perceived confidence and utility (that is satisfaction), depending on case complexity, AI accuracy, and user expertise. Our findings show that lower-level features CAMs, which highlight more focused anatomical landmarks, are associated with higher diagnostic accuracy than higher-level features CAMs, particularly among experienced physicians. Moreover, despite the intuitive appeal of semantic CAMs, traditionally colored CAMs consistently yielded higher diagnostic accuracy across all groups. Our results challenge some prevalent assumptions in the XAI field and emphasize the importance of adopting an evidence-based and human-centered approach to design and evaluate AI- and XAI-assisted diagnostic tools. To this aim, the paper also proposes a hierarchy of evidence framework to help designers and practitioners choose the XAI solutions that optimize performance and satisfaction on the basis of the strongest evidence available or to focus on the gaps in the literature that need to be filled to move from opinionated and eminence-based research to one more based on empirical evidence and end-user work and preferences.
Collapse
Affiliation(s)
- Lorenzo Famiglini
- Department of Computer Science, Systems and Communication, University of Milano-Bicocca, Milan, Italy.
| | | | - Marilia Barandas
- Associação Fraunhofer Portugal Research, Rua Alfredo Allen 455/461, Porto, Portugal
| | | | - Enrico Gallazzi
- Istituto Ortopedico Gaetano Pini - ASST Pini-CTO, Milan, Italy
| | - Federico Cabitza
- Department of Computer Science, Systems and Communication, University of Milano-Bicocca, Milan, Italy; IRCCS Istituto Ortopedico Galeazzi, Milan, Italy
| |
Collapse
|
4
|
Campagner A, Milella F, Guida S, Bernareggi S, Banfi G, Cabitza F. Assessment of Fast-Track Pathway in Hip and Knee Replacement Surgery by Propensity Score Matching on Patient-Reported Outcomes. Diagnostics (Basel) 2023; 13:diagnostics13061189. [PMID: 36980497 PMCID: PMC10047673 DOI: 10.3390/diagnostics13061189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 03/14/2023] [Accepted: 03/17/2023] [Indexed: 03/30/2023] Open
Abstract
Total hip (THA) and total knee (TKA) arthroplasty procedures have steadily increased over the past few decades, and their use is expected to grow further, mainly due to an increasing number of elderly patients. Cost-containment strategies, supporting a rapid recovery with a positive functional outcomes, high patient satisfaction, and enhanced patient reported outcomes, are needed. A Fast Track surgical procedure (FT) is a coordinated perioperative approach aimed at expediting early mobilization and recovery following surgery and, accordingly, shortening the length of hospital stay (LOS), convalescence and costs. In this view, rapid rehabilitation surgery optimizes traditional rehabilitation methods by integrating evidence-based practices into the procedure. The aim of the present study was to compare the effectiveness of Fast Track versus Care-as-Usual surgical procedures and pathways (including rehabilitation) on a mid-term patient-reported outcome (PROs), the SF12 (with regard both to Physical and Mental Scores), 3 months after hip or knee replacement surgery, with the use of Propensity score-matching (PSM) analysis to address the issue of the comparability of the groups in a non-randomized study. We were interested in the evaluation of the entire pathways, including the postoperative rehabilitation stage, therefore, we only used early home discharge as a surrogate to differentiate between the Fast Track and Care-as-Usual rehabilitation pathways. Our study shows that the entire Fast Track pathway, which includes the post-operative rehabilitation stage, has a significantly positive impact on physical health-related status (SF12 Physical Scores), as perceived by patients 3 months after hip or knee replacement surgery, as opposed to the standardized program, both in terms of the PROs score and the relative improvements observed, as compared with the minimum clinically important difference. This result encourages additional research into the effects of Fast Track rehabilitation on the entire process of care for patients undergoing hip or knee arthroplasty, focusing only on patient-reported outcomes.
Collapse
Affiliation(s)
| | - Frida Milella
- IRCCS Istituto Ortopedico Galeazzi, 20157 Milano, Italy
| | | | | | - Giuseppe Banfi
- IRCCS Istituto Ortopedico Galeazzi, 20157 Milano, Italy
- Faculty of Medicine and Surgery, Università Vita-Salute San Raffaele, 20132 Milano, Italy
| | - Federico Cabitza
- IRCCS Istituto Ortopedico Galeazzi, 20157 Milano, Italy
- Dipartimento di Informatica, Sistemistica e Comunicazione, University of Milano-Bicocca, 20126 Milano, Italy
| |
Collapse
|
5
|
Campagner A, Ciucci D, Denœux T. A General Framework for Evaluating and Comparing Soft Clusterings. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.11.114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
6
|
Bento N, Rebelo J, Barandas M, Carreiro AV, Campagner A, Cabitza F, Gamboa H. Comparing Handcrafted Features and Deep Neural Representations for Domain Generalization in Human Activity Recognition. Sensors (Basel) 2022; 22:s22197324. [PMID: 36236427 PMCID: PMC9572241 DOI: 10.3390/s22197324] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Revised: 09/21/2022] [Accepted: 09/23/2022] [Indexed: 06/02/2023]
Abstract
Human Activity Recognition (HAR) has been studied extensively, yet current approaches are not capable of generalizing across different domains (i.e., subjects, devices, or datasets) with acceptable performance. This lack of generalization hinders the applicability of these models in real-world environments. As deep neural networks are becoming increasingly popular in recent work, there is a need for an explicit comparison between handcrafted and deep representations in Out-of-Distribution (OOD) settings. This paper compares both approaches in multiple domains using homogenized public datasets. First, we compare several metrics to validate three different OOD settings. In our main experiments, we then verify that even though deep learning initially outperforms models with handcrafted features, the situation is reversed as the distance from the training distribution increases. These findings support the hypothesis that handcrafted features may generalize better across specific domains.
Collapse
Affiliation(s)
- Nuno Bento
- Associação Fraunhofer Portugal Research, Rua Alfredo Allen 455/461, 4200-135 Porto, Portugal
| | - Joana Rebelo
- Associação Fraunhofer Portugal Research, Rua Alfredo Allen 455/461, 4200-135 Porto, Portugal
| | - Marília Barandas
- Associação Fraunhofer Portugal Research, Rua Alfredo Allen 455/461, 4200-135 Porto, Portugal
- Laboratório de Instrumentação, Engenharia Biomédica e Física da Radiação (LIBPhys–UNL), Departamento de Física, Faculdade de Ciências e Tecnologia (FCT), Universidade Nova de Lisboa, 2829-516 Caparica, Portugal
| | - André V. Carreiro
- Associação Fraunhofer Portugal Research, Rua Alfredo Allen 455/461, 4200-135 Porto, Portugal
| | - Andrea Campagner
- Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, 20126 Milan, Italy
| | - Federico Cabitza
- Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, 20126 Milan, Italy
- IRCCS Istituto Ortopedico Galeazzi, 20161 Milan, Italy
| | - Hugo Gamboa
- Associação Fraunhofer Portugal Research, Rua Alfredo Allen 455/461, 4200-135 Porto, Portugal
- Laboratório de Instrumentação, Engenharia Biomédica e Física da Radiação (LIBPhys–UNL), Departamento de Física, Faculdade de Ciências e Tecnologia (FCT), Universidade Nova de Lisboa, 2829-516 Caparica, Portugal
| |
Collapse
|
7
|
Campagner A, Ciucci D. Three-way Learnability: A Learning Theoretic Perspective on Three-way Decision. ANNALS OF COMPUTER SCIENCE AND INFORMATION SYSTEMS 2022. [DOI: 10.15439/2022f18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
8
|
Campagner A, Sternini F, Cabitza F. Decisions are not all equal-Introducing a utility metric based on case-wise raters' perceptions. Comput Methods Programs Biomed 2022; 221:106930. [PMID: 35690505 DOI: 10.1016/j.cmpb.2022.106930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Revised: 05/13/2022] [Accepted: 05/31/2022] [Indexed: 06/15/2023]
Abstract
Background and Objective Evaluation of AI-based decision support systems (AI-DSS) is of critical importance in practical applications, nonetheless common evaluation metrics fail to properly consider relevant and contextual information. In this article we discuss a novel utility metric, the weighted Utility (wU), for the evaluation of AI-DSS, which is based on the raters' perceptions of their annotation hesitation and of the relevance of the training cases. Methods We discuss the relationship between the proposed metric and other previous proposals; and we describe the application of the proposed metric for both model evaluation and optimization, through three realistic case studies. Results We show that our metric generalizes the well-known Net Benefit, as well as other common error-based and utility-based metrics. Through the empirical studies, we show that our metric can provide a more flexible tool for the evaluation of AI models. We also show that, compared to other optimization metrics, model optimization based on the wU can provide significantly better performance (AUC 0.862 vs 0.895, p-value <0.05), especially on cases judged to be more complex by the human annotators (AUC 0.85 vs 0.92, p-value <0.05). Conclusions We make the point for having utility as a primary concern in the evaluation and optimization of machine learning models in critical domains, like the medical one; and for the importance of a human-centred approach to assess the potential impact of AI models on human decision making also on the basis of further information that can be collected during the ground-truthing process.
Collapse
Affiliation(s)
- Andrea Campagner
- Dipartimento di Informatica, Sistemistica e Comunicazione, Università di Milano-Bicocca, Milano, Italy.
| | - Federico Sternini
- Polito(BIO)Med Lab, Politecnico di Torino, Torino, Italy; USE-ME-D srl, I3P Politecnico di Torino, Torino, Ital
| | - Federico Cabitza
- Dipartimento di Informatica, Sistemistica e Comunicazione, Università di Milano-Bicocca, Milano, Italy; IRCCS Istituto Ortopedico Galeazzi, Milan, Italy
| |
Collapse
|
9
|
Campagner A, Famiglini L, Cabitza F. A Confidence Interval-Based Method for Classifier Re-Calibration. Stud Health Technol Inform 2022; 294:127-128. [PMID: 35612033 DOI: 10.3233/shti220413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
We propose a re-calibration method for Machine Learning models, based on computing confidence intervals for the predicted confidence scores. We show the effectiveness of the proposed method on a COVID-19 diagnosis benchmark.
Collapse
Affiliation(s)
| | | | - Federico Cabitza
- University of Milano-Bicocca, Milano, Italy
- IRCCS Istituto Ortopedico Galeazzi, Milan, Italy
| |
Collapse
|
10
|
Boffa S, Campagner A, Ciucci D, Yao Y. Aggregation operators on shadowed sets. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.02.046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
11
|
Famiglini L, Campagner A, Carobene A, Cabitza F. A robust and parsimonious machine learning method to predict ICU admission of COVID-19 patients. Med Biol Eng Comput 2022:10.1007/s11517-022-02543-x. [PMID: 35353302 PMCID: PMC8965547 DOI: 10.1007/s11517-022-02543-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Accepted: 02/27/2022] [Indexed: 01/08/2023]
Abstract
In this article, we discuss the development of prognostic machine learning (ML) models for COVID-19 progression, by focusing on the task of predicting ICU admission within (any of) the next 5 days. On the basis of 6,625 complete blood count (CBC) tests from 1,004 patients, of which 18% were admitted to intensive care unit (ICU), we created four ML models, by adopting a robust development procedure which was designed to minimize risks of bias and over-fitting, according to reference guidelines. The best model, a support vector machine, had an AUC of .85, a Brier score of .14, and a standardized net benefit of .69: these scores indicate that the model performed well over a variety of prediction criteria. We also conducted an interpretability study to back up our findings, showing that the data on which the developed model is based is consistent with the current medical literature. This also demonstrates that CBC data and ML methods can be used to predict COVID-19 patients' ICU admission at a relatively low cost: in particular, since CBC data can be quickly obtained by means of routine blood exams, our models could be used in resource-constrained settings and provide health practitioners with rapid and reliable indications.
Collapse
Affiliation(s)
- Lorenzo Famiglini
- Department of Informatics, University of Milano-Bicocca, Milan, Italy.
| | - Andrea Campagner
- Department of Informatics, University of Milano-Bicocca, Milan, Italy
| | - Anna Carobene
- IRCCS San Raffaele Scientific Institute, Milan, Italy
| | - Federico Cabitza
- Department of Informatics, University of Milano-Bicocca, Milan, Italy
- IRCCS Orthopedic Institute Galeazzi, Milan, Italy
| |
Collapse
|
12
|
Campagner A, Carobene A, Cabitza F. External validation of Machine Learning models for COVID-19 detection based on Complete Blood Count. Health Inf Sci Syst 2021; 9:37. [PMID: 34721844 PMCID: PMC8540880 DOI: 10.1007/s13755-021-00167-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Accepted: 09/29/2021] [Indexed: 01/13/2023] Open
Abstract
Purpose The rRT-PCR for COVID-19 diagnosis is affected by long turnaround time, potential shortage of reagents, high false-negative rates and high costs. Routine hematochemical tests are a faster and less expensive alternative for diagnosis. Thus, Machine Learning (ML) has been applied to hematological parameters to develop diagnostic tools and help clinicians in promptly managing positive patients. However, few ML models have been externally validated, making their real-world applicability unclear. Methods We externally validate 6 state-of-the-art diagnostic ML models, based on Complete Blood Count (CBC) and trained on a dataset encompassing 816 COVID-19 positive cases. The external validation was performed based on two datasets, collected at two different hospitals in northern Italy and encompassing 163 and 104 COVID-19 positive cases, in terms of both error rate and calibration. Results and Conclusion We report an average AUC of 95% and average Brier score of 0.11, out-performing existing ML methods, and showing good cross-site transportability. The best performing model (SVM) reported an average AUC of 97.5% (Sensitivity: 87.5%, Specificity: 94%), comparable with the performance of RT-PCR, and was also the best calibrated. The validated models can be useful in the early identification of potential COVID-19 patients, due to the rapid availability of CBC exams, and in multiple test settings.
Collapse
Affiliation(s)
| | - Anna Carobene
- Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Milan, Italy
| | | |
Collapse
|
13
|
Campagner A, Cabitza F, Berjano P, Ciucci D. Three-way decision and conformal prediction: Isomorphisms, differences and theoretical properties of cautious learning approaches. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.08.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
14
|
|
15
|
Cabitza F, Campagner A, Soares F, García de Guadiana-Romualdo L, Challa F, Sulejmani A, Seghezzi M, Carobene A. The importance of being external. methodological insights for the external validation of machine learning models in medicine. Comput Methods Programs Biomed 2021; 208:106288. [PMID: 34352688 DOI: 10.1016/j.cmpb.2021.106288] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Accepted: 07/09/2021] [Indexed: 06/13/2023]
Abstract
UNLABELLED Background and Objective Medical machine learning (ML) models tend to perform better on data from the same cohort than on new data, often due to overfitting, or co-variate shifts. For these reasons, external validation (EV) is a necessary practice in the evaluation of medical ML. However, there is still a gap in the literature on how to interpret EV results and hence assess the robustness of ML models. METHODS We fill this gap by proposing a meta-validation method, to assess the soundness of EV procedures. In doing so, we complement the usual way to assess EV by considering both dataset cardinality, and the similarity of the EV dataset with respect to the training set. We then investigate how the notions of cardinality and similarity can be used to inform on the reliability of a validation procedure, by integrating them into two summative data visualizations. RESULTS We illustrate our methodology by applying it to the validation of a state-of-the-art COVID-19 diagnostic model on 8 EV sets, collected across 3 different continents. The model performance was moderately impacted by data similarity (Pearson ρ = 0.38, p< 0.001). In the EV, the validated model reported good AUC (average: 0.84), acceptable calibration (average: 0.17) and utility (average: 0.50). The validation datasets were adequate in terms of dataset cardinality and similarity, thus suggesting the soundness of the results. We also provide a qualitative guideline to evaluate the reliability of validation procedures, and we discuss the importance of proper external validation in light of the obtained results. CONCLUSIONS In this paper, we propose a novel, lean methodology to: 1) study how the similarity between training and validation sets impacts the generalizability of a ML model; 2) assess the soundness of EV evaluations along three complementary performance dimensions: discrimination, utility and calibration; 3) draw conclusions on the robustness of the model under validation. We applied this methodology to a state-of-the-art model for the diagnosis of COVID-19 from routine blood tests, and showed how to interpret the results in light of the presented framework.
Collapse
Affiliation(s)
- Federico Cabitza
- University of Milano-Bicocca, Viale Sarca 336, Milano, 20126, Italy.
| | - Andrea Campagner
- University of Milano-Bicocca, Viale Sarca 336, Milano, 20126, Italy
| | - Felipe Soares
- Department of Industrial Engineering - Universidade Federal do Rio Grande do Sul. Porto Alegre, Brazil
| | | | - Feyissa Challa
- National Reference Laboratory for Clinical Chemistry, Ethiopian Public Health Institute, Addis Ababa, Ethiopia
| | - Adela Sulejmani
- Laboratorio di chimica clinica, Ospedale di Desio e Monza, ASST-Monza, Dipartimento di medicina e chirurgia, Universit di Milano-Bicocca, Monza, Italy
| | - Michela Seghezzi
- Laboratorio di chimica clinica, Ospedale Papa Giovanni XXIII, Bergamo, Italy
| | - Anna Carobene
- Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Milan, Italy
| |
Collapse
|
16
|
Carobene A, Campagner A, Uccheddu C, Banfi G, Vidali M, Cabitza F. The multicenter European Biological Variation Study (EuBIVAS): a new glance provided by the Principal Component Analysis (PCA), a machine learning unsupervised algorithms, based on the basic metabolic panel linked measurands. Clin Chem Lab Med 2021; 60:556-568. [PMID: 34333884 DOI: 10.1515/cclm-2021-0599] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Accepted: 07/20/2021] [Indexed: 02/03/2023]
Abstract
OBJECTIVES The European Biological Variation Study (EuBIVAS), which includes 91 healthy volunteers from five European countries, estimated high-quality biological variation (BV) data for several measurands. Previous EuBIVAS papers reported no significant differences among laboratories/population; however, they were focused on specific set of measurands, without a comprehensive general look. The aim of this paper is to evaluate the homogeneity of EuBIVAS data considering multivariate information applying the Principal Component Analysis (PCA), a machine learning unsupervised algorithm. METHODS The EuBIVAS data for 13 basic metabolic panel linked measurands (glucose, albumin, total protein, electrolytes, urea, total bilirubin, creatinine, phosphatase alkaline, aminotransferases), age, sex, menopause, body mass index (BMI), country, alcohol, smoking habits, and physical activity, have been used to generate three databases developed using the traditional univariate and the multivariate Elliptic Envelope approaches to detect outliers, and different missing-value imputations. Two matrix of data for each database, reporting both mean values, and "within-person BV" (CVP) values for any measurand/subject, were analyzed using PCA. RESULTS A clear clustering between males and females mean values has been identified, where the menopausal females are closer to the males. Data interpretations for the three databases are similar. No significant differences for both mean and CVPs values, for countries, alcohol, smoking habits, BMI and physical activity, have been found. CONCLUSIONS The absence of meaningful differences among countries confirms the EuBIVAS sample homogeneity and that the obtained data are widely applicable to deliver APS. Our data suggest that the use of PCA and the multivariate approach may be used to detect outliers, although further studies are required.
Collapse
Affiliation(s)
- Anna Carobene
- Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Milan, Italy
| | | | | | - Giuseppe Banfi
- IRCCS Istituto Ortopedico Galeazzi, Milan, Italy.,Università Vita e Salute San Raffaele, Milan, Italy
| | - Matteo Vidali
- Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milan, Italy
| | | |
Collapse
|
17
|
Cabitza F, Campagner A. The need to separate the wheat from the chaff in medical informatics: Introducing a comprehensive checklist for the (self)-assessment of medical AI studies. Int J Med Inform 2021; 153:104510. [PMID: 34108105 DOI: 10.1016/j.ijmedinf.2021.104510] [Citation(s) in RCA: 106] [Impact Index Per Article: 35.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Revised: 05/26/2021] [Accepted: 05/27/2021] [Indexed: 12/23/2022]
Abstract
This editorial aims to contribute to the current debate about the quality of studies that apply machine learning (ML) methodologies to medical data to extract value from them and provide clinicians with viable and useful tools supporting everyday care practices. We propose a practical checklist to help authors to self assess the quality of their contribution and to help reviewers to recognize and appreciate high-quality medical ML studies by distinguishing them from the mere application of ML techniques to medical data.
Collapse
Affiliation(s)
- Federico Cabitza
- DISCo, University of Milano-Bicocca, viale Sarca 336, Milano 20126, Italy.
| | - Andrea Campagner
- DISCo, University of Milano-Bicocca, viale Sarca 336, Milano 20126, Italy
| |
Collapse
|
18
|
Ronzio L, Campagner A, Cabitza F, Gensini GF. Unity Is Intelligence: A Collective Intelligence Experiment on ECG Reading to Improve Diagnostic Performance in Cardiology. J Intell 2021; 9:jintelligence9020017. [PMID: 33915991 PMCID: PMC8167709 DOI: 10.3390/jintelligence9020017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 02/21/2021] [Accepted: 03/09/2021] [Indexed: 12/03/2022] Open
Abstract
Medical errors have a huge impact on clinical practice in terms of economic and human costs. As a result, technology-based solutions, such as those grounded in artificial intelligence (AI) or collective intelligence (CI), have attracted increasing interest as a means of reducing error rates and their impacts. Previous studies have shown that a combination of individual opinions based on rules, weighting mechanisms, or other CI solutions could improve diagnostic accuracy with respect to individual doctors. We conducted a study to investigate the potential of this approach in cardiology and, more precisely, in electrocardiogram (ECG) reading. To achieve this aim, we designed and conducted an experiment involving medical students, recent graduates, and residents, who were asked to annotate a collection of 10 ECGs of various complexity and difficulty. For each ECG, we considered groups of increasing size (from three to 30 members) and applied three different CI protocols. In all cases, the results showed a statistically significant improvement (ranging from 9% to 88%) in terms of diagnostic accuracy when compared to the performance of individual readers; this difference held for not only large groups, but also smaller ones. In light of these results, we conclude that CI approaches can support the tasks mentioned above, and possibly other similar ones as well. We discuss the implications of applying CI solutions to clinical settings, such as cases of augmented ‘second opinions’ and decision-making.
Collapse
Affiliation(s)
- Luca Ronzio
- Dipartimento di Informatica, Sistemistica e Comunicazione, University of Milano-Bicocca, Viale Sarca 336, 20126 Milan, Italy; (L.R.); (A.C.)
| | - Andrea Campagner
- Dipartimento di Informatica, Sistemistica e Comunicazione, University of Milano-Bicocca, Viale Sarca 336, 20126 Milan, Italy; (L.R.); (A.C.)
| | - Federico Cabitza
- Dipartimento di Informatica, Sistemistica e Comunicazione, University of Milano-Bicocca, Viale Sarca 336, 20126 Milan, Italy; (L.R.); (A.C.)
- Correspondence:
| | | |
Collapse
|
19
|
Cabitza F, Campagner A, Sconfienza LM. Studying human-AI collaboration protocols: the case of the Kasparov's law in radiological double reading. Health Inf Sci Syst 2021; 9:8. [PMID: 33585029 PMCID: PMC7864624 DOI: 10.1007/s13755-021-00138-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Accepted: 01/13/2021] [Indexed: 12/17/2022] Open
Abstract
Purpose The integration of Artificial Intelligence into medical practices has recently been advocated for the promise to bring increased efficiency and effectiveness to these practices. Nonetheless, little research has so far been aimed at understanding the best human-AI interaction protocols in collaborative tasks, even in currently more viable settings, like independent double-reading screening tasks. Methods To this aim, we report about a retrospective case–control study, involving 12 board-certified radiologists, in the detection of knee lesions by means of Magnetic Resonance Imaging, in which we simulated the serial combination of two Deep Learning models with humans in eight double-reading protocols. Inspired by the so-called Kasparov’s Laws, we investigate whether the combination of humans and AI models could achieve better performance than AI models alone, and whether weak reader, when supported by fit-for-use interaction protocols, could out-perform stronger readers. Results We discuss two main findings: groups of humans who perform significantly worse than a state-of-the-art AI can significantly outperform it if their judgements are aggregated by majority voting (in concordance with the first part of the Kasparov’s law); small ensembles of significantly weaker readers can significantly outperform teams of stronger readers, supported by the same computational tool, when the judgments of the former ones are combined within “fit-for-use” protocols (in concordance with the second part of the Kasparov’s law). Conclusion Our study shows that good interaction protocols can guarantee improved decision performance that easily surpasses the performance of individual agents, even of realistic super-human AI systems. This finding highlights the importance of focusing on how to guarantee better co-operation within human-AI teams, so to enable safer and more human sustainable care practices.
Collapse
Affiliation(s)
- Federico Cabitza
- Università degli Studi di Milano-Bicocca, Viale Sarca 336, 20126 Milan, Italy
| | - Andrea Campagner
- Università degli Studi di Milano-Bicocca, Viale Sarca 336, 20126 Milan, Italy
| | - Luca Maria Sconfienza
- Department of Biomedical Sciences for Health, University of Milan, Milan, Italy.,IRCCS Istituto Ortopedico Galeazzi, Milan, Italy
| |
Collapse
|
20
|
Ferrari D, Carobene A, Campagner A, Cabitza F, Sabetta E, Ceriotti D, Di Resta C, Locatelli M. Evidence of significant difference in key COVID-19 biomarkers during the Italian lockdown strategy. A retrospective study on patients admitted to a hospital emergency department in Northern Italy. Acta Biomed 2020; 91:e2020156. [PMID: 33525206 PMCID: PMC7927476 DOI: 10.23750/abm.v91i4.10371] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 07/30/2020] [Indexed: 12/24/2022]
Abstract
Background. The Lombardy region, Italy, has been severely affected by COVID-19. During the epidemic peak, in March 2020, patients needing intensive care unit treatments were approximately 10% of those infected. This fraction decreased to approximately 2% in the second part of April, and to 0.4% at the beginning of July. COVID-19 is characterized by several biochemical abnormalities whose discrepancy from normal values was associated to the severity of the disease. The aim of this retrospective study was to compare the biochemical patterns of patients during and after the pandemic peak in order to verify whether later patients were experiencing a milder COVID-19 course, as anecdotally observed by several clinicians of the same Hospital. Material and Methods. The laboratory findings of two equivalent groups of 84 patients each, admitted at the emergency department of the San Raffaele Hospital (Milan, Italy), during March and April respectively, were analyzed and compared. Results. White blood cell, platelets, lymphocytes and lactate dehydrogenase showed a statistically significant improvement (i.e. closer or within the normal clinical range) in the April group compared to March. Creatinine, C-reactive protein, Calcium and liver enzymes, were also pointing in that direction, although the differences were not significant. Discussion. The laboratory findings analyzed in this study were consistent with a milder COVID-19 course in the April group. After excluding several hypotheses, we concluded that our observation was likely the consequence of the lockdown strategy enforcement, which, by imposing social distancing and the use of respiratory protective devices, reduced viral loads upon infection. (www.actabiomedica.it)
Collapse
|
21
|
Cabitza F, Campagner A, Ferrari D, Di Resta C, Ceriotti D, Sabetta E, Colombini A, De Vecchi E, Banfi G, Locatelli M, Carobene A. Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests. Clin Chem Lab Med 2020; 59:421-431. [PMID: 33079698 DOI: 10.1515/cclm-2020-1294] [Citation(s) in RCA: 69] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Accepted: 10/07/2020] [Indexed: 02/07/2023]
Abstract
Objectives The rRT-PCR test, the current gold standard for the detection of coronavirus disease (COVID-19), presents with known shortcomings, such as long turnaround time, potential shortage of reagents, false-negative rates around 15-20%, and expensive equipment. The hematochemical values of routine blood exams could represent a faster and less expensive alternative. Methods Three different training data set of hematochemical values from 1,624 patients (52% COVID-19 positive), admitted at San Raphael Hospital (OSR) from February to May 2020, were used for developing machine learning (ML) models: the complete OSR dataset (72 features: complete blood count (CBC), biochemical, coagulation, hemogasanalysis and CO-Oxymetry values, age, sex and specific symptoms at triage) and two sub-datasets (COVID-specific and CBC dataset, 32 and 21 features respectively). 58 cases (50% COVID-19 positive) from another hospital, and 54 negative patients collected in 2018 at OSR, were used for internal-external and external validation. Results We developed five ML models: for the complete OSR dataset, the area under the receiver operating characteristic curve (AUC) for the algorithms ranged from 0.83 to 0.90; for the COVID-specific dataset from 0.83 to 0.87; and for the CBC dataset from 0.74 to 0.86. The validations also achieved good results: respectively, AUC from 0.75 to 0.78; and specificity from 0.92 to 0.96. Conclusions ML can be applied to blood tests as both an adjunct and alternative method to rRT-PCR for the fast and cost-effective identification of COVID-19-positive patients. This is especially useful in developing countries, or in countries facing an increase in contagions.
Collapse
Affiliation(s)
| | - Andrea Campagner
- IRCCS Istituto Ortopedico Galeazzi, Laboratory of Clinical Chemistry and Microbiology, Milan, Italy
| | | | - Chiara Di Resta
- Vita-Salute San Raffaele University; Unit of Genomics for Human Disease Diagnosis, Division of Genetics and Cell Biology, Milan, Italy
| | - Daniele Ceriotti
- Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Milan, Italy
| | - Eleonora Sabetta
- Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Milan, Italy
| | - Alessandra Colombini
- IRCCS Istituto Ortopedico Galeazzi, Laboratory of Clinical Chemistry and Microbiology, Milan, Italy
| | - Elena De Vecchi
- IRCCS Istituto Ortopedico Galeazzi, Laboratory of Clinical Chemistry and Microbiology, Milan, Italy
| | - Giuseppe Banfi
- IRCCS Istituto Ortopedico Galeazzi, Laboratory of Clinical Chemistry and Microbiology, Milan, Italy
| | - Massimo Locatelli
- Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Milan, Italy
| | - Anna Carobene
- Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Milan, Italy
| |
Collapse
|
22
|
Affiliation(s)
- Andrea Campagner
- Dipartimento di Informatica, Sistemistica e Comunicazione University of Milano–Bicocca Milano Italy
| | - Valentina Dorigatti
- Dipartimento di Scienze Teoriche e Applicate University of Insubria Varese Italy
| | - Davide Ciucci
- Dipartimento di Informatica, Sistemistica e Comunicazione University of Milano–Bicocca Milano Italy
| |
Collapse
|
23
|
Cabitza F, Campagner A, Sconfienza LM. As if sand were stone. New concepts and metrics to probe the ground on which to build trustable AI. BMC Med Inform Decis Mak 2020; 20:219. [PMID: 32917183 PMCID: PMC7488864 DOI: 10.1186/s12911-020-01224-9] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Accepted: 08/17/2020] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND We focus on the importance of interpreting the quality of the labeling used as the input of predictive models to understand the reliability of their output in support of human decision-making, especially in critical domains, such as medicine. METHODS Accordingly, we propose a framework distinguishing the reference labeling (or Gold Standard) from the set of annotations from which it is usually derived (the Diamond Standard). We define a set of quality dimensions and related metrics: representativeness (are the available data representative of its reference population?); reliability (do the raters agree with each other in their ratings?); and accuracy (are the raters' annotations a true representation?). The metrics for these dimensions are, respectively, the degree of correspondence, Ψ, the degree of weighted concordance ϱ, and the degree of fineness, Φ. We apply and evaluate these metrics in a diagnostic user study involving 13 radiologists. RESULTS We evaluate Ψ against hypothesis-testing techniques, highlighting that our metrics can better evaluate distribution similarity in high-dimensional spaces. We discuss how Ψ could be used to assess the reliability of new predictions or for train-test selection. We report the value of ϱ for our case study and compare it with traditional reliability metrics, highlighting both their theoretical properties and the reasons that they differ. Then, we report the degree of fineness as an estimate of the accuracy of the collected annotations and discuss the relationship between this latter degree and the degree of weighted concordance, which we find to be moderately but significantly correlated. Finally, we discuss the implications of the proposed dimensions and metrics with respect to the context of Explainable Artificial Intelligence (XAI). CONCLUSION We propose different dimensions and related metrics to assess the quality of the datasets used to build predictive models and Medical Artificial Intelligence (MAI). We argue that the proposed metrics are feasible for application in real-world settings for the continuous development of trustable and interpretable MAI systems.
Collapse
Affiliation(s)
- Federico Cabitza
- Dipartimento di Informatica, Sistemistica e Comunicazione, Universitá degli Studi di Milano-Bicocca, Viale Sarca, 336, Milan, 20125 Italy
| | - Andrea Campagner
- IRCCS Istituto Ortopedico Galeazzi, Via Riccardo Galeazzi 4, Milan, 20161 Italy
| | - Luca Maria Sconfienza
- IRCCS Istituto Ortopedico Galeazzi, Via Riccardo Galeazzi 4, Milan, 20161 Italy
- Department of Biomedical Sciences for Health, Università degli Studi di Milano, Via Mangiagalli 31, Milan, 20133 Italy
| |
Collapse
|
24
|
Seveso A, Campagner A, Ciucci D, Cabitza F. Ordinal labels in machine learning: a user-centered approach to improve data validity in medical settings. BMC Med Inform Decis Mak 2020; 20:142. [PMID: 32819345 PMCID: PMC7439656 DOI: 10.1186/s12911-020-01152-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Accepted: 06/08/2020] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Despite the vagueness and uncertainty that is intrinsic in any medical act, interpretation and decision (including acts of data reporting and representation of relevant medical conditions), still little research has focused on how to explicitly take this uncertainty into account. In this paper, we focus on the representation of a general and wide-spread medical terminology, which is grounded on a traditional and well-established convention, to represent severity of health conditions (for instance, pain, visible signs), ranging from Absent to Extreme. Specifically, we will study how both potential patients and doctors perceive the different levels of the terminology in both quantitative and qualitative terms, and if the embedded user knowledge could improve the representation of ordinal values in the construction of machine learning models. METHODS To this aim, we conducted a questionnaire-based research study involving a relatively large sample of 1,152 potential patients and 31 clinicians to represent numerically the perceived meaning of standard and widely-applied labels to describe health conditions. Using these collected values, we then present and discuss different possible fuzzy-set based representations that address the vagueness of medical interpretation by taking into account the perceptions of domain experts. We also apply the findings of this user study to evaluate the impact of different encodings on the predictive performance of common machine learning models in regard to a real-world medical prognostic task. RESULTS We found significant differences in the perception of pain levels between the two user groups. We also show that the proposed encodings can improve the performances of specific classes of models, and discuss when this is the case. CONCLUSIONS In perspective, our hope is that the proposed techniques for ordinal scale representation and ordinal encoding may be useful to the research community, and also that our methodology will be applied to other widely used ordinal scales for improving validity of datasets and bettering the results of machine learning tasks.
Collapse
Affiliation(s)
- Andrea Seveso
- Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Viale Sarca 336, Milan, 20126, Italy
| | - Andrea Campagner
- IRCCS Istituto Ortopedico Galeazzi, Via Riccardo Galeazzi 4, Milan, 20161, Italy
| | - Davide Ciucci
- Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Viale Sarca 336, Milan, 20126, Italy
| | - Federico Cabitza
- Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Viale Sarca 336, Milan, 20126, Italy
| |
Collapse
|
25
|
Campagner A, Sconfienza L, Cabitza F. H-Accuracy, an Alternative Metric to Assess Classification Models in Medicine. Stud Health Technol Inform 2020; 270:242-246. [PMID: 32570383 DOI: 10.3233/shti200159] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
As widely known, regular accuracy is a misleading and shallow indicator of the performance of a predictive model, especially in real-life domains like medicine, where decisions affect health or life. In this paper we present and discuss a new accuracy measure, the H-accuracy, as a more conservative alternative to regular accuracy, which we claim is more informative in the medical domain (and others of similar needs) for the elements it encompasses. In particular, the proposed measure takes into account important information such as the complexity of the cases and the case prevalance in the population. We also provide proof that the H-accuracy is a generalization of the balanced accuracy and illustrate the descriptive power of this score.
Collapse
Affiliation(s)
- Andrea Campagner
- IRCCS Instituto Ortopedico Galeazzi, Milano, Italy.,University of Milano-Bicocca, Milano, Italy
| | | | | |
Collapse
|
26
|
Brinati D, Campagner A, Ferrari D, Locatelli M, Banfi G, Cabitza F. Detection of COVID-19 Infection from Routine Blood Exams with Machine Learning: A Feasibility Study. J Med Syst 2020. [PMID: 32607737 DOI: 10.1101/2020.04.22.20075143] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/01/2023]
Abstract
The COVID-19 pandemia due to the SARS-CoV-2 coronavirus, in its first 4 months since its outbreak, has to date reached more than 200 countries worldwide with more than 2 million confirmed cases (probably a much higher number of infected), and almost 200,000 deaths. Amplification of viral RNA by (real time) reverse transcription polymerase chain reaction (rRT-PCR) is the current gold standard test for confirmation of infection, although it presents known shortcomings: long turnaround times (3-4 hours to generate results), potential shortage of reagents, false-negative rates as large as 15-20%, the need for certified laboratories, expensive equipment and trained personnel. Thus there is a need for alternative, faster, less expensive and more accessible tests. We developed two machine learning classification models using hematochemical values from routine blood exams (namely: white blood cells counts, and the platelets, CRP, AST, ALT, GGT, ALP, LDH plasma levels) drawn from 279 patients who, after being admitted to the San Raffaele Hospital (Milan, Italy) emergency-room with COVID-19 symptoms, were screened with the rRT-PCR test performed on respiratory tract specimens. Of these patients, 177 resulted positive, whereas 102 received a negative response. We have developed two machine learning models, to discriminate between patients who are either positive or negative to the SARS-CoV-2: their accuracy ranges between 82% and 86%, and sensitivity between 92% e 95%, so comparably well with respect to the gold standard. We also developed an interpretable Decision Tree model as a simple decision aid for clinician interpreting blood tests (even off-line) for COVID-19 suspect cases. This study demonstrated the feasibility and clinical soundness of using blood tests analysis and machine learning as an alternative to rRT-PCR for identifying COVID-19 positive patients. This is especially useful in those countries, like developing ones, suffering from shortages of rRT-PCR reagents and specialized laboratories. We made available a Web-based tool for clinical reference and evaluation (This tool is available at https://covid19-blood-ml.herokuapp.com/ ).
Collapse
Affiliation(s)
- Davide Brinati
- DISCo, Università degli Studi di Milano-Bicocca, Viale Sarca 336, Milano, 20126, Italy
| | - Andrea Campagner
- DISCo, Università degli Studi di Milano-Bicocca, Viale Sarca 336, Milano, 20126, Italy
| | - Davide Ferrari
- SCVSA Department, University of Parma, Parco Area delle Science 11/a, 43124, Parman, Italy
| | - Massimo Locatelli
- Laboratory Medicine Service, San Raffaele Hospital, Via Olgettina, 60, 20132, Milano, Italy
| | - Giuseppe Banfi
- IRCCS Istituto Ortopedico Galeazzi, Via Riccardo Galeazzi, 4, 20161, Milano, Italy
| | - Federico Cabitza
- DISCo, Università degli Studi di Milano-Bicocca, Viale Sarca 336, Milano, 20126, Italy.
| |
Collapse
|
27
|
Brinati D, Campagner A, Ferrari D, Locatelli M, Banfi G, Cabitza F. Detection of COVID-19 Infection from Routine Blood Exams with Machine Learning: A Feasibility Study. J Med Syst 2020; 44:135. [PMID: 32607737 PMCID: PMC7326624 DOI: 10.1007/s10916-020-01597-4] [Citation(s) in RCA: 132] [Impact Index Per Article: 33.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Accepted: 06/02/2020] [Indexed: 12/15/2022]
Abstract
The COVID-19 pandemia due to the SARS-CoV-2 coronavirus, in its first 4 months since its outbreak, has to date reached more than 200 countries worldwide with more than 2 million confirmed cases (probably a much higher number of infected), and almost 200,000 deaths. Amplification of viral RNA by (real time) reverse transcription polymerase chain reaction (rRT-PCR) is the current gold standard test for confirmation of infection, although it presents known shortcomings: long turnaround times (3-4 hours to generate results), potential shortage of reagents, false-negative rates as large as 15-20%, the need for certified laboratories, expensive equipment and trained personnel. Thus there is a need for alternative, faster, less expensive and more accessible tests. We developed two machine learning classification models using hematochemical values from routine blood exams (namely: white blood cells counts, and the platelets, CRP, AST, ALT, GGT, ALP, LDH plasma levels) drawn from 279 patients who, after being admitted to the San Raffaele Hospital (Milan, Italy) emergency-room with COVID-19 symptoms, were screened with the rRT-PCR test performed on respiratory tract specimens. Of these patients, 177 resulted positive, whereas 102 received a negative response. We have developed two machine learning models, to discriminate between patients who are either positive or negative to the SARS-CoV-2: their accuracy ranges between 82% and 86%, and sensitivity between 92% e 95%, so comparably well with respect to the gold standard. We also developed an interpretable Decision Tree model as a simple decision aid for clinician interpreting blood tests (even off-line) for COVID-19 suspect cases. This study demonstrated the feasibility and clinical soundness of using blood tests analysis and machine learning as an alternative to rRT-PCR for identifying COVID-19 positive patients. This is especially useful in those countries, like developing ones, suffering from shortages of rRT-PCR reagents and specialized laboratories. We made available a Web-based tool for clinical reference and evaluation (This tool is available at https://covid19-blood-ml.herokuapp.com/ ).
Collapse
Affiliation(s)
- Davide Brinati
- DISCo, Università degli Studi di Milano-Bicocca, Viale Sarca 336, Milano, 20126, Italy
| | - Andrea Campagner
- DISCo, Università degli Studi di Milano-Bicocca, Viale Sarca 336, Milano, 20126, Italy
| | - Davide Ferrari
- SCVSA Department, University of Parma, Parco Area delle Science 11/a, 43124, Parman, Italy
| | - Massimo Locatelli
- Laboratory Medicine Service, San Raffaele Hospital, Via Olgettina, 60, 20132, Milano, Italy
| | - Giuseppe Banfi
- IRCCS Istituto Ortopedico Galeazzi, Via Riccardo Galeazzi, 4, 20161, Milano, Italy
| | - Federico Cabitza
- DISCo, Università degli Studi di Milano-Bicocca, Viale Sarca 336, Milano, 20126, Italy.
| |
Collapse
|
28
|
Campagner A, Cabitza F. Introducing New Measures of Inter- and Intra-Rater Agreement to Assess the Reliability of Medical Ground Truth. Stud Health Technol Inform 2020; 270:282-286. [PMID: 32570391 DOI: 10.3233/shti200167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this paper, we present and discuss two new measures of inter- and intra-rater agreement to assess the reliability of the raters, and hence of their labeling, in multi-rater setings, which are common in the production of ground truth for machine learning models. Our proposal is more conservative of other existing agreement measures, as it considers a more articulated notion of agreement by chance, based on an empirical estimation of the precision (or reliability) of the single raters involved. We discuss the measures in light of a realistic annotation tasks that involved 13 expert radiologists in labeling the MRNet dataset.
Collapse
Affiliation(s)
- Andrea Campagner
- IRCCS Istituto Ortopedico Galeazzi, Milano, Italy
- University of Milano-Bicocca, Milano, Italy
| | | |
Collapse
|
29
|
Gitto S, Campagner A, Messina C, Albano D, Cabitza F, Sconfienza LM. Collective Intelligence Has Increased Diagnostic Performance Compared with Expert Radiologists in the Evaluation of Knee MRI. Semin Musculoskelet Radiol 2020. [DOI: 10.1055/s-0040-1722499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
30
|
Campagner A, Berjano P, Lamartina C, Langella F, Lombardi G, Cabitza F. Assessment and prediction of spine surgery invasiveness with machine learning techniques. Comput Biol Med 2020; 121:103796. [PMID: 32568677 DOI: 10.1016/j.compbiomed.2020.103796] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Revised: 04/28/2020] [Accepted: 04/28/2020] [Indexed: 10/24/2022]
Abstract
BACKGROUND The interest in Minimally Invasive Surgery (MIS) techniques has greatly increased in the recent years due to their significant advantages, both in terms of outcome improvement and cost reduction. Also in spine surgery, MIS is now applicable to several conditions and, above all, in low back pain (LBP) treatment. However, reliable and objective measures of invasiveness, necessary to compare different procedures, are still lacking. METHODS In this article we study the application of Machine Learning (ML) techniques to define an invasiveness score for LBP procedures based on biological markers and inflammatory profiles. In so doing, we can assess the invasiveness of surgical procedures. We also propose a predictive model for treatment planning based on the evaluation of invasiveness of surgical alternatives for specific patients, using their pre-surgery biomarkers. The data used in study was characterized by low sample size and high-dimensionality, thus we adopted a combination of feature selection, careful selection of ML models and conservative model selection choices in order to address these concerns. We also performed an external validation based on a statistically significantly different datasets in order to confirm the relevance of the findings. RESULTS We report the results of an experimental study on real-world data, for which we obtained promising results for both considered applications: we report an AUC of 0.87 for the task of invasiveness score definition, and an AUC of 0.76 for the invasiveness prediction task. The results obtained on the external validation were in agreement with the obtained results. Further, in both cases the performances were considered as excellent by the involved clinicians and the selected predictive features were biologically relevant and associated with invasiveness and biological impact in the relevant literature. CONCLUSION Our results show that ML techniques could be effectively employed not only for diagnosis or prognosis, but also for treatment planning, a task of fundamental importance toward personalized and value-based healthcare. These results also show that ML approaches could be effectively used even in scenarios (e.g. pilot studies) where only small samples are available.
Collapse
Affiliation(s)
- Andrea Campagner
- IRCCS Istituto Ortopedico Galeazzi, Via Riccardo Galeazzi, 4, 20161, Milano, Italy.
| | - Pedro Berjano
- IRCCS Istituto Ortopedico Galeazzi, Via Riccardo Galeazzi, 4, 20161, Milano, Italy
| | - Claudio Lamartina
- IRCCS Istituto Ortopedico Galeazzi, Via Riccardo Galeazzi, 4, 20161, Milano, Italy
| | - Francesco Langella
- IRCCS Istituto Ortopedico Galeazzi, Via Riccardo Galeazzi, 4, 20161, Milano, Italy
| | - Giovanni Lombardi
- Laboratory of Experimental Biochemistry & Molecular Biology, IRCCS Istituto Ortopedico Galeazzi, Via Riccardo Galeazzi, 4, 20161, Milano, Italy; Department of Athletics, Strength and Conditioning, Poznań University of Physical Education, Królowej Jadwigi 27/39, 61-871, Poznań, Poland
| | - Federico Cabitza
- Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Viale Sarca 336, 20126, Milano, Italy.
| |
Collapse
|
31
|
|
32
|
Abstract
Interest in the application of machine learning (ML) techniques to medicine is growing fast and wide because of their ability to endow decision support systems with so-called artificial intelligence, particularly in those medical disciplines that extensively rely on digital imaging. Nonetheless, achieving a pragmatic and ecological validation of medical AI systems in real-world settings is difficult, even when these systems exhibit very high accuracy in laboratory settings. This difficulty has been called the “last mile of implementation.” In this review of the concept, we claim that this metaphorical mile presents two chasms: the hiatus of human trust and the hiatus of machine experience. The former hiatus encompasses all that can hinder the concrete use of AI at the point of care, including availability and usability issues, but also the contradictory phenomena of cognitive ergonomics, such as automation bias (overreliance on technology) and prejudice against the machine (clearly the opposite). The latter hiatus, on the other hand, relates to the production and availability of a sufficient amount of reliable and accurate clinical data that is suitable to be the “experience” with which a machine can be trained. In briefly reviewing the existing literature, we focus on this latter hiatus of the last mile, as it has been largely neglected by both ML developers and doctors. In doing so, we argue that efforts to cross this chasm require data governance practices and a focus on data work, including the practices of data awareness and data hygiene. To address the challenge of bridging the chasms in the last mile of medical AI implementation, we discuss the six main socio-technical challenges that must be overcome in order to build robust bridges and deploy potentially effective AI in real-world clinical settings.
Collapse
Affiliation(s)
- Federico Cabitza
- Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Milano, Italy
| | | | - Clara Balsano
- Dipartimento di Medicina Clinica, Sanità Pubblica, Scienze della Vita e dell'Ambiente, Università degli Studi dell'Aquila, L'Aquila, Italy.,Francesco Balsano Foundation, Via Giovanni Battista Martini 6, Rome, Italy
| |
Collapse
|
33
|
Campagner A, Ciucci D, Dorigatti V. Approximate Reaction Systems Based on Rough Set Theory. Rough Sets 2020. [PMCID: PMC7338153 DOI: 10.1007/978-3-030-52705-1_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
34
|
|
35
|
Campagner A, Ciucci D. Three-Way and Semi-supervised Decision Tree Learning Based on Orthopartitions. Communications in Computer and Information Science 2018. [DOI: 10.1007/978-3-319-91476-3_61] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|