1
|
Vallevik VB, Babic A, Marshall SE, Elvatun S, Brøgger HMB, Alagaratnam S, Edwin B, Veeraragavan NR, Befring AK, Nygård JF. Can I trust my fake data - A comprehensive quality assessment framework for synthetic tabular data in healthcare. Int J Med Inform 2024; 185:105413. [PMID: 38493547 DOI: 10.1016/j.ijmedinf.2024.105413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/17/2024] [Accepted: 03/11/2024] [Indexed: 03/19/2024]
Abstract
BACKGROUND Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. Synthetic data has been suggested in response to privacy concerns and regulatory requirements and can be created by training a generator on real data to produce a dataset with similar statistical properties. Competing metrics with differing taxonomies for quality evaluation have been proposed, resulting in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks. METHOD We performed a comprehensive literature review on the use of quality evaluation metrics on synthetic data within the scope of synthetic tabular healthcare data using deep generative methods. Based on this and the collective team experiences, we developed a conceptual framework for quality assurance. The applicability was benchmarked against a practical case from the Dutch National Cancer Registry. CONCLUSION We present a conceptual framework for quality assuranceof synthetic data for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients. DISCUSSION Despite the growing emphasis on algorithmic fairness and carbon footprint, these metrics were scarce in the literature review. The overwhelming focus was on statistical similarity using distance metrics while sequential logic detection was scarce. A consensus-backed framework that includes all relevant quality dimensions can provide assurance for safe and responsible real-life applications of synthetic data. As the choice of appropriate metrics are highly context dependent, further research is needed on validation studies to guide metric choices and support the development of technical standards.
Collapse
Affiliation(s)
- Vibeke Binz Vallevik
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; DNV AS, Veritasveien 1, 1322 Høvik, Norway.
| | | | | | - Severin Elvatun
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway
| | - Helga M B Brøgger
- DNV AS, Veritasveien 1, 1322 Høvik, Norway; Oslo University Hospital, Sognsvannsveien 20, 0372 Oslo, Norway
| | | | - Bjørn Edwin
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; The Intervention Centre and Department of HPB Surgery, Oslo University Hospital and Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway
| | | | | | - Jan F Nygård
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway; UiT - The Arctic University of Norway, Tromsø, Norway
| |
Collapse
|
2
|
Menegatti D, Giuseppi A, Delli Priscoli F, Pietrabissa A, Di Giorgio A, Baldisseri F, Mattioni M, Monaco S, Lanari L, Panfili M, Suraci V. CADUCEO: A Platform to Support Federated Healthcare Facilities through Artificial Intelligence. Healthcare (Basel) 2023; 11:2199. [PMID: 37570439 PMCID: PMC10418332 DOI: 10.3390/healthcare11152199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Revised: 07/26/2023] [Accepted: 07/31/2023] [Indexed: 08/13/2023] Open
Abstract
Data-driven algorithms have proven to be effective for a variety of medical tasks, including disease categorization and prediction, personalized medicine design, and imaging diagnostics. Although their performance is frequently on par with that of clinicians, their widespread use is constrained by a number of obstacles, including the requirement for high-quality data that are typical of the population, the difficulty of explaining how they operate, and ethical and regulatory concerns. The use of data augmentation and synthetic data generation methodologies, such as federated learning and explainable artificial intelligence ones, could provide a viable solution to the current issues, facilitating the widespread application of artificial intelligence algorithms in the clinical application domain and reducing the time needed for prevention, diagnosis, and prognosis by up to 70%. To this end, a novel AI-based functional framework is conceived and presented in this paper.
Collapse
Affiliation(s)
| | | | | | | | | | - Federico Baldisseri
- Department of Computer, Control and Management Engineering “Antonio Ruberti”, Sapienza University of Rome, Via Ariosto 25, 00185 Rome, Italy; (D.M.); (A.G.); (F.D.P.); (A.P.); (A.D.G.); (M.M.); (S.M.); (L.L.); (M.P.); (V.S.)
| | | | | | | | | | | |
Collapse
|
3
|
Pezoulas VC, Exarchos TP, Tachos NS, Goules A, Tzioufas AG, Fotiadis DI. Boosting the performance of MALT lymphoma classification in patients with primary Sjögren's Syndrome through data augmentation: a case study. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2023; 2023:1-4. [PMID: 38083761 DOI: 10.1109/embc40787.2023.10340802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2023]
Abstract
Sjögren's Syndrome (SS) patients with mucosa associated lymphoid tissue lymphomas (MALTLs) and diffuse large B-cell lymphomas (DLBCLs) have 10-year survival rates of 80% and 40%, respectively. This highlights the unique biologic burden of the two histologic forms, as well as, the need for early detection and thorough monitoring of these patients. The lack of MALTL patients and the fact that most studies are single cohort and combine patients with different lymphoma subtypes narrow the understanding of MALTL progression. Here, we propose a data augmentation pipeline that utilizes an advanced synthetic data generator which is trained on a Pan European data hub with primary SS (pSS) patients to yield a high-quality synthetic data pool. The latter is used for the development of an enhanced MALTL classification model. Four scenarios were defined to assess the reliability of augmentation. Our results revealed an overall improvement in the accuracy, sensitivity, specificity, and AUC by 7%, 6.3%, 9%, and 6.3%, respectively. This is the first case study that utilizes data augmentation to reflect the progression of MALTL in pSS.
Collapse
|
4
|
Hameed MAB, Alamgir Z. Improving mortality prediction in Acute Pancreatitis by machine learning and data augmentation. Comput Biol Med 2022; 150:106077. [PMID: 36137318 DOI: 10.1016/j.compbiomed.2022.106077] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 08/28/2022] [Accepted: 09/03/2022] [Indexed: 11/22/2022]
Abstract
Acute Pancreatitis (AP) is the inflammation of the pancreas that can be fatal or lead to further complications based on the severity of the attack. Early detection of AP disease can help save lives by providing utmost care, rigorous treatment, and better resources. In this era of data and technology, instead of relying on manual scoring systems, scientists are employing advanced machine learning and data mining models for the early detection of patients with high chances of mortality. The current work on AP mortality prediction is negligible, and the few studies that exist have many shortcomings and are impractical for clinical deployment. In this research work, we tried to overcome the existing issues. One main issue is the lack of high-quality public datasets for AP, which are crucial for effectively training ML models. The available datasets are small in size, have many missing values, and suffer from high class imbalance. We augmented three public datasets, MIMIC-III, MIMIC-IV, and eICU, to obtain a larger dataset, and experiments proved that augmented data trained classifiers better than original small datasets. Moreover, we employed emerging advanced techniques to handle underlying issues in data. The results showed that iterative imputer is best for filling missing values in AP data. It beats not only the basic techniques but also the Knn-based imputation. Class imbalance is first addressed using data downsampling; apparently, it gave decent results on small test sets. However, we conducted numerous experiments on large test sets to prove that downsampling in the case of AP produced misleading and poor results. Next, we applied various techniques to upsample data in two different class splits, a 50 to 50 and a 70 to 30 majority-minority class split. Four different tabular generative adversarial networks, CTGAN, TGAN, CopulaGAN, and CTAB, and a variational autoencoder, TVAE, were deployed for synthetic data generation. SMOTE was also utilized for data upsampling. The computational results showed that the Random Forest (RF) classifier outperformed all other classifiers on a 50 to 50 class split data generated by CTGAN, with 0.702 Fβ and 0.833 recall. Results produced by RF on the TVAE dataset were also comparable, with 0.698 Fβ. In the case of SMOTE-based upsampling, DNN performed best with a 0.671 Fβ score.
Collapse
Affiliation(s)
- M Asad Bin Hameed
- Department of Computer Science, National University of Computer and Emerging sciences (NUCES), Lahore, Pakistan.
| | - Zareen Alamgir
- Department of Computer Science, National University of Computer and Emerging sciences (NUCES), Lahore, Pakistan.
| |
Collapse
|
5
|
Pezoulas VC, Tachos NS, Olivotto I, Barlocco F, Fotiadis DI. A "smart" Imputation Approach for Effective Quality Control Across Complex Clinical Data Structures. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2022; 2022:1049-1052. [PMID: 36086027 DOI: 10.1109/embc48229.2022.9871919] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The overwhelming need to improve the quality of complex data structures in healthcare is more important than ever. Although data quality has been the point of interest in many studies, none of them has focused on the development of quantitative and explainable methods for data imputation. In this work, we propose a "smart" imputation workflow to address missing data across complex data structures in the context of in silico clinical trials. AI algorithms were utilized to produce high-quality virtual patient profiles. A search algorithm was then developed to extract the best virtual patient profiles through the definition of a profile matching score (PMS). A case study was conducted, where the real dataset was randomly contaminated with multiple missing values (e.g., 10 to 50%). In total, 10000 virtual patient profiles with less than 0.02 Kullback-Leibler (KL) divergence were produced to estimate the PMS distribution. The best generator achieved the lowest average squared absolute difference (0.4) and average correlation difference (0.02) with the real dataset highlighting its increased effectiveness for data imputation across complex clinical data structures.
Collapse
|
6
|
Baeza-Delgado C, Cerdá Alberich L, Carot-Sierra JM, Veiga-Canuto D, Martínez de Las Heras B, Raza B, Martí-Bonmatí L. A practical solution to estimate the sample size required for clinical prediction models generated from observational research on data. Eur Radiol Exp 2022; 6:22. [PMID: 35641659 PMCID: PMC9156610 DOI: 10.1186/s41747-022-00276-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Accepted: 04/12/2022] [Indexed: 12/23/2022] Open
Abstract
Background Estimating the required sample size is crucial when developing and validating clinical prediction models. However, there is no consensus about how to determine the sample size in such a setting. Here, the goal was to compare available methods to define a practical solution to sample size estimation for clinical predictive models, as applied to Horizon 2020 PRIMAGE as a case study. Methods Three different methods (Riley’s; “rule of thumb” with 10 and 5 events per predictor) were employed to calculate the sample size required to develop predictive models to analyse the variation in sample size as a function of different parameters. Subsequently, the sample size for model validation was also estimated. Results To develop reliable predictive models, 1397 neuroblastoma patients are required, 1060 high-risk neuroblastoma patients and 1345 diffuse intrinsic pontine glioma (DIPG) patients. This sample size can be lowered by reducing the number of variables included in the model, by including direct measures of the outcome to be predicted and/or by increasing the follow-up period. For model validation, the estimated sample size resulted to be 326 patients for neuroblastoma, 246 for high-risk neuroblastoma, and 592 for DIPG. Conclusions Given the variability of the different sample sizes obtained, we recommend using methods based on epidemiological data and the nature of the results, as the results are tailored to the specific clinical problem. In addition, sample size can be reduced by lowering the number of parameter predictors, by including direct measures of the outcome of interest.
Collapse
Affiliation(s)
- Carlos Baeza-Delgado
- Biomedical Imaging Research Group (GIBI230-PREBI) at La Fe Health Research Institute and the Imaging La Fe node of the Distributed Network for Biomedical Imaging (ReDIB) Unique Scientific and Technical Infrastructures (ICTS), Valencia, Spain
| | - Leonor Cerdá Alberich
- Biomedical Imaging Research Group (GIBI230-PREBI) at La Fe Health Research Institute and the Imaging La Fe node of the Distributed Network for Biomedical Imaging (ReDIB) Unique Scientific and Technical Infrastructures (ICTS), Valencia, Spain
| | - José Miguel Carot-Sierra
- Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia, Spain
| | - Diana Veiga-Canuto
- Radiology Department, Hospital Universitario y Politécnico La Fe, Valencia, Spain
| | | | - Ben Raza
- Biomedical Imaging Research Group (GIBI230-PREBI) at La Fe Health Research Institute and the Imaging La Fe node of the Distributed Network for Biomedical Imaging (ReDIB) Unique Scientific and Technical Infrastructures (ICTS), Valencia, Spain.,Pediatric Oncology Department, Hospital Universitario y Politécnico La Fe, Valencia, Spain
| | - Luis Martí-Bonmatí
- Biomedical Imaging Research Group (GIBI230-PREBI) at La Fe Health Research Institute and the Imaging La Fe node of the Distributed Network for Biomedical Imaging (ReDIB) Unique Scientific and Technical Infrastructures (ICTS), Valencia, Spain.
| |
Collapse
|
7
|
Wu H, Liang Q, Zhang W, Zou Q, El-Latif Hesham A, Liu B. iLncDA-LTR: Identification of lncRNA-disease associations by learning to rank. Comput Biol Med 2022; 146:105605. [PMID: 35594681 DOI: 10.1016/j.compbiomed.2022.105605] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Revised: 04/27/2022] [Accepted: 05/09/2022] [Indexed: 12/12/2022]
Abstract
Identifying the associations between lncRNAs and diseases is helpful for the treatment and diagnosis of complex diseases. The existing computational methods mainly focus on the identification of associations between known lncRNAs and known diseases. However, with the application of high-throughput sequencing in lncRNA research, more and more lncRNAs have been detected. Predicting diseases related with newly detected lncRNAs has not been fully explored. Therefore, there is an urgent need for developing powerful computational methods to predict diseases related with newly detected lncRNAs. In this paper, we propose a Learning to Rank (LTR)-based method called iLncDA-LTR to predict diseases related with newly detected lncRNAs. iLncDA-LTR treats this task as an information retrieval task. The newly detected lncRNAs and diseases are considered as queries and documents, respectively. For a given newly detected lncRNA (query), iLncDA-LTR integrates multiple relevant information into LTR for predicting candidate diseases associated with query lncRNA. Experimental results show that iLncDA-LTR outperforms the other exiting state-of-the-art predictors on independent dataset. The corresponding web server of iLncDA-LTR has been constructed as well (http://bliulab.net/iLncDA-LTR/).
Collapse
Affiliation(s)
- Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Qi Liang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Wenxiang Zhang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
| | - Abd El-Latif Hesham
- Genetics Department, Faculty of Agriculture, Beni-Suef University, Beni-Suef, 62511, Egypt.
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China; Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China.
| |
Collapse
|
8
|
Guan Q, Chen Y, Wei Z, Heidari AA, Hu H, Yang XH, Zheng J, Zhou Q, Chen H, Chen F. Medical image augmentation for lesion detection using a texture-constrained multichannel progressive GAN. Comput Biol Med 2022; 145:105444. [PMID: 35421795 DOI: 10.1016/j.compbiomed.2022.105444] [Citation(s) in RCA: 54] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 12/31/2021] [Accepted: 03/20/2022] [Indexed: 12/18/2022]
Abstract
Lesion detectors based on deep learning can assist doctors in diagnosing diseases. However, the performance of current detectors is likely to be unsatisfactory due to the scarcity of training samples. Therefore, it is beneficial to use image generation to augment the training set of a detector. However, when the imaging texture of the medical image is relatively delicate, the synthesized image generated by an existing method may be too poor in quality to meet the training requirements of the detectors. In this regard, a medical image augmentation method, namely, a texture-constrained multichannel progressive generative adversarial network (TMP-GAN), is proposed in this work. TMP-GAN uses joint training of multiple channels to effectively avoid the typical shortcomings of the current generation methods. It also uses an adversarial learning-based texture discrimination loss to further improve the fidelity of the synthesized images. In addition, TMP-GAN employs a progressive generation mechanism to steadily improve the accuracy of the medical image synthesizer. Experiments on the publicly available dataset CBIS-DDMS and our pancreatic tumor dataset show that the precision/recall/F1-score of the detector trained on the TMP-GAN augmented dataset improves by 2.59%/2.70%/2.77% and 2.44%/2.06%/2.36%, respectively, compared to the optimal results of other data augmentation methods. The FROC curve of the detector is also better than the curve from the contrast-augmented trained dataset. Therefore, we believe the proposed TMP-GAN is a practical technique to efficiently implement lesion detection case studies.
Collapse
Affiliation(s)
- Qiu Guan
- College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China.
| | - Yizhou Chen
- College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China.
| | - Zihan Wei
- College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China.
| | - Ali Asghar Heidari
- School of Surveying and Geospatial Engineering, College of Engineering, University of Tehran, Tehran, Iran.
| | - Haigen Hu
- College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China.
| | - Xu-Hua Yang
- College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China.
| | - Jianwei Zheng
- College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China.
| | - Qianwei Zhou
- College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China.
| | - Huiling Chen
- College of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou, Zhejiang, 325035, China.
| | - Feng Chen
- The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.
| |
Collapse
|