1
|
Vallevik VB, Babic A, Marshall SE, Elvatun S, Brøgger HMB, Alagaratnam S, Edwin B, Veeraragavan NR, Befring AK, Nygård JF. Can I trust my fake data - A comprehensive quality assessment framework for synthetic tabular data in healthcare. Int J Med Inform 2024; 185:105413. [PMID: 38493547 DOI: 10.1016/j.ijmedinf.2024.105413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/17/2024] [Accepted: 03/11/2024] [Indexed: 03/19/2024]
Abstract
BACKGROUND Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. Synthetic data has been suggested in response to privacy concerns and regulatory requirements and can be created by training a generator on real data to produce a dataset with similar statistical properties. Competing metrics with differing taxonomies for quality evaluation have been proposed, resulting in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks. METHOD We performed a comprehensive literature review on the use of quality evaluation metrics on synthetic data within the scope of synthetic tabular healthcare data using deep generative methods. Based on this and the collective team experiences, we developed a conceptual framework for quality assurance. The applicability was benchmarked against a practical case from the Dutch National Cancer Registry. CONCLUSION We present a conceptual framework for quality assuranceof synthetic data for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients. DISCUSSION Despite the growing emphasis on algorithmic fairness and carbon footprint, these metrics were scarce in the literature review. The overwhelming focus was on statistical similarity using distance metrics while sequential logic detection was scarce. A consensus-backed framework that includes all relevant quality dimensions can provide assurance for safe and responsible real-life applications of synthetic data. As the choice of appropriate metrics are highly context dependent, further research is needed on validation studies to guide metric choices and support the development of technical standards.
Collapse
Affiliation(s)
- Vibeke Binz Vallevik
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; DNV AS, Veritasveien 1, 1322 Høvik, Norway.
| | | | | | - Severin Elvatun
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway
| | - Helga M B Brøgger
- DNV AS, Veritasveien 1, 1322 Høvik, Norway; Oslo University Hospital, Sognsvannsveien 20, 0372 Oslo, Norway
| | | | - Bjørn Edwin
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; The Intervention Centre and Department of HPB Surgery, Oslo University Hospital and Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway
| | | | | | - Jan F Nygård
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway; UiT - The Arctic University of Norway, Tromsø, Norway
| |
Collapse
|
2
|
He S, Chong P, Yoon BJ, Chung PH, Chen D, Marzouk S, Black KC, Sharp W, Safari P, Goldstein JN, Raja AS, Lee J. Entropy removal of medical diagnostics. Sci Rep 2024; 14:1181. [PMID: 38216607 PMCID: PMC10786933 DOI: 10.1038/s41598-024-51268-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Accepted: 01/03/2024] [Indexed: 01/14/2024] Open
Abstract
Shannon entropy is a core concept in machine learning and information theory, particularly in decision tree modeling. To date, no studies have extensively and quantitatively applied Shannon entropy in a systematic way to quantify the entropy of clinical situations using diagnostic variables (true and false positives and negatives, respectively). Decision tree representations of medical decision-making tools can be generated using diagnostic variables found in literature and entropy removal can be calculated for these tools. This concept of clinical entropy removal has significant potential for further use to bring forth healthcare innovation, such as quantifying the impact of clinical guidelines and value of care and applications to Emergency Medicine scenarios where diagnostic accuracy in a limited time window is paramount. This analysis was done for 623 diagnostic tools and provided unique insights into their utility. For studies that provided detailed data on medical decision-making algorithms, bootstrapped datasets were generated from source data to perform comprehensive machine learning analysis on these algorithms and their constituent steps, which revealed a novel and thorough evaluation of medical diagnostic algorithms.
Collapse
Affiliation(s)
- Shuhan He
- Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
| | - Paul Chong
- Campbell University School of Osteopathic Medicine, Lillington, NC, USA
| | - Byung-Jun Yoon
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA
- Brookhaven National Laboratory, Computational Science Initiative, Upton, NY, USA
| | - Pei-Hung Chung
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA
| | - David Chen
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada
| | - Sammer Marzouk
- Harvard University Department of Chemistry and Chemical Biology, Cambridge, MA, USA
| | | | - Wilson Sharp
- Campbell University School of Osteopathic Medicine, Lillington, NC, USA
| | - Pedram Safari
- Massachusetts General Hospital Institute of Health Professions, Boston, MA, USA
| | - Joshua N Goldstein
- Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Ali S Raja
- Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Jarone Lee
- Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| |
Collapse
|
3
|
Chen YC, Chen SC, Liu YS. Reducing abnormal expenses in national health insurance based on a control chart and decision tree-driven define, measure, analyze, improve and control process. Health Informatics J 2023; 29:14604582231203757. [PMID: 37730249 DOI: 10.1177/14604582231203757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
This study examined the cost of medical insurance for "sepsis" treatment in Taiwan. We applied statistical tests, cost control charts, and C5.0 decision trees using the define, measure, analyze, improve and control (DMAIC) process to mine data on Diagnosis-Related Groups (DRGs) and clinics that reported expense anomalies and disposal costs. Analyzing 353 valid samples (application fees) from four DRGs, 70 clinics, and 15 input variables, abnormalities in application fees for adults (age ≧18 years old) with comorbidities or complications was significant (95% confidence interval) in one DRG and nine clinics. Four input variables (ward charge, treatment fee, laboratory fee, and pharmaceutical service charge) had a significant impact. Improvements or controls should be prioritized for three clinics (Nos. 49, 44, and 14) and two input variables (treatment and laboratory fees). This model can be replicated to ascertain excess medical expenditures and improve the efficiency of medical resource use.
Collapse
Affiliation(s)
- Yen-Chang Chen
- School of Sport and Health, SanMing University, SanMing, China
| | - Shui-Chuan Chen
- Department of Industrial Engineering and Management, National Chin-Yi University of Technology, Taichung, Taiwan
| | - Ying-Sing Liu
- College of Humanities and Social Sciences, Chaoyang University of Technology, Taichung, Taiwan
| |
Collapse
|
4
|
Nikolentzos G, Vazirgiannis M, Xypolopoulos C, Lingman M, Brandt EG. Synthetic electronic health records generated with variational graph autoencoders. NPJ Digit Med 2023; 6:83. [PMID: 37120594 PMCID: PMC10148837 DOI: 10.1038/s41746-023-00822-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 04/05/2023] [Indexed: 05/01/2023] Open
Abstract
Data-driven medical care delivery must always respect patient privacy-a requirement that is not easily met. This issue has impeded improvements to healthcare software and has delayed the long-predicted prevalence of artificial intelligence in healthcare. Until now, it has been very difficult to share data between healthcare organizations, resulting in poor statistical models due to unrepresentative patient cohorts. Synthetic data, i.e., artificial but realistic electronic health records, could overcome the drought that is troubling the healthcare sector. Deep neural network architectures, in particular, have shown an incredible ability to learn from complex data sets and generate large amounts of unseen data points with the same statistical properties as the training data. Here, we present a generative neural network model that can create synthetic health records with realistic timelines. These clinical trajectories are generated on a per-patient basis and are represented as linear-sequence graphs of clinical events over time. We use a variational graph autoencoder (VGAE) to generate synthetic samples from real-world electronic health records. Our approach generates health records not seen in the training data. We show that these artificial patient trajectories are realistic and preserve patient privacy and can therefore support the safe sharing of data across organizations.
Collapse
Affiliation(s)
- Giannis Nikolentzos
- LIX, École Polytechnique, Institut Polytechnique de Paris, Palaiseau, France.
| | - Michalis Vazirgiannis
- LIX, École Polytechnique, Institut Polytechnique de Paris, Palaiseau, France
- Department of Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden
| | | | - Markus Lingman
- Department of Molecular and Clinical Medicine/Cardiology, Institute of Medicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
- Center for Applied Intelligent Systems Research, Halmstad University, Halmstad, Sweden
| | | |
Collapse
|
5
|
Alharbi A, Ahmad M, Alosaimi W, Alyami H, Sarkar AK, Agrawal A, Kumar R, Khan RA. Securing healthcare information system through fuzzy based decision-making methodology. Health Informatics J 2022; 28:14604582221135420. [DOI: 10.1177/14604582221135420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The purpose of the healthcare Information System (HIS) is to replace the conventional method of data gathering and organization in hospitals into a modern method of systematic data collection, maintenance and dissemination. There has been an unprecedented rise in the malware and cyber-attacks on HIS recently. Cyber-attacks have become a major crisis for the healthcare industry. To address this scenario, the present paper conducts a study on the security factors integral to the healthcare information system and conducts the performance analysis of these factors. For this intent, the study has employed the Fuzzy Analytic Hierarchy Process (F.AHP) integrated with Technique Order Preference by Similarity to Ideal Solution (TOPSIS) integrated framework for evaluating the performance of each factor. Thereafter, the factors that play a vital role in healthcare data security breaches have been prioritized as per their security weights. Furthermore, the validity of the results obtained by the stated methodology has been established by conducting the sensitivity analysis and comparison of results with the other methods by using the same data set. Based on results thus obtained, the access control and software security have been identified as the most promising security factors.
Collapse
Affiliation(s)
- Abdullah Alharbi
- Department of Information Technology, College of Computers and Information Technology, Taif University, Taif, Saudi Arabia
| | - Masood Ahmad
- Department of Information Technology, Babasaheb Bhimrao Ambedkar University, Lucknow, India
| | - Wael Alosaimi
- Department of Information Technology, College of Computers and Information Technology, Taif University, Taif, Saudi Arabia
| | - Hashem Alyami
- Department of Computer Science, College of Computers and Information Technology, Taif University, Taif, Saudi Arabia
| | - Amal Krishna Sarkar
- Department of Biostatistics and Health Informatics, Sanjay Gandhi Post Graduate Institute of Medical Sciences, Lucknow, India
| | - Alka Agrawal
- Department of Information Technology, Babasaheb Bhimrao Ambedkar University, Lucknow, India
| | - Rajeev Kumar
- Centre for Innovation and Technology, Administrative Staff College of India, Hyderabad, India
| | - Raees Ahamd Khan
- Department of Information Technology, Babasaheb Bhimrao Ambedkar University, Lucknow, India
| |
Collapse
|
6
|
Review of Artificial Intelligence and Machine Learning Technologies: Classification, Restrictions, Opportunities and Challenges. MATHEMATICS 2022. [DOI: 10.3390/math10152552] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Artificial intelligence (AI) is an evolving set of technologies used for solving a wide range of applied issues. The core of AI is machine learning (ML)—a complex of algorithms and methods that address the problems of classification, clustering, and forecasting. The practical application of AI&ML holds promising prospects. Therefore, the researches in this area are intensive. However, the industrial applications of AI and its more intensive use in society are not widespread at the present time. The challenges of widespread AI applications need to be considered from both the AI (internal problems) and the societal (external problems) perspective. This consideration will identify the priority steps for more intensive practical application of AI technologies, their introduction, and involvement in industry and society. The article presents the identification and discussion of the challenges of the employment of AI technologies in the economy and society of resource-based countries. The systematization of AI&ML technologies is implemented based on publications in these areas. This systematization allows for the specification of the organizational, personnel, social and technological limitations. This paper outlines the directions of studies in AI and ML, which will allow us to overcome some of the limitations and achieve expansion of the scope of AI&ML applications.
Collapse
|
7
|
A Methodology for Controlling Bias and Fairness in Synthetic Data Generation. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12094619] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
The development of algorithms, based on machine learning techniques, supporting (or even replacing) human judgment must take into account concepts such as data bias and fairness. Though scientific literature proposes numerous techniques to detect and evaluate these problems, less attention has been dedicated to methods generating intentionally biased datasets, which could be used by data scientists to develop and validate unbiased and fair decision-making algorithms. To this end, this paper presents a novel method to generate a synthetic dataset, where bias can be modeled by using a probabilistic network exploiting structural equation modeling. The proposed methodology has been validated on a simple dataset to highlight the impact of tuning parameters on bias and fairness, as well as on a more realistic example based on a loan approval status dataset. In particular, this methodology requires a limited number of parameters compared to other techniques for generating datasets with a controlled amount of bias and fairness.
Collapse
|