1
|
Vallevik VB, Babic A, Marshall SE, Elvatun S, Brøgger HMB, Alagaratnam S, Edwin B, Veeraragavan NR, Befring AK, Nygård JF. Can I trust my fake data - A comprehensive quality assessment framework for synthetic tabular data in healthcare. Int J Med Inform 2024; 185:105413. [PMID: 38493547 DOI: 10.1016/j.ijmedinf.2024.105413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/17/2024] [Accepted: 03/11/2024] [Indexed: 03/19/2024]
Abstract
BACKGROUND Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. Synthetic data has been suggested in response to privacy concerns and regulatory requirements and can be created by training a generator on real data to produce a dataset with similar statistical properties. Competing metrics with differing taxonomies for quality evaluation have been proposed, resulting in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks. METHOD We performed a comprehensive literature review on the use of quality evaluation metrics on synthetic data within the scope of synthetic tabular healthcare data using deep generative methods. Based on this and the collective team experiences, we developed a conceptual framework for quality assurance. The applicability was benchmarked against a practical case from the Dutch National Cancer Registry. CONCLUSION We present a conceptual framework for quality assuranceof synthetic data for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients. DISCUSSION Despite the growing emphasis on algorithmic fairness and carbon footprint, these metrics were scarce in the literature review. The overwhelming focus was on statistical similarity using distance metrics while sequential logic detection was scarce. A consensus-backed framework that includes all relevant quality dimensions can provide assurance for safe and responsible real-life applications of synthetic data. As the choice of appropriate metrics are highly context dependent, further research is needed on validation studies to guide metric choices and support the development of technical standards.
Collapse
Affiliation(s)
- Vibeke Binz Vallevik
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; DNV AS, Veritasveien 1, 1322 Høvik, Norway.
| | | | | | - Severin Elvatun
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway
| | - Helga M B Brøgger
- DNV AS, Veritasveien 1, 1322 Høvik, Norway; Oslo University Hospital, Sognsvannsveien 20, 0372 Oslo, Norway
| | | | - Bjørn Edwin
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; The Intervention Centre and Department of HPB Surgery, Oslo University Hospital and Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway
| | | | | | - Jan F Nygård
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway; UiT - The Arctic University of Norway, Tromsø, Norway
| |
Collapse
|
2
|
Alloza C, Knox B, Raad H, Aguilà M, Coakley C, Mohrova Z, Boin É, Bénard M, Davies J, Jacquot E, Lecomte C, Fabre A, Batech M. A Case for Synthetic Data in Regulatory Decision-Making in Europe. Clin Pharmacol Ther 2023; 114:795-801. [PMID: 37441734 DOI: 10.1002/cpt.3001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Accepted: 07/05/2023] [Indexed: 07/15/2023]
Abstract
Regulators are faced with many challenges surrounding health data usage, including privacy, fragmentation, validity, and generalizability, especially in the European Union, for which synthetic data may provide innovative solutions. Synthetic data, defined as data artificially generated rather than captured in the real world, are increasingly being used for healthcare research purposes as a proxy to real-world data (RWD). Currently, there are barriers particularly challenging in Europe, where sharing patient's data is strictly regulated, costly, and time-consuming, causing delays in evidence generation and regulatory approvals. Recent initiatives are encouraging the use of synthetic data in regulatory decision making and health technology assessment to overcome these challenges, but synthetic data have still to overcome realistic obstacles before their adoption by researchers and regulators in Europe. Thus, the emerging use of RWD and synthetic data by pharmaceutical and medical device industries calls regulatory bodies to provide a framework for proper evidence generation and informed regulatory decision making. As the provision of data becomes more ubiquitous in scientific research, so will innovations in artificial intelligence, machine learning, and generation of synthetic data, making the exploration and intricacies of this topic all the more important and timely. In this review, we discuss the potential merits and challenges of synthetic data in the context of decision making in the European regulatory environment. We explore the current uses of synthetic data and ongoing initiatives, the value of synthetic data for regulatory purposes, and realistic barriers to the adoption of synthetic data in healthcare.
Collapse
|
3
|
Theodorou B, Xiao C, Sun J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat Commun 2023; 14:5305. [PMID: 37652934 PMCID: PMC10471716 DOI: 10.1038/s41467-023-41093-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 08/23/2023] [Indexed: 09/02/2023] Open
Abstract
Synthetic electronic health records (EHRs) that are both realistic and privacy-preserving offer alternatives to real EHRs for machine learning (ML) and statistical analysis. However, generating high-fidelity EHR data in its original, high-dimensional form poses challenges for existing methods. We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal, high-dimensional EHR, which preserve the statistical properties of real EHRs and can train accurate ML models without privacy concerns. HALO generates a probability density function over medical codes, clinical visits, and patient records, allowing for generating realistic EHR data without requiring variable selection or aggregation. Extensive experiments demonstrated that HALO can generate high-fidelity data with high-dimensional disease code probabilities closely mirroring (above 0.9 R2 correlation) real EHR data. HALO also enhances the accuracy of predictive modeling and enables downstream ML models to attain similar accuracy as models trained on genuine data.
Collapse
Affiliation(s)
- Brandon Theodorou
- University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL, USA
- Medisyn Inc., Las Vegas, NV, USA
| | - Cao Xiao
- Medisyn Inc., Las Vegas, NV, USA
| | - Jimeng Sun
- University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL, USA.
- Medisyn Inc., Las Vegas, NV, USA.
| |
Collapse
|
4
|
Affiliation(s)
- Jessica Morley
- Oxford Internet Institute, University of Oxford, Oxford, UK
| | | | | |
Collapse
|
5
|
Mosquera L, El Emam K, Ding L, Sharma V, Zhang XH, Kababji SE, Carvalho C, Hamilton B, Palfrey D, Kong L, Jiang B, Eurich DT. A method for generating synthetic longitudinal health data. BMC Med Res Methodol 2023; 23:67. [PMID: 36959532 PMCID: PMC10034254 DOI: 10.1186/s12874-023-01869-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 02/19/2023] [Indexed: 03/25/2023] Open
Abstract
Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health's administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.
Collapse
Affiliation(s)
- Lucy Mosquera
- Replica Analytics Ltd, Ottawa, ON, Canada
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1J 8L1, Canada
| | - Khaled El Emam
- Replica Analytics Ltd, Ottawa, ON, Canada.
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1J 8L1, Canada.
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.
| | - Lei Ding
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada
| | - Vishal Sharma
- School of Public Health, University of Alberta, Edmonton, AB, Canada
| | | | - Samer El Kababji
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1J 8L1, Canada
| | | | | | - Dan Palfrey
- Institute of Health Economics, Edmonton, Alberta, Canada
| | - Linglong Kong
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada
| | - Bei Jiang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada
| | - Dean T Eurich
- School of Public Health, University of Alberta, Edmonton, AB, Canada
| |
Collapse
|
6
|
|