1
|
Ibrahim M, Khalil YA, Amirrajab S, Sun C, Breeuwer M, Pluim J, Elen B, Ertaylan G, Dumontier M. Generative AI for synthetic data across multiple medical modalities: A systematic review of recent developments and challenges. Comput Biol Med 2025; 189:109834. [PMID: 40023073 DOI: 10.1016/j.compbiomed.2025.109834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2024] [Revised: 01/03/2025] [Accepted: 02/08/2025] [Indexed: 03/04/2025]
Abstract
This paper presents a comprehensive systematic review of generative models (GANs, VAEs, DMs, and LLMs) used to synthesize various medical data types, including imaging (dermoscopic, mammographic, ultrasound, CT, MRI, and X-ray), text, time-series, and tabular data (EHR). Unlike previous narrowly focused reviews, our study encompasses a broad array of medical data modalities and explores various generative models. Our aim is to offer insights into their current and future applications in medical research, particularly in the context of synthesis applications, generation techniques, and evaluation methods, as well as providing a GitHub repository as a dynamic resource for ongoing collaboration and innovation. Our search strategy queries databases such as Scopus, PubMed, and ArXiv, focusing on recent works from January 2021 to November 2023, excluding reviews and perspectives. This period emphasizes recent advancements beyond GANs, which have been extensively covered in previous reviews. The survey also emphasizes the aspect of conditional generation, which is not focused on in similar work. Key contributions include a broad, multi-modality scope that identifies cross-modality insights and opportunities unavailable in single-modality surveys. While core generative techniques are transferable, we find that synthesis methods often lack sufficient integration of patient-specific context, clinical knowledge, and modality-specific requirements tailored to the unique characteristics of medical data. Conditional models leveraging textual conditioning and multimodal synthesis remain underexplored but offer promising directions for innovation. Our findings are structured around three themes: (1) Synthesis applications, highlighting clinically valid synthesis applications and significant gaps in using synthetic data beyond augmentation, such as for validation and evaluation; (2) Generation techniques, identifying gaps in personalization and cross-modality innovation; and (3) Evaluation methods, revealing the absence of standardized benchmarks, the need for large-scale validation, and the importance of privacy-aware, clinically relevant evaluation frameworks. These findings emphasize the need for benchmarking and comparative studies to promote openness and collaboration.
Collapse
Affiliation(s)
- Mahmoud Ibrahim
- Institute of Data Science, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands; Department of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands; VITO, Belgium.
| | - Yasmina Al Khalil
- Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
| | - Sina Amirrajab
- Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
| | - Chang Sun
- Institute of Data Science, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands; Department of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands
| | - Marcel Breeuwer
- Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
| | - Josien Pluim
- Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
| | | | | | - Michel Dumontier
- Institute of Data Science, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands; Department of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands
| |
Collapse
|
2
|
Hernandez M, Osorio-Marulanda PA, Catalina M, Loinaz L, Epelde G, Aginako N. Comprehensive evaluation framework for synthetic tabular data in health: fidelity, utility and privacy analysis of generative models with and without privacy guarantees. Front Digit Health 2025; 7:1576290. [PMID: 40343213 PMCID: PMC12058740 DOI: 10.3389/fdgth.2025.1576290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2025] [Accepted: 04/09/2025] [Indexed: 05/11/2025] Open
Abstract
The generation of synthetic tabular data has emerged as a key privacy-enhancing technology to address challenges in data sharing, particularly in healthcare, where sensitive attributes can compromise patient privacy. Despite significant progress, balancing fidelity, utility, and privacy in complex medical datasets remains a substantial challenge. This paper introduces a comprehensive and holistic evaluation framework for synthetic tabular data, consolidating metrics and privacy risk measures across three key categories (fidelity, utility and privacy) and incorporating a fidelity-utility tradeoff metric. The framework was applied to three open-source medical datasets to evaluate synthetic tabular data generated by five generative models, both with and without differential privacy. Results showed that simpler models generally achieved better fidelity and utility, while more complex models provided lower privacy risks. The addition of differential privacy enhanced privacy preservation but often reduced fidelity and utility, highlighting the complexity of balancing fidelity, utility and privacy in synthetic data generation for medical datasets. Despite its contributions, this study acknowledges limitations, such as the lack of evaluation metrics neither privacy risk measures for required model training time and resource usage, reliance on default model parameters, and the assessment of models that incorporates differential privacy with only a single privacy budget. Future work should explore parameter optimization, alternative privacy mechanisms, broader applications of the framework to diverse datasets and domains, and collaborations with clinicians for clinical utility evaluation. This study provides a foundation for improving synthetic tabular data evaluation and advancing privacy-preserving data sharing in healthcare.
Collapse
Affiliation(s)
- Mikel Hernandez
- Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain
- Computer Science and Artificial Intelligence Department, Computer Science Faculty, University of the Basque Country (UPV/EHU), Donostia-San Sebastián, Spain
- eHealth Group, Biogipuzkoa Health Research Institute, Donostia-San Sebastián, Spain
| | - Pablo A. Osorio-Marulanda
- Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain
- School of Applied Sciences and Engineering, Universidad EAFIT, Medellín, Colombia
| | - Mikel Catalina
- Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain
| | - Lorea Loinaz
- Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain
- Computer Science and Artificial Intelligence Department, Computer Science Faculty, University of the Basque Country (UPV/EHU), Donostia-San Sebastián, Spain
| | - Gorka Epelde
- Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain
- eHealth Group, Biogipuzkoa Health Research Institute, Donostia-San Sebastián, Spain
| | - Naiara Aginako
- Computer Science and Artificial Intelligence Department, Computer Science Faculty, University of the Basque Country (UPV/EHU), Donostia-San Sebastián, Spain
| |
Collapse
|
3
|
Umesh C, Mahendra M, Bej S, Wolkenhauer O, Wolfien M. Challenges and applications in generative AI for clinical tabular data in physiology. Pflugers Arch 2025; 477:531-542. [PMID: 39417878 PMCID: PMC11958401 DOI: 10.1007/s00424-024-03024-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 09/17/2024] [Accepted: 09/23/2024] [Indexed: 10/19/2024]
Abstract
Recent advancements in generative approaches in AI have opened up the prospect of synthetic tabular clinical data generation. From filling in missing values in real-world data, these approaches have now advanced to creating complex multi-tables. This review explores the development of techniques capable of synthesizing patient data and modeling multiple tables. We highlight the challenges and opportunities of these methods for analyzing patient data in physiology. Additionally, it discusses the challenges and potential of these approaches in improving clinical research, personalized medicine, and healthcare policy. The integration of these generative models into physiological settings may represent both a theoretical advancement and a practical tool that has the potential to improve mechanistic understanding and patient care. By providing a reliable source of synthetic data, these models can also help mitigate privacy concerns and facilitate large-scale data sharing.
Collapse
Affiliation(s)
- Chaithra Umesh
- Institute of Computer Science, Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany.
| | - Manjunath Mahendra
- Institute of Computer Science, Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany.
| | - Saptarshi Bej
- School of Data Science, Indian Institute of Science Education and Research (IISER), Thiruvananthapuram, India
| | - Olaf Wolkenhauer
- Institute of Computer Science, Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany
- Leibniz-Institute for Food Systems Biology, Technical University of Munich, Freising, Germany
| | - Markus Wolfien
- Faculty of Medicine Carl Gustav Carus, Institute for Medical Informatics and Biometry, TUD Dresden University of Technology, Dresden, Germany
- Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Dresden, Germany
| |
Collapse
|
4
|
Fournier SJ, Urbani P. Generative modeling through internal high-dimensional chaotic activity. Phys Rev E 2025; 111:045304. [PMID: 40411070 DOI: 10.1103/physreve.111.045304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2024] [Accepted: 03/10/2025] [Indexed: 05/26/2025]
Abstract
Generative modeling aims to produce new data points whose statistical properties resemble those in a training dataset. In recent years, there has been a burst of machine learning techniques and settings that can achieve this goal with remarkable performances. In most of these settings, one uses the training dataset in conjunction with noise, which is added as a source of statistical variability and is essential for the generative task. Here, we explore the idea of using internal chaotic dynamics in high-dimensional chaotic systems as a way to generate new data points from a training dataset. We show that simple learning rules can achieve this goal within a set of vanilla architectures and characterize the quality of the generated data points through standard accuracy measures.
Collapse
Affiliation(s)
- Samantha J Fournier
- Institut de Physique Théorique, Université Paris Saclay, CNRS, CEA, F-91191 Gif-sur-Yvette, France
| | - Pierfrancesco Urbani
- Institut de Physique Théorique, Université Paris Saclay, CNRS, CEA, F-91191 Gif-sur-Yvette, France
| |
Collapse
|
5
|
Carbone A, Decelle A, Rosset L, Seoane B. Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:1309-1316. [PMID: 39527442 DOI: 10.1109/tpami.2024.3495999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
In this study, we address the challenge of using energy-based models to produce high-quality, label-specific data in complex structured datasets, such as population genetics, RNA or protein sequences data. Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing, which affects the diversity of synthetic data and increases generation times. To address these issues, we use a novel training algorithm that exploits non-equilibrium effects. This approach, applied to the Restricted Boltzmann Machine, improves the model's ability to correctly classify samples and generate high-quality synthetic data in only a few sampling steps. The effectiveness of this method is demonstrated by its successful application to five different types of data: handwritten digits, mutations of human genomes classified by continental origin, functionally characterized sequences of an enzyme protein family, homologous RNA sequences from specific taxonomies and real classical piano pieces classified by their composer.
Collapse
|
6
|
Sella N, Guinot F, Lagrange N, Albou LP, Desponds J, Isambert H. Preserving information while respecting privacy through an information theoretic framework for synthetic health data generation. NPJ Digit Med 2025; 8:49. [PMID: 39843957 PMCID: PMC11754479 DOI: 10.1038/s41746-025-01431-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Accepted: 01/02/2025] [Indexed: 01/24/2025] Open
Abstract
Generating synthetic data from medical records is a complex task intensified by patient privacy concerns. In recent years, multiple approaches have been reported for the generation of synthetic data, however, limited attention was given to jointly evaluate the quality and the privacy of the generated data. The quality and privacy of synthetic data stem from multivariate associations across variables, which cannot be assessed by comparing univariate distributions with the original data. Here, we introduce a novel algorithm (MIIC-SDG) for generating synthetic data from electronic records based on a multivariate information framework and Bayesian network theory. We also propose a new metric to quantitatively assess the trade-off between the Quality and Privacy Scores (QPS) of synthetic data generation methods. The performance of MIIC-SDG is demonstrated on different clinical datasets and favorably compares with state-of-the-art synthetic data generation methods, based on the QPS trade-off between several quality and privacy metrics.
Collapse
Affiliation(s)
- Nadir Sella
- Institut Roche, Boulogne-Billancourt, France.
| | | | - Nikita Lagrange
- Institut Curie, CNRS UMR168, PSL University, Sorbonne University, Paris, 75005, France
| | | | | | - Hervé Isambert
- Institut Curie, CNRS UMR168, PSL University, Sorbonne University, Paris, 75005, France
| |
Collapse
|
7
|
Alqulaity M, Yang P. Enhanced Conditional GAN for High-Quality Synthetic Tabular Data Generation in Mobile-Based Cardiovascular Healthcare. SENSORS (BASEL, SWITZERLAND) 2024; 24:7673. [PMID: 39686209 DOI: 10.3390/s24237673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/06/2024] [Revised: 11/01/2024] [Accepted: 11/25/2024] [Indexed: 12/18/2024]
Abstract
The generation of synthetic tabular data has emerged as a critical task in various fields, particularly in healthcare, where data privacy concerns limit the availability of real datasets for research and analysis. This paper presents an enhanced Conditional Generative Adversarial Network (GAN) architecture designed for generating high-quality synthetic tabular data, with a focus on cardiovascular disease datasets that encompass mixed data types and complex feature relationships. The proposed architecture employs specialized sub-networks to process continuous and categorical variables separately, leveraging metadata such as Gaussian Mixture Model (GMM) parameters for continuous attributes and embedding layers for categorical features. By integrating these specialized pathways, the generator produces synthetic samples that closely mimic the statistical properties of the real data. Comprehensive experiments were conducted to compare the proposed architecture with two established models: Conditional Tabular GAN (CTGAN) and Tabular Variational AutoEncoder (TVAE). The evaluation utilized metrics such as the Kolmogorov-Smirnov (KS) test for continuous variables, the Jaccard coefficient for categorical variables, and pairwise correlation analyses. Results indicate that the proposed approach attains a mean KS statistic of 0.3900, demonstrating strong overall performance that outperforms CTGAN (0.4803) and is comparable to TVAE (0.3858). Notably, our approach shows lowest KS statistics for key continuous features, such as total cholesterol (KS = 0.0779), weight (KS = 0.0861), and diastolic blood pressure (KS = 0.0957), indicating its effectiveness in closely replicating real data distributions. Additionally, it achieved a Jaccard coefficient of 1.00 for eight out of eleven categorical variables, effectively preserving categorical distributions. These findings indicate that the proposed architecture captures both distributions and dependencies, providing a robust solution in supporting mobile personalized cardiovascular disease prevention systems.
Collapse
Affiliation(s)
- Malak Alqulaity
- Department of Computer Science, University of Sheffield, Sheffield S1 4DP, UK
| | - Po Yang
- Department of Computer Science, University of Sheffield, Sheffield S1 4DP, UK
| |
Collapse
|
8
|
Chen Y, Calabrese R, Martin-Barragan B. JointLIME: An interpretation method for machine learning survival models with endogenous time-varying covariates in credit scoring. RISK ANALYSIS : AN OFFICIAL PUBLICATION OF THE SOCIETY FOR RISK ANALYSIS 2024. [PMID: 39568231 DOI: 10.1111/risa.17679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2024]
Abstract
In this work, we introduce JointLIME, a novel interpretation method for explaining black-box survival (BBS) models with endogenous time-varying covariates (TVCs). Existing interpretation methods, like SurvLIME, are limited to BBS models only with time-invariant covariates. To fill this gap, JointLIME leverages the Local Interpretable Model-agnostic Explanations (LIME) framework to apply the joint model to approximate the survival functions predicted by the BBS model in a local area around a new individual. To achieve this, JointLIME minimizes the distances between survival functions predicted by the black-box survival model and those derived from the joint model. The outputs of this minimization problem are the coefficient values of each covariate in the joint model, serving as explanations to quantify their impact on survival predictions. JointLIME uniquely incorporates endogenous TVCs using a spline-based model coupled with the Monte Carlo method for precise estimations within any specified prediction period. These estimations are then integrated to formulate the joint model in the optimization problem. We illustrate the explanation results of JointLIME using a US mortgage data set and compare them with those of SurvLIME.
Collapse
Affiliation(s)
- Yujia Chen
- Business School, University of Edinburgh, Edinburgh, UK
| | - Raffaella Calabrese
- Business School, University of Edinburgh, Edinburgh, UK
- European University Institute, Via dei Roccettini, Fiesole, Tuscany, Italy
| | | |
Collapse
|
9
|
Tian M, Chen B, Guo A, Jiang S, Zhang AR. Reliable generation of privacy-preserving synthetic electronic health record time series via diffusion models. J Am Med Inform Assoc 2024; 31:2529-2539. [PMID: 39222376 PMCID: PMC11491591 DOI: 10.1093/jamia/ocae229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Revised: 08/04/2024] [Accepted: 08/12/2024] [Indexed: 09/04/2024] Open
Abstract
OBJECTIVE Electronic health records (EHRs) are rich sources of patient-level data, offering valuable resources for medical data analysis. However, privacy concerns often restrict access to EHRs, hindering downstream analysis. Current EHR deidentification methods are flawed and can lead to potential privacy leakage. Additionally, existing publicly available EHR databases are limited, preventing the advancement of medical research using EHR. This study aims to overcome these challenges by generating realistic and privacy-preserving synthetic EHRs time series efficiently. MATERIALS AND METHODS We introduce a new method for generating diverse and realistic synthetic EHR time series data using denoizing diffusion probabilistic models. We conducted experiments on 6 databases: Medical Information Mart for Intensive Care III and IV, the eICU Collaborative Research Database (eICU), and non-EHR datasets on Stocks and Energy. We compared our proposed method with 8 existing methods. RESULTS Our results demonstrate that our approach significantly outperforms all existing methods in terms of data fidelity while requiring less training effort. Additionally, data generated by our method yield a lower discriminative accuracy compared to other baseline methods, indicating the proposed method can generate data with less privacy risk. DISCUSSION The proposed model utilizes a mixed diffusion process to generate realistic synthetic EHR samples that protect patient privacy. This method could be useful in tackling data availability issues in the field of healthcare by reducing barrier to EHR access and supporting research in machine learning for health. CONCLUSION The proposed diffusion model-based method can reliably and efficiently generate synthetic EHR time series, which facilitates the downstream medical data analysis. Our numerical results show the superiority of the proposed method over all other existing methods.
Collapse
Affiliation(s)
- Muhang Tian
- Department of Computer Science, Duke University, Durham, NC 27708, United States
| | - Bernie Chen
- Department of Electrical & Computer Engineering, Duke University, Durham, NC 27708, United States
| | - Allan Guo
- Department of Computer Science, Duke University, Durham, NC 27708, United States
| | - Shiyi Jiang
- Department of Electrical & Computer Engineering, Duke University, Durham, NC 27708, United States
| | - Anru R Zhang
- Department of Computer Science, Duke University, Durham, NC 27708, United States
- Department of Biostatistics & Bioinformatics, Duke University, Durham, NC 27708, United States
| |
Collapse
|
10
|
Crouzet A, Lopez N, Riss Yaw B, Lepelletier Y, Demange L. The Millennia-Long Development of Drugs Associated with the 80-Year-Old Artificial Intelligence Story: The Therapeutic Big Bang? Molecules 2024; 29:2716. [PMID: 38930784 PMCID: PMC11206022 DOI: 10.3390/molecules29122716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 05/30/2024] [Accepted: 05/31/2024] [Indexed: 06/28/2024] Open
Abstract
The journey of drug discovery (DD) has evolved from ancient practices to modern technology-driven approaches, with Artificial Intelligence (AI) emerging as a pivotal force in streamlining and accelerating the process. Despite the vital importance of DD, it faces challenges such as high costs and lengthy timelines. This review examines the historical progression and current market of DD alongside the development and integration of AI technologies. We analyse the challenges encountered in applying AI to DD, focusing on drug design and protein-protein interactions. The discussion is enriched by presenting models that put forward the application of AI in DD. Three case studies are highlighted to demonstrate the successful application of AI in DD, including the discovery of a novel class of antibiotics and a small-molecule inhibitor that has progressed to phase II clinical trials. These cases underscore the potential of AI to identify new drug candidates and optimise the development process. The convergence of DD and AI embodies a transformative shift in the field, offering a path to overcome traditional obstacles. By leveraging AI, the future of DD promises enhanced efficiency and novel breakthroughs, heralding a new era of medical innovation even though there is still a long way to go.
Collapse
Affiliation(s)
- Aurore Crouzet
- UMR 8038 CNRS CiTCoM, Team PNAS, Faculté de Pharmacie, Université Paris Cité, 4 Avenue de l’Observatoire, 75006 Paris, France
- W-MedPhys, 128 Rue la Boétie, 75008 Paris, France
| | - Nicolas Lopez
- W-MedPhys, 128 Rue la Boétie, 75008 Paris, France
- ENOES, 62 Rue de Miromesnil, 75008 Paris, France
- Unité Mixte de Recherche «Institut de Physique Théorique (IPhT)» CEA-CNRS, UMR 3681, Bat 774, Route de l’Orme des Merisiers, 91191 St Aubin-Gif-sur-Yvette, France
| | - Benjamin Riss Yaw
- UMR 8038 CNRS CiTCoM, Team PNAS, Faculté de Pharmacie, Université Paris Cité, 4 Avenue de l’Observatoire, 75006 Paris, France
| | - Yves Lepelletier
- W-MedPhys, 128 Rue la Boétie, 75008 Paris, France
- Université Paris Cité, Imagine Institute, 24 Boulevard Montparnasse, 75015 Paris, France
- INSERM UMR 1163, Laboratory of Cellular and Molecular Basis of Normal Hematopoiesis and Hematological Disorders: Therapeutical Implications, 24 Boulevard Montparnasse, 75015 Paris, France
| | - Luc Demange
- UMR 8038 CNRS CiTCoM, Team PNAS, Faculté de Pharmacie, Université Paris Cité, 4 Avenue de l’Observatoire, 75006 Paris, France
| |
Collapse
|
11
|
Vallevik VB, Babic A, Marshall SE, Elvatun S, Brøgger HMB, Alagaratnam S, Edwin B, Veeraragavan NR, Befring AK, Nygård JF. Can I trust my fake data - A comprehensive quality assessment framework for synthetic tabular data in healthcare. Int J Med Inform 2024; 185:105413. [PMID: 38493547 DOI: 10.1016/j.ijmedinf.2024.105413] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/17/2024] [Accepted: 03/11/2024] [Indexed: 03/19/2024]
Abstract
BACKGROUND Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. Synthetic data has been suggested in response to privacy concerns and regulatory requirements and can be created by training a generator on real data to produce a dataset with similar statistical properties. Competing metrics with differing taxonomies for quality evaluation have been proposed, resulting in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks. METHOD We performed a comprehensive literature review on the use of quality evaluation metrics on synthetic data within the scope of synthetic tabular healthcare data using deep generative methods. Based on this and the collective team experiences, we developed a conceptual framework for quality assurance. The applicability was benchmarked against a practical case from the Dutch National Cancer Registry. CONCLUSION We present a conceptual framework for quality assuranceof synthetic data for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients. DISCUSSION Despite the growing emphasis on algorithmic fairness and carbon footprint, these metrics were scarce in the literature review. The overwhelming focus was on statistical similarity using distance metrics while sequential logic detection was scarce. A consensus-backed framework that includes all relevant quality dimensions can provide assurance for safe and responsible real-life applications of synthetic data. As the choice of appropriate metrics are highly context dependent, further research is needed on validation studies to guide metric choices and support the development of technical standards.
Collapse
Affiliation(s)
- Vibeke Binz Vallevik
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; DNV AS, Veritasveien 1, 1322 Høvik, Norway.
| | | | | | - Severin Elvatun
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway
| | - Helga M B Brøgger
- DNV AS, Veritasveien 1, 1322 Høvik, Norway; Oslo University Hospital, Sognsvannsveien 20, 0372 Oslo, Norway
| | | | - Bjørn Edwin
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; The Intervention Centre and Department of HPB Surgery, Oslo University Hospital and Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway
| | | | | | - Jan F Nygård
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway; UiT - The Arctic University of Norway, Tromsø, Norway
| |
Collapse
|
12
|
Yan C, Zhang Z, Nyemba S, Li Z. Generating Synthetic Electronic Health Record Data Using Generative Adversarial Networks: Tutorial. JMIR AI 2024; 3:e52615. [PMID: 38875595 PMCID: PMC11074891 DOI: 10.2196/52615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Revised: 01/24/2024] [Accepted: 03/07/2024] [Indexed: 06/16/2024]
Abstract
Synthetic electronic health record (EHR) data generation has been increasingly recognized as an important solution to expand the accessibility and maximize the value of private health data on a large scale. Recent advances in machine learning have facilitated more accurate modeling for complex and high-dimensional data, thereby greatly enhancing the data quality of synthetic EHR data. Among various approaches, generative adversarial networks (GANs) have become the main technical path in the literature due to their ability to capture the statistical characteristics of real data. However, there is a scarcity of detailed guidance within the domain regarding the development procedures of synthetic EHR data. The objective of this tutorial is to present a transparent and reproducible process for generating structured synthetic EHR data using a publicly accessible EHR data set as an example. We cover the topics of GAN architecture, EHR data types and representation, data preprocessing, GAN training, synthetic data generation and postprocessing, and data quality evaluation. We conclude this tutorial by discussing multiple important issues and future opportunities in this domain. The source code of the entire process has been made publicly available.
Collapse
Affiliation(s)
- Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Ziqi Zhang
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Steve Nyemba
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Zhuohang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| |
Collapse
|
13
|
Chen Y, Esmaeilzadeh P. Generative AI in Medical Practice: In-Depth Exploration of Privacy and Security Challenges. J Med Internet Res 2024; 26:e53008. [PMID: 38457208 PMCID: PMC10960211 DOI: 10.2196/53008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 12/12/2023] [Accepted: 01/31/2024] [Indexed: 03/09/2024] Open
Abstract
As advances in artificial intelligence (AI) continue to transform and revolutionize the field of medicine, understanding the potential uses of generative AI in health care becomes increasingly important. Generative AI, including models such as generative adversarial networks and large language models, shows promise in transforming medical diagnostics, research, treatment planning, and patient care. However, these data-intensive systems pose new threats to protected health information. This Viewpoint paper aims to explore various categories of generative AI in health care, including medical diagnostics, drug discovery, virtual health assistants, medical research, and clinical decision support, while identifying security and privacy threats within each phase of the life cycle of such systems (ie, data collection, model development, and implementation phases). The objectives of this study were to analyze the current state of generative AI in health care, identify opportunities and privacy and security challenges posed by integrating these technologies into existing health care infrastructure, and propose strategies for mitigating security and privacy risks. This study highlights the importance of addressing the security and privacy threats associated with generative AI in health care to ensure the safe and effective use of these systems. The findings of this study can inform the development of future generative AI systems in health care and help health care organizations better understand the potential benefits and risks associated with these systems. By examining the use cases and benefits of generative AI across diverse domains within health care, this paper contributes to theoretical discussions surrounding AI ethics, security vulnerabilities, and data privacy regulations. In addition, this study provides practical insights for stakeholders looking to adopt generative AI solutions within their organizations.
Collapse
Affiliation(s)
- Yan Chen
- Department of Information Systems and Business Analytics, College of Business, Florida International University, Miami, FL, United States
| | - Pouyan Esmaeilzadeh
- Department of Information Systems and Business Analytics, College of Business, Florida International University, Miami, FL, United States
| |
Collapse
|
14
|
Gwon H, Ahn I, Kim Y, Kang HJ, Seo H, Choi H, Cho HN, Kim M, Han J, Kee G, Park S, Lee KH, Jun TJ, Kim YH. LDP-GAN : Generative adversarial networks with local differential privacy for patient medical records synthesis. Comput Biol Med 2024; 168:107738. [PMID: 37995536 DOI: 10.1016/j.compbiomed.2023.107738] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Revised: 10/31/2023] [Accepted: 11/16/2023] [Indexed: 11/25/2023]
Abstract
Electronic medical records(EMR) have considerable potential to advance healthcare technologies, including medical AI. Nevertheless, due to the privacy issues associated with the sharing of patient's personal information, it is difficult to sufficiently utilize them. Generative models based on deep learning can solve this problem by creating synthetic data similar to real patient data. However, the data used for training these deep learning models run into the risk of getting leaked because of malicious attacks. This means that traditional deep learning-based generative models cannot completely solve the privacy issues. Therefore, we suggested a method to prevent the leakage of training data by protecting the model from malicious attacks using local differential privacy(LDP). Our method was evaluated in terms of utility and privacy. Experimental results demonstrated that the proposed method can generate medical data with reasonable performance while protecting training data from malicious attacks.
Collapse
Affiliation(s)
- Hansle Gwon
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Imjin Ahn
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Yunha Kim
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Hee Jun Kang
- Division of Cardiology, Asan Medical Center, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Hyeram Seo
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Heejung Choi
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Ha Na Cho
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Minkyoung Kim
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - JiYe Han
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Gaeun Kee
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Seohyun Park
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Kye Hwa Lee
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Tae Joon Jun
- Big Data Research Center, Asan Institute for Life Sciences, Asan Medical Center, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea.
| | - Young-Hak Kim
- Division of Cardiology, Department of Information Medicine, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| |
Collapse
|
15
|
Yelmen B, Decelle A, Boulos LL, Szatkownik A, Furtlehner C, Charpiat G, Jay F. Deep convolutional and conditional neural networks for large-scale genomic data generation. PLoS Comput Biol 2023; 19:e1011584. [PMID: 37903158 PMCID: PMC10635570 DOI: 10.1371/journal.pcbi.1011584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 11/09/2023] [Accepted: 10/09/2023] [Indexed: 11/01/2023] Open
Abstract
Applications of generative models for genomic data have gained significant momentum in the past few years, with scopes ranging from data characterization to generation of genomic segments and functional sequences. In our previous study, we demonstrated that generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be used to create novel high-quality artificial genomes (AGs) which can preserve the complex characteristics of real genomes such as population structure, linkage disequilibrium and selection signals. However, a major drawback of these models is scalability, since the large feature space of genome-wide data increases computational complexity vastly. To address this issue, we implemented a novel convolutional Wasserstein GAN (WGAN) model along with a novel conditional RBM (CRBM) framework for generating AGs with high SNP number. These networks implicitly learn the varying landscape of haplotypic structure in order to capture complex correlation patterns along the genome and generate a wide diversity of plausible haplotypes. We performed comparative analyses to assess both the quality of these generated haplotypes and the amount of possible privacy leakage from the training data. As the importance of genetic privacy becomes more prevalent, the need for effective privacy protection measures for genomic data increases. We used generative neural networks to create large artificial genome segments which possess many characteristics of real genomes without substantial privacy leakage from the training dataset. In the near future, with further improvements in haplotype quality and privacy preservation, large-scale artificial genome databases can be assembled to provide easily accessible surrogates of real databases, allowing researchers to conduct studies with diverse genomic data within a safe ethical framework in terms of donor privacy.
Collapse
Affiliation(s)
- Burak Yelmen
- Université Paris-Saclay, CNRS, INRIA, LISN, Paris, France
- University of Tartu, Institute of Genomics, Tartu, Estonia
| | - Aurélien Decelle
- Université Paris-Saclay, CNRS, INRIA, LISN, Paris, France
- Universidad Complutense de Madrid, Departamento de Física Teórica, Madrid, Spain
| | - Leila Lea Boulos
- Université Paris-Saclay, CNRS, INRIA, LISN, Paris, France
- Université d’Évry Val-d’Essonne, Évry-Courcouronnes, France
| | | | | | | | - Flora Jay
- Université Paris-Saclay, CNRS, INRIA, LISN, Paris, France
| |
Collapse
|
16
|
Tagmatova Z, Abdusalomov A, Nasimov R, Nasimova N, Dogru AH, Cho YI. New Approach for Generating Synthetic Medical Data to Predict Type 2 Diabetes. Bioengineering (Basel) 2023; 10:1031. [PMID: 37760133 PMCID: PMC10525473 DOI: 10.3390/bioengineering10091031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 08/28/2023] [Accepted: 08/30/2023] [Indexed: 09/29/2023] Open
Abstract
The lack of medical databases is currently the main barrier to the development of artificial intelligence-based algorithms in medicine. This issue can be partially resolved by developing a reliable high-quality synthetic database. In this study, an easy and reliable method for developing a synthetic medical database based only on statistical data is proposed. This method changes the primary database developed based on statistical data using a special shuffle algorithm to achieve a satisfactory result and evaluates the resulting dataset using a neural network. Using the proposed method, a database was developed to predict the risk of developing type 2 diabetes 5 years in advance. This dataset consisted of data from 172,290 patients. The prediction accuracy reached 94.45% during neural network training of the dataset.
Collapse
Affiliation(s)
- Zarnigor Tagmatova
- Department of Computer Engineering, Gachon University, Sujeong-Gu, Seongnam-Si 461-701, Republic of Korea
| | - Akmalbek Abdusalomov
- Department of Computer Engineering, Gachon University, Sujeong-Gu, Seongnam-Si 461-701, Republic of Korea
| | - Rashid Nasimov
- Department of Artificial Intelligence, Tashkent State University of Economics, Tashkent 100066, Uzbekistan
| | - Nigorakhon Nasimova
- Department of Artificial Intelligence, Tashkent State University of Economics, Tashkent 100066, Uzbekistan
| | - Ali Hikmet Dogru
- Department of Computer Science, University of Texas at San Antonio, San Antonio, TX 78249-0667, USA;
| | - Young-Im Cho
- Department of Computer Engineering, Gachon University, Sujeong-Gu, Seongnam-Si 461-701, Republic of Korea
| |
Collapse
|
17
|
Theodorou B, Xiao C, Sun J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat Commun 2023; 14:5305. [PMID: 37652934 PMCID: PMC10471716 DOI: 10.1038/s41467-023-41093-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 08/23/2023] [Indexed: 09/02/2023] Open
Abstract
Synthetic electronic health records (EHRs) that are both realistic and privacy-preserving offer alternatives to real EHRs for machine learning (ML) and statistical analysis. However, generating high-fidelity EHR data in its original, high-dimensional form poses challenges for existing methods. We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal, high-dimensional EHR, which preserve the statistical properties of real EHRs and can train accurate ML models without privacy concerns. HALO generates a probability density function over medical codes, clinical visits, and patient records, allowing for generating realistic EHR data without requiring variable selection or aggregation. Extensive experiments demonstrated that HALO can generate high-fidelity data with high-dimensional disease code probabilities closely mirroring (above 0.9 R2 correlation) real EHR data. HALO also enhances the accuracy of predictive modeling and enables downstream ML models to attain similar accuracy as models trained on genuine data.
Collapse
Affiliation(s)
- Brandon Theodorou
- University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL, USA
- Medisyn Inc., Las Vegas, NV, USA
| | - Cao Xiao
- Medisyn Inc., Las Vegas, NV, USA
| | - Jimeng Sun
- University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL, USA.
- Medisyn Inc., Las Vegas, NV, USA.
| |
Collapse
|
18
|
Lenatti M, Paglialonga A, Orani V, Ferretti M, Mongelli M. Characterization of Synthetic Health Data Using Rule-Based Artificial Intelligence Models. IEEE J Biomed Health Inform 2023; 27:3760-3769. [PMID: 37018683 DOI: 10.1109/jbhi.2023.3236722] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
The aim of this study is to apply and characterize eXplainable AI (XAI) to assess the quality of synthetic health data generated using a data augmentation algorithm. In this exploratory study, several synthetic datasets are generated using various configurations of a conditional Generative Adversarial Network (GAN) from a set of 156 observations related to adult hearing screening. A rule-based native XAI algorithm, the Logic Learning Machine, is used in combination with conventional utility metrics. The classification performance in different conditions is assessed: models trained and tested on synthetic data, models trained on synthetic data and tested on real data, and models trained on real data and tested on synthetic data. The rules extracted from real and synthetic data are then compared using a rule similarity metric. The results indicate that XAI may be used to assess the quality of synthetic data by (i) the analysis of classification performance and (ii) the analysis of the rules extracted on real and synthetic data (number, covering, structure, cut-off values, and similarity). These results suggest that XAI can be used in an original way to assess synthetic health data and extract knowledge about the mechanisms underlying the generated data.
Collapse
|
19
|
Lacan A, Sebag M, Hanczar B. GAN-based data augmentation for transcriptomics: survey and comparative assessment. Bioinformatics 2023; 39:i111-i120. [PMID: 37387181 DOI: 10.1093/bioinformatics/btad239] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models' full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes. RESULTS This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly. AVAILABILITY AND IMPLEMENTATION All data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository: https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics.
Collapse
Affiliation(s)
- Alice Lacan
- IBISC, University Paris-Saclay (Univ. Evry), Evry 91000, France
| | - Michèle Sebag
- TAU, CNRS-INRIA-LISN, University Paris-Saclay, Gif-sur-Yvette 91190, France
| | - Blaise Hanczar
- IBISC, University Paris-Saclay (Univ. Evry), Evry 91000, France
| |
Collapse
|
20
|
Nikolentzos G, Vazirgiannis M, Xypolopoulos C, Lingman M, Brandt EG. Synthetic electronic health records generated with variational graph autoencoders. NPJ Digit Med 2023; 6:83. [PMID: 37120594 PMCID: PMC10148837 DOI: 10.1038/s41746-023-00822-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 04/05/2023] [Indexed: 05/01/2023] Open
Abstract
Data-driven medical care delivery must always respect patient privacy-a requirement that is not easily met. This issue has impeded improvements to healthcare software and has delayed the long-predicted prevalence of artificial intelligence in healthcare. Until now, it has been very difficult to share data between healthcare organizations, resulting in poor statistical models due to unrepresentative patient cohorts. Synthetic data, i.e., artificial but realistic electronic health records, could overcome the drought that is troubling the healthcare sector. Deep neural network architectures, in particular, have shown an incredible ability to learn from complex data sets and generate large amounts of unseen data points with the same statistical properties as the training data. Here, we present a generative neural network model that can create synthetic health records with realistic timelines. These clinical trajectories are generated on a per-patient basis and are represented as linear-sequence graphs of clinical events over time. We use a variational graph autoencoder (VGAE) to generate synthetic samples from real-world electronic health records. Our approach generates health records not seen in the training data. We show that these artificial patient trajectories are realistic and preserve patient privacy and can therefore support the safe sharing of data across organizations.
Collapse
Affiliation(s)
- Giannis Nikolentzos
- LIX, École Polytechnique, Institut Polytechnique de Paris, Palaiseau, France.
| | - Michalis Vazirgiannis
- LIX, École Polytechnique, Institut Polytechnique de Paris, Palaiseau, France
- Department of Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden
| | | | - Markus Lingman
- Department of Molecular and Clinical Medicine/Cardiology, Institute of Medicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
- Center for Applied Intelligent Systems Research, Halmstad University, Halmstad, Sweden
| | | |
Collapse
|
21
|
Mosquera L, El Emam K, Ding L, Sharma V, Zhang XH, Kababji SE, Carvalho C, Hamilton B, Palfrey D, Kong L, Jiang B, Eurich DT. A method for generating synthetic longitudinal health data. BMC Med Res Methodol 2023; 23:67. [PMID: 36959532 PMCID: PMC10034254 DOI: 10.1186/s12874-023-01869-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 02/19/2023] [Indexed: 03/25/2023] Open
Abstract
Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health's administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.
Collapse
Affiliation(s)
- Lucy Mosquera
- Replica Analytics Ltd, Ottawa, ON, Canada
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1J 8L1, Canada
| | - Khaled El Emam
- Replica Analytics Ltd, Ottawa, ON, Canada.
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1J 8L1, Canada.
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.
| | - Lei Ding
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada
| | - Vishal Sharma
- School of Public Health, University of Alberta, Edmonton, AB, Canada
| | | | - Samer El Kababji
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1J 8L1, Canada
| | | | | | - Dan Palfrey
- Institute of Health Economics, Edmonton, Alberta, Canada
| | - Linglong Kong
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada
| | - Bei Jiang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada
| | - Dean T Eurich
- School of Public Health, University of Alberta, Edmonton, AB, Canada
| |
Collapse
|
22
|
Hernadez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions. Methods Inf Med 2023. [PMID: 36623830 DOI: 10.1055/s-0042-1760247] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
BACKGROUND Synthetic tabular data generation is a potentially valuable technology with great promise for data augmentation and privacy preservation. However, prior to adoption, an empirical assessment of generated synthetic tabular data is required across dimensions relevant to the target application to determine its efficacy. A lack of standardized and objective evaluation and benchmarking strategy for synthetic tabular data in the health domain has been found in the literature. OBJECTIVE The aim of this paper is to identify key dimensions, per dimension metrics, and methods for evaluating synthetic tabular data generated with different techniques and configurations for health domain application development and to provide a strategy to orchestrate them. METHODS Based on the literature, the resemblance, utility, and privacy dimensions have been prioritized, and a collection of metrics and methods for their evaluation are orchestrated into a complete evaluation pipeline. This way, a guided and comparative assessment of generated synthetic tabular data can be done, categorizing its quality into three categories ("Excellent," "Good," and "Poor"). Six health care-related datasets and four synthetic tabular data generation approaches have been chosen to conduct an analysis and evaluation to verify the utility of the proposed evaluation pipeline. RESULTS The synthetic tabular data generated with the four selected approaches has maintained resemblance, utility, and privacy for most datasets and synthetic tabular data generation approach combination. In several datasets, some approaches have outperformed others, while in other datasets, more than one approach has yielded the same performance. CONCLUSION The results have shown that the proposed pipeline can effectively be used to evaluate and benchmark the synthetic tabular data generated by various synthetic tabular data generation approaches. Therefore, this pipeline can support the scientific community in selecting the most suitable synthetic tabular data generation approaches for their data and application of interest.
Collapse
Affiliation(s)
- Mikel Hernadez
- Digital Health and Biomedical Technologies Department, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain
| | - Gorka Epelde
- Digital Health and Biomedical Technologies Department, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain.,eHealth Group, Biodonostia Health Research Institute, Donostia-San Sebastian, Spain
| | - Ane Alberdi
- Biomedical Engineering Department, Mondragon Unibertsitatea, Arrasate-Mondragón, Spain
| | - Rodrigo Cilla
- Digital Health and Biomedical Technologies Department, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain
| | - Debbie Rankin
- School of Computing, Engineering and Intelligent Systems, Ulster University, Derry-Londonderry, United Kingdom
| |
Collapse
|
23
|
McDonnell K, Howley E, Abram F. Critical evaluation of the use of artificial data for machine learning based de novo peptide identification. Comput Struct Biotechnol J 2023; 21:2732-2743. [PMID: 37168871 PMCID: PMC10165132 DOI: 10.1016/j.csbj.2023.04.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 04/16/2023] [Accepted: 04/16/2023] [Indexed: 05/13/2023] Open
Abstract
Proteins are essential components of all living cells and so the study of their in situ expression, proteomics, has wide reaching applications. Peptide identification in proteomics typically relies on matching high resolution tandem mass spectra to a protein database but can also be performed de novo. While artificial spectra have been successfully incorporated into database search pipelines to increase peptide identification rates, little work has been done to investigate the utility of artificial spectra in the context of de novo peptide identification. Here, we perform a critical analysis of the use of artificial data for the training and evaluation of de novo peptide identification algorithms. First, we classify the different fragment ion types present in real spectra and then estimate the number of spurious matches using random peptides. We then categorise the different types of noise present in real spectra. Finally, we transfer this knowledge to artificial data and test the performance of a state-of-the-art de novo peptide identification algorithm trained using artificial spectra with and without relevant noise addition. Noise supplementation increased artificial training data performance from 30% to 77% of real training data peptide recall. While real data performance was not fully replicated, this work provides the first steps towards an artificial spectrum framework for the training and evaluation of de novo peptide identification algorithms. Further enhanced artificial spectra may allow for more in depth analysis of de novo algorithms as well as alleviating the reliance on database searches for training data.
Collapse
Affiliation(s)
- Kevin McDonnell
- Functional Environmental Microbiology, School of Natural Sciences, Ryan Institute, University of Galway, Ireland
- School of Computer Science, University of Galway, Ireland
- Corresponding author at: Functional Environmental Microbiology, School of Natural Sciences, Ryan Institute, University of Galway, Ireland.
| | - Enda Howley
- School of Computer Science, University of Galway, Ireland
| | - Florence Abram
- Functional Environmental Microbiology, School of Natural Sciences, Ryan Institute, University of Galway, Ireland
- Corresponding author.
| |
Collapse
|
24
|
Yan C, Yan Y, Wan Z, Zhang Z, Omberg L, Guinney J, Mooney SD, Malin BA. A Multifaceted benchmarking of synthetic electronic health record generation models. Nat Commun 2022; 13:7609. [PMID: 36494374 PMCID: PMC9734113 DOI: 10.1038/s41467-022-35295-1] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 11/28/2022] [Indexed: 12/13/2022] Open
Abstract
Synthetic health data have the potential to mitigate privacy concerns in supporting biomedical research and healthcare applications. Modern approaches for data generation continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a systematic benchmarking framework to appraise key characteristics with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic health data and further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context.
Collapse
Affiliation(s)
- Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Yao Yan
- Sage Bionetworks, Seattle, WA, USA
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Ziqi Zhang
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA
| | | | - Justin Guinney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
- Tempus Labs, Chicago, IL, USA
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA.
| | - Bradley A Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA.
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA.
| |
Collapse
|
25
|
Venugopal R, Shafqat N, Venugopal I, Tillbury BMJ, Stafford HD, Bourazeri A. Privacy preserving Generative Adversarial Networks to model Electronic Health Records. Neural Netw 2022; 153:339-348. [DOI: 10.1016/j.neunet.2022.06.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 05/13/2022] [Accepted: 06/16/2022] [Indexed: 11/25/2022]
|
26
|
Chen HY, Huang SH. Generating a trading strategy in the financial market from sensitive expert data based on the privacy-preserving generative adversarial imitation network. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.05.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
27
|
Hernandez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic data generation for tabular health records: A systematic review. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.04.053] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
28
|
Bhanot K, Pedersen J, Guyon I, Bennett KP. Investigating synthetic medical time-series resemblance. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.04.097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
29
|
Incorporation of Synthetic Data Generation Techniques within a Controlled Data Processing Workflow in the Health and Wellbeing Domain. ELECTRONICS 2022. [DOI: 10.3390/electronics11050812] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
To date, the use of synthetic data generation techniques in the health and wellbeing domain has been mainly limited to research activities. Although several open source and commercial packages have been released, they have been oriented to generating synthetic data as a standalone data preparation process and not integrated into a broader analysis or experiment testing workflow. In this context, the VITALISE project is working to harmonize Living Lab research and data capture protocols and to provide controlled processing access to captured data to industrial and scientific communities. In this paper, we present the initial design and implementation of our synthetic data generation approach in the context of VITALISE Living Lab controlled data processing workflow, together with identified challenges and future developments. By uploading data captured from Living Labs, generating synthetic data from them, developing analysis locally with synthetic data, and then executing them remotely with real data, the utility of the proposed workflow has been validated. Results have shown that the presented workflow helps accelerate research on artificial intelligence, ensuring compliance with data protection laws. The presented approach has demonstrated how the adoption of state-of-the-art synthetic data generation techniques can be applied for real-world applications.
Collapse
|
30
|
Zhou N, Wang L, Marino S, Zhao Y, Dinov ID. DataSifter II: Partially synthetic data sharing of sensitive information containing time-varying correlated observations. JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY 2022; 16:10.1177/17483026211065379. [PMID: 36274750 PMCID: PMC9585991 DOI: 10.1177/17483026211065379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
There is a significant public demand for rapid data-driven scientific investigations using aggregated sensitive information. However, many technical challenges and regulatory policies hinder efficient data sharing. In this study, we describe a partially synthetic data generation technique for creating anonymized data archives whose joint distributions closely resemble those of the original (sensitive) data. Specifically, we introduce the DataSifter technique for time-varying correlated data (DataSifter II), which relies on an iterative model-based imputation using generalized linear mixed model and random effects-expectation maximization tree. DataSifter II can be used to generate synthetic repeated measures data for testing and validating new analytical techniques. Compared to the multiple imputation method, DataSifter II application on simulated and real clinical data demonstrates that the new method provides extensive reduction of re-identification risk (data privacy) while preserving the analytical value (data utility) in the obfuscated data. The performance of the DataSifter II on a simulation involving 20% artificially missingness in the data, shows at least 80% reduction of the disclosure risk, compared to the multiple imputation method, without a substantial impact on the data analytical value. In a separate clinical data (Medical Information Mart for Intensive Care III) validation, a model-based statistical inference drawn from the original data agrees with an analogous analytical inference obtained using the DataSifter II obfuscated (sifted) data. For large time-varying datasets containing sensitive information, the proposed technique provides an automated tool for alleviating the barriers of data sharing and facilitating effective, advanced, and collaborative analytics.
Collapse
Affiliation(s)
- Nina Zhou
- Statistics Online Computational Resource, University of Michigan, USA
- Department of Biostatistics, University of Michigan, USA
| | - Lu Wang
- Department of Biostatistics, University of Michigan, USA
| | - Simeone Marino
- Statistics Online Computational Resource, University of Michigan, USA
| | - Yi Zhao
- Department of Biostatistics, University of Michigan, USA
| | - Ivo D Dinov
- Statistics Online Computational Resource, University of Michigan, USA
- Michigan Institute for Data Science, University of Michigan, USA
| |
Collapse
|
31
|
Barrat-Charlaix P, Muntoni AP, Shimagaki K, Weigt M, Zamponi F. Sparse generative modeling via parameter reduction of Boltzmann machines: Application to protein-sequence families. Phys Rev E 2021; 104:024407. [PMID: 34525554 DOI: 10.1103/physreve.104.024407] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 07/19/2021] [Indexed: 11/07/2022]
Abstract
Boltzmann machines (BMs) are widely used as generative models. For example, pairwise Potts models (PMs), which are instances of the BM class, provide accurate statistical models of families of evolutionarily related protein sequences. Their parameters are the local fields, which describe site-specific patterns of amino acid conservation, and the two-site couplings, which mirror the coevolution between pairs of sites. This coevolution reflects structural and functional constraints acting on protein sequences during evolution. The most conservative choice to describe the coevolution signal is to include all possible two-site couplings into the PM. This choice, typical of what is known as Direct Coupling Analysis, has been successful for predicting residue contacts in the three-dimensional structure, mutational effects, and generating new functional sequences. However, the resulting PM suffers from important overfitting effects: many couplings are small, noisy, and hardly interpretable; the PM is close to a critical point, meaning that it is highly sensitive to small parameter perturbations. In this work, we introduce a general parameter-reduction procedure for BMs, via a controlled iterative decimation of the less statistically significant couplings, identified by an information-based criterion that selects either weak or statistically unsupported couplings. For several protein families, our procedure allows one to remove more than 90% of the PM couplings, while preserving the predictive and generative properties of the original dense PM, and the resulting model is far away from criticality, hence more robust to noise.
Collapse
Affiliation(s)
- Pierre Barrat-Charlaix
- Biozentrum, Universität Basel, Switzerland, Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Anna Paola Muntoni
- Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino 10129, Italy.,Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo (TO), Italy.,Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative-LCQB, F-75005 Paris, France.,Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France
| | - Kai Shimagaki
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative-LCQB, F-75005 Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative-LCQB, F-75005 Paris, France
| | - Francesco Zamponi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France
| |
Collapse
|
32
|
Bhanot K, Qi M, Erickson JS, Guyon I, Bennett KP. The Problem of Fairness in Synthetic Healthcare Data. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1165. [PMID: 34573790 PMCID: PMC8468495 DOI: 10.3390/e23091165] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Revised: 08/25/2021] [Accepted: 08/30/2021] [Indexed: 11/16/2022]
Abstract
Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must "fairly" represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets.
Collapse
Affiliation(s)
- Karan Bhanot
- Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA; (M.Q.); (K.P.B.)
- OptumLabs, Eden Prairie, MN 55344, USA
| | - Miao Qi
- Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA; (M.Q.); (K.P.B.)
| | - John S. Erickson
- Rensselaer Institute for Data Exploration and Applications, Troy, NY 12180, USA;
| | - Isabelle Guyon
- LISN, CNRS/INRIA, Université Paris-Saclay, 91190 Gif-sur-Yvette, France;
- ChaLearn, San Francisco, CA 94115, USA
| | - Kristin P. Bennett
- Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA; (M.Q.); (K.P.B.)
- Rensselaer Institute for Data Exploration and Applications, Troy, NY 12180, USA;
- Department of Mathematics, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
| |
Collapse
|
33
|
Chen RJ, Lu MY, Chen TY, Williamson DFK, Mahmood F. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng 2021; 5:493-497. [PMID: 34131324 PMCID: PMC9353344 DOI: 10.1038/s41551-021-00751-8] [Citation(s) in RCA: 221] [Impact Index Per Article: 55.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
The proliferation of synthetic data in artificial intelligence for medicine and healthcare raises concerns about the vulnerabilities of the software and the challenges of current policy.
Collapse
Affiliation(s)
- Richard J Chen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Ming Y Lu
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Tiffany Y Chen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Drew F K Williamson
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Faisal Mahmood
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA.
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA.
| |
Collapse
|