1
|
An evaluation of synthetic data augmentation for mitigating covariate bias in health data. PATTERNS (NEW YORK, N.Y.) 2024; 5:100946. [PMID: 38645766 PMCID: PMC11026977 DOI: 10.1016/j.patter.2024.100946] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Revised: 10/23/2023] [Accepted: 02/08/2024] [Indexed: 04/23/2024]
Abstract
Data bias is a major concern in biomedical research, especially when evaluating large-scale observational datasets. It leads to imprecise predictions and inconsistent estimates in standard regression models. We compare the performance of commonly used bias-mitigating approaches (resampling, algorithmic, and post hoc approaches) against a synthetic data-augmentation method that utilizes sequential boosted decision trees to synthesize under-represented groups. The approach is called synthetic minority augmentation (SMA). Through simulations and analysis of real health datasets on a logistic regression workload, the approaches are evaluated across various bias scenarios (types and severity levels). Performance was assessed based on area under the curve, calibration (Brier score), precision of parameter estimates, confidence interval overlap, and fairness. Overall, SMA produces the closest results to the ground truth in low to medium bias (50% or less missing proportion). In high bias (80% or more missing proportion), the advantage of SMA is not obvious, with no specific method consistently outperforming others.
Collapse
|
2
|
CHeart: A Conditional Spatio-Temporal Generative Model for Cardiac Anatomy. IEEE TRANSACTIONS ON MEDICAL IMAGING 2024; 43:1259-1269. [PMID: 37948142 PMCID: PMC7615911 DOI: 10.1109/tmi.2023.3331982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2023]
Abstract
Two key questions in cardiac image analysis are to assess the anatomy and motion of the heart from images; and to understand how they are associated with non-imaging clinical factors such as gender, age and diseases. While the first question can often be addressed by image segmentation and motion tracking algorithms, our capability to model and answer the second question is still limited. In this work, we propose a novel conditional generative model to describe the 4D spatio-temporal anatomy of the heart and its interaction with non-imaging clinical factors. The clinical factors are integrated as the conditions of the generative modelling, which allows us to investigate how these factors influence the cardiac anatomy. We evaluate the model performance in mainly two tasks, anatomical sequence completion and sequence generation. The model achieves high performance in anatomical sequence completion, comparable to or outperforming other state-of-the-art generative models. In terms of sequence generation, given clinical conditions, the model can generate realistic synthetic 4D sequential anatomies that share similar distributions with the real data. The code and the trained generative model are available at https://github.com/MengyunQ/CHeart.
Collapse
|
3
|
PVS-GEN: Systematic Approach for Universal Synthetic Data Generation Involving Parameterization, Verification, and Segmentation. SENSORS (BASEL, SWITZERLAND) 2024; 24:266. [PMID: 38203126 PMCID: PMC10781314 DOI: 10.3390/s24010266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Revised: 12/06/2023] [Accepted: 12/14/2023] [Indexed: 01/12/2024]
Abstract
Synthetic data generation addresses the challenges of obtaining extensive empirical datasets, offering benefits such as cost-effectiveness, time efficiency, and robust model development. Nonetheless, synthetic data-generation methodologies still encounter significant difficulties, including a lack of standardized metrics for modeling different data types and comparing generated results. This study introduces PVS-GEN, an automated, general-purpose process for synthetic data generation and verification. The PVS-GEN method parameterizes time-series data with minimal human intervention and verifies model construction using a specific metric derived from extracted parameters. For complex data, the process iteratively segments the empirical dataset until an extracted parameter can reproduce synthetic data that reflects the empirical characteristics, irrespective of the sensor data type. Moreover, we introduce the PoR metric to quantify the quality of the generated data by evaluating its time-series characteristics. Consequently, the proposed method can automatically generate diverse time-series data that covers a wide range of sensor types. We compared PVS-GEN with existing synthetic data-generation methodologies, and PVS-GEN demonstrated a superior performance. It generated data with a similarity of up to 37.1% across multiple data types and by 19.6% on average using the proposed metric, irrespective of the data type.
Collapse
|
4
|
A Synthetic Time-Series Generation Using a Variational Recurrent Autoencoder with an Attention Mechanism in an Industrial Control System. SENSORS (BASEL, SWITZERLAND) 2023; 24:128. [PMID: 38202989 PMCID: PMC10781275 DOI: 10.3390/s24010128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Revised: 12/14/2023] [Accepted: 12/22/2023] [Indexed: 01/12/2024]
Abstract
Data scarcity is a significant obstacle for modern data science and artificial intelligence research communities. The fact that abundant data are a key element of a powerful prediction model is well known through various past studies. However, industrial control systems (ICS) are operated in a closed environment due to security and privacy issues, so collected data are generally not disclosed. In this environment, synthetic data generation can be a good alternative. However, ICS datasets have time-series characteristics and include features with short- and long-term temporal dependencies. In this paper, we propose the attention-based variational recurrent autoencoder (AVRAE) for generating time-series ICS data. We first extend the evidence lower bound of the variational inference to time-series data. Then, a recurrent neural-network-based autoencoder is designed to take this as the objective. AVRAE employs the attention mechanism to effectively learn the long-term and short-term temporal dependencies ICS data implies. Finally, we present an algorithm for generating synthetic ICS time-series data using learned AVRAE. In a comprehensive evaluation using the ICS dataset HAI and various performance indicators, AVRAE successfully generated visually and statistically plausible synthetic ICS data.
Collapse
|
5
|
Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy. JMIR Med Inform 2023; 11:e47859. [PMID: 37999942 DOI: 10.2196/47859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Revised: 08/02/2023] [Accepted: 10/28/2023] [Indexed: 11/25/2023] Open
Abstract
BACKGROUND Synthetic data generation (SDG) based on generative adversarial networks (GANs) is used in health care, but research on preserving data with logical relationships with synthetic tabular data (STD) remains challenging. Filtering methods for SDG can lead to the loss of important information. OBJECTIVE This study proposed a divide-and-conquer (DC) method to generate STD based on the GAN algorithm, while preserving data with logical relationships. METHODS The proposed method was evaluated on data from the Korea Association for Lung Cancer Registry (KALC-R) and 2 benchmark data sets (breast cancer and diabetes). The DC-based SDG strategy comprises 3 steps: (1) We used 2 different partitioning methods (the class-specific criterion distinguished between survival and death groups, while the Cramer V criterion identified the highest correlation between columns in the original data); (2) the entire data set was divided into a number of subsets, which were then used as input for the conditional tabular generative adversarial network and the copula generative adversarial network to generate synthetic data; and (3) the generated synthetic data were consolidated into a single entity. For validation, we compared DC-based SDG and conditional sampling (CS)-based SDG through the performances of machine learning models. In addition, we generated imbalanced and balanced synthetic data for each of the 3 data sets and compared their performance using 4 classifiers: decision tree (DT), random forest (RF), Extreme Gradient Boosting (XGBoost), and light gradient-boosting machine (LGBM) models. RESULTS The synthetic data of the 3 diseases (non-small cell lung cancer [NSCLC], breast cancer, and diabetes) generated by our proposed model outperformed the 4 classifiers (DT, RF, XGBoost, and LGBM). The CS- versus DC-based model performances were compared using the mean area under the curve (SD) values: 74.87 (SD 0.77) versus 63.87 (SD 2.02) for NSCLC, 73.31 (SD 1.11) versus 67.96 (SD 2.15) for breast cancer, and 61.57 (SD 0.09) versus 60.08 (SD 0.17) for diabetes (DT); 85.61 (SD 0.29) versus 79.01 (SD 1.20) for NSCLC, 78.05 (SD 1.59) versus 73.48 (SD 4.73) for breast cancer, and 59.98 (SD 0.24) versus 58.55 (SD 0.17) for diabetes (RF); 85.20 (SD 0.82) versus 76.42 (SD 0.93) for NSCLC, 77.86 (SD 2.27) versus 68.32 (SD 2.37) for breast cancer, and 60.18 (SD 0.20) versus 58.98 (SD 0.29) for diabetes (XGBoost); and 85.14 (SD 0.77) versus 77.62 (SD 1.85) for NSCLC, 78.16 (SD 1.52) versus 70.02 (SD 2.17) for breast cancer, and 61.75 (SD 0.13) versus 61.12 (SD 0.23) for diabetes (LGBM). In addition, we found that balanced synthetic data performed better. CONCLUSIONS This study is the first attempt to generate and validate STD based on a DC approach and shows improved performance using STD. The necessity for balanced SDG was also demonstrated.
Collapse
|
6
|
Application of Gaussian Mixtures in a Multimodal Kalman Filter to Estimate the State of a Nonlinearly Moving System Using Sparse Inaccurate Measurements in a Cellular Radio Network. SENSORS (BASEL, SWITZERLAND) 2023; 23:3603. [PMID: 37050661 PMCID: PMC10098955 DOI: 10.3390/s23073603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 03/17/2023] [Accepted: 03/27/2023] [Indexed: 06/19/2023]
Abstract
Kalman filter is a well-established accuracy correction method in control, guidance, and navigation. With the popularity of mobile communication and ICT, Kalman Filter has been used in many new applications related to positioning based on spatiotemporal data from the cellular network. Despite the low accuracy compared to Global Positioning System, the method is an excellent supplement to other positioning technologies. It is often used in sensor fusion setups as a complementary source. One of the reasons for the Kalman Filter's inaccuracy lies in naive radio coverage approximation techniques based on multivariate normal distributions assumed by previous studies. Therefore, in this paper, we evaluated those disadvantages and proposed a Gaussian mixtures model to address the non-arbitrary shape of the radio cells' coverage area. Having incorporated the Gaussian mixtures model into Switching Kalman Filter, we achieved better accuracy in positioning within the cellular network.
Collapse
|
7
|
High-efficient Bloch simulation of magnetic resonance imaging sequences based on deep learning. Phys Med Biol 2023; 68. [PMID: 36921351 DOI: 10.1088/1361-6560/acc4a6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Accepted: 03/15/2023] [Indexed: 03/17/2023]
Abstract
OBJECTIVE Bloch simulation constitutes an essential part of magnetic resonance imaging (MRI) development. However, even with the graphics processing unit (GPU) acceleration, the heavy computational load remains a major challenge, especially in large-scale, high-accuracy simulation scenarios. This work aims to develop a deep learning-based simulator to accelerate Bloch simulation. APPROACH The simulator model, called Simu-Net, is based on an end-to-end convolutional neural network and is trained with synthetic data generated by traditional Bloch simulation. It uses dynamic convolution to fuse spatial and physical information with different dimensions and introduces position encoding templates to achieve position-specific labeling and overcome the receptive field limitation of the convolutional network. MAIN RESULTS Compared with mainstream GPU-based MRI simulation software, Simu-Net successfully accelerates simulations by hundreds of times in both traditional and advanced MRI pulse sequences. The accuracy and robustness of the proposed framework were verified qualitatively and quantitatively. Besides, the trained Simu-Net was applied to generate sufficient customized training samples for deep learning-based T2mapping and comparable results to conventional methods were obtained in the human brain. SIGNIFICANCE As a proof-of-concept work, Simu-Net shows the potential to apply deep learning for rapidly approximating the forward physical process of MRI and may increase the efficiency of Bloch simulation for optimization of MRI pulse sequences and deep learning-based methods.
Collapse
|
8
|
Validating a membership disclosure metric for synthetic health data. JAMIA Open 2022; 5:ooac083. [PMID: 36238080 PMCID: PMC9553223 DOI: 10.1093/jamiaopen/ooac083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 09/13/2022] [Accepted: 09/22/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. OBJECTIVE Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. MATERIALS AND METHODS We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack. RESULTS The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets. CONCLUSIONS Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.
Collapse
|
9
|
k-SALSA: k-anonymous synthetic averaging of retinal images via local style alignment. COMPUTER VISION - ECCV ... : ... EUROPEAN CONFERENCE ON COMPUTER VISION : PROCEEDINGS. EUROPEAN CONFERENCE ON COMPUTER VISION 2022; 13681:661-678. [PMID: 37525827 PMCID: PMC10388376 DOI: 10.1007/978-3-031-19803-8_39] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/02/2023]
Abstract
The application of modern machine learning to retinal image analyses offers valuable insights into a broad range of human health conditions beyond ophthalmic diseases. Additionally, data sharing is key to fully realizing the potential of machine learning models by providing a rich and diverse collection of training data. However, the personallyidentifying nature of retinal images, encompassing the unique vascular structure of each individual, often prevents this data from being shared openly. While prior works have explored image de-identification strategies based on synthetic averaging of images in other domains (e.g. facial images), existing techniques face difficulty in preserving both privacy and clinical utility in retinal images, as we demonstrate in our work. We therefore introduce k-SALSA, a generative adversarial network (GAN)-based framework for synthesizing retinal fundus images that summarize a given private dataset while satisfying the privacy notion of k-anonymity. k-SALSA brings together state-of-the-art techniques for training and inverting GANs to achieve practical performance on retinal images. Furthermore, k-SALSA leverages a new technique, called local style alignment, to generate a synthetic average that maximizes the retention of fine-grain visual patterns in the source images, thus improving the clinical utility of the generated images. On two benchmark datasets of diabetic retinopathy (EyePACS and APTOS), we demonstrate our improvement upon existing methods with respect to image fidelity, classification performance, and mitigation of membership inference attacks. Our work represents a step toward broader sharing of retinal images for scientific collaboration. Code is available at https://github.com/hcholab/k-salsa.
Collapse
|
10
|
Deep Convolutional Generative Adversarial Networks to Enhance Artificial Intelligence in Healthcare: A Skin Cancer Application. SENSORS (BASEL, SWITZERLAND) 2022; 22:6145. [PMID: 36015906 PMCID: PMC9416026 DOI: 10.3390/s22166145] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 08/04/2022] [Accepted: 08/14/2022] [Indexed: 06/15/2023]
Abstract
In recent years, researchers designed several artificial intelligence solutions for healthcare applications, which usually evolved into functional solutions for clinical practice. Furthermore, deep learning (DL) methods are well-suited to process the broad amounts of data acquired by wearable devices, smartphones, and other sensors employed in different medical domains. Conceived to serve the role of diagnostic tool and surgical guidance, hyperspectral images emerged as a non-contact, non-ionizing, and label-free technology. However, the lack of large datasets to efficiently train the models limits DL applications in the medical field. Hence, its usage with hyperspectral images is still at an early stage. We propose a deep convolutional generative adversarial network to generate synthetic hyperspectral images of epidermal lesions, targeting skin cancer diagnosis, and overcome small-sized datasets challenges to train DL architectures. Experimental results show the effectiveness of the proposed framework, capable of generating synthetic data to train DL classifiers.
Collapse
|
11
|
Contribution of Synthetic Data Generation towards an Improved Patient Stratification in Palliative Care. J Pers Med 2022; 12:1278. [PMID: 36013227 PMCID: PMC9409663 DOI: 10.3390/jpm12081278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 07/29/2022] [Accepted: 08/01/2022] [Indexed: 11/23/2022] Open
Abstract
AI model development for synthetic data generation to improve Machine Learning (ML) methodologies is an integral part of research in Computer Science and is currently being transferred to related medical fields, such as Systems Medicine and Medical Informatics. In general, the idea of personalized decision-making support based on patient data has driven the motivation of researchers in the medical domain for more than a decade, but the overall sparsity and scarcity of data are still major limitations. This is in contrast to currently applied technology that allows us to generate and analyze patient data in diverse forms, such as tabular data on health records, medical images, genomics data, or even audio and video. One solution arising to overcome these data limitations in relation to medical records is the synthetic generation of tabular data based on real world data. Consequently, ML-assisted decision-support can be interpreted more conveniently, using more relevant patient data at hand. At a methodological level, several state-of-the-art ML algorithms generate and derive decisions from such data. However, there remain key issues that hinder a broad practical implementation in real-life clinical settings. In this review, we will give for the first time insights towards current perspectives and potential impacts of using synthetic data generation in palliative care screening because it is a challenging prime example of highly individualized, sparsely available patient information. Taken together, the reader will obtain initial starting points and suitable solutions relevant for generating and using synthetic data for ML-based screenings in palliative care and beyond.
Collapse
|
12
|
Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study. JMIR Med Inform 2022; 10:e35734. [PMID: 35389366 PMCID: PMC9030990 DOI: 10.2196/35734] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 01/27/2022] [Accepted: 02/13/2022] [Indexed: 01/06/2023] Open
Abstract
Background A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. Objective This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. Methods We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. Results The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. Conclusions This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.
Collapse
|
13
|
Functional assessment of bidirectional cortical and peripheral neural control on heartbeat dynamics: a brain-heart study on thermal stress. Neuroimage 2022; 251:119023. [PMID: 35217203 DOI: 10.1016/j.neuroimage.2022.119023] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 02/16/2022] [Accepted: 02/18/2022] [Indexed: 12/12/2022] Open
Abstract
The study of functional brain-heart interplay (BHI) from non-invasive recordings has gained much interest in recent years. Previous endeavors aimed at understanding how the two dynamical systems exchange information, providing novel holistic biomarkers and important insights on essential cognitive aspects and neural system functioning. However, the interplay between cardiac sympathovagal and cortical oscillations still has much room for further investigation. In this study, we introduce a new computational framework for a functional BHI assessment, namely the Sympatho-Vagal Synthetic Data Generation Model, combining cortical (electroencephalography, EEG) and peripheral (cardiac sympathovagal) neural dynamics. The causal, bidirectional neural control on heartbeat dynamics was quantified on data gathered from 26 human volunteers undergoing a cold-pressor test. Results show that thermal stress induces heart-to-brain functional interplay sustained by EEG oscillations in the delta and gamma bands, primarily originating from sympathetic activity, whereas brain-to-heart interplay originates over central brain regions through sympathovagal control. The proposed methodology provides a viable computational tool for the functional assessment of the causal interplay between cortical and cardiac neural control.
Collapse
|
14
|
DataSifter II: Partially synthetic data sharing of sensitive information containing time-varying correlated observations. JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY 2022; 16:10.1177/17483026211065379. [PMID: 36274750 PMCID: PMC9585991 DOI: 10.1177/17483026211065379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
There is a significant public demand for rapid data-driven scientific investigations using aggregated sensitive information. However, many technical challenges and regulatory policies hinder efficient data sharing. In this study, we describe a partially synthetic data generation technique for creating anonymized data archives whose joint distributions closely resemble those of the original (sensitive) data. Specifically, we introduce the DataSifter technique for time-varying correlated data (DataSifter II), which relies on an iterative model-based imputation using generalized linear mixed model and random effects-expectation maximization tree. DataSifter II can be used to generate synthetic repeated measures data for testing and validating new analytical techniques. Compared to the multiple imputation method, DataSifter II application on simulated and real clinical data demonstrates that the new method provides extensive reduction of re-identification risk (data privacy) while preserving the analytical value (data utility) in the obfuscated data. The performance of the DataSifter II on a simulation involving 20% artificially missingness in the data, shows at least 80% reduction of the disclosure risk, compared to the multiple imputation method, without a substantial impact on the data analytical value. In a separate clinical data (Medical Information Mart for Intensive Care III) validation, a model-based statistical inference drawn from the original data agrees with an analogous analytical inference obtained using the DataSifter II obfuscated (sifted) data. For large time-varying datasets containing sensitive information, the proposed technique provides an automated tool for alleviating the barriers of data sharing and facilitating effective, advanced, and collaborative analytics.
Collapse
|
15
|
Synthetic Generation of Passive Infrared Motion Sensor Data Using a Game Engine. SENSORS 2021; 21:s21238078. [PMID: 34884081 PMCID: PMC8662402 DOI: 10.3390/s21238078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Revised: 11/19/2021] [Accepted: 11/28/2021] [Indexed: 11/22/2022]
Abstract
Quantifying the number of occupants in an indoor space is useful for a wide variety of applications. Attempts have been made at solving the task using passive infrared (PIR) motion sensor data together with supervised learning methods. Collecting a large labeled dataset containing both PIR motion sensor data and ground truth people count is however time-consuming, often requiring one hour of observation for each hour of data gathered. In this paper, a method is proposed for generating such data synthetically. A simulator is developed in the Unity game engine capable of producing synthetic PIR motion sensor data by detecting simulated occupants. The accuracy of the simulator is tested by replicating a real-world meeting room inside the simulator and conducting an experiment where a set of choreographed movements are performed in the simulated environment as well as the real room. In 34 out of 50 tested situations, the output from the simulated PIR sensors is comparable to the output from the real-world PIR sensors. The developed simulator is also used to study how a PIR sensor’s output changes depending on where in a room a motion is carried out. Through this, the relationship between sensor output and spatial position of a motion is discovered to be highly non-linear, which highlights some of the difficulties associated with mapping PIR data to occupancy count.
Collapse
|
16
|
Object Positioning Algorithm Based on Multidimensional Scaling and Optimization for Synthetic Gesture Data Generation. SENSORS 2021; 21:s21175923. [PMID: 34502814 PMCID: PMC8434389 DOI: 10.3390/s21175923] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Revised: 08/26/2021] [Accepted: 08/31/2021] [Indexed: 11/16/2022]
Abstract
This work studies the feasibility of a novel two-step algorithm for infrastructure and object positioning, using pairwise distances. The proposal is based on the optimization algorithms, Scaling-by-Majorizing-a-Complicated-Function and the Limited-Memory-Broyden-Fletcher-Goldfarb-Shannon. A qualitative evaluation of these algorithms is performed for 3D positioning. As the final stage, smoothing filtering techniques are applied to estimate the trajectory, from the previously obtained positions. This approach can also be used as a synthetic gesture data generator framework. This framework is independent from the hardware and can be used to simulate the estimation of trajectories from noisy distances gathered with a large range of sensors by modifying the noise properties of the initial distances. The framework is validated, using a system of ultrasound transceivers. The results show this framework to be an efficient and simple positioning and filtering approach, accurately reconstructing the real path followed by the mobile object while maintaining low latency. Furthermore, these capabilities can be exploited by using the proposed algorithms for synthetic data generation, as demonstrated in this work, where synthetic ultrasound gesture data are generated.
Collapse
|
17
|
Generation of Synthetic Chest X-ray Images and Detection of COVID-19: A Deep Learning Based Approach. Diagnostics (Basel) 2021; 11:895. [PMID: 34069841 PMCID: PMC8157360 DOI: 10.3390/diagnostics11050895] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 05/14/2021] [Accepted: 05/16/2021] [Indexed: 12/13/2022] Open
Abstract
COVID-19 is a disease caused by the SARS-CoV-2 virus. The COVID-19 virus spreads when a person comes into contact with an affected individual. This is mainly through drops of saliva or nasal discharge. Most of the affected people have mild symptoms while some people develop acute respiratory distress syndrome (ARDS), which damages organs like the lungs and heart. Chest X-rays (CXRs) have been widely used to identify abnormalities that help in detecting the COVID-19 virus. They have also been used as an initial screening procedure for individuals highly suspected of being infected. However, the availability of radiographic CXRs is still scarce. This can limit the performance of deep learning (DL) based approaches for COVID-19 detection. To overcome these limitations, in this work, we developed an Auxiliary Classifier Generative Adversarial Network (ACGAN), to generate CXRs. Each generated X-ray belongs to one of the two classes COVID-19 positive or normal. To ensure the goodness of the synthetic images, we performed some experimentation on the obtained images using the latest Convolutional Neural Networks (CNNs) to detect COVID-19 in the CXRs. We fine-tuned the models and achieved more than 98% accuracy. After that, we also performed feature selection using the Harmony Search (HS) algorithm, which reduces the number of features while retaining classification accuracy. We further release a GAN-generated dataset consisting of 500 COVID-19 radiographic images.
Collapse
|
18
|
A framework for automated and objective modification of tubular structures: Application to the internal carotid artery. INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN BIOMEDICAL ENGINEERING 2020; 36:e3330. [PMID: 32125768 DOI: 10.1002/cnm.3330] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/16/2019] [Revised: 02/26/2020] [Accepted: 02/27/2020] [Indexed: 06/10/2023]
Abstract
Patient-specific medical image-based computational fluid dynamics has been widely used to reveal fundamental insight into mechanisms of cardiovascular disease, for instance, correlating morphology to adverse vascular remodeling. However, segmentation of medical images is laborious, error-prone, and a bottleneck in the development of large databases that are needed to capture the natural variability in morphology. Instead, idealized models, where morphological features are parameterized, have been used to investigate the correlation with flow features, but at the cost of limited understanding of the complexity of cardiovascular flows. To combine the advantages of both approaches, we developed a tool that preserves the patient-specificness inherent in medical images while allowing for parametric alteration of the morphology. In our open-source framework morphMan we convert the segmented surface to a Voronoi diagram, modify the diagram to change the morphological features of interest, and then convert back to a new surface. In this paper, we present algorithms for modifying bifurcation angles, location of branches, cross-sectional area, vessel curvature, shape of bends, and surface roughness. We show qualitative and quantitative validation of the algorithms, performing with an accuracy exceeding 97% in general, and proof-of-concept on combining the tool with computational fluid dynamics. By combining morphMan with appropriate clinical measurements, one could explore the morphological parameter space and resulting hemodynamic response using only a handful of segmented surfaces, effectively minimizing the main bottleneck in image-based computational fluid dynamics.
Collapse
|