1
|
Li L, Hoefsloot H, Bakker BM, Horner D, Rasmussen MA, Smilde AK, Acar E. Longitudinal Metabolomics Data Analysis Informed by Mechanistic Models. Metabolites 2024; 15:2. [PMID: 39852345 PMCID: PMC11766892 DOI: 10.3390/metabo15010002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2024] [Revised: 12/06/2024] [Accepted: 12/20/2024] [Indexed: 01/26/2025] Open
Abstract
Background: Metabolomics measurements are noisy, often characterized by a small sample size and missing entries. While data-driven methods have shown promise in terms of analyzing metabolomics data, e.g., revealing biomarkers of various phenotypes, metabolomics data analysis can significantly benefit from incorporating prior information about metabolic mechanisms. This paper introduces a novel data analysis approach to incorporate mechanistic models in metabolomics data analysis. Methods: We arranged time-resolved metabolomics measurements of plasma samples collected during a meal challenge test from the COPSAC2000 cohort as a third-order tensor: subjects by metabolites by time samples. Simulated challenge test data generated using a human whole-body metabolic model were also arranged as a third-order tensor: virtual subjects by metabolites by time samples. Real and simulated data sets were coupled in the metabolites mode and jointly analyzed using coupled tensor factorizations to reveal the underlying patterns. Results: Our experiments demonstrated that the joint analysis of simulated and real data had better performance in terms of pattern discovery, achieving higher correlations with a BMI (body mass index)-related phenotype compared to the analysis of only real data in males, while in females, the performance was comparable. We also demonstrated the advantages of such a joint analysis approach in the presence of incomplete measurements and its limitations in the presence of wrong prior information. Conclusions: The joint analysis of real measurements and simulated data (generated using a mechanistic model) through coupled tensor factorizations guides real data analysis with prior information encapsulated in mechanistic models and reveals interpretable patterns.
Collapse
Affiliation(s)
- Lu Li
- School of Mathematics (Zhuhai), Sun Yat-sen University, Zhuhai 519000, China
- Department of Data Science and Knowledge Discovery, Simula Metropolitan Center for Digital Engineering, 0130 Oslo, Norway
| | - Huub Hoefsloot
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1090 GE Amsterdam, The Netherlands
| | - Barbara M. Bakker
- Laboratory of Pediatrics, Systems Medicine of Metabolism and Signaling, University of Groningen, University Medical Center Groningen, 9700 AD Groningen, The Netherlands
| | - David Horner
- Copenhagen Prospective Studies on Asthma in Childhood (COPSAC), Herlev and Gentofte Hospital, DK-2820 Gentofte, Denmark
| | - Morten A. Rasmussen
- Copenhagen Prospective Studies on Asthma in Childhood (COPSAC), Herlev and Gentofte Hospital, DK-2820 Gentofte, Denmark
- Department of Food Science, University of Copenhagen, DK-1958 Frederiksberg, Denmark
| | - Age K. Smilde
- Department of Data Science and Knowledge Discovery, Simula Metropolitan Center for Digital Engineering, 0130 Oslo, Norway
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1090 GE Amsterdam, The Netherlands
| | - Evrim Acar
- Department of Data Science and Knowledge Discovery, Simula Metropolitan Center for Digital Engineering, 0130 Oslo, Norway
| |
Collapse
|
2
|
Ding S, Zhang S, Hu X, Zou N. Identify and mitigate bias in electronic phenotyping: A comprehensive study from computational perspective. J Biomed Inform 2024; 156:104671. [PMID: 38876452 DOI: 10.1016/j.jbi.2024.104671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Revised: 05/26/2024] [Accepted: 06/05/2024] [Indexed: 06/16/2024]
Abstract
Electronic phenotyping is a fundamental task that identifies the special group of patients, which plays an important role in precision medicine in the era of digital health. Phenotyping provides real-world evidence for other related biomedical research and clinical tasks, e.g., disease diagnosis, drug development, and clinical trials, etc. With the development of electronic health records, the performance of electronic phenotyping has been significantly boosted by advanced machine learning techniques. In the healthcare domain, precision and fairness are both essential aspects that should be taken into consideration. However, most related efforts are put into designing phenotyping models with higher accuracy. Few attention is put on the fairness perspective of phenotyping. The neglection of bias in phenotyping leads to subgroups of patients being underrepresented which will further affect the following healthcare activities such as patient recruitment in clinical trials. In this work, we are motivated to bridge this gap through a comprehensive experimental study to identify the bias existing in electronic phenotyping models and evaluate the widely-used debiasing methods' performance on these models. We choose pneumonia and sepsis as our phenotyping target diseases. We benchmark 9 kinds of electronic phenotyping methods spanning from rule-based to data-driven methods. Meanwhile, we evaluate the performance of the 5 bias mitigation strategies covering pre-processing, in-processing, and post-processing. Through the extensive experiments, we summarize several insightful findings from the bias identified in the phenotyping and key points of the bias mitigation strategies in phenotyping.
Collapse
Affiliation(s)
- Sirui Ding
- Department of Computer Science & Engineering, Texas A&M University, College Station, TX, United States
| | - Shenghan Zhang
- Department of Biomedical Informatics, Harvard University, Boston, MA, United States
| | - Xia Hu
- Department of Computer Science, Rice University, Houston, TX, United States
| | - Na Zou
- Department of Industrial Engineering, University of Houston, Houston, TX, United States.
| |
Collapse
|
3
|
Karimian Sichani E, Smith A, El Emam K, Mosquera L. Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation. JMIR Form Res 2024; 8:e53241. [PMID: 38648097 PMCID: PMC11034549 DOI: 10.2196/53241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 01/09/2024] [Accepted: 03/01/2024] [Indexed: 04/25/2024] Open
Abstract
BACKGROUND Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients' privacy while properly reflecting the data. OBJECTIVE This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected. METHODS We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients. RESULTS The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data. CONCLUSIONS We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.
Collapse
Affiliation(s)
| | - Aaron Smith
- Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON, Canada
| | - Khaled El Emam
- Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada
- Replica Analytics Ltd, Ottawa, ON, Canada
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada
| | - Lucy Mosquera
- Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada
- Replica Analytics Ltd, Ottawa, ON, Canada
| |
Collapse
|
4
|
Ren Y, Lou J, Xiong L, Ho JC, Jiang X, Bhavani SV. MULTIPAR: Supervised Irregular Tensor Factorization with Multi-task Learning for Computational Phenotyping. PROCEEDINGS OF MACHINE LEARNING RESEARCH 2023; 225:498-511. [PMID: 39624658 PMCID: PMC11611252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 12/06/2024]
Abstract
Tensor factorization has received increasing interest due to its intrinsic ability to capture latent factors in multi-dimensional data with many applications including Electronic Health Records (EHR) mining. PARAFAC2 and its variants have been proposed to address irregular tensors where one of the tensor modes is not aligned, e.g., different patients in EHRs may have different length of records. PARAFAC2 has been successfully applied to EHRs for extracting meaningful medical concepts (phenotypes). Despite recent advancements, current models' predictability and interpretability are not satisfactory, which limits its utility for downstream analysis. In this paper, we propose MULTIPAR: a supervised irregular tensor factorization with multi-task learning for computational phenotyping. MULTIPAR is flexible to incorporate both static (e.g. in-hospital mortality prediction) and continuous or dynamic (e.g. the need for ventilation) tasks. By supervising the tensor factorization with downstream prediction tasks and leveraging information from multiple related predictive tasks, MULTIPAR can yield not only more meaningful phenotypes but also better predictive performance for downstream tasks. We conduct extensive experiments on two real-world temporal EHR datasets to demonstrate that MULTIPAR is scalable and achieves better tensor fit with more meaningful subgroups and stronger predictive performance compared to existing state-of-the-art methods. The implementation of MULTIPAR is available.
Collapse
Affiliation(s)
| | | | | | | | - Xiaoqian Jiang
- Health Science Center of University of Texas, United States
| | | |
Collapse
|
5
|
Khodadadi A, Ghanbari Bousejin N, Molaei S, Kumar Chauhan V, Zhu T, Clifton DA. Improving Diagnostics with Deep Forest Applied to Electronic Health Records. SENSORS (BASEL, SWITZERLAND) 2023; 23:6571. [PMID: 37514865 PMCID: PMC10384165 DOI: 10.3390/s23146571] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 07/08/2023] [Accepted: 07/14/2023] [Indexed: 07/30/2023]
Abstract
An electronic health record (EHR) is a vital high-dimensional part of medical concepts. Discovering implicit correlations in the information of this data set and the research and informative aspects can improve the treatment and management process. The challenge of concern is the data sources' limitations in finding a stable model to relate medical concepts and use these existing connections. This paper presents Patient Forest, a novel end-to-end approach for learning patient representations from tree-structured data for readmission and mortality prediction tasks. By leveraging statistical features, the proposed model is able to provide an accurate and reliable classifier for predicting readmission and mortality. Experiments on MIMIC-III and eICU datasets demonstrate Patient Forest outperforms existing machine learning models, especially when the training data are limited. Additionally, a qualitative evaluation of Patient Forest is conducted by visualising the learnt representations in 2D space using the t-SNE, which further confirms the effectiveness of the proposed model in learning EHR representations.
Collapse
Affiliation(s)
- Atieh Khodadadi
- Institute of Applied Informatics and Formal Description Methods, Karlsruhe Institute of Technology, 76133 Karlsruhe, Germany
| | | | - Soheila Molaei
- Department of Engineering Science, University of Oxford, Oxford OX1 3AZ, UK; (V.K.C.); (T.Z.); (D.A.C.)
| | - Vinod Kumar Chauhan
- Department of Engineering Science, University of Oxford, Oxford OX1 3AZ, UK; (V.K.C.); (T.Z.); (D.A.C.)
| | - Tingting Zhu
- Department of Engineering Science, University of Oxford, Oxford OX1 3AZ, UK; (V.K.C.); (T.Z.); (D.A.C.)
| | - David A. Clifton
- Department of Engineering Science, University of Oxford, Oxford OX1 3AZ, UK; (V.K.C.); (T.Z.); (D.A.C.)
- Oxford-Suzhou Centre for Advanced Research (OSCAR), Suzhou 215123, China
| |
Collapse
|
6
|
Xie F, Yuan H, Ning Y, Ong MEH, Feng M, Hsu W, Chakraborty B, Liu N. Deep learning for temporal data representation in electronic health records: A systematic review of challenges and methodologies. J Biomed Inform 2021; 126:103980. [PMID: 34974189 DOI: 10.1016/j.jbi.2021.103980] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 11/07/2021] [Accepted: 12/20/2021] [Indexed: 12/21/2022]
Abstract
OBJECTIVE Temporal electronic health records (EHRs) contain a wealth of information for secondary uses, such as clinical events prediction and chronic disease management. However, challenges exist for temporal data representation. We therefore sought to identify these challenges and evaluate novel methodologies for addressing them through a systematic examination of deep learning solutions. METHODS We searched five databases (PubMed, Embase, the Institute of Electrical and Electronics Engineers [IEEE] Xplore Digital Library, the Association for Computing Machinery [ACM] Digital Library, and Web of Science) complemented with hand-searching in several prestigious computer science conference proceedings. We sought articles that reported deep learning methodologies on temporal data representation in structured EHR data from January 1, 2010, to August 30, 2020. We summarized and analyzed the selected articles from three perspectives: nature of time series, methodology, and model implementation. RESULTS We included 98 articles related to temporal data representation using deep learning. Four major challenges were identified, including data irregularity, heterogeneity, sparsity, and model opacity. We then studied how deep learning techniques were applied to address these challenges. Finally, we discuss some open challenges arising from deep learning. CONCLUSION Temporal EHR data present several major challenges for clinical prediction modeling and data utilization. To some extent, current deep learning solutions can address these challenges. Future studies may consider designing comprehensive and integrated solutions. Moreover, researchers should incorporate clinical domain knowledge into study designs and enhance model interpretability to facilitate clinical implementation.
Collapse
Affiliation(s)
- Feng Xie
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore; Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore
| | - Han Yuan
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore
| | - Yilin Ning
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore
| | - Marcus Eng Hock Ong
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore; Department of Emergency Medicine, Singapore General Hospital, Singapore
| | - Mengling Feng
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore
| | - Wynne Hsu
- School of Computing, National University of Singapore, Singapore; Institute of Data Science, National University of Singapore, Singapore
| | - Bibhas Chakraborty
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore; Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore; Department of Statistics and Data Science, National University of Singapore, Singapore; Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, United States
| | - Nan Liu
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore; Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore; Institute of Data Science, National University of Singapore, Singapore; SingHealth AI Health Program, Singapore Health Services, Singapore.
| |
Collapse
|
7
|
Spadon G, Hong S, Brandoli B, Matwin S, Rodrigues-Jr JF, Sun J. Pay Attention to Evolution: Time Series Forecasting with Deep Graph-Evolution Learning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; PP:5368-5384. [PMID: 33905327 DOI: 10.1109/tpami.2021.3076155] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Time-series forecasting is one of the most active research topics in artificial intelligence. A still open gap in that literature is that statistical and ensemble learning approaches systematically present lower predictive performance than deep learning methods. They generally disregard the data sequence aspect entangled with multivariate data represented in more than one time series. Conversely, this work presents a novel neural network architecture for time-series forecasting that combines the power of graph evolution with deep recurrent learning on distinct data distributions; we named our method Recurrent Graph Evolution Neural Network (ReGENN). The idea is to infer multiple multivariate relationships between co-occurring time-series by assuming that the temporal data depends not only on inner variables and intra-temporal relationships (i.e., observations from itself) but also on outer variables and inter-temporal relationships (i.e., observations from other-selves). An extensive set of experiments was conducted comparing ReGENN with dozens of ensemble methods and classical statistical ones, showing sound improvement of up to 64.87% over the competing algorithms. Furthermore, we present an analysis of the intermediate weights arising from ReGENN, showing that by looking at inter and intra-temporal relationships simultaneously, time-series forecasting is majorly improved if paying attention to how multiple multivariate data synchronously evolve.
Collapse
|
8
|
Yin K, Afshar A, Ho JC, Cheung WK, Zhang C, Sun J. LogPar: Logistic PARAFAC2 Factorization for Temporal Binary Data with Missing Values. KDD : PROCEEDINGS. INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING 2020; 2020:1625-1635. [PMID: 34109054 DOI: 10.1145/3394486.3403213] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Binary data with one-class missing values are ubiquitous in real-world applications. They can be represented by irregular tensors with varying sizes in one dimension, where value one means presence of a feature while zero means unknown (i.e., either presence or absence of a feature). Learning accurate low-rank approximations from such binary irregular tensors is a challenging task. However, none of the existing models developed for factorizing irregular tensors take the missing values into account, and they assume Gaussian distributions, resulting in a distribution mismatch when applied to binary data. In this paper, we propose Logistic PARAFAC2 (LogPar) by modeling the binary irregular tensor with Bernoulli distribution parameterized by an underlying real-valued tensor. Then we approximate the underlying tensor with a positive-unlabeled learning loss function to account for the missing values. We also incorporate uniqueness and temporal smoothness regularization to enhance the interpretability. Extensive experiments using large-scale real-world datasets show that LogPar outperforms all baselines in both irregular tensor completion and downstream predictive tasks. For the irregular tensor completion, LogPar achieves up to 26% relative improvement compared to the best baseline. Besides, LogPar obtains relative improvement of 13.2% for heart failure prediction and 14% for mortality prediction on average compared to the state-of-the-art PARAFAC2 models.
Collapse
Affiliation(s)
| | | | | | | | | | - Jimeng Sun
- University of Illinois, Urbana-Champaign
| |
Collapse
|