1
|
Ren W, Liu Z, Wu Y, Zhang Z, Hong S, Liu H. Moving Beyond Medical Statistics: A Systematic Review on Missing Data Handling in Electronic Health Records. HEALTH DATA SCIENCE 2024; 4:0176. [PMID: 39635227 PMCID: PMC11615160 DOI: 10.34133/hds.0176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Accepted: 07/23/2024] [Indexed: 12/07/2024]
Abstract
Background: Missing data in electronic health records (EHRs) presents significant challenges in medical studies. Many methods have been proposed, but uncertainty exists regarding the current state of missing data addressing methods applied for EHR and which strategy performs better within specific contexts. Methods: All studies referencing EHR and missing data methods published from their inception until 2024 March 30 were searched via the MEDLINE, EMBASE, and Digital Bibliography and Library Project databases. The characteristics of the included studies were extracted. We also compared the performance of various methods under different missingness scenarios. Results: After screening, 46 studies published between 2010 and 2024 were included. Three missingness mechanisms were simulated when evaluating the missing data methods: missing completely at random (29/46), missing at random (20/46), and missing not at random (21/46). Multiple imputation by chained equations (MICE) was the most popular statistical method, whereas generative adversarial network-based methods and the k nearest neighbor (KNN) classification were the common deep-learning-based or traditional machine-learning-based methods, respectively. Among the 26 articles comparing the performance among medical statistical and machine learning approaches, traditional machine learning or deep learning methods generally outperformed statistical methods. Med.KNN and context-aware time-series imputation performed better for longitudinal datasets, whereas probabilistic principal component analysis and MICE-based methods were optimal for cross-sectional datasets. Conclusions: Machine learning methods show significant promise for addressing missing data in EHRs. However, no single approach provides a universally generalizable solution. Standardized benchmarking analyses are essential to evaluate these methods across different missingness scenarios.
Collapse
Affiliation(s)
- Wenhui Ren
- Department of Clinical Epidemiology and Biostatistics,
Peking University People’s Hospital, Beijing, China
| | - Zheng Liu
- Department of Clinical Epidemiology and Biostatistics,
Peking University People’s Hospital, Beijing, China
| | - Yanqiu Wu
- Department of Clinical Epidemiology and Biostatistics,
Peking University People’s Hospital, Beijing, China
| | - Zhilong Zhang
- National Institute of Health Data Science, Peking University, Beijing, China
- Institute of Medical Technology,
Health Science Center of Peking University, Beijing, China
| | - Shenda Hong
- National Institute of Health Data Science, Peking University, Beijing, China
| | - Huixin Liu
- Department of Clinical Epidemiology and Biostatistics,
Peking University People’s Hospital, Beijing, China
| | | |
Collapse
|
2
|
Li J, Wang Z, Wu L, Qiu S, Zhao H, Lin F, Zhang K. Method for Incomplete and Imbalanced Data Based on Multivariate Imputation by Chained Equations and Ensemble Learning. IEEE J Biomed Health Inform 2024; 28:3102-3113. [PMID: 38483807 DOI: 10.1109/jbhi.2024.3376428] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
The classification analysis of incomplete and imbalanced data is still a challenging task since these issues could negatively impact the training of classifiers, which were also found in our study on the physical fitness assessments of patients. And in fields such as healthcare, there are higher requirements for the accuracy of the generated imputation values. To train a high-performance classifier and pursue high accuracy, we attempted to resolve any potential negative impact by using a novel algorithmic approach based on the combination of multivariate imputation by chained equations and the ensemble learning method (MICEEN), which can solve the two problems simultaneously. We used multivariate imputation by chained equations to generate more accurate imputation values for the training set passed to ensemble learning to build a predictor. On the other hand, missing values were introduced into minority classes and used them to generate new samples belonging to the minority classes in order to balance the distribution of classes. On real-world datasets, we perform extensive experiments to assess our method and compare it to other state-of-the-art approaches. The advantages of the proposed method are demonstrated by experimental results for the benchmark datasets and self-collected datasets of physical fitness assessment of tumor patients with varying missing rates.
Collapse
|
3
|
Zhou X, Xiang W, Huang T. A novel neural network for improved in-hospital mortality prediction with irregular and incomplete multivariate data. Neural Netw 2023; 167:741-750. [PMID: 37734273 DOI: 10.1016/j.neunet.2023.07.033] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 07/10/2023] [Accepted: 07/25/2023] [Indexed: 09/23/2023]
Abstract
Accurate estimation of in-hospital mortality based on patients' physiological time series data improves the performance of the clinical decision support systems and assists hospital providers in allocating resources. In practice, the data quality issues of missing values are ubiquitous in electronic health records (EHRs). Since the vital signs are usually observed with irregular temporal intervals and different sampling rates, it is challenging to predict clinical outcomes with sparse and incomplete multivariate time series. We propose an auto-regressive recurrent neural network (RNN) based model, dubbed the bi-directional recursive encoder-decoder network (BiRED), to jointly perform data imputation and mortality prediction. To capture complex patterns of medical time sequences, a 2D cross-regression with an RNN unit (2DCR-RNN) and an imputation block with an RNN unit (IB-RNN) are designed as the recurrent component of the encoder and decoder, respectively. Furthermore, a state initialization method is proposed to alleviate errors accumulated in the generated sequence. The experimental results on two real EHR datasets show that our proposed method can predict hospital mortality with high AUC scores.
Collapse
Affiliation(s)
- Xi Zhou
- School of Computing, Engineering and Mathematical Sciences, La Trobe University, Melbourne 3086, Victoria, Australia
| | - Wei Xiang
- School of Computing, Engineering and Mathematical Sciences, La Trobe University, Melbourne 3086, Victoria, Australia.
| | - Tao Huang
- College of Science and Engineering, James Cook University, Cairns 4878, Queensland, Australia.
| |
Collapse
|
4
|
Old O, Friedrichson B, Zacharowski K, Kloka JA. Entering the new digital era of intensive care medicine: an overview of interdisciplinary approaches to use artificial intelligence for patients' benefit. EUROPEAN JOURNAL OF ANAESTHESIOLOGY AND INTENSIVE CARE 2023; 2:e0014. [PMID: 39916758 PMCID: PMC11783618 DOI: 10.1097/ea9.0000000000000014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/09/2025]
Abstract
The idea of implementing artificial intelligence in medicine is as old as artificial intelligence itself. So far, technical difficulties have prevented the integration of artificial intelligence in day-to-day healthcare. During the coronavirus disease 2019 (COVID-19) pandemic, a substantial amount of funding went into projects to research and implement artificial intelligence in healthcare. So far, artificial intelligence-based tools have had little impact in the fight against COVID-19. The reasons for the lack of success are complex. With advancing digitalisation, new data-based developed methods and research are finding their way into intensive care medicine. Data scientists and medical professionals, representing two different worlds, are slowly uniting. These two highly specialised fields do not yet speak a uniform language. Each field has its own interests and objectives. We took this idea as a starting point for this technical guide and aim to provide a deeper understanding of the terminology, applications, opportunities and risks of such applications for physicians. The most important terms in the field of machine learning are defined within a medical context to assure that the same language is spoken. The future of artificial intelligence applications will largely depend on the ability of artificial intelligence experts and physicians to cooperate in order to release the true power of artificial intelligence. Large research consortia, covering both technical and medical expertise, will grow because of growing demand in the future.
Collapse
Affiliation(s)
- Oliver Old
- From the Department of Anaesthesiology, Intensive Care Medicine and Pain Therapy University Hospital Frankfurt, Goethe University, Frankfurt, Germany
| | - Benjamin Friedrichson
- From the Department of Anaesthesiology, Intensive Care Medicine and Pain Therapy University Hospital Frankfurt, Goethe University, Frankfurt, Germany
| | - Kai Zacharowski
- From the Department of Anaesthesiology, Intensive Care Medicine and Pain Therapy University Hospital Frankfurt, Goethe University, Frankfurt, Germany
| | - Jan Andreas Kloka
- From the Department of Anaesthesiology, Intensive Care Medicine and Pain Therapy University Hospital Frankfurt, Goethe University, Frankfurt, Germany
| |
Collapse
|
5
|
Isgut M, Gloster L, Choi K, Venugopalan J, Wang MD. Systematic Review of Advanced AI Methods for Improving Healthcare Data Quality in Post COVID-19 Era. IEEE Rev Biomed Eng 2023; 16:53-69. [PMID: 36269930 DOI: 10.1109/rbme.2022.3216531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
At the beginning of the COVID-19 pandemic, there was significant hype about the potential impact of artificial intelligence (AI) tools in combatting COVID-19 on diagnosis, prognosis, or surveillance. However, AI tools have not yet been widely successful. One of the key reason is the COVID-19 pandemic has demanded faster real-time development of AI-driven clinical and health support tools, including rapid data collection, algorithm development, validation, and deployment. However, there was not enough time for proper data quality control. Learning from the hard lessons in COVID-19, we summarize the important health data quality challenges during COVID-19 pandemic such as lack of data standardization, missing data, tabulation errors, and noise and artifact. Then we conduct a systematic investigation of computational methods that address these issues, including emerging novel advanced AI data quality control methods that achieve better data quality outcomes and, in some cases, simplify or automate the data cleaning process. We hope this article can assist healthcare community to improve health data quality going forward with novel AI development.
Collapse
|
6
|
Li M, Du S. Current status and trends in researches based on public intensive care databases: A scientometric investigation. Front Public Health 2022; 10:912151. [PMID: 36187634 PMCID: PMC9521614 DOI: 10.3389/fpubh.2022.912151] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 08/08/2022] [Indexed: 01/22/2023] Open
Abstract
Objective Public intensive care databases cover a wide range of data that are produced in intensive care units (ICUs). Public intensive care databases draw great attention from researchers since they were time-saving and money-saving in obtaining data. This study aimed to explore the current status and trends of publications based on public intensive care databases. Methods Articles and reviews based on public intensive care databases, published from 2001 to 2021, were retrieved from the Web of Science Core Collection (WoSCC) for investigation. Scientometric software (CiteSpace and VOSviewer) were used to generate network maps and reveal hot spots of studies based on public intensive care databases. Results A total of 456 studies were collected. Zhang Zhongheng from Zhejiang University (China) and Leo Anthony Celi from Massachusetts Institute of Technology (MIT, USA) occupied important positions in studies based on public intensive care databases. Closer cooperation was observed between institutions in the same country. Six Research Topics were concluded through keyword analysis. Result of citation burst indicated that this field was in the stage of rapid development, with more diseases and clinical problems being investigated. Machine learning is still the hot research method in this field. Conclusions This is the first time that scientometrics has been used in the investigation of studies based on public intensive databases. Although more and more studies based on public intensive care databases were published, public intensive care databases may not be fully explored. Moreover, it could also help researchers directly perceive the current status and trends in this field. Public intensive care databases could be fully explored with more researchers' knowledge of this field.
Collapse
|
7
|
Mohammed YS, Abdelkader H, Pławiak P, Hammad M. A novel model to optimize multiple imputation algorithm for missing data using evolution methods. Biomed Signal Process Control 2022; 76:103661. [DOI: 10.1016/j.bspc.2022.103661] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
|
8
|
Zhu Y, Venugopalan J, Zhang Z, Chanani NK, Maher KO, Wang MD. Domain Adaptation Using Convolutional Autoencoder and Gradient Boosting for Adverse Events Prediction in the Intensive Care Unit. Front Artif Intell 2022; 5:640926. [PMID: 35481281 PMCID: PMC9036368 DOI: 10.3389/frai.2022.640926] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Accepted: 02/23/2022] [Indexed: 11/13/2022] Open
Abstract
More than 5 million patients have admitted annually to intensive care units (ICUs) in the United States. The leading causes of mortality are cardiovascular failures, multi-organ failures, and sepsis. Data-driven techniques have been used in the analysis of patient data to predict adverse events, such as ICU mortality and ICU readmission. These models often make use of temporal or static features from a single ICU database to make predictions on subsequent adverse events. To explore the potential of domain adaptation, we propose a method of data analysis using gradient boosting and convolutional autoencoder (CAE) to predict significant adverse events in the ICU, such as ICU mortality and ICU readmission. We demonstrate our results from a retrospective data analysis using patient records from a publicly available database called Multi-parameter Intelligent Monitoring in Intensive Care-II (MIMIC-II) and a local database from Children's Healthcare of Atlanta (CHOA). We demonstrate that after adopting novel data imputation on patient ICU data, gradient boosting is effective in both the mortality prediction task and the ICU readmission prediction task. In addition, we use gradient boosting to identify top-ranking temporal and non-temporal features in both prediction tasks. We discuss the relationship between these features and the specific prediction task. Lastly, we indicate that CAE might not be effective in feature extraction on one dataset, but domain adaptation with CAE feature extraction across two datasets shows promising results.
Collapse
Affiliation(s)
- Yuanda Zhu
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, United States
| | - Janani Venugopalan
- Biomedical Engineering Department, Georgia Institute of Technology, Emory University, Atlanta, GA, United States
| | - Zhenyu Zhang
- Biomedical Engineering Department, Georgia Institute of Technology, Atlanta, GA, United States
- Department of Biomedical Engineering, Peking University, Beijing, China
| | | | - Kevin O. Maher
- Pediatrics Department, Emory University, Atlanta, GA, United States
| | - May D. Wang
- Biomedical Engineering Department, Georgia Institute of Technology, Emory University, Atlanta, GA, United States
- *Correspondence: May D. Wang
| |
Collapse
|
9
|
Xu Y, Liu X, Pan L, Mao X, Liang H, Wang G, Chen T. Explainable Dynamic Multimodal Variational Autoencoder for the Prediction of Patients with Suspected Central Precocious Puberty. IEEE J Biomed Health Inform 2021; 26:1362-1373. [PMID: 34388097 DOI: 10.1109/jbhi.2021.3103271] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Central precocious puberty (CPP) is the most common type of precocious puberty and has a significant effect on children. A gonadotropin-releasing hormone (GnRH)-stimulation test is the gold standard for confirming CPP. This test, however, is costly and unpleasant for patients. Therefore, it is critical to developing alternative methods for CPP diagnosis in order to alleviate patient suffering. This study aims to develop an artificial intelligence (AI) diagnostic system for predicting response to the GnRH-stimulation test using data from laboratory tests, electronic health records (EHRs), and pelvic ultrasonography and left-hand radiography reports. The challenges are in integrating these mul-timodal features into a comprehensive deep learning model in order to achieve an accurate diagnosis while also accounting for the missing or incomplete modalities. To begin, we developed a dynamic multimodal variational autoencoder (DMVAE) that can exploit intrinsic correlations between different modalities to im-pute features for missing modalities. Next, we combined features from all modalities to predict the outcome of a CPP diagnosis. The experimental results (AUROC 0.9086) demonstrate that our DMVAE model is superior to standard methods. Additionally, we showed that by setting appropriate operating thresholds, clinicians could diagnose about two-thirds of patients with confidence (1.0 specificity). Only about one-third of patients require confirmation of their diagnoses using GnRH (or GnRH analog)-stimulation tests. To interpret the results, we implemented an explainer Shapley additive explanation (SHAP) to analyze the local and global feature attributions.
Collapse
|
10
|
Xu D, Sheng JQ, Hu PJH, Huang TS, Hsu CC. A Deep Learning-Based Unsupervised Method to Impute Missing Values in Patient Records for Improved Management of Cardiovascular Patients. IEEE J Biomed Health Inform 2021; 25:2260-2272. [PMID: 33095720 DOI: 10.1109/jbhi.2020.3033323] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Physicians increasingly depend on electronic health records (EHRs) to manage their patients. However, many patient records have substantial missing values that pose a fundamental challenge to their clinical use. To address this prevailing challenge, we propose an unsupervised deep learning-based method that can facilitate physicians' use of EHRs to improve their management of cardiovascular patients. By building on the deep autoencoder framework, we develop a novel method to impute missing values in patient records. To demonstrate its clinical applicability and values, we use data from cardiovascular patients and evaluate the proposed method's imputation effectiveness and predictive efficacy, in comparison with six prevalent benchmark techniques. The proposed method can impute missing values and predict important patient outcomes more effectively than all the benchmark techniques. This study reinforces the importance of adequately addressing missing values in patient records. It further illustrates how effective imputations can enable greater predictive efficacy with regard to important patient outcomes, which are crucial to the use of EHRs and health analytics for improved patient management. Supported by the complete data imputed by the proposed method, physicians can make timely patient outcome estimations (predictions) and therapeutic treatment assessments.
Collapse
|
11
|
Vivar G, Kazi A, Burwinkel H, Zwergal A, Navab N, Ahmadi SA. Simultaneous imputation and classification using Multigraph Geometric Matrix Completion (MGMC): Application to neurodegenerative disease classification. Artif Intell Med 2021; 117:102097. [PMID: 34127236 DOI: 10.1016/j.artmed.2021.102097] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Revised: 05/04/2021] [Accepted: 05/05/2021] [Indexed: 10/21/2022]
Abstract
Large-scale population-based studies in medicine are a key resource towards better diagnosis, monitoring, and treatment of diseases. They also serve as enablers of clinical decision support systems, in particular computer-aided diagnosis (CADx) using machine learning (ML). Numerous ML approaches for CADx have been proposed in literature. However, these approaches assume feature-complete data, which is often not the case in clinical data. To account for missing data, incomplete data samples are either removed or imputed, which could lead to data bias and may negatively affect classification performance. As a solution, we propose an end-to-end learning of imputation and disease prediction of incomplete medical datasets via Multi-graph Geometric Matrix Completion (MGMC). MGMC uses multiple recurrent graph convolutional networks, where each graph represents an independent population model based on a key clinical meta-feature like age, sex, or cognitive function. Graph signal aggregation from local patient neighborhoods, combined with multi-graph signal fusion via self-attention, has a regularizing effect on both matrix reconstruction and classification performance. Our proposed approach is able to impute class relevant features as well as perform accurate and robust classification on two publicly available medical datasets. We empirically show the superiority of our proposed approach in terms of classification and imputation performance when compared with state-of-the-art approaches. MGMC enables disease prediction in multimodal and incomplete medical datasets. These findings could serve as baseline for future CADx approaches which utilize incomplete datasets.
Collapse
Affiliation(s)
- Gerome Vivar
- Department of Computer Aided Medical Procedures (CAMP), Technical University of Munich (TUM), Boltzmannstr. 3, 85748 Garching, Germany; German Center for Vertigo and Balance Disorders (DSGZ), Ludwig-Maximilians University (LMU), Fraunhoferstr. 20, 82152, Planegg, Germany
| | - Anees Kazi
- Department of Computer Aided Medical Procedures (CAMP), Technical University of Munich (TUM), Boltzmannstr. 3, 85748 Garching, Germany
| | - Hendrik Burwinkel
- Department of Computer Aided Medical Procedures (CAMP), Technical University of Munich (TUM), Boltzmannstr. 3, 85748 Garching, Germany
| | - Andreas Zwergal
- German Center for Vertigo and Balance Disorders (DSGZ), Ludwig-Maximilians University (LMU), Fraunhoferstr. 20, 82152, Planegg, Germany
| | - Nassir Navab
- Department of Computer Aided Medical Procedures (CAMP), Technical University of Munich (TUM), Boltzmannstr. 3, 85748 Garching, Germany
| | - Seyed-Ahmad Ahmadi
- Department of Computer Aided Medical Procedures (CAMP), Technical University of Munich (TUM), Boltzmannstr. 3, 85748 Garching, Germany; German Center for Vertigo and Balance Disorders (DSGZ), Ludwig-Maximilians University (LMU), Fraunhoferstr. 20, 82152, Planegg, Germany.
| | | |
Collapse
|
12
|
Venugopalan J, Chanani N, Maher K, Wang MD. Combination of static and temporal data analysis to predict mortality and readmission in the intensive care. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2018; 2017:2570-2573. [PMID: 29060424 DOI: 10.1109/embc.2017.8037382] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
There are approximately 4 million intensive care unit (ICU) admissions each year in the United States with costs accounting for 4.1% of national health expenditures. Unforeseen adverse events contribute disproportionately to these costs. Thus, there has been substantial research in developing clinical decision support systems to predict and improve ICU outcomes such as ICU mortality, prolonged length of stay, and ICU readmission. However, the data in the ICU is collected at diverse time intervals and includes both static and temporal data. Common methods for static data mining such as Cox and logistic regression and methods for temporal data analysis such as temporal association rule mining do not model the combination of both static and temporal data. This work aims to overcome this challenge to combine static models such as logistic regression and feedforward neural networks with temporal models such as conditional random fields(CRF). We demonstrate the results using adult patient records from a publicly available database called Multi-parameter Intelligent Monitoring in Intensive Care - II (MIMIC-II). We show that the combination models outperformed individual models of logistic regression, feed-forward neural networks and conditional random fields in predicting ICU mortality. The combination models also outperform the static models of logistic regression and feed-forward neural networks for the prediction of 30 day ICU readmissions when tested using Matthews correlation coefficient and accuracy as the metrics.
Collapse
|