1
|
Zamanian A, von Kleist H, Ciora OA, Piperno M, Lancho G, Ahmidi N. Analysis of Missingness Scenarios for Observational Health Data. J Pers Med 2024; 14:514. [PMID: 38793096 PMCID: PMC11122060 DOI: 10.3390/jpm14050514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 04/29/2024] [Accepted: 05/08/2024] [Indexed: 05/26/2024] Open
Abstract
Despite the extensive literature on missing data theory and cautionary articles emphasizing the importance of realistic analysis for healthcare data, a critical gap persists in incorporating domain knowledge into the missing data methods. In this paper, we argue that the remedy is to identify the key scenarios that lead to data missingness and investigate their theoretical implications. Based on this proposal, we first introduce an analysis framework where we investigate how different observation agents, such as physicians, influence the data availability and then scrutinize each scenario with respect to the steps in the missing data analysis. We apply this framework to the case study of observational data in healthcare facilities. We identify ten fundamental missingness scenarios and show how they influence the identification step for missing data graphical models, inverse probability weighting estimation, and exponential tilting sensitivity analysis. To emphasize how domain-informed analysis can improve method reliability, we conduct simulation studies under the influence of various missingness scenarios. We compare the results of three common methods in medical data analysis: complete-case analysis, Missforest imputation, and inverse probability weighting estimation. The experiments are conducted for two objectives: variable mean estimation and classification accuracy. We advocate for our analysis approach as a reference for the observational health data analysis. Beyond that, we also posit that the proposed analysis framework is applicable to other medical domains.
Collapse
Affiliation(s)
- Alireza Zamanian
- Department of Computer Science, TUM School of Computation, Information and Technology, Technical University of Munich, 85748 Munich, Germany;
- Fraunhofer Institute for Cognitive Systems IKS, 80686 Munich, Germany; (O.-A.C.); (M.P.); (G.L.); (N.A.)
| | - Henrik von Kleist
- Department of Computer Science, TUM School of Computation, Information and Technology, Technical University of Munich, 85748 Munich, Germany;
- Institute of Computational Biology, Helmholtz Center Munich, 80939 Munich, Germany
| | - Octavia-Andreea Ciora
- Fraunhofer Institute for Cognitive Systems IKS, 80686 Munich, Germany; (O.-A.C.); (M.P.); (G.L.); (N.A.)
| | - Marta Piperno
- Fraunhofer Institute for Cognitive Systems IKS, 80686 Munich, Germany; (O.-A.C.); (M.P.); (G.L.); (N.A.)
| | - Gino Lancho
- Fraunhofer Institute for Cognitive Systems IKS, 80686 Munich, Germany; (O.-A.C.); (M.P.); (G.L.); (N.A.)
| | - Narges Ahmidi
- Fraunhofer Institute for Cognitive Systems IKS, 80686 Munich, Germany; (O.-A.C.); (M.P.); (G.L.); (N.A.)
| |
Collapse
|
2
|
Zamanian A, Ahmidi N, Drton M. Assessable and interpretable sensitivity analysis in the pattern graph framework for nonignorable missingness mechanisms. Stat Med 2023; 42:5419-5450. [PMID: 37759370 DOI: 10.1002/sim.9920] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 06/12/2023] [Accepted: 09/13/2023] [Indexed: 09/29/2023]
Abstract
The pattern graph framework solves a wide range of missing data problems with nonignorable mechanisms. However, it faces two challenges of assessability and interpretability, particularly important in safety-critical problems such as clinical diagnosis: (i) How can one assess the validity of the framework's a priori assumption and make necessary adjustments to accommodate known information about the problem? (ii) How can one interpret the process of exponential tilting used for sensitivity analysis in the pattern graph framework and choose the tilt perturbations based on meaningful real-world quantities? In this paper, we introduce Informed Sensitivity Analysis, an extension of the pattern graph framework that enables us to incorporate substantive knowledge about the missingness mechanism into the pattern graph framework. Our extension allows us to examine the validity of assumptions underlying pattern graphs and interpret sensitivity analysis results in terms of realistic problem characteristics. We apply our method to a prevalent nonignorable missing data scenario in clinical research. We validate and compare our method's results of our method with a number of widely-used missing data methods, including Unweighted CCA, KNN Imputer, MICE, and MissForest. The validation is done using both boot-strapped simulated experiments as well as real-world clinical observations in the MIMIC-III public dataset.
Collapse
Affiliation(s)
- Alireza Zamanian
- TUM School of Computation, Information and Technology, Department of Computer Science, Technical University of Munich, Munich, Germany
- Department of Reasoned AI Decisions, Fraunhofer Institute for Cognitive Systems IKS, Munich, Germany
| | - Narges Ahmidi
- Department of Reasoned AI Decisions, Fraunhofer Institute for Cognitive Systems IKS, Munich, Germany
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
| | - Mathias Drton
- TUM School of Computation, Information and Technology, Department of Mathematics, Technical University of Munich, Munich, Germany
| |
Collapse
|
3
|
Chen H, Heitjan DF. Analysis of local sensitivity to nonignorability with missing outcomes and predictors. Biometrics 2022; 78:1342-1352. [PMID: 34297356 DOI: 10.1111/biom.13532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Revised: 07/03/2021] [Accepted: 07/15/2021] [Indexed: 12/30/2022]
Abstract
The ISNI (index of sensitivity to local nonignorability) method quantifies local sensitivity of parametric inferences to nonignorable missingness in an outcome variable. Here we extend ISNI to the situations where both outcomes and predictors can be missing and where the missingness mechanism can be either parametric or semi-parametric. We define the quantity MinNI (minimum nonignorability) to be an approximation to the norm of the smallest value of the transformed nonignorability that gives a nonnegligible displacement of the estimate of the parameter of interest. We illustrate our method in a complete data set from which we synthetically delete observations according to various patterns. We then apply the method to real-data examples involving the normal linear model and conditional logistic regression.
Collapse
Affiliation(s)
- Heng Chen
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - Daniel F Heitjan
- Department of Statistical Science, Southern Methodist University, Dallas, Texas, USA.,Department of Population & Data Sciences, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| |
Collapse
|
4
|
Zhuang W, Camacho L, Silva CS, Thomson M, Snyder K. A robust biostatistical method leverages informative but uncertainly determined qPCR data for biomarker detection, early diagnosis, and treatment. PLoS One 2022; 17:e0263070. [PMID: 35100319 PMCID: PMC8803186 DOI: 10.1371/journal.pone.0263070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Accepted: 01/11/2022] [Indexed: 11/19/2022] Open
Abstract
As a common medium-throughput technique, qPCR (quantitative real-time polymerase chain reaction) is widely used to measure levels of nucleic acids. In addition to accurate and complete data, experimenters have unavoidably observed some incomplete and uncertainly determined qPCR data because of intrinsically low overall amounts of biological materials, such as nucleic acids present in biofluids. When there are samples with uncertainly determined qPCR data, some investigators apply the statistical complete-case method by excluding the subset of samples with uncertainly determined data from analysis (CO), while others simply choose not to analyze (CNA) these datasets altogether. To include as many observations as possible in analysis for interesting differential changes between groups, some investigators set incomplete observations equal to the maximum quality qPCR cycle (MC), such as 32 and 40. Although straightforward, these methods may decrease the sample size, skew the data distribution, and compromise statistical power and research reproducibility across replicate qPCR studies. To overcome the shortcomings of the existing, commonly-used qPCR data analysis methods and to join the efforts in advancing statistical analysis in rigorous preclinical research, we propose a robust nonparametric statistical cycle-to-threshold method (CTOT) to analyze incomplete qPCR data for two-group comparisons. CTOT incorporates important characteristics of qPCR data and time-to-event statistical methodology, resulting in a novel analytical method for qPCR data that is built around good quality data from all subjects, certainly determined or not. Considering the benchmark full data (BFD), we compared the abilities of CTOT, CO, MC, and CNA statistical methods to detect interesting differential changes between groups with informative but uncertainly determined qPCR data. Our simulations and applications show that CTOT improves the power of detecting and confirming differential changes in many situations over the three commonly used methods without excess type I errors. The robust nonparametric statistical method of CTOT helps leverage qPCR technology and increase the power to detect differential changes that may assist decision making with respect to biomarker detection and early diagnosis, with the goal of improving the management of patient healthcare.
Collapse
Affiliation(s)
- Wei Zhuang
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America
| | - Luísa Camacho
- Division of Biochemical Toxicology, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America
| | - Camila S. Silva
- Division of Biochemical Toxicology, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America
| | - Michael Thomson
- Office of New Drugs, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, United States of America
| | - Kevin Snyder
- Office of New Drugs, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, United States of America
| |
Collapse
|
5
|
Hu Z. Assessing conditional causal effect via characteristic score. Stat Med 2021; 40:5188-5198. [PMID: 34181277 DOI: 10.1002/sim.9119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 05/28/2021] [Accepted: 06/10/2021] [Indexed: 11/10/2022]
Abstract
Observational studies usually include participants representing the wide heterogeneous population. The conditional causal effect, treatment effect conditional on baseline characteristics, is of practical importance. Its estimation is subject to two challenges. First, the causal effect is not observable in any individual due to counterfactuality. Second, high-dimensional baseline variables are involved to satisfy the ignorable treatment selection assumption and to attain better estimation efficiency. In this work, a nonparametric estimation procedure, along with a pseudo-response, is proposed to estimate the conditional treatment effect through "characteristic score"-a parsimonious representation of baseline variable influence on treatment benefit. Adopting sparse dimension reduction with variable prescreening in the proposed estimation, we aim to identify the key baseline variables that impact the conditional treatment effect and to uncover the characteristic score that best predicts the treatment effect. This approach is applied to an HIV study for assessing the benefit of antiretroviral regimens and identifying the beneficiary subpopulation.
Collapse
Affiliation(s)
- Zonghui Hu
- Biostatistics Research Branch, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
6
|
McKennan C, Ober C, Nicolae D. ESTIMATION AND INFERENCE IN METABOLOMICS WITH NON-RANDOM MISSING DATA AND LATENT FACTORS. Ann Appl Stat 2020; 14:789-808. [PMID: 34221212 PMCID: PMC8248477 DOI: 10.1214/20-aoas1328] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
High throughput metabolomics data are fraught with both non-ignorable missing observations and unobserved factors that influence a metabolite's measured concentration, and it is well known that ignoring either of these complications can compromise estimators. However, current methods to analyze these data can only account for the missing data or unobserved factors, but not both. We therefore developed MetabMiss, a statistically rigorous method to account for both non-random missing data and latent factors in high throughput metabolomics data. Our methodology does not require the practitioner specify a likelihood for the missing data, and makes investigating the relationship between the metabolome and tens, or even hundreds, of phenotypes computationally tractable. We demonstrate the fidelity of Metab-Miss's estimates using both simulated and real metabolomics data, and prove their asymptotic correctness when the sample size and number of metabolites grows to infinity.
Collapse
|