1
|
Shi X, Pan Z, Miao W. Data Integration in Causal Inference. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS 2023; 15:e1581. [PMID: 36713955 PMCID: PMC9880960 DOI: 10.1002/wics.1581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 02/24/2022] [Accepted: 03/01/2022] [Indexed: 04/12/2023]
Abstract
Integrating data from multiple heterogeneous sources has become increasingly popular to achieve a large sample size and diverse study population. This paper reviews development in causal inference methods that combines multiple datasets collected by potentially different designs from potentially heterogeneous populations. We summarize recent advances on combining randomized clinical trial with external information from observational studies or historical controls, combining samples when no single sample has all relevant variables with application to two-sample Mendelian randomization, distributed data setting under privacy concerns for comparative effectiveness and safety research using real-world data, Bayesian causal inference, and causal discovery methods.
Collapse
Affiliation(s)
- Xu Shi
- Department of BiostatisticsUniversity of MichiganAnn ArborMichiganUSA
| | - Ziyang Pan
- Department of BiostatisticsUniversity of MichiganAnn ArborMichiganUSA
| | - Wang Miao
- Department of Probability and StatisticsPeking UniversityBeijingChina
| |
Collapse
|
2
|
Miao W, Li W, Hu W, Wang R, Geng Z. Invited Commentary: Estimation and Bounds Under Data Fusion. Am J Epidemiol 2022; 191:674-678. [PMID: 34240101 DOI: 10.1093/aje/kwab194] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Revised: 05/02/2021] [Accepted: 05/17/2021] [Indexed: 11/12/2022] Open
Abstract
In their recent article, Ogburn et al. (Am J Epidemiol. 2021;190(6):1142-1147) raised a cautionary tale for epidemiologic data fusion: Bias may occur if a variable that is completely missing in the primary data set is imputed according to a regression model estimated from an auxiliary data set. However, in some specific settings, a solution may exist. Focusing on a linear outcome regression model with a missing covariate, we show that the bias can be eliminated if the underlying imputation model for the missing covariate is nonlinear in the common variables measured in both data sets. Otherwise, we describe 2 alternative approaches existing in the data fusion literature that could partially resolve this issue: One fits the outcome model by leveraging an additional validation data set containing joint observations of the outcome and the missing covariate, and the other offers informative bounds for the outcome regression coefficients without using validation data. We justify these 3 methods in a linear outcome model and briefly discuss their extension to general settings.
Collapse
|
3
|
Li H, Jia J, Yan R, Xue F, Geng Z. A causal data fusion method for the general exposure and outcome. Stat Med 2021; 41:328-339. [PMID: 34729799 DOI: 10.1002/sim.9239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 10/10/2021] [Accepted: 10/12/2021] [Indexed: 11/10/2022]
Abstract
With the advent of the big data era, the need to combine multiple individual data sets to draw causal effects arises naturally in many medical and biological applications. Especially each data set cannot measure enough confounders to infer the causal effect of an exposure on an outcome. In this article, we extend the method proposed by a previous study to causal data fusion of more than two data sets without external validation and to a more general (continuous or discrete) exposure and outcome. Theoretically, we obtain the condition for identifiability of exposure effects using multiple individual data sources for the continuous or discrete exposure and outcome. The simulation results show that our proposed causal data fusion method has unbiased causal effect estimate and higher precision than traditional regression, meta-analysis and statistical matching methods. We further apply our method to study the causal effect of BMI on glucose level in individuals with diabetes by combining two data sets. Our method is essential for causal data fusion and provides important insights into the ongoing discourse on the empirical analysis of merging multiple individual data sources.
Collapse
Affiliation(s)
- Hongkai Li
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, P. R. China.,Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, P. R. China
| | - Jinzhu Jia
- Department of Biostatistics, School of Public Health, Peking University, Beijing, P. R. China
| | - Ran Yan
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, P. R. China.,Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, P. R. China
| | - Fuzhong Xue
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, P. R. China.,Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, P. R. China
| | - Zhi Geng
- Department of Biostatistics, School of Public Health, Peking University, Beijing, P. R. China.,Shool of Mathematical sciences, Peking University, Beijing, P. R. China
| |
Collapse
|