1
|
Ye X, Yang S, Wang X, Liu Y. Integrative analysis of high-dimensional RCT and RWD subject to censoring and hidden confounding. LIFETIME DATA ANALYSIS 2025:10.1007/s10985-025-09654-1. [PMID: 40301269 DOI: 10.1007/s10985-025-09654-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Accepted: 03/31/2025] [Indexed: 05/01/2025]
Abstract
In this study, we focus on estimating the heterogeneous treatment effect (HTE) for survival outcome. The outcome is subject to censoring and the number of covariates is high-dimensional. We utilize data from both the randomized controlled trial (RCT), considered as the gold standard, and real-world data (RWD), possibly affected by hidden confounding factors. To achieve a more efficient HTE estimate, such integrative analysis requires great insight into the data generation mechanism, particularly the accurate characterization of unmeasured confounding effects/bias. With this aim, we propose a penalized-regression-based integrative approach that allows for the simultaneous estimation of parameters, selection of variables, and identification of the existence of unmeasured confounding effects. The consistency, asymptotic normality, and efficiency gains are rigorously established for the proposed estimate. Finally, we apply the proposed method to estimate the HTE of lobar/sublobar resection on the survival of lung cancer patients. The RCT is a multicenter non-inferiority randomized phase 3 trial, and the RWD comes from a clinical oncology cancer registry in the United States. The analysis reveals that the unmeasured confounding exists and the integrative approach does enhance the efficiency for the HTE estimation.
Collapse
Affiliation(s)
- Xin Ye
- School of Statistics and Mathematics, Guangdong University of Finance and Economics, Guangzhou, China
- School of Mathematics and Statistics, Wuhan University, Wuhan, China
| | - Shu Yang
- Department of Statistics, North Carolina State University, North Carolina, USA.
| | - Xiaofei Wang
- Department of Biostatistics and Bioinformatics, Duke University, Duke, USA
| | - Yanyan Liu
- School of Mathematics and Statistics, Wuhan University, Wuhan, China
| |
Collapse
|
2
|
Chang C, Bu Z, Long Q. CEDAR: communication efficient distributed analysis for regressions. Biometrics 2023; 79:2357-2369. [PMID: 36305019 PMCID: PMC10133408 DOI: 10.1111/biom.13786] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Accepted: 10/05/2022] [Indexed: 11/27/2022]
Abstract
Electronic health records (EHRs) offer great promises for advancing precision medicine and, at the same time, present significant analytical challenges. Particularly, it is often the case that patient-level data in EHRs cannot be shared across institutions (data sources) due to government regulations and/or institutional policies. As a result, there are growing interests about distributed learning over multiple EHRs databases without sharing patient-level data. To tackle such challenges, we propose a novel communication efficient method that aggregates the optimal estimates of external sites, by turning the problem into a missing data problem. In addition, we propose incorporating posterior samples of remote sites, which can provide partial information on the missing quantities and improve efficiency of parameter estimates while having the differential privacy property and thus reducing the risk of information leaking. The proposed approach, without sharing the raw patient level data, allows for proper statistical inference. We provide theoretical investigation for the asymptotic properties of the proposed method for statistical inference as well as differential privacy, and evaluate its performance in simulations and real data analyses in comparison with several recently developed methods.
Collapse
Affiliation(s)
- C. Chang
- University of Pennsylvania, PA 19104, USA
| | - Z. Bu
- University of Pennsylvania, PA 19104, USA
| | - Q. Long
- University of Pennsylvania, PA 19104, USA
| |
Collapse
|
3
|
Liu Y, Sun W, Hsu L, He Q. Statistical inference for high-dimensional pathway analysis with multiple responses. Comput Stat Data Anal 2022; 169. [PMID: 35125572 PMCID: PMC8813039 DOI: 10.1016/j.csda.2021.107418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Pathway analysis, i.e., grouping analysis, has important applications in genomic studies. Existing pathway analysis approaches are mostly focused on a single response and are not suitable for analyzing complex diseases that are often related with multiple response variables. Although a handful of approaches have been developed for multiple responses, these methods are mainly designed for pathways with a moderate number of features. A multi-response pathway analysis approach that is able to conduct statistical inference when the dimension is potentially higher than sample size is introduced. Asymptotical properties of the test statistic are established and theoretical investigation of the statistical power is conducted. Simulation studies and real data analysis show that the proposed approach performs well in identifying important pathways that influence multiple expression quantitative trait loci (eQTL).
Collapse
|
4
|
Hu Z, Zhou Y, Tong T. Meta-Analyzing Multiple Omics Data With Robust Variable Selection. Front Genet 2021; 12:656826. [PMID: 34290735 PMCID: PMC8288516 DOI: 10.3389/fgene.2021.656826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 05/24/2021] [Indexed: 12/03/2022] Open
Abstract
High-throughput omics data are becoming more and more popular in various areas of science. Given that many publicly available datasets address the same questions, researchers have applied meta-analysis to synthesize multiple datasets to achieve more reliable results for model estimation and prediction. Due to the high dimensionality of omics data, it is also desirable to incorporate variable selection into meta-analysis. Existing meta-analyzing variable selection methods are often sensitive to the presence of outliers, and may lead to missed detections of relevant covariates, especially for lasso-type penalties. In this paper, we develop a robust variable selection algorithm for meta-analyzing high-dimensional datasets based on logistic regression. We first search an outlier-free subset from each dataset by borrowing information across the datasets with repeatedly use of the least trimmed squared estimates for the logistic model and together with a hierarchical bi-level variable selection technique. We then refine a reweighting step to further improve the efficiency after obtaining a reliable non-outlier subset. Simulation studies and real data analysis show that our new method can provide more reliable results than the existing meta-analysis methods in the presence of outliers.
Collapse
Affiliation(s)
- Zongliang Hu
- College of Mathematics and Statistics, Shenzhen University, Shenzhen, China
| | - Yan Zhou
- College of Mathematics and Statistics, Shenzhen University, Shenzhen, China
| | - Tiejun Tong
- Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong
| |
Collapse
|
5
|
Cai T, Liu M, Xia Y. Individual Data Protected Integrative Regression Analysis of High-Dimensional Heterogeneous Data. J Am Stat Assoc 2021; 117:2105-2119. [PMID: 37975021 PMCID: PMC10653033 DOI: 10.1080/01621459.2021.1904958] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2019] [Revised: 03/05/2021] [Accepted: 03/13/2021] [Indexed: 01/29/2023]
Abstract
Evidence-based decision making often relies on meta-analyzing multiple studies, which enables more precise estimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however, highly challenging in the ultra high-dimensional setting. The challenge is even more pronounced when the individual-level data cannot be shared across studies, known as DataSHIELD contraint. Under sparse regression models that are assumed to be similar yet not identical across studies, we propose in this paper a novel integrative estimation procedure for data-Shielding High-dimensional Integrative Regression (SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodates between-study heterogeneity in both the covariate distribution and model parameters, and attains consistent variable selection. Theoretically, SHIR is statistically more efficient than the existing distributed approaches that integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurred by aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to be asymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sample performance of our method is studied and compared with existing approaches via extensive simulation settings. We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease using electronic health records data from multiple chronic disease cohorts.
Collapse
Affiliation(s)
- Tianxi Cai
- Department of Biostatistics, Harvard School of Public Health, Harvard University, Boston, USA
| | - Molei Liu
- Department of Biostatistics, Harvard School of Public Health, Harvard University, Boston, USA
| | - Yin Xia
- Department of Statistics, School of Management, Fudan University, Shanghai, China
| |
Collapse
|
6
|
Sheng Y, Sun Y, Huang CY, Kim MO. Synthesizing external aggregated information in the presence of population heterogeneity: A penalized empirical likelihood approach. Biometrics 2021; 78:679-690. [PMID: 33528028 DOI: 10.1111/biom.13429] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Revised: 08/23/2020] [Accepted: 12/31/2020] [Indexed: 01/04/2023]
Abstract
With the increasing availability of data in the public domain, there has been a growing interest in exploiting information from external sources to improve the analysis of smaller scale studies. An emerging challenge in the era of big data is that the subject-level data are high dimensional, but the external information is at an aggregate level and of a lower dimension. Moreover, heterogeneity and uncertainty in the auxiliary information are often not accounted for in information synthesis. In this paper, we propose a unified framework to summarize various forms of aggregated information via estimating equations and develop a penalized empirical likelihood approach to incorporate such information in logistic regression. When the homogeneity assumption is violated, we extend the method to account for population heterogeneity among different sources of information. When the uncertainty in the external information is not negligible, we propose a variance estimator adjusting for the uncertainty. The proposed estimators are asymptotically more efficient than the conventional penalized maximum likelihood estimator and enjoy the oracle property even with a diverging number of predictors. Simulation studies show that the proposed approaches yield higher accuracy in variable selection compared with competitors. We illustrate the proposed methodologies with a pediatric kidney transplant study.
Collapse
Affiliation(s)
- Ying Sheng
- Department of Epidemiology & Biostatistics, University of California at San Francisco, San Francisco, California
| | - Yifei Sun
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, New York
| | - Chiung-Yu Huang
- Department of Epidemiology & Biostatistics, University of California at San Francisco, San Francisco, California.,UCSF Helen Diller Family Comprehensive Cancer Center, San Francisco, California
| | - Mi-Ok Kim
- Department of Epidemiology & Biostatistics, University of California at San Francisco, San Francisco, California.,UCSF Helen Diller Family Comprehensive Cancer Center, San Francisco, California
| |
Collapse
|
7
|
Hong C, Wang Y, Cai T. A divide-and-conquer method for sparse risk prediction and evaluation. Biostatistics 2020; 23:397-411. [PMID: 32909599 DOI: 10.1093/biostatistics/kxaa031] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 07/06/2020] [Accepted: 07/10/2020] [Indexed: 11/12/2022] Open
Abstract
Divide-and-conquer (DAC) is a commonly used strategy to overcome the challenges of extraordinarily large data, by first breaking the dataset into series of data blocks, then combining results from individual data blocks to obtain a final estimation. Various DAC algorithms have been proposed to fit a sparse predictive regression model in the $L_1$ regularization setting. However, many existing DAC algorithms remain computationally intensive when sample size and number of candidate predictors are both large. In addition, no existing DAC procedures provide inference for quantifying the accuracy of risk prediction models. In this article, we propose a screening and one-step linearization infused DAC (SOLID) algorithm to fit sparse logistic regression to massive datasets, by integrating the DAC strategy with a screening step and sequences of linearization. This enables us to maximize the likelihood with only selected covariates and perform penalized estimation via a fast approximation to the likelihood. To assess the accuracy of a predictive regression model, we develop a modified cross-validation (MCV) that utilizes the side products of the SOLID, substantially reducing the computational burden. Compared with existing DAC methods, the MCV procedure is the first to make inference on accuracy. Extensive simulation studies suggest that the proposed SOLID and MCV procedures substantially outperform the existing methods with respect to computational speed and achieve similar statistical efficiency as the full sample-based estimator. We also demonstrate that the proposed inference procedure provides valid interval estimators. We apply the proposed SOLID procedure to develop and validate a classification model for disease diagnosis using narrative clinical notes based on electronic medical record data from Partners HealthCare.
Collapse
Affiliation(s)
- Chuan Hong
- Department of Biomedical Informatics, Harvard Medical School, Boston, 02115, MA, USA
| | - Yan Wang
- Department of Biomedical Informatics, Harvard Medical School, Boston, 02115, MA, USA
| | - Tianxi Cai
- Department of Biomedical Informatics, Harvard Medical School, Boston, 02115 MA, USA and Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, 20115, MA, USA
| |
Collapse
|
8
|
Yang T, Kim J, Wu C, Ma Y, Wei P, Pan W. An adaptive test for meta-analysis of rare variant association studies. Genet Epidemiol 2020; 44:104-116. [PMID: 31830326 PMCID: PMC6980317 DOI: 10.1002/gepi.22273] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2019] [Revised: 11/12/2019] [Accepted: 11/25/2019] [Indexed: 01/02/2023]
Abstract
Single genome-wide studies may be underpowered to detect trait-associated rare variants with moderate or weak effect sizes. As a viable alternative, meta-analysis is widely used to increase power by combining different studies. The power of meta-analysis critically depends on the underlying association patterns and heterogeneity levels, which are unknown and vary from locus to locus. However, existing methods mainly focus on one or only a few combinations of the association pattern and heterogeneity level, thus may lose power in many situations. To address this issue, we propose a general and unified framework by combining a class of tests including and beyond some existing ones, leading to high power across a wide range of scenarios. We demonstrate that the proposed test is more powerful than some existing methods in simulation studies, then show their performance with the NHLBI Exome-Sequencing Project (ESP) data. One gene (B4GALNT2) was found by our proposed test, but not by others, to be statistically significantly associated with plasma triglyceride. The signal was driven by African-ancestry subjects but it was previously reported to be associated with coronary artery disease among European-ancestry subjects. We implemented our method in an R package aSPUmeta, publicly available at https://github.com/ytzhong/metaRV and will be on CRAN soon.
Collapse
Affiliation(s)
- Tianzhong Yang
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Junghi Kim
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Chong Wu
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Yiding Ma
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Peng Wei
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
9
|
Liu Y, Sun W, Reiner AP, Kooperberg C, He Q. Statistical inference of genetic pathway analysis in high dimensions. Biometrika 2019; 106:651. [PMID: 31427824 DOI: 10.1093/biomet/asz033] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2017] [Indexed: 11/13/2022] Open
Abstract
Genetic pathway analysis has become an important tool for investigating the association between a group of genetic variants and traits. With dense genotyping and extensive imputation, the number of genetic variants in biological pathways has increased considerably and sometimes exceeds the sample size [Formula: see text]. Conducting genetic pathway analysis and statistical inference in such settings is challenging. We introduce an approach that can handle pathways whose dimension [Formula: see text] could be greater than [Formula: see text]. Our method can be used to detect pathways that have nonsparse weak signals, as well as pathways that have sparse but stronger signals. We establish the asymptotic distribution for the proposed statistic and conduct theoretical analysis on its power. Simulation studies show that our test has correct Type I error control and is more powerful than existing approaches. An application to a genome-wide association study of high-density lipoproteins demonstrates the proposed approach.
Collapse
Affiliation(s)
- Yang Liu
- Department of Mathematics and Statistics, Wright State University, 3640 Colonel Glenn Highway, Dayton, Ohio, U.S.A
| | - Wei Sun
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington, U.S.A
| | - Alexander P Reiner
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington, U.S.A
| | - Charles Kooperberg
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington, U.S.A
| | - Qianchuan He
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington, U.S.A
| |
Collapse
|
10
|
Silva-Fernández L, Carmona L. Meta-analysis in the era of big data. Clin Rheumatol 2019; 38:2027-2028. [PMID: 31273634 DOI: 10.1007/s10067-019-04666-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Revised: 06/20/2019] [Accepted: 06/26/2019] [Indexed: 02/07/2023]
Affiliation(s)
- Lucía Silva-Fernández
- Rheumatology Department, Complexo Hospitalario Universitario de A Coruña, A Coruña, Spain.
| | - Loreto Carmona
- Instituto de Salud Músculo-Esquelética (INMUSC), Madrid, Spain
| |
Collapse
|
11
|
el Bouhaddani S, Uh HW, Hayward C, Jongbloed G, Houwing-Duistermaat J. Probabilistic partial least squares model: Identifiability, estimation and application. J MULTIVARIATE ANAL 2018. [DOI: 10.1016/j.jmva.2018.05.009] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|