1
|
Hector EC, Song PXK. Joint integrative analysis of multiple data sources with correlated vector outcomes. Ann Appl Stat 2022. [DOI: 10.1214/21-aoas1563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
2
|
Hector EC, Song PXK. A Distributed and Integrated Method of Moments for High-Dimensional Correlated Data Analysis. J Am Stat Assoc 2021; 116:805-818. [PMID: 34168390 DOI: 10.1080/01621459.2020.1736082] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
This paper is motivated by a regression analysis of electroencephalography (EEG) neuroimaging data with high-dimensional correlated responses with multi-level nested correlations. We develop a divide-and-conquer procedure implemented in a fully distributed and parallelized computational scheme for statistical estimation and inference of regression parameters. Despite significant efforts in the literature, the computational bottleneck associated with high-dimensional likelihoods prevents the scalability of existing methods. The proposed method addresses this challenge by dividing responses into subvectors to be analyzed separately and in parallel on a distributed platform using pairwise composite likelihood. Theoretical challenges related to combining results from dependent data are overcome in a statistically efficient way using a meta-estimator derived from Hansen's generalized method of moments. We provide a rigorous theoretical framework for efficient estimation, inference, and goodness-of-fit tests. We develop an R package for ease of implementation. We illustrate our method's performance with simulations and the analysis of the EEG data, and find that iron deficiency is significantly associated with two auditory recognition memory related potentials in the left parietal-occipital region of the brain.
Collapse
|
3
|
Liang W, Ma S, Zhang Q, Zhu T. Integrative sparse partial least squares. Stat Med 2021; 40:2239-2256. [PMID: 33559203 PMCID: PMC8071349 DOI: 10.1002/sim.8900] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Revised: 11/26/2020] [Accepted: 01/20/2021] [Indexed: 01/28/2023]
Abstract
Partial least squares, as a dimension reduction technique, has become increasingly important for its ability to deal with problems with a large number of variables. Since noisy variables may weaken estimation performance, the sparse partial least squares (SPLS) technique has been proposed to identify important variables and generate more interpretable results. However, the small sample size of a single dataset limits the performance of conventional methods. An effective solution comes from gathering information from multiple comparable studies. Integrative analysis has essential importance in multidatasets analysis. The main idea is to improve performance by assembling raw data from multiple independent datasets and analyzing them jointly. In this article, we develop an integrative SPLS (iSPLS) method using penalization based on the SPLS technique. The proposed approach consists of two penalties. The first penalty conducts variable selection under the context of integrative analysis. The second penalty, a contrasted penalty, is imposed to encourage the similarity of estimates across datasets and generate more sensible and accurate results. Computational algorithms are developed. Simulation experiments are conducted to compare iSPLS with alternative approaches. The practical utility of iSPLS is shown in the analysis of two TCGA gene expression data.
Collapse
Affiliation(s)
- Weijuan Liang
- School of Statistics, Renmin University of China, Beijing, China
| | - Shuangge Ma
- School of Public Health, Yale University, New Haven, Connecticut
| | - Qingzhao Zhang
- MOE Key Laboratory of Econometrics, Department of Statistics, School of Economics, The Wang Yanan Institute for Studies in Economics, and Fujian Key Lab of Statistics, Xiamen University, Xiamen, China
| | - Tingyu Zhu
- Department of Statistics, Oregon State University, Corvallis, Oregon
| |
Collapse
|
4
|
Tang L, Song PXK. Poststratification fusion learning in longitudinal data analysis. Biometrics 2020; 77:914-928. [PMID: 32683671 DOI: 10.1111/biom.13333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2019] [Revised: 05/12/2020] [Accepted: 07/08/2020] [Indexed: 11/28/2022]
Abstract
Stratification is a very commonly used approach in biomedical studies to handle sample heterogeneity arising from, for examples, clinical units, patient subgroups, or missing-data. A key rationale behind such approach is to overcome potential sampling biases in statistical inference. Two issues of such stratification-based strategy are (i) whether individual strata are sufficiently distinctive to warrant stratification, and (ii) sample size attrition resulted from the stratification may potentially lead to loss of statistical power. To address these issues, we propose a penalized generalized estimating equations approach to reducing the complexity of parametric model structures due to excessive stratification. Specifically, we develop a data-driven fusion learning approach for longitudinal data that improves estimation efficiency by integrating information across similar strata, yet still allows necessary separation for stratum-specific conclusions. The proposed method is evaluated by simulation studies and applied to a motivating example of psychiatric study to demonstrate its usefulness in real world settings.
Collapse
Affiliation(s)
- Lu Tang
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| |
Collapse
|
5
|
Tang L, Zhou L, Song PXK. Distributed Simultaneous Inference in Generalized Linear Models via Confidence Distribution. J MULTIVARIATE ANAL 2019; 176. [PMID: 32863459 DOI: 10.1016/j.jmva.2019.104567] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
We propose a distributed method for simultaneous inference for datasets with sample size much larger than the number of covariates, i.e., N ≫ p, in the generalized linear models framework. When such datasets are too big to be analyzed entirely by a single centralized computer, or when datasets are already stored in distributed database systems, the strategy of divide-and-combine has been the method of choice for scalability. Due to partition, the sub-dataset sample sizes may be uneven and some possibly close to p, which calls for regularization techniques to improve numerical stability. However, there is a lack of clear theoretical justification and practical guidelines to combine results obtained from separate regularized estimators, especially when the final objective is simultaneous inference for a group of regression parameters. In this paper, we develop a strategy to combine bias-corrected lasso-type estimates by using confidence distributions. We show that the resulting combined estimator achieves the same estimation efficiency as that of the maximum likelihood estimator using the centralized data. As demonstrated by simulated and real data examples, our divide-and-combine method yields nearly identical inference as the centralized benchmark.
Collapse
Affiliation(s)
- Lu Tang
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Ling Zhou
- Center of Statistical Research, Southwestern University of Finance and Economics, Chengdu, Sichuan, China
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
6
|
Abstract
We propose a fusion learning procedure to perform regression coefficients clustering in the Cox proportional hazards model when parameters are partially heterogeneous across certain predefined subgroups, such as age groups. One major issue pertains to the fact that the same covariate may have different influence on the survival time across different subgroups. Learning differences in covariate effects is of critical importance to understand the model heterogeneity resulted from the between-group heterogeneity, especially when the number of subgroups is large. We establish a computationally efficient procedure to learn the heterogeneous patterns of regression coefficients across the subgroups in Cox proportional hazards model. Utilizing a fusion learning algorithm coupled with the estimated parameter ordering, the proposed method mitigates greatly computational burden with little loss of statistical power. Extensive simulation studies are conducted to evaluate the performance of our method. Finally with a comparison to some popular conventional methods, we illustrate the proposed method by a vehicle leasing contract renewal analysis.
Collapse
|
7
|
|
8
|
Xu C, Fang J, Shen H, Wang YP, Deng HW. EPS-LASSO: test for high-dimensional regression under extreme phenotype sampling of continuous traits. Bioinformatics 2018; 34:1996-2003. [PMID: 29385408 PMCID: PMC6454442 DOI: 10.1093/bioinformatics/bty042] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2017] [Accepted: 01/24/2018] [Indexed: 01/19/2023] Open
Abstract
Motivation Extreme phenotype sampling (EPS) is a broadly-used design to identify candidate genetic factors contributing to the variation of quantitative traits. By enriching the signals in extreme phenotypic samples, EPS can boost the association power compared to random sampling. Most existing statistical methods for EPS examine the genetic factors individually, despite many quantitative traits have multiple genetic factors underlying their variation. It is desirable to model the joint effects of genetic factors, which may increase the power and identify novel quantitative trait loci under EPS. The joint analysis of genetic data in high-dimensional situations requires specialized techniques, e.g. the least absolute shrinkage and selection operator (LASSO). Although there are extensive research and application related to LASSO, the statistical inference and testing for the sparse model under EPS remain unknown. Results We propose a novel sparse model (EPS-LASSO) with hypothesis test for high-dimensional regression under EPS based on a decorrelated score function. The comprehensive simulation shows EPS-LASSO outperforms existing methods with stable type I error and FDR control. EPS-LASSO can provide a consistent power for both low- and high-dimensional situations compared with the other methods dealing with high-dimensional situations. The power of EPS-LASSO is close to other low-dimensional methods when the causal effect sizes are small and is superior when the effects are large. Applying EPS-LASSO to a transcriptome-wide gene expression study for obesity reveals 10 significant body mass index associated genes. Our results indicate that EPS-LASSO is an effective method for EPS data analysis, which can account for correlated predictors. Availability and implementation The source code is available at https://github.com/xu1912/EPSLASSO. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chao Xu
- Center of Bioinformatics and Genomics, Tulane University, New Orleans, LA, USA,Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA, USA
| | - Jian Fang
- Center of Bioinformatics and Genomics, Tulane University, New Orleans, LA, USA,Department of Biomedical Engineering, Tulane University, New Orleans, LA, USA
| | - Hui Shen
- Center of Bioinformatics and Genomics, Tulane University, New Orleans, LA, USA,Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA, USA
| | - Yu-Ping Wang
- Center of Bioinformatics and Genomics, Tulane University, New Orleans, LA, USA,Department of Biomedical Engineering, Tulane University, New Orleans, LA, USA
| | - Hong-Wen Deng
- Center of Bioinformatics and Genomics, Tulane University, New Orleans, LA, USA,Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA, USA,Laboratory of Molecular and Statistical Genetics, College of Life Sciences, Hunan Normal University, Changsha, Hunan, China ,To whom correspondence should be addressed.
| |
Collapse
|
9
|
Tao Y, Wang L. Adaptive contrast weighted learning for multi-stage multi-treatment decision-making. Biometrics 2016; 73:145-155. [DOI: 10.1111/biom.12539] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2015] [Revised: 03/01/2016] [Accepted: 04/01/2016] [Indexed: 11/28/2022]
Affiliation(s)
- Yebin Tao
- Department of Biostatistics; University of Michigan; Ann Arbor, Michigan 48109 U.S.A
| | - Lu Wang
- Department of Biostatistics; University of Michigan; Ann Arbor, Michigan 48109 U.S.A
| |
Collapse
|
10
|
Tang L, Song PXK. Fused Lasso Approach in Regression Coefficients Clustering - Learning Parameter Heterogeneity in Data Integration. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2016; 17:113. [PMID: 29056876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
As data sets of related studies become more easily accessible, combining data sets of similar studies is often undertaken in practice to achieve a larger sample size and higher power. A major challenge arising from data integration pertains to data heterogeneity in terms of study population, study design, or study coordination. Ignoring such heterogeneity in data analysis may result in biased estimation and misleading inference. Traditional techniques of remedy to data heterogeneity include the use of interactions and random effects, which are inferior to achieving desirable statistical power or providing a meaningful interpretation, especially when a large number of smaller data sets are combined. In this paper, we propose a regularized fusion method that allows us to identify and merge inter-study homogeneous parameter clusters in regression analysis, without the use of hypothesis testing approach. Using the fused lasso, we establish a computationally efficient procedure to deal with large-scale integrated data. Incorporating the estimated parameter ordering in the fused lasso facilitates computing speed with no loss of statistical power. We conduct extensive simulation studies and provide an application example to demonstrate the performance of the new method with a comparison to the conventional methods.
Collapse
Affiliation(s)
- Lu Tang
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Peter X K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|