1
|
Kang Z, Chen L, Wei P, Xu Z, Li C, Yang T. Estimation of total mediation effect for a binary trait in a case-control study for high-dimensional omics mediators. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.28.635396. [PMID: 39975081 PMCID: PMC11838279 DOI: 10.1101/2025.01.28.635396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Mediation analysis helps uncover how exposures impact outcomes through intermediate variables. Traditional mean-based total mediation effect measures can suffer from the cancellation of opposite component-wise effects and existing methods often lack the power to capture weak effects in high-dimensional mediators. Additionally, most existing work has focused on continuous outcomes, with limited attention to binary outcomes, particularly in case-control studies. To fill in this gap, we propose anR 2 total mediation effect measure under the liability framework, providing a causal interpretation and applicable to various high-dimensional mediation models. We develop a cross-fitted, modified Haseman-Elston regression-based estimation procedure tailored for case-control studies, which can also be applied to cohort studies with reduced efficiency. Our estimator remains consistent with non-mediators and weak effect sizes in extensive simulations. Theoretical justification on consistency is provided under mild conditions. In the Women's Health Initiative of 2150 individuals, we found that 89% (CI: 73% - 91%) of the variation in the underlying liability for coronary heart disease associated with BMI can be explained by metabolomics.
Collapse
Affiliation(s)
- Zhiyu Kang
- Division of Biostatistics and Health Data Science, University of Minnesota, MN 55455
| | - Li Chen
- School of Statistics, University of Minnesota, MN 55455
| | - Peng Wei
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, TX 77030
| | - Zhichao Xu
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, TX 77030
| | - Chunlin Li
- Department of Statistics, Iowa State University, IA 50011
| | - Tianzhong Yang
- Division of Biostatistics and Health Data Science, University of Minnesota, MN 55455
| |
Collapse
|
2
|
Wang R, Fang L, Wang Y, Jin J. Identifying Effect Modification of Latent Population Characteristics on Risk Factors with a Sparse Varying Coefficient Regression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.30.626101. [PMID: 39677704 PMCID: PMC11642784 DOI: 10.1101/2024.11.30.626101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Leveraging observational data to understand the associations between risk factors and disease outcomes and conduct disease risk prediction is a common task in epidemiology. While traditional linear regression and other machine learning models have been extensively implemented for this task, the associations between risk factors and disease outcomes are typically deemed fixed. In many cases, however, such associations may vary by some underlying features of the individuals, which may involve certain subpopulation characteristics and environmental factors. While data for these latent features may not be available, the observed data on risk factors may have captured some proportion of the variation in these features. Thus extracting latent factors from risk factors and incorporating this effect modification into the model may better capture the underlying data structure and improve inference. We develop a novel regression model with some coefficients varying as functions of latent features extracted from the risk factors. We have demonstrated the superiority of our approach in various data settings via simulation studies. An application on a dataset for lung cancer patients from The Cancer Genome Atlas (TCGA) Program showed that our approach led to a 6% - 118% increase in (AUC-0.5) for distinguishing between different lung cancer stages compared to the classic lasso and elastic net regressions and identified interesting latent effect modifications associated with certain gene pathways.
Collapse
|
3
|
Hansen B, Avalos-Pacheco A, Russo M, De Vito R. Fast Variational Inference for Bayesian Factor Analysis in Single and Multi-Study Settings. J Comput Graph Stat 2024; 34:96-108. [PMID: 40161999 PMCID: PMC11949465 DOI: 10.1080/10618600.2024.2356173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 05/10/2024] [Indexed: 04/02/2025]
Abstract
Factors models are commonly used to analyze high-dimensional data in both single-study and multi-study settings. Bayesian inference for such models relies on Markov Chain Monte Carlo (MCMC) methods, which scale poorly as the number of studies, observations, or measured variables increase. To address this issue, we propose new variational inference algorithms to approximate the posterior distribution of Bayesian latent factor models using the multiplicative gamma process shrinkage prior. The proposed algorithms provide fast approximate inference at a fraction of the time and memory of MCMC-based implementations while maintaining comparable accuracy in characterizing the data covariance matrix. We conduct extensive simulations to evaluate our proposed algorithms and show their utility in estimating the model for high-dimensional multi-study gene expression data in ovarian cancers. Overall, our proposed approaches enable more efficient and scalable inference for factor models, facilitating their use in high-dimensional settings. An R package VIMSFA implementing our methods is available on GitHub (github.com/blhansen/VI-MSFA).
Collapse
Affiliation(s)
| | - Alejandra Avalos-Pacheco
- Applied Statistics Research Unit, TU Wien, Harvard-MIT Center for Regulatory Science, Harvard Medical School
| | | | - Roberta De Vito
- Department of Biostatistics and Data Science Institute, Brown University
| |
Collapse
|
4
|
Wan R, Zhang Y, Peng Y, Tian F, Gao G, Tang F, Jia J, Ge H. Unveiling gene regulatory networks during cellular state transitions without linkage across time points. Sci Rep 2024; 14:12355. [PMID: 38811747 PMCID: PMC11137113 DOI: 10.1038/s41598-024-62850-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 05/22/2024] [Indexed: 05/31/2024] Open
Abstract
Time-stamped cross-sectional data, which lack linkage across time points, are commonly generated in single-cell transcriptional profiling. Many previous methods for inferring gene regulatory networks (GRNs) driving cell-state transitions relied on constructing single-cell temporal ordering. Introducing COSLIR (COvariance restricted Sparse LInear Regression), we presented a direct approach to reconstructing GRNs that govern cell-state transitions, utilizing only the first and second moments of samples between two consecutive time points. Simulations validated COSLIR's perfect accuracy in the oracle case and demonstrated its robust performance in real-world scenarios. When applied to single-cell RT-PCR and RNAseq datasets in developmental biology, COSLIR competed favorably with existing methods. Notably, its running time remained nearly independent of the number of cells. Therefore, COSLIR emerges as a promising addition to GRN reconstruction methods under cell-state transitions, bypassing the single-cell temporal ordering to enhance accuracy and efficiency in single-cell transcriptional profiling.
Collapse
Affiliation(s)
- Ruosi Wan
- Beijing International Center for Mathematical Research, Peking University, Beijing, China
| | - Yuhao Zhang
- Biomedical Pioneering Innovation Center, Peking University, Beijing, China
| | - Yongli Peng
- Beijing International Center for Mathematical Research, Peking University, Beijing, China
| | - Feng Tian
- Biomedical Pioneering Innovation Center, Peking University, Beijing, China
| | - Ge Gao
- Biomedical Pioneering Innovation Center, Peking University, Beijing, China
- Beijing Advanced Innovation Center for Genomics, Peking University, Beijing, China
| | - Fuchou Tang
- Biomedical Pioneering Innovation Center, Peking University, Beijing, China
- Beijing Advanced Innovation Center for Genomics, Peking University, Beijing, China
| | - Jinzhu Jia
- School of Public Health and Center for Statistical Science, Peking University, Beijing, China.
| | - Hao Ge
- Beijing International Center for Mathematical Research, Peking University, Beijing, China.
- Biomedical Pioneering Innovation Center, Peking University, Beijing, China.
| |
Collapse
|
5
|
Cheng S, Morel R, Allys E, Ménard B, Mallat S. Scattering spectra models for physics. PNAS NEXUS 2024; 3:pgae103. [PMID: 38560525 PMCID: PMC10978061 DOI: 10.1093/pnasnexus/pgae103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2023] [Accepted: 02/16/2024] [Indexed: 04/04/2024]
Abstract
Physicists routinely need probabilistic models for a number of tasks such as parameter inference or the generation of new realizations of a field. Establishing such models for highly non-Gaussian fields is a challenge, especially when the number of samples is limited. In this paper, we introduce scattering spectra models for stationary fields and we show that they provide accurate and robust statistical descriptions of a wide range of fields encountered in physics. These models are based on covariances of scattering coefficients, i.e. wavelet decomposition of a field coupled with a pointwise modulus. After introducing useful dimension reductions taking advantage of the regularity of a field under rotation and scaling, we validate these models on various multiscale physical fields and demonstrate that they reproduce standard statistics, including spatial moments up to fourth order. The scattering spectra provide us with a low-dimensional structured representation that captures key properties encountered in a wide range of physical fields. These generic models can be used for data exploration, classification, parameter inference, symmetry detection, and component separation.
Collapse
Affiliation(s)
- Sihao Cheng
- School of Natural Sciences, Institute for Advanced Study, Princeton, NJ 08540, USA
| | - Rudy Morel
- Departement d'informatique de l'ENS, ENS, CNRS, PSL University, 75014 Paris, France
| | - Erwan Allys
- Laboratoire de Physique de l'Ecole normale supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université Paris Cité, 75014 Paris, France
| | - Brice Ménard
- Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Stéphane Mallat
- Departement d'informatique de l'ENS, ENS, CNRS, PSL University, 75014 Paris, France
- Collège de France, 75231 Paris, France
- Center for Computational Mathematics, Flatiron Institute, New York, NY 10010, USA
| |
Collapse
|
6
|
Bellot A, van der Schaar M. Linear Deconfounded Score Method: Scoring DAGs With Dense Unobserved Confounding. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:4948-4962. [PMID: 38285579 DOI: 10.1109/tnnls.2024.3352657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2024]
Abstract
This article deals with the discovery of causal relations from a combination of observational data and qualitative assumptions about the nature of causality in the presence of unmeasured confounding. We focus on applications where unobserved variables are known to have a widespread effect on many of the observed ones, which makes the problem particularly difficult for constraint-based methods, because most pairs of variables are conditionally dependent given any other subset, rendering the causal effect unidentifiable. In this article, we show that under the principle of independent mechanisms, unobserved confounding in this setting leaves a statistical footprint in the observed data distribution that allows for disentangling spurious and causal effects. Using this insight, we demonstrate that a sparse linear Gaussian directed acyclic graph (DAG) among observed variables may be recovered approximately and propose a simple adjusted score-based causal discovery algorithm that may be implemented with general-purpose solvers and scales to high-dimensional problems. We find, in addition, that despite the conditions we pose to guarantee causal recovery, performance in practice is robust to large deviations in model assumptions, and extensions to nonlinear structural models are possible.
Collapse
|
7
|
Park S, Ceulemans E, Van Deun K. A critical assessment of sparse PCA (research): why (one should acknowledge that) weights are not loadings. Behav Res Methods 2024; 56:1413-1432. [PMID: 37540466 PMCID: PMC10991020 DOI: 10.3758/s13428-023-02099-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/16/2023] [Indexed: 08/05/2023]
Abstract
Principal component analysis (PCA) is an important tool for analyzing large collections of variables. It functions both as a pre-processing tool to summarize many variables into components and as a method to reveal structure in data. Different coefficients play a central role in these two uses. One focuses on the weights when the goal is summarization, while one inspects the loadings if the goal is to reveal structure. It is well known that the solutions to the two approaches can be found by singular value decomposition; weights, loadings, and right singular vectors are mathematically equivalent. What is often overlooked, is that they are no longer equivalent in the setting of sparse PCA methods which induce zeros either in the weights or the loadings. The lack of awareness for this difference has led to questionable research practices in sparse PCA. First, in simulation studies data is generated mostly based only on structures with sparse singular vectors or sparse loadings, neglecting the structure with sparse weights. Second, reported results represent local optima as the iterative routines are often initiated with the right singular vectors. In this paper we critically re-assess sparse PCA methods by also including data generating schemes characterized by sparse weights and different initialization strategies. The results show that relying on commonly used data generating models can lead to over-optimistic conclusions. They also highlight the impact of choice between sparse weights versus sparse loadings methods and the initialization strategies. The practical consequences of this choice are illustrated with empirical datasets.
Collapse
Affiliation(s)
- S Park
- Tilburg University, Methods and Statistics, Tilburg, The Netherlands.
| | - E Ceulemans
- KU Leuven, Psychology and Educational Sciences, Leuven, Belgium
| | - K Van Deun
- Tilburg University, Methods and Statistics, Tilburg, The Netherlands
| |
Collapse
|
8
|
Heiling HM, Rashid NU, Li Q, Peng XL, Yeh JJ, Ibrahim JG. Efficient computation of high-dimensional penalized generalized linear mixed models by latent factor modeling of the random effects. Biometrics 2024; 80:ujae016. [PMID: 38497825 PMCID: PMC10946237 DOI: 10.1093/biomtc/ujae016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2023] [Revised: 11/22/2023] [Accepted: 02/16/2024] [Indexed: 03/19/2024]
Abstract
Modern biomedical datasets are increasingly high-dimensional and exhibit complex correlation structures. Generalized linear mixed models (GLMMs) have long been employed to account for such dependencies. However, proper specification of the fixed and random effects in GLMMs is increasingly difficult in high dimensions, and computational complexity grows with increasing dimension of the random effects. We present a novel reformulation of the GLMM using a factor model decomposition of the random effects, enabling scalable computation of GLMMs in high dimensions by reducing the latent space from a large number of random effects to a smaller set of latent factors. We also extend our prior work to estimate model parameters using a modified Monte Carlo Expectation Conditional Minimization algorithm, allowing us to perform variable selection on both the fixed and random effects simultaneously. We show through simulation that through this factor model decomposition, our method can fit high-dimensional penalized GLMMs faster than comparable methods and more easily scale to larger dimensions not previously seen in existing approaches.
Collapse
Affiliation(s)
- Hillary M Heiling
- Department of Biostatistics, University of North Carolina Chapel Hill, Chapel Hill, NC 27599, United States
| | - Naim U Rashid
- Department of Biostatistics, University of North Carolina Chapel Hill, Chapel Hill, NC 27599, United States
| | - Quefeng Li
- Department of Biostatistics, University of North Carolina Chapel Hill, Chapel Hill, NC 27599, United States
| | - Xianlu L Peng
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
| | - Jen Jen Yeh
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
- Department of Surgery, University of North Carolina Chapel Hill, Chapel Hill, NC 27599, United States
- Department of Pharmacology, University of North Carolina Chapel Hill, Chapel Hill, NC 27599, United States
| | - Joseph G Ibrahim
- Department of Biostatistics, University of North Carolina Chapel Hill, Chapel Hill, NC 27599, United States
| |
Collapse
|
9
|
Ma TF, Wang F, Zhu J. On generalized latent factor modeling and inference for high-dimensional binomial data. Biometrics 2023; 79:2311-2320. [PMID: 36200926 DOI: 10.1111/biom.13768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Accepted: 09/23/2022] [Indexed: 11/30/2022]
Abstract
We explore a hierarchical generalized latent factor model for discrete and bounded response variables and in particular, binomial responses. Specifically, we develop a novel two-step estimation procedure and the corresponding statistical inference that is computationally efficient and scalable for the high dimension in terms of both the number of subjects and the number of features per subject. We also establish the validity of the estimation procedure, particularly the asymptotic properties of the estimated effect size and the latent structure, as well as the estimated number of latent factors. The results are corroborated by a simulation study and for illustration, the proposed methodology is applied to analyze a dataset in a gene-environment association study.
Collapse
Affiliation(s)
- Ting Fung Ma
- Department of Statistics, University of South Carolina, Columbia, South Carolina, USA
| | - Fangfang Wang
- Department of Mathematical Sciences, Worcester Polytechnic Institute, Worcester, Massachusetts, USA
| | - Jun Zhu
- Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, USA
| |
Collapse
|
10
|
Leday GGR, Hemerik J, Engel J, van der Voet H. Improved family-wise error rate control in multiple equivalence testing. Food Chem Toxicol 2023; 178:113928. [PMID: 37406754 DOI: 10.1016/j.fct.2023.113928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 06/20/2023] [Accepted: 06/30/2023] [Indexed: 07/07/2023]
Abstract
Equivalence testing is an important component of safety assessments, used for example by the European Food Safety Authority, to allow new food or feed products on the market. The aim of such tests is to demonstrate equivalence of characteristics of test and reference crops. Equivalence tests are typically univariate and applied to each measured analyte (characteristic) separately without multiplicity correction. This increases the probability of making false claims of equivalence (type I errors) when evaluating multiple analytes simultaneously. To solve this problem, familywise error rate (FWER) control using Hochberg's method has been proposed. This paper demonstrates that, in the context of equivalence testing, other FWER-controlling methods are more powerful than Hochberg's. Particularly, it is shown that Hommel's method is guaranteed to perform at least as well as Hochberg's and that an "adaptive" version of Bonferroni's method, which uses an estimator of the proportion of non-equivalent characteristics, often substantially outperforms Hommel's method. Adaptive Bonferroni takes better advantage of the particular context of food safety where a large proportion of true equivalences is expected, a situation where other methods are particularly conservative. The different methods are illustrated by their application to two compositional datasets and further assessed and compared using simulated data.
Collapse
Affiliation(s)
- Gwenaël G R Leday
- Wageningen University and Research, Biometris, Droevendaalsesteeg 1, 6708, PB, Wageningen, the Netherlands.
| | - Jesse Hemerik
- Wageningen University and Research, Biometris, Droevendaalsesteeg 1, 6708, PB, Wageningen, the Netherlands
| | - Jasper Engel
- Wageningen University and Research, Biometris, Droevendaalsesteeg 1, 6708, PB, Wageningen, the Netherlands
| | - Hilko van der Voet
- Wageningen University and Research, Biometris, Droevendaalsesteeg 1, 6708, PB, Wageningen, the Netherlands
| |
Collapse
|
11
|
Block-diagonal precision matrix regularization for ultra-high dimensional data. Comput Stat Data Anal 2023. [DOI: 10.1016/j.csda.2022.107630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
|
12
|
Fan J, Lou Z, Yu M. Are Latent Factor Regression and Sparse Regression Adequate? J Am Stat Assoc 2023; 119:1076-1088. [PMID: 39268549 PMCID: PMC11390100 DOI: 10.1080/01621459.2023.2169700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Accepted: 01/13/2023] [Indexed: 01/19/2023]
Abstract
We propose the Factor Augmented (sparse linear) Regression Model (FARM) that not only admits both the latent factor regression and sparse linear regression as special cases but also bridges dimension reduction and sparse regression together. We provide theoretical guarantees for the estimation of our model under the existence of sub-Gaussian and heavy-tailed noises (with bounded (1 + ϑ) -th moment, for all ϑ > 0) respectively. In addition, the existing works on supervised learning often assume the latent factor regression or sparse linear regression is the true underlying model without justifying its adequacy. To fill in such an important gap on high-dimensional inference, we also leverage our model as the alternative model to test the sufficiency of the latent factor regression and the sparse linear regression models. To accomplish these goals, we propose the Factor-Adjusted deBiased Test (FabTest) and a two-stage ANOVA type test respectively. We also conduct large-scale numerical experiments including both synthetic and FRED macroeconomics data to corroborate the theoretical properties of our methods. Numerical results illustrate the robustness and effectiveness of our model against latent factor regression and sparse linear regression models.
Collapse
Affiliation(s)
- Jianqing Fan
- Frederick L. Moore '18 Professor of Finance, Professor of Statistics, and Professor of Operations Research and Financial Engineering at the Princeton University
| | - Zhipeng Lou
- Department of Operations Research and Financial Engineering, Princeton University
| | - Mengxin Yu
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA
| |
Collapse
|
13
|
Vutov V, Dickhaus T. Multiple two-sample testing under arbitrary covariance dependency with an application in imaging mass spectrometry. Biom J 2023; 65:e2100328. [PMID: 36029271 DOI: 10.1002/bimj.202100328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 05/12/2022] [Accepted: 07/04/2022] [Indexed: 11/12/2022]
Abstract
Large-scale hypothesis testing has become a ubiquitous problem in high-dimensional statistical inference, with broad applications in various scientific disciplines. One relevant application is constituted by imaging mass spectrometry (IMS) association studies, where a large number of tests are performed simultaneously in order to identify molecular masses that are associated with a particular phenotype, for example, a cancer subtype. Mass spectra obtained from matrix-assisted laser desorption/ionization (MALDI) experiments are dependent, when considered as statistical quantities. False discovery proportion (FDP) estimation and control under arbitrary dependency structure among test statistics is an active topic in modern multiple testing research. In this context, we are concerned with the evaluation of associations between the binary outcome variable (describing the phenotype) and multiple predictors derived from MALDI measurements. We propose an inference procedure in which the correlation matrix of the test statistics is utilized. The approach is based on multiple marginal models. Specifically, we fit a marginal logistic regression model for each predictor individually. Asymptotic joint normality of the stacked vector of the marginal regression coefficients is established under standard regularity assumptions, and their (limiting) correlation matrix is estimated. The proposed method extracts common factors from the resulting empirical correlation matrix. Finally, we estimate the realized FDP of a thresholding procedure for the marginal p-values. We demonstrate a practical application of the proposed workflow to MALDI IMS data in an oncological context.
Collapse
Affiliation(s)
- Vladimir Vutov
- Institute for Statistics, University of Bremen, Bremen, Germany
| | | |
Collapse
|
14
|
Cui J, Wang G, Zou C, Wang Z. Change-point testing for parallel data sets with FDR control. Comput Stat Data Anal 2023. [DOI: 10.1016/j.csda.2023.107705] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
|
15
|
Zhu W, Lévy-Leduc C, Ternès N. Identification of prognostic and predictive biomarkers in high-dimensional data with PPLasso. BMC Bioinformatics 2023; 24:25. [PMID: 36690931 PMCID: PMC9869528 DOI: 10.1186/s12859-023-05143-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2022] [Accepted: 01/09/2023] [Indexed: 01/24/2023] Open
Abstract
In clinical trials, identification of prognostic and predictive biomarkers has became essential to precision medicine. Prognostic biomarkers can be useful for the prevention of the occurrence of the disease, and predictive biomarkers can be used to identify patients with potential benefit from the treatment. Previous researches were mainly focused on clinical characteristics, and the use of genomic data in such an area is hardly studied. A new method is required to simultaneously select prognostic and predictive biomarkers in high dimensional genomic data where biomarkers are highly correlated. We propose a novel approach called PPLasso, that integrates prognostic and predictive effects into one statistical model. PPLasso also takes into account the correlations between biomarkers that can alter the biomarker selection accuracy. Our method consists in transforming the design matrix to remove the correlations between the biomarkers before applying the generalized Lasso. In a comprehensive numerical evaluation, we show that PPLasso outperforms the traditional Lasso and other extensions on both prognostic and predictive biomarker identification in various scenarios. Finally, our method is applied to publicly available transcriptomic and proteomic data.
Collapse
Affiliation(s)
- Wencan Zhu
- Université Paris-Saclay, AgroParisTech, INRAE, UMR MIA Paris-Saclay, 91120, Palaiseau, France.
- Biostatistics and Programming Department, Sanofi R&D, 91380, Chilly Mazarin, France.
| | - Céline Lévy-Leduc
- Université Paris-Saclay, AgroParisTech, INRAE, UMR MIA Paris-Saclay, 91120, Palaiseau, France
| | - Nils Ternès
- Biostatistics and Programming Department, Sanofi R&D, 91380, Chilly Mazarin, France
| |
Collapse
|
16
|
Wang P, Li Q, Shen D, Liu Y. HIGH-DIMENSIONAL FACTOR REGRESSION FOR HETEROGENEOUS SUBPOPULATIONS. Stat Sin 2023; 33:27-53. [PMID: 37854586 PMCID: PMC10583735 DOI: 10.5705/ss.202020.0145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2023]
Abstract
In modern scientific research, data heterogeneity is commonly observed owing to the abundance of complex data. We propose a factor regression model for data with heterogeneous subpopulations. The proposed model can be represented as a decomposition of heterogeneous and homogeneous terms. The heterogeneous term is driven by latent factors in different subpopulations. The homogeneous term captures common variation in the covariates and shares common regression coefficients across subpopulations. Our proposed model attains a good balance between a global model and a group-specific model. The global model ignores the data heterogeneity, while the group-specific model fits each subgroup separately. We prove the estimation and prediction consistency for our proposed estimators, and show that it has better convergence rates than those of the group-specific and global models. We show that the extra cost of estimating latent factors is asymptotically negligible and the minimax rate is still attainable. We further demonstrate the robustness of our proposed method by studying its prediction error under a mis-specified group-specific model. Finally, we conduct simulation studies and analyze a data set from the Alzheimer's Disease Neuroimaging Initiative and an aggregated microarray data set to further demonstrate the competitiveness and interpretability of our proposed factor regression model.
Collapse
Affiliation(s)
| | - Quefeng Li
- University of North Carolina at Chapel Hill
| | - Dinggang Shen
- ShanghaiTech University
- Shanghai United Imaging Intelligence Co
- Korea University
| | - Yufeng Liu
- University of North Carolina at Chapel Hill
| |
Collapse
|
17
|
Zhang B, Huang H, Chen J. Estimation of Large-Dimensional Covariance Matrices via Second-Order Stein-Type Regularization. ENTROPY (BASEL, SWITZERLAND) 2022; 25:53. [PMID: 36673194 PMCID: PMC9857414 DOI: 10.3390/e25010053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 12/20/2022] [Accepted: 12/23/2022] [Indexed: 06/17/2023]
Abstract
This paper tackles the problem of estimating the covariance matrix in large-dimension and small-sample-size scenarios. Inspired by the well-known linear shrinkage estimation, we propose a novel second-order Stein-type regularization strategy to generate well-conditioned covariance matrix estimators. We model the second-order Stein-type regularization as a quadratic polynomial concerning the sample covariance matrix and a given target matrix, representing the prior information of the actual covariance structure. To obtain available covariance matrix estimators, we choose the spherical and diagonal target matrices and develop unbiased estimates of the theoretical mean squared errors, which measure the distances between the actual covariance matrix and its estimators. We formulate the second-order Stein-type regularization as a convex optimization problem, resulting in the optimal second-order Stein-type estimators. Numerical simulations reveal that the proposed estimators can significantly lower the Frobenius losses compared with the existing Stein-type estimators. Moreover, a real data analysis in portfolio selection verifies the performance of the proposed estimators.
Collapse
Affiliation(s)
- Bin Zhang
- College of Mathematics and Statistics, Guangxi Normal University, Guilin 541004, China
| | - Hengzhen Huang
- College of Mathematics and Statistics, Guangxi Normal University, Guilin 541004, China
| | - Jianbin Chen
- School of Mathematics and Statistics, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
18
|
Liebscher E, Okhrin O. Semiparametric estimation of the high-dimensional elliptical distribution. J MULTIVARIATE ANAL 2022. [DOI: 10.1016/j.jmva.2022.105142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
19
|
Kim D, Song X, Wang Y. Unified discrete-time factor stochastic volatility and continuous-time Itô models for combining inference based on low-frequency and high-frequency. J MULTIVARIATE ANAL 2022. [DOI: 10.1016/j.jmva.2022.105091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
20
|
Zhu Z, Wang T, Samworth RJ. High-dimensional principal component analysis with heterogeneous missingness. J R Stat Soc Series B Stat Methodol 2022; 84:2000-2031. [PMID: 37065873 PMCID: PMC10098677 DOI: 10.1111/rssb.12550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Accepted: 08/11/2022] [Indexed: 11/22/2022]
Abstract
We study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In a simple, homogeneous observation model, we show that an existing observed-proportion weighted (OPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence, which exhibits an interesting phase transition. However, deeper investigation reveals that, particularly in more realistic settings where the observation probabilities are heterogeneous, the empirical performance of the OPW estimator can be unsatisfactory; moreover, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method, which we call primePCA, that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the OPW estimator, primePCA iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. We prove that the error of primePCA converges to zero at a geometric rate in the noiseless case, and when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that primePCA exhibits very encouraging performance across a wide range of scenarios, including settings where the data are not Missing Completely At Random.
Collapse
Affiliation(s)
- Ziwei Zhu
- Statistical LaboratoryUniversity of CambridgeCambridgeUK
- Department of StatisticsUniversity of MichiganAnn ArborMichiganUSA
| | - Tengyao Wang
- Statistical LaboratoryUniversity of CambridgeCambridgeUK
- Department of StatisticsLondon School of EconomicsLondonUK
| | | |
Collapse
|
21
|
Gao LL, Bien J, Witten D. Selective Inference for Hierarchical Clustering. J Am Stat Assoc 2022; 119:332-342. [PMID: 38660582 PMCID: PMC11036349 DOI: 10.1080/01621459.2022.2116331] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Accepted: 08/16/2022] [Indexed: 10/17/2022]
Abstract
Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly-used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data.
Collapse
Affiliation(s)
- Lucy L. Gao
- Department of Statistics, University of British Columbia
| | - Jacob Bien
- Department of Data Sciences and Operations, University of Southern California
| | - Daniela Witten
- Departments of Statistics and Biostatistics, University of Washington
| |
Collapse
|
22
|
Boileau P, Hejazi NS, van der Laan MJ, Dudoit S. Cross-Validated Loss-Based Covariance Matrix Estimator Selection in High Dimensions. J Comput Graph Stat 2022; 32:601-612. [PMID: 37273839 PMCID: PMC10237052 DOI: 10.1080/10618600.2022.2110883] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2021] [Accepted: 07/28/2022] [Indexed: 10/15/2022]
Abstract
The covariance matrix plays a fundamental role in many modern exploratory and inferential statistical procedures, including dimensionality reduction, hypothesis testing, and regression. In low-dimensional regimes, where the number of observations far exceeds the number of variables, the optimality of the sample covariance matrix as an estimator of this parameter is well-established. High-dimensional regimes do not admit such a convenience. Thus, a variety of estimators have been derived to overcome the shortcomings of the canonical estimator in such settings. Yet, selecting an optimal estimator from among the plethora available remains an open challenge. Using the framework of cross-validated loss-based estimation, we develop the theoretical underpinnings of just such an estimator selection procedure. We propose a general class of loss functions for covariance matrix estimation and establish accompanying finite-sample risk bounds and conditions for the asymptotic optimality of the cross-validation selector. In numerical experiments, we demonstrate the optimality of our proposed selector in moderate sample sizes and across diverse data-generating processes. The practical benefits of our procedure are highlighted in a dimension reduction application to single-cell transcriptome sequencing data.
Collapse
Affiliation(s)
- Philippe Boileau
- Graduate Group in Biostatistics and Center for Computational Biology, UC Berkeley
| | - Nima S. Hejazi
- Division of Biostatistics, Department of Population Health Sciences, Weill Cornell Medicine
| | - Mark J. van der Laan
- Division of Biostatistics, Department of Statistics, and Center for Computational Biology, UC Berkeley
| | - Sandrine Dudoit
- Department of Statistics, Division of Biostatistics, and Center for Computational Biology, UC Berkeley
| |
Collapse
|
23
|
Leday GGR, Engel J, Vossen JH, de Vos RCH, van der Voet H. Multivariate equivalence testing for food safety assessment. Food Chem Toxicol 2022; 170:113446. [PMID: 36191656 DOI: 10.1016/j.fct.2022.113446] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 09/20/2022] [Accepted: 09/24/2022] [Indexed: 11/28/2022]
Abstract
Products for food and feed derived from genetically modified (GM) crops are only allowed on the market when they are deemed to be safe for human health and the environment. The European Food Safety Authority (EFSA) performs safety assessment including a comparative approach: the compositional characteristics of a GM genotype are compared to those of reference genotypes that have a history of safe use. Statistical equivalence tests are used to carry out such a comparative assessment. These tests are univariate and therefore only consider one measured variable at a time. Phenotypic data, however, often comprise measurements on multiple variables that must be integrated to arrive at a single decision on acceptance in the regulatory process. The surge of modern molecular phenotyping platforms further challenges this integration, due to the large number of characteristics measured on the plants. This paper presents a new multivariate equivalence test that naturally extends a recently proposed univariate equivalence test and allows to assess equivalence across all variables simultaneously. The proposed test is illustrated on plant compositional data from a field study on maize grain and on untargeted metabolomic data of potato tubers, while its performance is assessed on simulated data.
Collapse
Affiliation(s)
- Gwenaël G R Leday
- Biometris, Wageningen University and Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, the Netherlands.
| | - Jasper Engel
- Biometris, Wageningen University and Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, the Netherlands
| | - Jack H Vossen
- Plant Breeding, Wageningen University and Research, Droevendaalsesteeg 1, 6700 AJ, Wageningen, the Netherlands
| | - Ric C H de Vos
- Business Unit Bioscience, Wageningen Plant Research, Wageningen University and Research, Droevendaalsesteeg 1, 6700 AA, Wageningen, the Netherlands
| | - Hilko van der Voet
- Biometris, Wageningen University and Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, the Netherlands
| |
Collapse
|
24
|
Barigozzi M, Farnè M. An algebraic estimator for large spectral density matrices. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2126780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
25
|
Fan J, Jiang B, Sun Q. Bayesian Factor-adjusted Sparse Regression. JOURNAL OF ECONOMETRICS 2022; 230:3-19. [PMID: 35754940 PMCID: PMC9223477 DOI: 10.1016/j.jeconom.2020.06.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Many sparse regression methods are based on the assumption that covariates are weakly correlated, which unfortunately do not hold in many economic and financial datasets. To address this challenge, we model the strongly-correlated covariates by a factor structure: strong correlations among covariates are explained by common factors and the remaining variations are interpreted as idiosyncratic components. We then propose a factor-adjusted sparse regression model with both common factors and idiosyncratic components as decorrelated covariates and develop a semi-Bayesian method. Parameter estimation rate-optimality and model selection consistency are established by non-asymptotic analyses. We show on simulated data that the semi-Bayesian method outperforms its Lasso analogue, manifests insensitivity to the overestimates of the number of common factors, pays a negligible price when covariates are not correlated, scales up well with increasing sample size, dimensionality and sparsity, and converges fast to the equilibrium of the posterior distribution. Numerical results on a real dataset of U.S. bond risk premia and macroeconomic indicators also lend strong supports to the proposed method.
Collapse
Affiliation(s)
- Jianqing Fan
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544
| | - Bai Jiang
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544
| | - Qiang Sun
- Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3
| |
Collapse
|
26
|
Comparing the Robustness of the Structural after Measurement (SAM) Approach to Structural Equation Modeling (SEM) against Local Model Misspecifications with Alternative Estimation Approaches. STATS 2022. [DOI: 10.3390/stats5030039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Structural equation models (SEM), or confirmatory factor analysis as a special case, contain model parameters at the measurement part and the structural part. In most social-science SEM applications, all parameters are simultaneously estimated in a one-step approach (e.g., with maximum likelihood estimation). In a recent article, Rosseel and Loh (2022, Psychol. Methods) proposed a two-step structural after measurement (SAM) approach to SEM that estimates the parameters of the measurement model in the first step and the parameters of the structural model in the second step. Rosseel and Loh claimed that SAM is more robust to local model misspecifications (i.e., cross loadings and residual correlations) than one-step maximum likelihood estimation. In this article, it is demonstrated with analytical derivations and simulation studies that SAM is generally not more robust to misspecifications than one-step estimation approaches. Alternative estimation methods are proposed that provide more robustness to misspecifications. SAM suffers from finite-sample bias that depends on the size of factor reliability and factor correlations. A bootstrap-bias-corrected LSAM estimate provides less biased estimates in finite samples. Nevertheless, we argue in the discussion section that applied researchers should nevertheless adopt SAM because robustness to local misspecifications is an irrelevant property when applying SAM. Parameter estimates in a structural model are of interest because intentionally misspecified SEMs frequently offer clearly interpretable factors. In contrast, SEMs with some empirically driven model modifications will result in biased estimates of the structural parameters because the meaning of factors is unintentionally changed.
Collapse
|
27
|
A Log-Det Heuristics for Covariance Matrix Estimation: The Analytic Setup. STATS 2022. [DOI: 10.3390/stats5030037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
This paper studies a new nonconvex optimization problem aimed at recovering high-dimensional covariance matrices with a low rank plus sparse structure. The objective is composed of a smooth nonconvex loss and a nonsmooth composite penalty. A number of structural analytic properties of the new heuristics are presented and proven, thus providing the necessary framework for further investigating the statistical applications. In particular, the first and the second derivative of the smooth loss are obtained, its local convexity range is derived, and the Lipschitzianity of its gradient is shown. This opens the path to solve the described problem via a proximal gradient algorithm.
Collapse
|
28
|
Guo Z, Ćevid D, Bühlmann P. Doubly debiased lasso: High-dimensional inference under hidden confounding. Ann Stat 2022; 50:1320-1347. [DOI: 10.1214/21-aos2152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Zijian Guo
- Department of Statistics, Rutgers University
| | | | | |
Collapse
|
29
|
Zhang L, Zhou W, Wang H. Non-asymptotic properties of spectral decomposition of large Gram-type matrices and applications. BERNOULLI 2022. [DOI: 10.3150/21-bej1384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Lyuou Zhang
- School of Statistics and Management, Shanghai University of Finance of Economics, 777 Guoding Road, Shanghai, 200433, P.R. China
| | - Wen Zhou
- Department of Statistics, Colorado State University, Fort Collins, CO 80523, USA
| | - Haonan Wang
- Department of Statistics, Colorado State University, Fort Collins, CO 80523, USA
| |
Collapse
|
30
|
Bing X, Bunea F, Wegkamp M. Inference in latent factor regression with clusterable features. BERNOULLI 2022. [DOI: 10.3150/21-bej1374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Xin Bing
- Department of Statistics and Data Science, Cornell University, Ithaca, New York, USA
| | - Florentina Bunea
- Department of Statistics and Data Science, Cornell University, Ithaca, New York, USA
| | - Marten Wegkamp
- Department of Statistics and Data Science, Cornell University, Ithaca, New York, USA
| |
Collapse
|
31
|
Gao Z, Tsay RS. Divide-and-Conquer: A Distributed Hierarchical Factor Approach to Modeling Large-Scale Time Series Data. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2071279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
32
|
Bing X, Ning Y, Xu Y. Adaptive estimation in multivariate response regression with hidden variables. Ann Stat 2022. [DOI: 10.1214/21-aos2059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Xin Bing
- Department of Statistics and Data Science, Cornell University
| | - Yang Ning
- Department of Statistics and Data Science, Cornell University
| | - Yaosheng Xu
- Department of Statistics and Data Science, Cornell University
| |
Collapse
|
33
|
Xia Q, Wong H, Shen S, He K. Factor analysis for high‐dimensional time series: Consistent estimation and efficient computation. Stat Anal Data Min 2022. [DOI: 10.1002/sam.11557] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Qiang Xia
- College of Mathematics and Informatics South China Agricultural University Guangzhou China
| | - Heung Wong
- University Research Facility in Big Data Analytics The Hong Kong Polytechnic University Hong Kong China
| | - Shirun Shen
- Center for Applied Statistics and Institute of Statistics and Big Data Renmin University of China Beijing China
| | - Kejun He
- Center for Applied Statistics and Institute of Statistics and Big Data Renmin University of China Beijing China
| |
Collapse
|
34
|
Fan J, Fan Y, Han X, Lv J. SIMPLE: Statistical inference on membership profiles in large networks. J R Stat Soc Series B Stat Methodol 2022. [DOI: 10.1111/rssb.12505] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Affiliation(s)
- Jianqing Fan
- Department of Operations Research and Financial EngineeringPrinceton University PrincetonNew JerseyUSA
| | - Yingying Fan
- Data Sciences and Operations DepartmentMarshall School of BusinessUniversity of Southern California Los AngelesCaliforniaUSA
| | - Xiao Han
- International Institute of FinanceDepartment of Statistics and FinanceUniversity of Science and Technology of China HefeiChina
| | - Jinchi Lv
- Data Sciences and Operations DepartmentMarshall School of BusinessUniversity of Southern California Los AngelesCaliforniaUSA
| |
Collapse
|
35
|
Liu X, Zhang T. Estimating change-point latent factor models for high-dimensional time series. J Stat Plan Inference 2022. [DOI: 10.1016/j.jspi.2021.07.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
36
|
|
37
|
Payne NY, Gagnon-Bartsch JA. Separating and reintegrating latent variables to improve classification of genomic data. Biostatistics 2022; 23:1133-1149. [DOI: 10.1093/biostatistics/kxab046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2021] [Revised: 11/09/2021] [Accepted: 11/24/2021] [Indexed: 11/12/2022] Open
Abstract
Summary
Genomic data sets contain the effects of various unobserved biological variables in addition to the variable of primary interest. These latent variables often affect a large number of features (e.g., genes), giving rise to dense latent variation. This latent variation presents both challenges and opportunities for classification. While some of these latent variables may be partially correlated with the phenotype of interest and thus helpful, others may be uncorrelated and merely contribute additional noise. Moreover, whether potentially helpful or not, these latent variables may obscure weaker effects that impact only a small number of features but more directly capture the signal of primary interest. To address these challenges, we propose the cross-residualization classifier (CRC). Through an adjustment and ensemble procedure, the CRC estimates and residualizes out the latent variation, trains a classifier on the residuals, and then reintegrates the latent variation in a final ensemble classifier. Thus, the latent variables are accounted for without discarding any potentially predictive information. We apply the method to simulated data and a variety of genomic data sets from multiple platforms. In general, we find that the CRC performs well relative to existing classifiers and sometimes offers substantial gains.
Collapse
Affiliation(s)
- Nora Yujia Payne
- Department of Statistics, University of Michigan, 1085 S. University Ave., Ann Arbor, MI 48109, USA
| | - Johann A Gagnon-Bartsch
- Department of Statistics, University of Michigan, 1085 S. University Ave., Ann Arbor, MI 48109, USA
| |
Collapse
|
38
|
Zhong X, Su C, Fan Z. Empirical Bayes PCA in high dimensions. J R Stat Soc Series B Stat Methodol 2022. [DOI: 10.1111/rssb.12490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Xinyi Zhong
- Department of Statistics and Data ScienceYale University New HavenUSA
| | - Chang Su
- Department of BiostatisticsYale University New HavenUSA
| | - Zhou Fan
- Department of Statistics and Data ScienceYale University New HavenUSA
| |
Collapse
|
39
|
High-Dimensional Conditional Covariance Matrices Estimation Using a Factor-GARCH Model. Symmetry (Basel) 2022. [DOI: 10.3390/sym14010158] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2022] Open
Abstract
Estimation of a conditional covariance matrix is an interesting and important research topic in statistics and econometrics. However, modelling ultra-high dimensional dynamic (conditional) covariance structures is known to suffer from the curse of dimensionality or the problem of singularity. To partially solve this problem, this paper establishes a model by combining the ideas of a factor model and a symmetric GARCH model to describe the dynamics of a high-dimensional conditional covariance matrix. Quasi maximum likelihood estimation (QMLE) and least square estimation (LSE) methods are used to estimate the parameters in the model, and the plug-in method is introduced to obtain the estimation of conditional covariance matrix. Asymptotic properties are established for the proposed method, and simulation studies are given to demonstrate its performance. A financial application is presented to support the methodology.
Collapse
|
40
|
Abstract
Multimodal data, where different types of data are collected from the same subjects, are fast emerging in a large variety of scientific applications. Factor analysis is commonly used in integrative analysis of multimodal data, and is particularly useful to overcome the curse of high dimensionality and high correlations. However, there is little work on statistical inference for factor analysis based supervised modeling of multimodal data. In this article, we consider an integrative linear regression model that is built upon the latent factors extracted from multimodal data. We address three important questions: how to infer the significance of one data modality given the other modalities in the model; how to infer the significance of a combination of variables from one modality or across different modalities; and how to quantify the contribution, measured by the goodness-of-fit, of one data modality given the others. When answering each question, we explicitly characterize both the benefit and the extra cost of factor analysis. Those questions, to our knowledge, have not yet been addressed despite wide use of factor analysis in integrative multimodal analysis, and our proposal bridges an important gap. We study the empirical performance of our methods through simulations, and further illustrate with a multimodal neuroimaging analysis.
Collapse
|
41
|
Hörmann S, Jammoul F. Preprocessing noisy functional data: A multivariate perspective. Electron J Stat 2022. [DOI: 10.1214/22-ejs2083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Affiliation(s)
| | - Fatima Jammoul
- Institute of Software Design and Security, FH JOANNEUM, Austria
| |
Collapse
|
42
|
Shu H, Qu Z. CDPA: Common and distinctive pattern analysis between high-dimensional datasets. Electron J Stat 2022; 16:2475-2517. [DOI: 10.1214/22-ejs2008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Hai Shu
- Department of Biostatistics, School of Global Public Health, New York University
| | - Zhe Qu
- Department of Mathematics, School of Science and Engineering, Tulane University
| |
Collapse
|
43
|
Han Y, Chen R, Zhang CH. Rank determination in tensor factor model. Electron J Stat 2022. [DOI: 10.1214/22-ejs1991] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Yuefeng Han
- Department of Statistics, Rutgers University, Piscataway, NJ 08854, USA
| | - Rong Chen
- Department of Statistics, Rutgers University, Piscataway, NJ 08854, USA
| | - Cun-Hui Zhang
- Department of Statistics, Rutgers University, Piscataway, NJ 08854, USA
| |
Collapse
|
44
|
Shu H, Qu Z, Zhu H. D-GCCA: Decomposition-based Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2022; 23:169. [PMID: 35983506 PMCID: PMC9380864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Modern biomedical studies often collect multi-view data, that is, multiple types of data measured on the same set of objects. A popular model in high-dimensional multi-view data analysis is to decompose each view's data matrix into a low-rank common-source matrix generated by latent factors common across all data views, a low-rank distinctive-source matrix corresponding to each view, and an additive noise matrix. We propose a novel decomposition method for this model, called decomposition-based generalized canonical correlation analysis (D-GCCA). The D-GCCA rigorously defines the decomposition on theL 2 space of random variables in contrast to the Euclidean dot product space used by most existing methods, thereby being able to provide the estimation consistency for the low-rank matrix recovery. Moreover, to well calibrate common latent factors, we impose a desirable orthogonality constraint on distinctive latent factors. Existing methods, however, inadequately consider such orthogonality and may thus suffer from substantial loss of undetected common-source variation. Our D-GCCA takes one step further than generalized canonical correlation analysis by separating common and distinctive components among canonical variables, while enjoying an appealing interpretation from the perspective of principal component analysis. Furthermore, we propose to use the variable-level proportion of signal variance explained by common or distinctive latent factors for selecting the variables most influenced. Consistent estimators of our D-GCCA method are established with good finite-sample numerical performance, and have closed-form expressions leading to efficient computation especially for large-scale data. The superiority of D-GCCA over state-of-the-art methods is also corroborated in simulations and real-world data examples.
Collapse
Affiliation(s)
- Hai Shu
- Department of Biostatistics, New York University, New York, NY 10003, USA
| | - Zhe Qu
- Department of Mathematics, Tulane University, New Orleans, LA 70118, USA
| | - Hongtu Zhu
- Department of Biostatistics, Department of Computer Science, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
45
|
Liu W, Lin H, Zheng S, Liu J. Generalized Factor Model for Ultra-High Dimensional Correlated Variables with Mixed Types. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1999818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Wei Liu
- Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics, Chengdu, China
| | - Huazhen Lin
- Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics, Chengdu, China
| | - Shurong Zheng
- School of Mathematics and Statistics, Northeast Normal University, Changchun, China
| | - Jin Liu
- Centre for Quantitative Medicine, Program in Health Services & Systems Research, Duke-NUS Medical School, Singapore, Singapore
| |
Collapse
|
46
|
Kong XB, Lin JG, Liu C, Liu GY. Discrepancy Between Global and Local Principal Component Analysis on Large-Panel High-Frequency Data. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1996376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Xin-Bing Kong
- Department of Statistics, Nanjing Audit University, Nanjing, China
| | - Jin-Guan Lin
- Department of Statistics, Nanjing Audit University, Nanjing, China
| | - Cheng Liu
- Department of Mathematical Economics and Finance, Wuhan University, Wuhan, China
| | - Guang-Ying Liu
- Department of Statistics, Nanjing Audit University, Nanjing, China
| |
Collapse
|
47
|
Zhang L, Zhou W, Wang H. A semiparametric latent factor model for large scale temporal data with heteroscedasticity. J MULTIVARIATE ANAL 2021. [DOI: 10.1016/j.jmva.2021.104786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
48
|
Yang Y, Yang Y, Shang HL. Feature extraction for functional time series: Theory and application to NIR spectroscopy data. J MULTIVARIATE ANAL 2021. [DOI: 10.1016/j.jmva.2021.104863] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
49
|
|
50
|
Chen EY, Fan J. Statistical Inference for High-Dimensional Matrix-Variate Factor Models. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1970569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
| | - Jianqing Fan
- EECS, University of California, Berkeley, CA
- ORFE, Princeton University, Princeton, NJ
| |
Collapse
|