1
|
Huang C, Zhu H. Functional hybrid factor regression model for handling heterogeneity in imaging studies. Biometrika 2022; 109:1133-1148. [PMID: 36531154 PMCID: PMC9754099 DOI: 10.1093/biomet/asac007] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2023] Open
Abstract
This paper develops a functional hybrid factor regression modelling framework to handle the heterogeneity of many large-scale imaging studies, such as the Alzheimer's disease neuroimaging initiative study. Despite the numerous successes of those imaging studies, such heterogeneity may be caused by the differences in study environment, population, design, protocols or other hidden factors, and it has posed major challenges in integrative analysis of imaging data collected from multicentres or multistudies. We propose both estimation and inference procedures for estimating unknown parameters and detecting unknown factors under our new model. The asymptotic properties of both estimation and inference procedures are systematically investigated. The finite-sample performance of our proposed procedures is assessed by using Monte Carlo simulations and a real data example on hippocampal surface data from the Alzheimer's disease study.
Collapse
Affiliation(s)
- C Huang
- Department of Statistics, Florida State University, 117 N. Woodward Ave., Tallahassee, Florida 32304, U.S.A
| | - H Zhu
- Department of Biostatistics, The University of North Carolina at Chapel Hill, 135 Dauer Drive, Chapel Hill, North Carolina 27599, U.S.A
| |
Collapse
|
2
|
Hörmann S, Jammoul F. Prediction in functional regression with discretely observed and noisy covariates. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
3
|
Jumentier B, Caye K, Heude B, Lepeule J, François O. Sparse latent factor regression models for genome-wide and epigenome-wide association studies. Stat Appl Genet Mol Biol 2022; 21:sagmb-2021-0035. [PMID: 35245419 DOI: 10.1515/sagmb-2021-0035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Accepted: 02/06/2022] [Indexed: 11/15/2022]
Abstract
Association of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to account for variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. Here, we present least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. In simulated data, sparse latent factor regression models generally achieved higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator and a Bayesian sparse linear mixed model. In generative model simulations, statistical performance was slightly lower (while being comparable) to non-sparse methods, but in simulations based on empirical data, sparse latent factor regression models were more robust to departure from the model than the non-sparse approaches. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while overcoming multiple testing issues. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.
Collapse
Affiliation(s)
- Basile Jumentier
- Centre National de la Recherche Scientifique, Grenoble INP, TIMC-IMAG CNRS UMR 5525, Université Grenoble-Alpes, Grenoble, 38000, France.,Centre National de la Recherche Scientifique, Institut National de la Santé et de la Recherche Médicale, Institute for Advanced Biosciences, INSERM U 1209, CNRS UMR 5309, Université Grenoble-Alpes, Grenoble, 38000, France
| | - Kevin Caye
- Centre National de la Recherche Scientifique, Grenoble INP, TIMC-IMAG CNRS UMR 5525, Université Grenoble-Alpes, Grenoble, 38000, France
| | - Barbara Heude
- Institut National de la Santé et de la Recherche Médicale, Centre of Research in Epidemiology and Statistics, INSERM UMR 1153, Université de Paris, F75004 Paris, France
| | - Johanna Lepeule
- Centre National de la Recherche Scientifique, Institut National de la Santé et de la Recherche Médicale, Institute for Advanced Biosciences, INSERM U 1209, CNRS UMR 5309, Université Grenoble-Alpes, Grenoble, 38000, France
| | - Olivier François
- Centre National de la Recherche Scientifique, Grenoble INP, TIMC-IMAG CNRS UMR 5525, Université Grenoble-Alpes, Grenoble, 38000, France.,Inria Grenoble, Equipe Statify, Laboratoire Jean Kuntzmann, Rhône-Alpes Inovallée 655 Avenue de l'Europe - CS 90051, Montbonnot, 38334, France
| |
Collapse
|
4
|
Affiliation(s)
- Anru R. Zhang
- Department of Statistics, University of Wisconsin-Madison
| | - T. Tony Cai
- Department of Statistics, The Wharton School, University of Pennsylvania
| | - Yihong Wu
- Department of Statistics and Data Science, Yale University
| |
Collapse
|
5
|
Miao W, Hu W, Ogburn EL, Zhou X. Identifying effects of multiple treatments in the presence of unmeasured confounding. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2021.2023551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Wang Miao
- Department of Probability and Statistics, Peking University, Beijing, PRC
| | - Wenjie Hu
- Department of Probability and Statistics, Peking University, Beijing, PRC
| | - Elizabeth L. Ogburn
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Xiaohua Zhou
- Department of Biostatistics and Beijing International Center for Mathematical Research, Peking University, Beijing, PRC
| |
Collapse
|
6
|
Hörmann S, Jammoul F. Preprocessing noisy functional data: A multivariate perspective. Electron J Stat 2022. [DOI: 10.1214/22-ejs2083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Affiliation(s)
| | - Fatima Jammoul
- Institute of Software Design and Security, FH JOANNEUM, Austria
| |
Collapse
|
7
|
McKennan C, Nicolae D. Estimating and accounting for unobserved covariates in high-dimensional correlated data. J Am Stat Assoc 2022; 117:225-236. [PMID: 35615339 PMCID: PMC9126075 DOI: 10.1080/01621459.2020.1769635] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Many high dimensional and high-throughput biological datasets have complex sample correlation structures, which include longitudinal and multiple tissue data, as well as data with multiple treatment conditions or related individuals. These data, as well as nearly all high-throughput 'omic' data, are influenced by technical and biological factors unknown to the researcher, which, if unaccounted for, can severely obfuscate estimation of and inference on the effects of interest. We therefore developed CBCV and CorrConf: provably accurate and computationally efficient methods to choose the number of and estimate latent confounding factors present in high dimensional data with correlated or nonexchangeable residuals. We demonstrate each method's superior performance compared to other state of the art methods by analyzing simulated multi-tissue gene expression data and identifying sex-associated DNA methylation sites in a real, longitudinal twin study.
Collapse
Affiliation(s)
| | - Dan Nicolae
- Department of Statistics, University of Chicago
| |
Collapse
|
8
|
|
9
|
Chen Y, Li X. Determining the number of factors in high-dimensional generalized latent factor models. Biometrika 2021. [DOI: 10.1093/biomet/asab044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Summary
As a generalization of the classical linear factor model, generalized latent factor models are useful for analysing multivariate data of different types, including binary choices and counts. This paper proposes an information criterion to determine the number of factors in generalized latent factor models. The consistency of the proposed information criterion is established under a high-dimensional setting, where both the sample size and the number of manifest variables grow to infinity, and data may have many missing values. An error bound is established for the parameter estimates, which plays an important role in establishing the consistency of the proposed information criterion. This error bound improves several existing results and may be of independent theoretical interest. We evaluate the proposed method by a simulation study and an application to Eysenck’s personality questionnaire.
Collapse
Affiliation(s)
- Y Chen
- Department of Statistics, London School of Economics and Political Science, Houghton Street, London, WC2A 2AE, U.K
| | - X Li
- School of Statistics, University of Minnesota, 224 Church Street SE, Minneapolis, Minnesota, 55455, U.S.A
| |
Collapse
|
10
|
Dai F, Dutta S, Maitra R. A Matrix-Free Likelihood Method for Exploratory Factor Analysis of High-Dimensional Gaussian Data. J Comput Graph Stat 2020; 29:675-680. [DOI: 10.1080/10618600.2019.1704296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Fan Dai
- Department of Statistics, Iowa State University, Ames, IA
| | - Somak Dutta
- Department of Statistics, Iowa State University, Ames, IA
| | - Ranjan Maitra
- Department of Statistics, Iowa State University, Ames, IA
| |
Collapse
|
11
|
McKennan C, Nicolae D. Accounting for unobserved covariates with varying degrees of estimability in high-dimensional biological data. Biometrika 2019; 106:823-840. [PMID: 31754283 DOI: 10.1093/biomet/asz037] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2018] [Indexed: 12/18/2022] Open
Abstract
An important phenomenon in high-throughput biological data is the presence of unobserved covariates that can have a significant impact on the measured response. When these covariates are also correlated with the covariate of interest, ignoring or improperly estimating them can lead to inaccurate estimates of and spurious inference on the corresponding coefficients of interest in a multivariate linear model. We first prove that existing methods to account for these unobserved covariates often inflate Type I error for the null hypothesis that a given coefficient of interest is zero. We then provide alternative estimators for the coefficients of interest that correct the inflation, and prove that our estimators are asymptotically equivalent to the ordinary least squares estimators obtained when every covariate is observed. Lastly, we use previously published DNA methylation data to show that our method can more accurately estimate the direct effect of asthma on DNA methylation levels compared to existing methods, the latter of which likely fail to recover and account for latent cell type heterogeneity.
Collapse
Affiliation(s)
- Chris McKennan
- Department of Statistics, University of Chicago, 5747 S. Ellis Avenue, Chicago, Illinois, U.S.A
| | - Dan Nicolae
- Department of Statistics, University of Chicago, 5747 S. Ellis Avenue, Chicago, Illinois, U.S.A
| |
Collapse
|
12
|
Chen Y, Li X, Zhang S. Structured Latent Factor Analysis for Large-scale Data: Identifiability, Estimability, and Their Implications. J Am Stat Assoc 2019. [DOI: 10.1080/01621459.2019.1635485] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Yunxiao Chen
- Department of Statistics, London School of Economics and Political Science, London, UK
| | - Xiaoou Li
- School of Statistics, University of Minnesota, Minneapolis, MN
| | - Siliang Zhang
- Shanghai Center for Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
13
|
Hypothesis Tests for Principal Component Analysis When Variables are Standardized. JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS 2019. [DOI: 10.1007/s13253-019-00355-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
14
|
Dobriban E, Owen AB. Deterministic parallel analysis: an improved method for selecting factors and principal components. J R Stat Soc Series B Stat Methodol 2018. [DOI: 10.1111/rssb.12301] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
15
|
Abstract
We consider large-scale studies in which thousands of significance tests are performed simultaneously. In some of these studies, the multiple testing procedure can be severely biased by latent confounding factors such as batch effects and unmeasured covariates that correlate with both primary variable(s) of interest (e.g., treatment variable, phenotype) and the outcome. Over the past decade, many statistical methods have been proposed to adjust for the confounders in hypothesis testing. We unify these methods in the same framework, generalize them to include multiple primary variables and multiple nuisance variables, and analyze their statistical properties. In particular, we provide theoretical guarantees for RUV-4 [Gagnon-Bartsch, Jacob and Speed (2013)] and LEAPP [Ann. Appl. Stat. 6 (2012) 1664-1688], which correspond to two different identification conditions in the framework: the first requires a set of "negative controls" that are known a priori to follow the null distribution; the second requires the true nonnulls to be sparse. Two different estimators which are based on RUV-4 and LEAPP are then applied to these two scenarios. We show that if the confounding factors are strong, the resulting estimators can be asymptotically as powerful as the oracle estimator which observes the latent confounding factors. For hypothesis testing, we show the asymptotic z-tests based on the estimators can control the type I error. Numerical experiments show that the false discovery rate is also controlled by the Benjamini-Hochberg procedure when the sample size is reasonably large.
Collapse
Affiliation(s)
- Jingshu Wang
- Department of Statistics, The Wharton School, University of Pennsylvania, 400 Huntsman Hall, 3730 Walnut St, Philadelphia, Pennsylvania 19104, USA
| | - Qingyuan Zhao
- Department of Statistics, The Wharton School, University of Pennsylvania, 400 Huntsman Hall, 3730 Walnut St, Philadelphia, Pennsylvania 19104, USA
| | - Trevor Hastie
- Department of Statistics, Stanford University, 390 Serra Mall, Stanford, California 94305, USA
| | - Art B. Owen
- Department of Statistics, Stanford University, 390 Serra Mall, Stanford, California 94305, USA
| |
Collapse
|
16
|
Shahdoust M, Pezeshk H, Mahjub H, Sadeghi M. F-MAP: A Bayesian approach to infer the gene regulatory network using external hints. PLoS One 2017; 12:e0184795. [PMID: 28938012 PMCID: PMC5609748 DOI: 10.1371/journal.pone.0184795] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2017] [Accepted: 08/31/2017] [Indexed: 01/07/2023] Open
Abstract
The Common topological features of related species gene regulatory networks suggest reconstruction of the network of one species by using the further information from gene expressions profile of related species. We present an algorithm to reconstruct the gene regulatory network named; F-MAP, which applies the knowledge about gene interactions from related species. Our algorithm sets a Bayesian framework to estimate the precision matrix of one species microarray gene expressions dataset to infer the Gaussian Graphical model of the network. The conjugate Wishart prior is used and the information from related species is applied to estimate the hyperparameters of the prior distribution by using the factor analysis. Applying the proposed algorithm on six related species of drosophila shows that the precision of reconstructed networks is improved considerably compared to the precision of networks constructed by other Bayesian approaches.
Collapse
Affiliation(s)
- Maryam Shahdoust
- Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Hamid Pezeshk
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran
| | - Hossein Mahjub
- Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| |
Collapse
|
17
|
|