1
|
Ma TF, Wang F, Zhu J. On generalized latent factor modeling and inference for high-dimensional binomial data. Biometrics 2023; 79:2311-2320. [PMID: 36200926 DOI: 10.1111/biom.13768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Accepted: 09/23/2022] [Indexed: 11/30/2022]
Abstract
We explore a hierarchical generalized latent factor model for discrete and bounded response variables and in particular, binomial responses. Specifically, we develop a novel two-step estimation procedure and the corresponding statistical inference that is computationally efficient and scalable for the high dimension in terms of both the number of subjects and the number of features per subject. We also establish the validity of the estimation procedure, particularly the asymptotic properties of the estimated effect size and the latent structure, as well as the estimated number of latent factors. The results are corroborated by a simulation study and for illustration, the proposed methodology is applied to analyze a dataset in a gene-environment association study.
Collapse
Affiliation(s)
- Ting Fung Ma
- Department of Statistics, University of South Carolina, Columbia, South Carolina, USA
| | - Fangfang Wang
- Department of Mathematical Sciences, Worcester Polytechnic Institute, Worcester, Massachusetts, USA
| | - Jun Zhu
- Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, USA
| |
Collapse
|
2
|
Ye H, Zhang X, Wang C, Goode EL, Chen J. Batch-effect correction with sample remeasurement in highly confounded case-control studies. NATURE COMPUTATIONAL SCIENCE 2023; 3:709-719. [PMID: 38177326 PMCID: PMC10993308 DOI: 10.1038/s43588-023-00500-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 07/11/2023] [Indexed: 01/06/2024]
Abstract
Batch effects are pervasive in biomedical studies. One approach to address the batch effects is repeatedly measuring a subset of samples in each batch. These remeasured samples are used to estimate and correct the batch effects. However, rigorous statistical methods for batch-effect correction with remeasured samples are severely underdeveloped. Here we developed a framework for batch-effect correction using remeasured samples in highly confounded case-control studies. We provided theoretical analyses of the proposed procedure, evaluated its power characteristics and provided a power calculation tool to aid in the study design. We found that the number of samples that need to be remeasured depends strongly on the between-batch correlation. When the correlation is high, remeasuring a small subset of samples is possible to rescue most of the power.
Collapse
Affiliation(s)
- Hanxuan Ye
- Department of Statistics, Texas A&M University, College Station, TX, USA
| | - Xianyang Zhang
- Department of Statistics, Texas A&M University, College Station, TX, USA.
| | - Chen Wang
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA
| | - Ellen L Goode
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA
| | - Jun Chen
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
3
|
Huang C, Zhu H. Functional hybrid factor regression model for handling heterogeneity in imaging studies. Biometrika 2022; 109:1133-1148. [PMID: 36531154 PMCID: PMC9754099 DOI: 10.1093/biomet/asac007] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2023] Open
Abstract
This paper develops a functional hybrid factor regression modelling framework to handle the heterogeneity of many large-scale imaging studies, such as the Alzheimer's disease neuroimaging initiative study. Despite the numerous successes of those imaging studies, such heterogeneity may be caused by the differences in study environment, population, design, protocols or other hidden factors, and it has posed major challenges in integrative analysis of imaging data collected from multicentres or multistudies. We propose both estimation and inference procedures for estimating unknown parameters and detecting unknown factors under our new model. The asymptotic properties of both estimation and inference procedures are systematically investigated. The finite-sample performance of our proposed procedures is assessed by using Monte Carlo simulations and a real data example on hippocampal surface data from the Alzheimer's disease study.
Collapse
Affiliation(s)
- C Huang
- Department of Statistics, Florida State University, 117 N. Woodward Ave., Tallahassee, Florida 32304, U.S.A
| | - H Zhu
- Department of Biostatistics, The University of North Carolina at Chapel Hill, 135 Dauer Drive, Chapel Hill, North Carolina 27599, U.S.A
| |
Collapse
|
4
|
Guo Z, Ćevid D, Bühlmann P. Doubly debiased lasso: High-dimensional inference under hidden confounding. Ann Stat 2022; 50:1320-1347. [DOI: 10.1214/21-aos2152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Zijian Guo
- Department of Statistics, Rutgers University
| | | | | |
Collapse
|
5
|
Bing X, Ning Y, Xu Y. Adaptive estimation in multivariate response regression with hidden variables. Ann Stat 2022. [DOI: 10.1214/21-aos2059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Xin Bing
- Department of Statistics and Data Science, Cornell University
| | - Yang Ning
- Department of Statistics and Data Science, Cornell University
| | - Yaosheng Xu
- Department of Statistics and Data Science, Cornell University
| |
Collapse
|
6
|
Payne NY, Gagnon-Bartsch JA. Separating and reintegrating latent variables to improve classification of genomic data. Biostatistics 2022; 23:1133-1149. [DOI: 10.1093/biostatistics/kxab046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2021] [Revised: 11/09/2021] [Accepted: 11/24/2021] [Indexed: 11/12/2022] Open
Abstract
Summary
Genomic data sets contain the effects of various unobserved biological variables in addition to the variable of primary interest. These latent variables often affect a large number of features (e.g., genes), giving rise to dense latent variation. This latent variation presents both challenges and opportunities for classification. While some of these latent variables may be partially correlated with the phenotype of interest and thus helpful, others may be uncorrelated and merely contribute additional noise. Moreover, whether potentially helpful or not, these latent variables may obscure weaker effects that impact only a small number of features but more directly capture the signal of primary interest. To address these challenges, we propose the cross-residualization classifier (CRC). Through an adjustment and ensemble procedure, the CRC estimates and residualizes out the latent variation, trains a classifier on the residuals, and then reintegrates the latent variation in a final ensemble classifier. Thus, the latent variables are accounted for without discarding any potentially predictive information. We apply the method to simulated data and a variety of genomic data sets from multiple platforms. In general, we find that the CRC performs well relative to existing classifiers and sometimes offers substantial gains.
Collapse
Affiliation(s)
- Nora Yujia Payne
- Department of Statistics, University of Michigan, 1085 S. University Ave., Ann Arbor, MI 48109, USA
| | - Johann A Gagnon-Bartsch
- Department of Statistics, University of Michigan, 1085 S. University Ave., Ann Arbor, MI 48109, USA
| |
Collapse
|
7
|
McKennan C, Nicolae D. Estimating and accounting for unobserved covariates in high-dimensional correlated data. J Am Stat Assoc 2022; 117:225-236. [PMID: 35615339 PMCID: PMC9126075 DOI: 10.1080/01621459.2020.1769635] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Many high dimensional and high-throughput biological datasets have complex sample correlation structures, which include longitudinal and multiple tissue data, as well as data with multiple treatment conditions or related individuals. These data, as well as nearly all high-throughput 'omic' data, are influenced by technical and biological factors unknown to the researcher, which, if unaccounted for, can severely obfuscate estimation of and inference on the effects of interest. We therefore developed CBCV and CorrConf: provably accurate and computationally efficient methods to choose the number of and estimate latent confounding factors present in high dimensional data with correlated or nonexchangeable residuals. We demonstrate each method's superior performance compared to other state of the art methods by analyzing simulated multi-tissue gene expression data and identifying sex-associated DNA methylation sites in a real, longitudinal twin study.
Collapse
Affiliation(s)
| | - Dan Nicolae
- Department of Statistics, University of Chicago
| |
Collapse
|
8
|
Jernigan R, Jia K, Ren Z, Zhou W. Large-scale multiple inference of collective dependence with applications to protein function. Ann Appl Stat 2021; 15:902-924. [DOI: 10.1214/20-aoas1431] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Robert Jernigan
- Department of Biochemistry, Biophysics, and Molecular Biology, Program of Bioinformatics and Computational Biology, Iowa State University
| | - Kejue Jia
- Department of Biochemistry, Biophysics, and Molecular Biology, Program of Bioinformatics and Computational Biology, Iowa State University
| | - Zhao Ren
- Department of Statistics, University of Pittsburgh
| | - Wen Zhou
- Department of Statistics, Colorado State University
| |
Collapse
|
9
|
Abstract
BACKGROUND With the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in unsubstantiated claims of a method's performance. RESULTS Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. Our procedures may be applied to both single-cell and bulk RNA-seq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNA-seq datasets. CONCLUSIONS Using data simulated from a theoretical model can substantially impact the results of a study. We developed more realistic simulation techniques for RNA-seq data. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Network: https://cran.r-project.org/package=seqgendiff.
Collapse
Affiliation(s)
- David Gerard
- Department of Mathematics and Statistics, American University, Massachusetts Ave NW, Washington, DC, 20016, USA.
| |
Collapse
|
10
|
Gerard D, Stephens M. Empirical Bayes shrinkage and false discovery rate estimation, allowing for unwanted variation. Biostatistics 2020; 21:15-32. [PMID: 29985984 PMCID: PMC8204175 DOI: 10.1093/biostatistics/kxy029] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2017] [Revised: 06/06/2018] [Accepted: 06/18/2018] [Indexed: 11/12/2022] Open
Abstract
We combine two important ideas in the analysis of large-scale genomics experiments (e.g. experiments that aim to identify genes that are differentially expressed between two conditions). The first is use of Empirical Bayes (EB) methods to handle the large number of potentially-sparse effects, and estimate false discovery rates and related quantities. The second is use of factor analysis methods to deal with sources of unwanted variation such as batch effects and unmeasured confounders. We describe a simple modular fitting procedure that combines key ideas from both these lines of research. This yields new, powerful EB methods for analyzing genomics experiments that account for both sparse effects and unwanted variation. In realistic simulations, these new methods provide significant gains in power and calibration over competing methods. In real data analysis, we find that different methods, while often conceptually similar, can vary widely in their assessments of statistical significance. This highlights the need for care in both choice of methods and interpretation of results.
Collapse
Affiliation(s)
- David Gerard
- Department of Human Genetics, Cummings Life Science Center, University of Chicago, 920 E 58th Street, Chicago, IL 60637, USA
| | - Matthew Stephens
- Department of Human Genetics, Cummings Life Science Center, University of Chicago, 920 E 58th Street, Chicago, IL 60637, USA and Department of Statistics, George Herbert Jones Laboratory, University of Chicago, 5747 S Ellis Avenue, Chicago, IL 60637, USA
| |
Collapse
|
11
|
McKennan C, Nicolae D. Accounting for unobserved covariates with varying degrees of estimability in high-dimensional biological data. Biometrika 2019; 106:823-840. [PMID: 31754283 DOI: 10.1093/biomet/asz037] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2018] [Indexed: 12/18/2022] Open
Abstract
An important phenomenon in high-throughput biological data is the presence of unobserved covariates that can have a significant impact on the measured response. When these covariates are also correlated with the covariate of interest, ignoring or improperly estimating them can lead to inaccurate estimates of and spurious inference on the corresponding coefficients of interest in a multivariate linear model. We first prove that existing methods to account for these unobserved covariates often inflate Type I error for the null hypothesis that a given coefficient of interest is zero. We then provide alternative estimators for the coefficients of interest that correct the inflation, and prove that our estimators are asymptotically equivalent to the ordinary least squares estimators obtained when every covariate is observed. Lastly, we use previously published DNA methylation data to show that our method can more accurately estimate the direct effect of asthma on DNA methylation levels compared to existing methods, the latter of which likely fail to recover and account for latent cell type heterogeneity.
Collapse
Affiliation(s)
- Chris McKennan
- Department of Statistics, University of Chicago, 5747 S. Ellis Avenue, Chicago, Illinois, U.S.A
| | - Dan Nicolae
- Department of Statistics, University of Chicago, 5747 S. Ellis Avenue, Chicago, Illinois, U.S.A
| |
Collapse
|
12
|
Zhou W, Koudijs KKM, Böhringer S. Influence of batch effect correction methods on drug induced differential gene expression profiles. BMC Bioinformatics 2019; 20:437. [PMID: 31438848 PMCID: PMC6706913 DOI: 10.1186/s12859-019-3028-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 08/13/2019] [Indexed: 01/17/2023] Open
Abstract
Background Batch effects were not accounted for in most of the studies of computational drug repositioning based on gene expression signatures. It is unknown how batch effect removal methods impact the results of signature-based drug repositioning. Herein, we conducted differential analyses on the Connectivity Map (CMAP) database using several batch effect correction methods to evaluate the influence of batch effect correction methods on computational drug repositioning using microarray data and compare several batch effect correction methods. Results Differences in average signature size were observed with different methods applied. The gene signatures identified by the Latent Effect Adjustment after Primary Projection (LEAPP) method and the methods fitted with Linear Models for Microarray Data (limma) software demonstrated little agreement. The external validity of the gene signatures was evaluated by connectivity mapping between the CMAP database and the Library of Integrated Network-based Cellular Signatures (LINCS) database. The results of connectivity mapping indicate that the genes identified were not reliable for drugs with total sample size (drug + control samples) smaller than 40, irrespective of the batch effect correction method applied. With total sample size larger than 40, the methods correcting for batch effects produced significantly better results than the method with no batch effect correction. In a simulation study, the power was generally low for simulated data with sample size smaller than 40. We observed best performance when using the limma method correcting for two principal components. Conclusion Batch effect correction methods strongly impact differential gene expression analysis when the sample size is large enough to contain sufficient information and thus the downstream drug repositioning. We recommend including two or three principal components as covariates in fitting models with limma when sample size is sufficient (larger than 40 drug and controls combined). Electronic supplementary material The online version of this article (10.1186/s12859-019-3028-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wei Zhou
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands. .,Department of Internal Medicine, Erasmus Medical Center, Rotterdam, The Netherlands.
| | - Karel K M Koudijs
- Department of Clinical Pharmacy & Toxicology, Leiden University Medical Center, Leiden, The Netherlands
| | - Stefan Böhringer
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
13
|
Hornstein M, Fan R, Shedden K, Zhou S. Joint Mean and Covariance Estimation with Unreplicated Matrix-Variate Data. J Am Stat Assoc 2019. [DOI: 10.1080/01621459.2018.1429275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
| | - Roger Fan
- Department of Statistics, University of Michigan, Ann Arbor, MI
| | - Kerby Shedden
- Department of Statistics, University of Michigan, Ann Arbor, MI
| | - Shuheng Zhou
- Department of Statistics, University of Michigan, Ann Arbor, MI
- Department of Statistics, University of California, Riverside, CA
| |
Collapse
|
14
|
Dahl A, Guillemot V, Mefford J, Aschard H, Zaitlen N. Adjusting for Principal Components of Molecular Phenotypes Induces Replicating False Positives. Genetics 2019; 211:1179-1189. [PMID: 30692194 PMCID: PMC6456307 DOI: 10.1534/genetics.118.301768] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2018] [Accepted: 01/23/2019] [Indexed: 12/20/2022] Open
Abstract
High-throughput measurements of molecular phenotypes provide an unprecedented opportunity to model cellular processes and their impact on disease. These highly structured datasets are usually strongly confounded, creating false positives and reducing power. This has motivated many approaches based on principal components analysis (PCA) to estimate and correct for confounders, which have become indispensable elements of association tests between molecular phenotypes and both genetic and nongenetic factors. Here, we show that these correction approaches induce a bias, and that it persists for large sample sizes and replicates out-of-sample. We prove this theoretically for PCA by deriving an analytic, deterministic, and intuitive bias approximation. We assess other methods with realistic simulations, which show that perturbing any of several basic parameters can cause false positive rate (FPR) inflation. Our experiments show the bias depends on covariate and confounder sparsity, effect sizes, and their correlation. Surprisingly, when the covariate and confounder have [Formula: see text], standard two-step methods all have [Formula: see text]-fold FPR inflation. Our analysis informs best practices for confounder correction in genomic studies, and suggests many false discoveries have been made and replicated in some differential expression analyses.
Collapse
Affiliation(s)
- Andy Dahl
- Department of Medicine, University of California San Francisco, 94158 California
| | - Vincent Guillemot
- Centre de Bioinformatique, Biostatistique et Biologie Intégrative, Institut Pasteur, Paris, 75015 France
| | - Joel Mefford
- Department of Medicine, University of California San Francisco, 94158 California
| | - Hugues Aschard
- Centre de Bioinformatique, Biostatistique et Biologie Intégrative, Institut Pasteur, Paris, 75015 France
- Department of Epidemiology, Harvard TH Chan School of Public Health, Boston, 02115 Massachusetts
| | - Noah Zaitlen
- Department of Medicine, University of California San Francisco, 94158 California
| |
Collapse
|
15
|
Hung H. A robust removing unwanted variation-testing procedure via γ -divergence. Biometrics 2018; 75:650-662. [PMID: 30430537 DOI: 10.1111/biom.13002] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2017] [Accepted: 10/29/2018] [Indexed: 11/29/2022]
Abstract
Identification of differentially expressed genes (DE genes) is commonly conducted in modern biomedical research. However, unwanted variation inevitably arises during the data collection process, which can make the detection results heavily biased. Various methods have been suggested for removing the unwanted variation while keeping the biological variation to ensure a reliable analysis result. Removing unwanted variation (RUV) has recently been proposed for this purpose, which works by virtue of negative control genes. On the other hand, outliers frequently appear in modern high-throughput genetic data, which can heavily affect the performances of RUV and its downstream analysis. In this work, we propose a robust RUV-testing procedure (a robust RUV procedure to remove unwanted variance, followed by a robust testing procedure to identify DE genes) via γ -divergence. The advantages of our method are twofold: (a) it does not involve any modeling for the outlier distribution, which makes it applicable to various situations; (b) it is easy to implement in the sense that its robustness is controlled by a single tuning parameter γ of γ -divergence, and a data-driven criterion is developed to select γ . When applied to real data sets, our method can successfully remove unwanted variation, and was able to identify more DE genes than conventional methods.
Collapse
Affiliation(s)
- Hung Hung
- Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
16
|
Dobriban E, Owen AB. Deterministic parallel analysis: an improved method for selecting factors and principal components. J R Stat Soc Series B Stat Methodol 2018. [DOI: 10.1111/rssb.12301] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
17
|
Affiliation(s)
- Qingyuan Zhao
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA
| |
Collapse
|
18
|
Guillaume B, Wang C, Poh J, Shen MJ, Ong ML, Tan PF, Karnani N, Meaney M, Qiu A. Improving mass-univariate analysis of neuroimaging data by modelling important unknown covariates: Application to Epigenome-Wide Association Studies. Neuroimage 2018; 173:57-71. [PMID: 29448075 DOI: 10.1016/j.neuroimage.2018.01.073] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2017] [Revised: 01/03/2018] [Accepted: 01/28/2018] [Indexed: 10/18/2022] Open
Abstract
Statistical inference on neuroimaging data is often conducted using a mass-univariate model, equivalent to fitting a linear model at every voxel with a known set of covariates. Due to the large number of linear models, it is challenging to check if the selection of covariates is appropriate and to modify this selection adequately. The use of standard diagnostics, such as residual plotting, is clearly not practical for neuroimaging data. However, the selection of covariates is crucial for linear regression to ensure valid statistical inference. In particular, the mean model of regression needs to be reasonably well specified. Unfortunately, this issue is often overlooked in the field of neuroimaging. This study aims to adopt the existing Confounder Adjusted Testing and Estimation (CATE) approach and to extend it for use with neuroimaging data. We propose a modification of CATE that can yield valid statistical inferences using Principal Component Analysis (PCA) estimators instead of Maximum Likelihood (ML) estimators. We then propose a non-parametric hypothesis testing procedure that can improve upon parametric testing. Monte Carlo simulations show that the modification of CATE allows for more accurate modelling of neuroimaging data and can in turn yield a better control of False Positive Rate (FPR) and Family-Wise Error Rate (FWER). We demonstrate its application to an Epigenome-Wide Association Study (EWAS) on neonatal brain imaging and umbilical cord DNA methylation data obtained as part of a longitudinal cohort study. Software for this CATE study is freely available at http://www.bioeng.nus.edu.sg/cfa/Imaging_Genetics2.html.
Collapse
Affiliation(s)
- Bryan Guillaume
- Department of Biomedical Engineering, National University of Singapore, Singapore
| | - Changqing Wang
- Department of Biomedical Engineering, National University of Singapore, Singapore
| | - Joann Poh
- Department of Biomedical Engineering, National University of Singapore, Singapore; Singapore Institute for Clinical Sciences, Agency for Science, Technology, and Research, Singapore
| | - Mo Jun Shen
- Department of Biomedical Engineering, National University of Singapore, Singapore; Singapore Institute for Clinical Sciences, Agency for Science, Technology, and Research, Singapore
| | - Mei Lyn Ong
- Singapore Institute for Clinical Sciences, Agency for Science, Technology, and Research, Singapore
| | - Pei Fang Tan
- Singapore Institute for Clinical Sciences, Agency for Science, Technology, and Research, Singapore
| | - Neerja Karnani
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 119228, Singapore; Singapore Institute for Clinical Sciences, Agency for Science, Technology, and Research, Singapore
| | - Michael Meaney
- Ludmer Centre for Neuroinformatics and Mental Health, Douglas Mental Health University Institute, McGill University, Canada; Sackler Program for Epigenetics and Psychobiology at McGill University, Canada; Singapore Institute for Clinical Sciences, Agency for Science, Technology, and Research, Singapore
| | - Anqi Qiu
- Department of Biomedical Engineering, National University of Singapore, Singapore; Clinical Imaging Research Centre, National University of Singapore, Singapore; Singapore Institute for Clinical Sciences, Agency for Science, Technology, and Research, Singapore.
| |
Collapse
|
19
|
Controlling for Confounding Effects in Single Cell RNA Sequencing Studies Using both Control and Target Genes. Sci Rep 2017; 7:13587. [PMID: 29051597 PMCID: PMC5648789 DOI: 10.1038/s41598-017-13665-w] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Accepted: 09/29/2017] [Indexed: 11/24/2022] Open
Abstract
Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the influence of confounding effects. Controlling for confounding effects in scRNAseq data is a crucial step for accurate downstream analysis. Here, we present a novel statistical method, which we refer to as scPLS (single cell partial least squares), for robust and accurate inference of confounding effects. scPLS takes advantage of the fact that genes in a scRNAseq study often can be naturally classified into two sets: a control set of genes that are free of effects of the predictor variables and a target set of genes that are of primary interest. By modeling the two sets of genes jointly using the partial least squares regression, scPLS is capable of making full use of the data to improve the inference of confounding effects. With extensive simulations and comparisons with other methods, we demonstrate the effectiveness of scPLS. Finally, we apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.
Collapse
|
20
|
Abstract
We consider large-scale studies in which thousands of significance tests are performed simultaneously. In some of these studies, the multiple testing procedure can be severely biased by latent confounding factors such as batch effects and unmeasured covariates that correlate with both primary variable(s) of interest (e.g., treatment variable, phenotype) and the outcome. Over the past decade, many statistical methods have been proposed to adjust for the confounders in hypothesis testing. We unify these methods in the same framework, generalize them to include multiple primary variables and multiple nuisance variables, and analyze their statistical properties. In particular, we provide theoretical guarantees for RUV-4 [Gagnon-Bartsch, Jacob and Speed (2013)] and LEAPP [Ann. Appl. Stat. 6 (2012) 1664-1688], which correspond to two different identification conditions in the framework: the first requires a set of "negative controls" that are known a priori to follow the null distribution; the second requires the true nonnulls to be sparse. Two different estimators which are based on RUV-4 and LEAPP are then applied to these two scenarios. We show that if the confounding factors are strong, the resulting estimators can be asymptotically as powerful as the oracle estimator which observes the latent confounding factors. For hypothesis testing, we show the asymptotic z-tests based on the estimators can control the type I error. Numerical experiments show that the false discovery rate is also controlled by the Benjamini-Hochberg procedure when the sample size is reasonably large.
Collapse
Affiliation(s)
- Jingshu Wang
- Department of Statistics, The Wharton School, University of Pennsylvania, 400 Huntsman Hall, 3730 Walnut St, Philadelphia, Pennsylvania 19104, USA
| | - Qingyuan Zhao
- Department of Statistics, The Wharton School, University of Pennsylvania, 400 Huntsman Hall, 3730 Walnut St, Philadelphia, Pennsylvania 19104, USA
| | - Trevor Hastie
- Department of Statistics, Stanford University, 390 Serra Mall, Stanford, California 94305, USA
| | - Art B. Owen
- Department of Statistics, Stanford University, 390 Serra Mall, Stanford, California 94305, USA
| |
Collapse
|
21
|
Lee S, Sun W, Wright FA, Zou F. An improved and explicit surrogate variable analysis procedure by coefficient adjustment. Biometrika 2017; 104:303-316. [PMID: 29430031 PMCID: PMC5627626 DOI: 10.1093/biomet/asx018] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2015] [Indexed: 01/31/2023] Open
Abstract
Unobserved environmental, demographic and technical factors canadversely affect the estimation and testing of the effects ofprimary variables. Surrogate variable analysis, proposed to tacklethis problem, has been widely used in genomic studies. To estimatehidden factors that are correlated with the primary variables,surrogate variable analysis performs principal component analysiseither on a subset of features or on all features, but weightingeach differently. However, existing approaches may fail to identifyhidden factors that are strongly correlated with the primaryvariables, and the extra step of feature selection and weightcalculation makes the theoretical investigation of surrogatevariable analysis challenging. In this paper, we propose an improvedsurrogate variable analysis, using all measured features, that has anatural connection with restricted least squares, which allows us tostudy its theoretical properties. Simulation studies and real-dataanalysis show that the method is competitive with state-of-the-artmethods.
Collapse
Affiliation(s)
- Seunggeun Lee
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48109,
| | - Wei Sun
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, Washington 98109,
| | - Fred A Wright
- Bioinformatics Research Center, North Carolina State University, 1 Lampe Drive, Raleigh, North Carolina 27607,
| | - Fei Zou
- Department of Biostatistics, University of Florida, 2004 Mowry Rd, Gainesville, Florida 32611,
| |
Collapse
|
22
|
Du L, Zhang C. Estimation of false discovery proportion in multiple testing: From normal to chi-squared test statistics. Electron J Stat 2017. [DOI: 10.1214/17-ejs1256] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
23
|
Sheu CF, Perthame É, Lee YS, Causeur D. Accounting for time dependence in large-scale multiple testing of event-related potential data. Ann Appl Stat 2016. [DOI: 10.1214/15-aoas888] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
24
|
|
25
|
Delattre S, Roquain E. On empirical distribution function of high-dimensional Gaussian vector components with an application to multiple testing. BERNOULLI 2016. [DOI: 10.3150/14-bej659] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
26
|
Blum Y, Houée-Bigot M, Causeur D. Sparse factor model for co-expression networks with an application using prior biological knowledge. Stat Appl Genet Mol Biol 2016; 15:253-72. [DOI: 10.1515/sagmb-2015-0002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
AbstractInference on gene regulatory networks from high-throughput expression data turns out to be one of the main current challenges in systems biology. Such networks can be very insightful for the deep understanding of interactions between genes. Because genes-gene interactions is often viewed as joint contributions to known biological mechanisms, inference on the dependence among gene expressions is expected to be consistent to some extent with the functional characterization of genes which can be derived from ontologies (GO, KEGG, …). The present paper introduces a sparse factor model as a general framework either to account for a prior knowledge on joint contributions of modules of genes to latent biological processes or to infer on the corresponding co-expression network. We propose an
Collapse
|
27
|
Jiang Y, Oldridge DA, Diskin SJ, Zhang NR. CODEX: a normalization and copy number variation detection method for whole exome sequencing. Nucleic Acids Res 2015; 43:e39. [PMID: 25618849 PMCID: PMC4381046 DOI: 10.1093/nar/gku1363] [Citation(s) in RCA: 95] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2014] [Accepted: 12/19/2014] [Indexed: 01/24/2023] Open
Abstract
High-throughput sequencing of DNA coding regions has become a common way of assaying genomic variation in the study of human diseases. Copy number variation (CNV) is an important type of genomic variation, but detecting and characterizing CNV from exome sequencing is challenging due to the high level of biases and artifacts. We propose CODEX, a normalization and CNV calling procedure for whole exome sequencing data. The Poisson latent factor model in CODEX includes terms that specifically remove biases due to GC content, exon capture and amplification efficiency, and latent systemic artifacts. CODEX also includes a Poisson likelihood-based recursive segmentation procedure that explicitly models the count-based exome sequencing data. CODEX is compared to existing methods on a population analysis of HapMap samples from the 1000 Genomes Project, and shown to be more accurate on three microarray-based validation data sets. We further evaluate performance on 222 neuroblastoma samples with matched normals and focus on a well-studied rare somatic CNV within the ATRX gene. We show that the cross-sample normalization procedure of CODEX removes more noise than normalizing the tumor against the matched normal and that the segmentation procedure performs well in detecting CNVs with nested structures.
Collapse
Affiliation(s)
- Yuchao Jiang
- Genomics and Computational Biology Graduate Program, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Derek A Oldridge
- Medical Scientist Training Program, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Division of Oncology and Center for Childhood Cancer Research, The Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Sharon J Diskin
- Division of Oncology and Center for Childhood Cancer Research, The Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Abramson Family Cancer Research Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Nancy R Zhang
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|