1
|
Chen Z, He Z, Chu BB, Gu J, Morrison T, Sabatti C, Candès E. Controlled Variable Selection from Summary Statistics Only? A Solution via Ghost Knockoffs and Penalized Regression. ArXiv 2024:arXiv:2402.12724v1. [PMID: 38463500 PMCID: PMC10925382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs He et al. [2022] and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer's disease, and evidence a significant improvement in power.
Collapse
Affiliation(s)
| | - Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University
- Department of Medicine (Biomedical Informatics Research), Stanford University
| | - Benjamin B Chu
- Department of Biomedical Data Science, Stanford University
| | - Jiaqi Gu
- Department of Neurology and Neurological Sciences, Stanford University
| | | | - Chiara Sabatti
- Department of Statistics, Stanford University
- Department of Biomedical Data Science, Stanford University
| | - Emmanuel Candès
- Department of Statistics, Stanford University
- Department of Mathematics, Stanford University
| |
Collapse
|
2
|
Luo D, Ebadi A, Emery K, He Y, Noble WS, Keich U. Competition-based control of the false discovery proportion. Biometrics 2023; 79:3472-3484. [PMID: 36652258 DOI: 10.1111/biom.13830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 10/12/2022] [Accepted: 01/02/2023] [Indexed: 01/19/2023]
Abstract
Recently, Barber and Candès laid the theoretical foundation for a general framework for false discovery rate (FDR) control based on the notion of "knockoffs." A closely related FDR control methodology has long been employed in the analysis of mass spectrometry data, referred to there as "target-decoy competition" (TDC). However, any approach that aims to control the FDR, which is defined as the expected value of the false discovery proportion (FDP), suffers from a problem. Specifically, even when successfully controlling the FDR at level α, the FDP in the list of discoveries can significantly exceed α. We offer FDP-SD, a new procedure that rigorously controls the FDP in the knockoff/TDC competition setup by guaranteeing that the FDP is bounded by α at a desired confidence level. Compared with the recently published framework of Katsevich and Ramdas, FDP-SD generally delivers more power and often substantially so in simulated and real data.
Collapse
Affiliation(s)
- Dong Luo
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | - Arya Ebadi
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | - Kristen Emery
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | - Yilun He
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | | | - Uri Keich
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| |
Collapse
|
3
|
Zhao T, Zhu G, Dubey HV, Flaherty P. Identification of significant gene expression changes in multiple perturbation experiments using knockoffs. Brief Bioinform 2023; 24:bbad084. [PMID: 36892174 PMCID: PMC10025447 DOI: 10.1093/bib/bbad084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 01/20/2023] [Accepted: 02/13/2023] [Indexed: 03/10/2023] Open
Abstract
Large-scale multiple perturbation experiments have the potential to reveal a more detailed understanding of the molecular pathways that respond to genetic and environmental changes. A key question in these studies is which gene expression changes are important for the response to the perturbation. This problem is challenging because (i) the functional form of the nonlinear relationship between gene expression and the perturbation is unknown and (ii) identification of the most important genes is a high-dimensional variable selection problem. To deal with these challenges, we present here a method based on the model-X knockoffs framework and Deep Neural Networks to identify significant gene expression changes in multiple perturbation experiments. This approach makes no assumptions on the functional form of the dependence between the responses and the perturbations and it enjoys finite sample false discovery rate control for the selected set of important gene expression responses. We apply this approach to the Library of Integrated Network-Based Cellular Signature data sets which is a National Institutes of Health Common Fund program that catalogs how human cells globally respond to chemical, genetic and disease perturbations. We identified important genes whose expression is directly modulated in response to perturbation with anthracycline, vorinostat, trichostatin-a, geldanamycin and sirolimus. We compare the set of important genes that respond to these small molecules to identify co-responsive pathways. Identification of which genes respond to specific perturbation stressors can provide better understanding of the underlying mechanisms of disease and advance the identification of new drug targets.
Collapse
Affiliation(s)
- Tingting Zhao
- Department of Information Systems and Analytics, College of Business, Bryant University, Smithfield, 02917, RI, USA
- Center for Health and Behavioral Sciences, Bryant University, Smithfield, 02917, RI, USA
| | - Guangyu Zhu
- Department of Computer Science and Statistics, University of Rhode Island, Kingston, 02881, RI, USA
| | - Harsh Vardhan Dubey
- Department of Mathematics & Statistics, University of Massachusetts Amherst, Amherst, 01003, MA, USA
| | - Patrick Flaherty
- Department of Mathematics & Statistics, University of Massachusetts Amherst, Amherst, 01003, MA, USA
| |
Collapse
|
4
|
Sesia M, Bates S, Candès E, Marchini J, Sabatti C. False discovery rate control in genome-wide association studies with population structure. Proc Natl Acad Sci U S A 2021; 118:e2105841118. [PMID: 34580220 DOI: 10.1073/pnas.2105841118] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
We present a comprehensive statistical framework to analyze data from genome-wide association studies of polygenic traits, producing interpretable findings while controlling the false discovery rate. In contrast with standard approaches, our method can leverage sophisticated multivariate algorithms but makes no parametric assumptions about the unknown relation between genotypes and phenotype. Instead, we recognize that genotypes can be considered as a random sample from an appropriate model, encapsulating our knowledge of genetic inheritance and human populations. This allows the generation of imperfect copies (knockoffs) of these variables that serve as ideal negative controls, correcting for linkage disequilibrium and accounting for unknown population structure, which may be due to diverse ancestries or familial relatedness. The validity and effectiveness of our method are demonstrated by extensive simulations and by applications to the UK Biobank data. These analyses confirm our method is powerful relative to state-of-the-art alternatives, while comparisons with other studies validate most of our discoveries. Finally, fast software is made available for researchers to analyze Biobank-scale datasets.
Collapse
|
5
|
Sesia M, Bates S, Candès E, Marchini J, Sabatti C. False discovery rate control in genome-wide association studies with population structure. Proc Natl Acad Sci U S A 2021; 118:e2105841118. [PMID: 34580220 PMCID: PMC8501795 DOI: 10.1073/pnas.2105841118] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/18/2021] [Indexed: 12/25/2022] Open
Abstract
We present a comprehensive statistical framework to analyze data from genome-wide association studies of polygenic traits, producing interpretable findings while controlling the false discovery rate. In contrast with standard approaches, our method can leverage sophisticated multivariate algorithms but makes no parametric assumptions about the unknown relation between genotypes and phenotype. Instead, we recognize that genotypes can be considered as a random sample from an appropriate model, encapsulating our knowledge of genetic inheritance and human populations. This allows the generation of imperfect copies (knockoffs) of these variables that serve as ideal negative controls, correcting for linkage disequilibrium and accounting for unknown population structure, which may be due to diverse ancestries or familial relatedness. The validity and effectiveness of our method are demonstrated by extensive simulations and by applications to the UK Biobank data. These analyses confirm our method is powerful relative to state-of-the-art alternatives, while comparisons with other studies validate most of our discoveries. Finally, fast software is made available for researchers to analyze Biobank-scale datasets.
Collapse
Affiliation(s)
- Matteo Sesia
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA 90089;
| | - Stephen Bates
- Department of Statistics, University of California, Berkeley, CA 94720
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94720
| | - Emmanuel Candès
- Department of Statistics, Stanford University, Stanford, CA 94305;
- Department of Mathematics, Stanford University, Stanford, CA 94305
| | | | - Chiara Sabatti
- Department of Statistics, Stanford University, Stanford, CA 94305
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA 94305
| |
Collapse
|
6
|
Zhu Z, Fan Y, Kong Y, Lv J, Sun F. DeepLINK: Deep learning inference using knockoffs with applications to genomics. Proc Natl Acad Sci U S A 2021; 118:e2104683118. [PMID: 34480002 PMCID: PMC8433583 DOI: 10.1073/pnas.2104683118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 07/16/2021] [Indexed: 11/18/2022] Open
Abstract
We propose a deep learning-based knockoffs inference framework, DeepLINK, that guarantees the false discovery rate (FDR) control in high-dimensional settings. DeepLINK is applicable to a broad class of covariate distributions described by the possibly nonlinear latent factor models. It consists of two major parts: an autoencoder network for the knockoff variable construction and a multilayer perceptron network for feature selection with the FDR control. The empirical performance of DeepLINK is investigated through extensive simulation studies, where it is shown to achieve FDR control in feature selection with both high selection power and high prediction accuracy. We also apply DeepLINK to three real data applications to demonstrate its practical utility.
Collapse
Affiliation(s)
- Zifan Zhu
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA 90089
| | - Yingying Fan
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089;
| | - Yinfei Kong
- Department of Information Systems and Decision Sciences, California State University, Fullerton, CA 92831
| | - Jinchi Lv
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA 90089;
| |
Collapse
|
7
|
Xie F, Lederer J. Aggregating Knockoffs for False Discovery Rate Control with an Application to Gut Microbiome Data. Entropy (Basel) 2021; 23:230. [PMID: 33669462 DOI: 10.3390/e23020230] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Accepted: 02/11/2021] [Indexed: 12/31/2022]
Abstract
Recent discoveries suggest that our gut microbiome plays an important role in our health and wellbeing. However, the gut microbiome data are intricate; for example, the microbial diversity in the gut makes the data high-dimensional. While there are dedicated high-dimensional methods, such as the lasso estimator, they always come with the risk of false discoveries. Knockoffs are a recent approach to control the number of false discoveries. In this paper, we show that knockoffs can be aggregated to increase power while retaining sharp control over the false discoveries. We support our method both in theory and simulations, and we show that it can lead to new discoveries on microbiome data from the American Gut Project. In particular, our results indicate that several phyla that have been overlooked so far are associated with obesity.
Collapse
|
8
|
Shen A, Fu H, He K, Jiang H. False Discovery Rate Control in Cancer Biomarker Selection Using Knockoffs. Cancers (Basel) 2019; 11:E744. [PMID: 31146393 DOI: 10.3390/cancers11060744] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 05/23/2019] [Indexed: 11/17/2022] Open
Abstract
The discovery of biomarkers that are informative for cancer risk assessment, diagnosis, prognosis and treatment predictions is crucial. Recent advances in high-throughput genomics make it plausible to select biomarkers from the vast number of human genes in an unbiased manner. Yet, control of false discoveries is challenging given the large number of genes versus the relatively small number of patients in a typical cancer study. To ensure that most of the discoveries are true, we employ a knockoff procedure to control false discoveries. Our method is general and flexible, accommodating arbitrary covariate distributions, linear and nonlinear associations, and survival models. In simulations, our method compares favorably to the alternatives; its utility of identifying important genes in real clinical applications is demonstrated by the identification of seven genes associated with Breslow thickness in skin cutaneous melanoma patients.
Collapse
|