1
|
Li C, Wang J, Wang P. Large-scale dependent multiple testing via higher-order hidden Markov models. J Biopharm Stat 2024:1-13. [PMID: 39494677 DOI: 10.1080/10543406.2024.2420657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Accepted: 10/17/2024] [Indexed: 11/05/2024]
Abstract
Taking into account the local dependence structure in large-scale multiple testing is expected to improve both the efficiency of the testing procedure and the interpretability of scientific findings. The hidden Markov model (HMM), as an effective model to describe the sequential dependence, has been successfully applied to large-scale multiple testing with local correlations. However, in many applications, the first-order Markov chain is not flexible enough to capture the complexity of local correlations. To address this issue, this paper proposes a novel multiple testing procedure that uses a higher-order Markov chain to better characterize local correlations among tests. The proposed procedure is validated by theoretical results and simulation studies, which show that it outperforms its competitors in terms of power. Finally, a real data analysis is presented to demonstrate the favorable performance of the proposed procedure.
Collapse
Affiliation(s)
- Canhui Li
- School of Mathematics and Statistics, Henan University, Kaifeng, China
| | - Jiangzhou Wang
- College of Mathematics and Statistics, Institute of Statistical Sciences, Shenzhen University, Shenzhen, China
| | - Pengfei Wang
- School of Statistics, Dongbei University of Finance and Economics, Dalian, China
| |
Collapse
|
2
|
Deng L, He K, Zhang X. Joint mirror procedure: controlling false discovery rate for identifying simultaneous signals. Biometrics 2024; 80:ujae142. [PMID: 39671277 PMCID: PMC11639532 DOI: 10.1093/biomtc/ujae142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 08/02/2024] [Accepted: 11/11/2024] [Indexed: 12/15/2024]
Abstract
In many applications, the process of identifying a specific feature of interest often involves testing multiple hypotheses for their joint statistical significance. Examples include mediation analysis, which simultaneously examines the existence of the exposure-mediator and the mediator-outcome effects, and replicability analysis, aiming to identify simultaneous signals that exhibit statistical significance across multiple independent studies. In this work, we present a new approach called the joint mirror (JM) procedure that effectively detects such features while maintaining false discovery rate (FDR) control in finite samples. The JM procedure employs an iterative method that gradually shrinks the rejection region based on progressively revealed information until a conservative estimate of the false discovery proportion is below the target FDR level. Additionally, we introduce a more stringent error measure known as the composite FDR (cFDR), which assigns weights to each false discovery based on its number of null components. We use the leave-one-out technique to prove that the JM procedure controls the cFDR in finite samples. To implement the JM procedure, we propose an efficient algorithm that can incorporate partial ordering information. Through extensive simulations, we show that our procedure effectively controls the cFDR and enhances statistical power across various scenarios, including the case that test statistics are dependent across the features. Finally, we showcase the utility of our method by applying it to real-world mediation and replicability analyses.
Collapse
Affiliation(s)
- Linsui Deng
- School of Data Science, The Chinese University of Hong Kong, Shenzhen 518172, China
- Center for Applied Statistics, Institute of Statistics and Big Data, Renmin University of China, Beijing 100872, China
| | - Kejun He
- Center for Applied Statistics, Institute of Statistics and Big Data, Renmin University of China, Beijing 100872, China
| | - Xianyang Zhang
- Department of Statistics, Texas A&M University, College Station, TX 77843, United States
| |
Collapse
|
3
|
Zheng S, McLain AC, Habiger J, Rorden C, Fridriksson J. False Discovery Rate Control for Lesion-Symptom Mapping With Heterogeneous Data via Weighted p-Values. Biom J 2024; 66:e202300198. [PMID: 39162085 PMCID: PMC11420788 DOI: 10.1002/bimj.202300198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 06/06/2024] [Accepted: 06/10/2024] [Indexed: 08/21/2024]
Abstract
Lesion-symptom mapping studies provide insight into what areas of the brain are involved in different aspects of cognition. This is commonly done via behavioral testing in patients with a naturally occurring brain injury or lesions (e.g., strokes or brain tumors). This results in high-dimensional observational data where lesion status (present/absent) is nonuniformly distributed, with some voxels having lesions in very few (or no) subjects. In this situation, mass univariate hypothesis tests have severe power heterogeneity where many tests are known a priori to have little to no power. Recent advancements in multiple testing methodologies allow researchers to weigh hypotheses according to side information (e.g., information on power heterogeneity). In this paper, we propose the use of p-value weighting for voxel-based lesion-symptom mapping studies. The weights are created using the distribution of lesion status and spatial information to estimate different non-null prior probabilities for each hypothesis test through some common approaches. We provide a monotone minimum weight criterion, which requires minimum a priori power information. Our methods are demonstrated on dependent simulated data and an aphasia study investigating which regions of the brain are associated with the severity of language impairment among stroke survivors. The results demonstrate that the proposed methods have robust error control and can increase power. Further, we showcase how weights can be used to identify regions that are inconclusive due to lack of power.
Collapse
Affiliation(s)
- Siyu Zheng
- Department of Epidemiology and Biostatistics, University of South Carolina, SC, United States
| | - Alexander C. McLain
- Department of Epidemiology and Biostatistics, University of South Carolina, SC, United States
| | - Joshua Habiger
- Department of Statistics, Oklahoma State University, OK, United States
| | - Christopher Rorden
- Department of Psychology, University of South Carolina, SC, United States
| | - Julius Fridriksson
- Department of Communication Sciences and Disorders, University of South Carolina, SC, United States
| |
Collapse
|
4
|
Al-Mekhlafi A, Klawonn F. HiPerMAb: a tool for judging the potential of small sample size biomarker pilot studies. Int J Biostat 2024; 20:157-167. [PMID: 36867668 DOI: 10.1515/ijb-2022-0063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2022] [Accepted: 02/01/2023] [Indexed: 03/04/2023]
Abstract
Common statistical approaches are not designed to deal with so-called "short fat data" in biomarker pilot studies, where the number of biomarker candidates exceeds the sample size by magnitudes. High-throughput technologies for omics data enable the measurement of ten thousands and more biomarker candidates for specific diseases or states of a disease. Due to the limited availability of study participants, ethical reasons and high costs for sample processing and analysis researchers often prefer to start with a small sample size pilot study in order to judge the potential of finding biomarkers that enable - usually in combination - a sufficiently reliable classification of the disease state under consideration. We developed a user-friendly tool, called HiPerMAb that allows to evaluate pilot studies based on performance measures like multiclass AUC, entropy, area above the cost curve, hypervolume under manifold, and misclassification rate using Monte-Carlo simulations to compute the p-values and confidence intervals. The number of "good" biomarker candidates is compared to the expected number of "good" biomarker candidates in a data set with no association to the considered disease states. This allows judging the potential in the pilot study even if statistical tests with correction for multiple testing fail to provide any hint of significance.
Collapse
Affiliation(s)
- Amani Al-Mekhlafi
- Department of Biostatistics, Helmholtz Centre for Infection Research, Braunschweig, Germany
- PhD Programme "Epidemiology" Hannover Medical School (MHH), Hannover, Germany
| | - Frank Klawonn
- Department of Computer Science, Ostfalia University of Applied Sciences, Wolfenbuettel, Germany
| |
Collapse
|
5
|
Yang L, Wang P, Chen J. 2dGBH: Two-dimensional group Benjamini-Hochberg procedure for false discovery rate control in two-way multiple testing of genomic data. Bioinformatics 2024; 40:btae035. [PMID: 38244568 PMCID: PMC10873908 DOI: 10.1093/bioinformatics/btae035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 02/16/2024] [Accepted: 02/16/2024] [Indexed: 01/22/2024] Open
Abstract
MOTIVATION Emerging omics technologies have introduced a two-way grouping structure in multiple testing, as seen in single-cell omics data, where the features can be grouped by either genes or cell types. Traditional multiple testing methods have limited ability to exploit such two-way grouping structure, leading to potential power loss. RESULTS We propose a new 2D Group Benjamini-Hochberg (2dGBH) procedure to harness the two-way grouping structure in omics data, extending the traditional one-way adaptive GBH procedure. Using both simulated and real datasets, we show that 2dGBH effectively controls the false discovery rate across biologically relevant settings, and it is more powerful than the BH or q-value procedure and more robust than the one-way adaptive GBH procedure. AVAILABILITY AND IMPLEMENTATION 2dGBH is available as an R package at: https://github.com/chloelulu/tdGBH. The analysis code and data are available at: https://github.com/chloelulu/tdGBH-paper.
Collapse
Affiliation(s)
- Lu Yang
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, United States
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN 55905, United States
| | - Pei Wang
- Department of Statistics, Miami University, Oxford, OH 45056, United States
| | - Jun Chen
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, United States
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN 55905, United States
| |
Collapse
|
6
|
Hariprakash JM, Salviato E, La Mastra F, Sebestyén E, Tagliaferri I, Silva RS, Lucini F, Farina L, Cinquanta M, Rancati I, Riboni M, Minardi SP, Roz L, Gorini F, Lanzuolo C, Casola S, Ferrari F. Leveraging Tissue-Specific Enhancer-Target Gene Regulatory Networks Identifies Enhancer Somatic Mutations That Functionally Impact Lung Cancer. Cancer Res 2024; 84:133-153. [PMID: 37855660 PMCID: PMC10758689 DOI: 10.1158/0008-5472.can-23-1129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 08/29/2023] [Accepted: 10/17/2023] [Indexed: 10/20/2023]
Abstract
Enhancers are noncoding regulatory DNA regions that modulate the transcription of target genes, often over large distances along with the genomic sequence. Enhancer alterations have been associated with various pathological conditions, including cancer. However, the identification and characterization of somatic mutations in noncoding regulatory regions with a functional effect on tumorigenesis and prognosis remain a major challenge. Here, we present a strategy for detecting and characterizing enhancer mutations in a genome-wide analysis of patient cohorts, across three lung cancer subtypes. Lung tissue-specific enhancers were defined by integrating experimental data and public epigenomic profiles, and the genome-wide enhancer-target gene regulatory network of lung cells was constructed by integrating chromatin three-dimensional architecture data. Lung cancers possessed a similar mutation burden at tissue-specific enhancers and exons but with differences in their mutation signatures. Functionally relevant alterations were prioritized on the basis of the pathway-level integration of the effect of a mutation and the frequency of mutations on individual enhancers. The genes enriched for mutated enhancers converged on the regulation of key biological processes and pathways relevant to tumor biology. Recurrent mutations in individual enhancers also affected the expression of target genes, with potential relevance for patient prognosis. Together, these findings show that noncoding regulatory mutations have a potential relevance for cancer pathogenesis and can be exploited for patient classification. SIGNIFICANCE Mapping enhancer-target gene regulatory interactions and analyzing enhancer mutations at the level of their target genes and pathways reveal convergence of recurrent enhancer mutations on biological processes involved in tumorigenesis and prognosis.
Collapse
Affiliation(s)
| | - Elisa Salviato
- IFOM-ETS, the AIRC Institute of Molecular Oncology, Milan, Italy
| | | | - Endre Sebestyén
- IFOM-ETS, the AIRC Institute of Molecular Oncology, Milan, Italy
| | | | | | - Federica Lucini
- IFOM-ETS, the AIRC Institute of Molecular Oncology, Milan, Italy
| | - Lorenzo Farina
- IFOM-ETS, the AIRC Institute of Molecular Oncology, Milan, Italy
| | | | - Ilaria Rancati
- IFOM-ETS, the AIRC Institute of Molecular Oncology, Milan, Italy
| | | | | | - Luca Roz
- Fondazione IRCCS—Istituto Nazionale Tumori, Milan, Italy
| | - Francesca Gorini
- INGM, National Institute of Molecular Genetics “Romeo ed Enrica Invernizzi,” Milan, Italy
| | - Chiara Lanzuolo
- INGM, National Institute of Molecular Genetics “Romeo ed Enrica Invernizzi,” Milan, Italy
- Institute of Biomedical Technologies, National Research Council (ITB-CNR), Segrate, Italy
| | - Stefano Casola
- IFOM-ETS, the AIRC Institute of Molecular Oncology, Milan, Italy
| | - Francesco Ferrari
- IFOM-ETS, the AIRC Institute of Molecular Oncology, Milan, Italy
- Institute of Molecular Genetics “Luigi Luca Cavalli-Sforza,” National Research Council (IGM-CNR), Pavia, Italy
| |
Collapse
|
7
|
Jeon H, Lim KS, Nguyen Y, Nettleton D. Adjusting for gene-specific covariates to improve RNA-seq analysis. Bioinformatics 2023; 39:btad498. [PMID: 37589589 PMCID: PMC10460482 DOI: 10.1093/bioinformatics/btad498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Revised: 06/29/2023] [Accepted: 08/16/2023] [Indexed: 08/18/2023] Open
Abstract
SUMMARY This article suggests a novel positive false discovery rate (pFDR) controlling method for testing gene-specific hypotheses using a gene-specific covariate variable, such as gene length. We suppose the null probability depends on the covariate variable. In this context, we propose a rejection rule that accounts for heterogeneity among tests by using two distinct types of null probabilities. We establish a pFDR estimator for a given rejection rule by following Storey's q-value framework. A condition on a type 1 error posterior probability is provided that equivalently characterizes our rejection rule. We also present a suitable procedure for selecting a tuning parameter through cross-validation that maximizes the expected number of hypotheses declared significant. A simulation study demonstrates that our method is comparable to or better than existing methods across realistic scenarios. In data analysis, we find support for our method's premise that the null probability varies with a gene-specific covariate variable. AVAILABILITY AND IMPLEMENTATION The source code repository is publicly available at https://github.com/hsjeon1217/conditional_method.
Collapse
Affiliation(s)
- Hyeongseon Jeon
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, United States
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, United States
| | - Kyu-Sang Lim
- Department of Animal Resources Science, Kongju National University, Yesan-gun, Chungnam 32439, Republic of Korea
| | - Yet Nguyen
- Department of Mathematics and Statistics, Old Dominion University, Norfolk, VA 23529, United States
| | - Dan Nettleton
- Department of Statistics, Iowa State University, Ames, IA 50011, Unites States
| |
Collapse
|
8
|
Obry L, Dalmasso C. Weighted multiple testing procedures in genome-wide association studies. PeerJ 2023; 11:e15369. [PMID: 37337586 PMCID: PMC10276986 DOI: 10.7717/peerj.15369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Accepted: 04/17/2023] [Indexed: 06/21/2023] Open
Abstract
Multiple testing procedures controlling the false discovery rate (FDR) are increasingly used in the context of genome wide association studies (GWAS), and weighted multiple testing procedures that incorporate covariate information are efficient to improve the power to detect associations. In this work, we evaluate some recent weighted multiple testing procedures in the specific context of GWAS through a simulation study. We also present a new efficient procedure called wBHa that prioritizes the detection of genetic variants with low minor allele frequencies while maximizing the overall detection power. The results indicate good performance of our procedure compared to other weighted multiple testing procedures. In particular, in all simulated settings, wBHa tends to outperform other procedures in detecting rare variants while maintaining good overall power. The use of the different procedures is illustrated with a real dataset.
Collapse
Affiliation(s)
- Ludivine Obry
- Université Paris-Saclay, CNRS, Univ Evry, Laboratoire de Mathématiques et Modélisation d’Evry, Evry-Courcouronnes, France
| | - Cyril Dalmasso
- Université Paris-Saclay, CNRS, Univ Evry, Laboratoire de Mathématiques et Modélisation d’Evry, Evry-Courcouronnes, France
| |
Collapse
|
9
|
Bello N, López-Kleine L. Prog-Plot - a visual method to determine functional relationships for false discovery rate regression methods. J Cell Sci 2023; 136:jcs260312. [PMID: 36482762 DOI: 10.1242/jcs.260312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Accepted: 12/01/2022] [Indexed: 12/14/2022] Open
Abstract
Multiple test corrections are a fundamental step in the analysis of differentially expressed genes, as the number of tests performed would otherwise inflate the false discovery rate (FDR). Recent methods for P-value correction involve a regression model in order to include covariates that are informative of the power of the test. Here, we present Progressive proportions plot (Prog-Plot), a visual tool to identify the functional relationship between the covariate and the proportion of P-values consistent with the null hypothesis. The relationship between the proportion of P-values and the covariate to be included is needed, but there are no available tools to verify it. The approach presented here aims at having an objective way to specify regression models instead of relying on prior knowledge.
Collapse
Affiliation(s)
- Nicolás Bello
- Statistics Department, Universidad Nacional de Colombia, Ciudad Universitaria, Cra 30 No 45-03, Bogotá 111321, Colombia
| | - Liliana López-Kleine
- Statistics Department, Universidad Nacional de Colombia, Ciudad Universitaria, Cra 30 No 45-03, Bogotá 111321, Colombia
| |
Collapse
|
10
|
Bryan JG, Hoff PD. Smaller p-values in genomics studies using distilled auxiliary information. Biostatistics 2022; 24:193-208. [PMID: 34269373 DOI: 10.1093/biostatistics/kxaa053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 09/25/2020] [Accepted: 11/15/2020] [Indexed: 12/16/2022] Open
Abstract
Medical research institutions have generated massive amounts of biological data by genetically profiling hundreds of cancer cell lines. In parallel, academic biology labs have conducted genetic screens on small numbers of cancer cell lines under custom experimental conditions. In order to share information between these two approaches to scientific discovery, this article proposes a "frequentist assisted by Bayes" (FAB) procedure for hypothesis testing that allows auxiliary information from massive genomics datasets to increase the power of hypothesis tests in specialized studies. The exchange of information takes place through a novel probability model for multimodal genomics data, which distills auxiliary information pertaining to cancer cell lines and genes across a wide variety of experimental contexts. If the relevance of the auxiliary information to a given study is high, then the resulting FAB tests can be more powerful than the corresponding classical tests. If the relevance is low, then the FAB tests yield as many discoveries as the classical tests. Simulations and practical investigations demonstrate that the FAB testing procedure can increase the number of effects discovered in genomics studies while still maintaining strict control of type I error and false discovery rate.
Collapse
Affiliation(s)
- Jordan G Bryan
- Department of Statistical Science, Duke University, 415 Chapel Drive, Durham, NC 27708, USA
| | - Peter D Hoff
- Department of Statistical Science, Duke University, 415 Chapel Drive, Durham, NC 27708, USA
| |
Collapse
|
11
|
Wang J, Cui T, Zhu W, Wang P. Covariate-modulated large-scale multiple testing under dependence. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
|
12
|
Transfer Learning in Genome-Wide Association Studies with Knockoffs. SANKHYA B 2022. [DOI: 10.1007/s13571-022-00297-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
AbstractThis paper presents and compares alternative transfer learning methods that can increase the power of conditional testing via knockoffs by leveraging prior information in external data sets collected from different populations or measuring related outcomes. The relevance of this methodology is explored in particular within the context of genome-wide association studies, where it can be helpful to address the pressing need for principled ways to suitably account for, and efficiently learn from the genetic variation associated to diverse ancestries. Finally, we apply these methods to analyze several phenotypes in the UK Biobank data set, demonstrating that transfer learning helps knockoffs discover more associations in the data collected from minority populations, potentially opening the way to the development of more accurate polygenic risk scores.
Collapse
|
13
|
Leung D, Sun W. ZAP: Z$$ Z $$‐value adaptive procedures for false discovery rate control with side information. J R Stat Soc Series B Stat Methodol 2022. [DOI: 10.1111/rssb.12557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Affiliation(s)
- Dennis Leung
- School of Mathematics and Statistics University of Melbourne Parkville Victoria Australia
| | - Wenguang Sun
- Center for Data Science Zhejiang University Hangzhou China
| |
Collapse
|
14
|
Freestone J, Short T, Noble WS, Keich U. Group-walk: a rigorous approach to group-wise false discovery rate analysis by target-decoy competition. Bioinformatics 2022; 38:ii82-ii88. [PMID: 36124786 DOI: 10.1093/bioinformatics/btac471] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION Target-decoy competition (TDC) is a commonly used method for false discovery rate (FDR) control in the analysis of tandem mass spectrometry data. This type of competition-based FDR control has recently gained significant popularity in other fields after Barber and Candès laid its theoretical foundation in a more general setting that included the feature selection problem. In both cases, the competition is based on a head-to-head comparison between an (observed) target score and a corresponding decoy (knockoff) score. However, the effectiveness of TDC depends on whether the data are homogeneous, which is often not the case: in many settings, the data consist of groups with different score profiles or different proportions of true nulls. In such cases, applying TDC while ignoring the group structure often yields imbalanced lists of discoveries, where some groups might include relatively many false discoveries and other groups include relatively very few. On the other hand, as we show, the alternative approach of applying TDC separately to each group does not rigorously control the FDR. RESULTS We developed Group-walk, a procedure that controls the FDR in the target-decoy/knockoff setting while taking into account a given group structure. Group-walk is derived from the recently developed AdaPT-a general framework for controlling the FDR with side-information. We show using simulated and real datasets that when the data naturally divide into groups with different characteristics Group-walk can deliver consistent power gains that in some cases are substantial. These groupings include the precursor charge state (4% more discovered peptides at 1% FDR threshold), the peptide length (3.6% increase) and the mass difference due to modifications (26% increase). AVAILABILITY AND IMPLEMENTATION Group-walk is available at https://cran.r-project.org/web/packages/groupwalk/index.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jack Freestone
- School of Mathematics and Statistics F07, University of Sydney, Sydney 2006, Australia
| | - Temana Short
- School of Mathematics and Statistics F07, University of Sydney, Sydney 2006, Australia
| | | | - Uri Keich
- School of Mathematics and Statistics F07, University of Sydney, Sydney 2006, Australia
| |
Collapse
|
15
|
Page CM, Nøst TH, Djordjilović V, Thoresen M, Frigessi A, Sandanger TM, Veierød MB. Pre-diagnostic DNA methylation in blood leucocytes in cutaneous melanoma; a nested case-control study within the Norwegian Women and Cancer cohort. Sci Rep 2022; 12:14200. [PMID: 35987900 PMCID: PMC9392730 DOI: 10.1038/s41598-022-18585-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Accepted: 08/16/2022] [Indexed: 12/03/2022] Open
Abstract
The prognosis of cutaneous melanoma depends on early detection, and good biomarkers for melanoma risk may provide a valuable tool to detect melanoma development at a pre-clinical stage. By studying the epigenetic profile in pre-diagnostic blood samples of melanoma cases and cancer free controls, we aimed to identify DNA methylation sites conferring melanoma risk. DNA methylation was measured at 775,528 CpG sites using the Illumina EPIC array in whole blood in incident melanoma cases (n = 183) and matched cancer-free controls (n = 183) in the Norwegian Women and Cancer cohort. Phenotypic information and ultraviolet radiation exposure were obtained from questionnaires. Epigenome wide association (EWAS) was analyzed in future melanoma cases and controls with conditional logistic regression, with correction for multiple testing using the false discovery rate (FDR). We extended the analysis by including a public data set on melanoma (GSE120878), and combining these different data sets using a version of covariate modulated FDR (AdaPT). The analysis on future melanoma cases and controls did not identify any genome wide significant CpG sites (0.85 ≤ padj ≤ 0.99). In the restricted AdaPT analysis, 7 CpG sites were suggestive at the FDR level of 0.15. These CpG sites may potentially be used as pre-diagnostic biomarkers of melanoma risk.
Collapse
Affiliation(s)
- Christian M Page
- Oslo Centre for Biostatistics and Epidemiology, Division for Research Support, Oslo University Hospital, Oslo, Norway.
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, University of Oslo, Oslo, Norway.
- Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway.
| | - Therese H Nøst
- Department of Community Medicine, UiT The Arctic University of Norway, Tromsø, Norway
- K.G. Jebsen Center for Genetic Epidemiology, Department of Public Health and Nursing, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, Trondheim, Norway
| | - Vera Djordjilović
- Department of Economics, Ca' Foscari University of Venice, Venice, Italy
| | - Magne Thoresen
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway
| | - Arnoldo Frigessi
- Oslo Centre for Biostatistics and Epidemiology, Division for Research Support, Oslo University Hospital, Oslo, Norway
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway
| | - Torkjel M Sandanger
- Department of Community Medicine, UiT The Arctic University of Norway, Tromsø, Norway
| | - Marit B Veierød
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway
| |
Collapse
|
16
|
Djordjilović V, Hemerik J, Thoresen M. On optimal two-stage testing of multiple mediators. Biom J 2022; 64:1090-1108. [PMID: 35426161 PMCID: PMC9544827 DOI: 10.1002/bimj.202100190] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Revised: 10/18/2021] [Accepted: 11/28/2021] [Indexed: 11/27/2022]
Abstract
Mediation analysis in high-dimensional settings often involves identifying potential mediators among a large number of measured variables. For this purpose, a two-step familywise error rate procedure called ScreenMin has been recently proposed. In ScreenMin, variables are first screened and only those that pass the screening are tested. The proposed data-independent threshold for selection has been shown to guarantee asymptotic familywise error rate. In this work, we investigate the impact of the threshold on the finite-sample familywise error rate. We derive a power maximizing threshold and show that it is well approximated by an adaptive threshold of Wang et al. (2016, arXiv preprint arXiv:1610.03330). We illustrate the investigated procedures on a case-control study examining the effect of fish intake on the risk of colorectal adenoma. We also apply our procedure in the context of replicability analysis to identify single nucleotide polymorphisms (SNP) associated with crop yield in two distinct environments.
Collapse
Affiliation(s)
- Vera Djordjilović
- Department of EconomicsCa' Foscari University of VeniceDorsoduroVeniceItaly
| | - Jesse Hemerik
- BiometrisWageningen University & ResearchWageningenThe Netherlands
| | - Magne Thoresen
- Oslo Centre for Biostatistics and EpidemiologyDepartment of BiostatisticsUniversity of OsloBlindernOsloNorway
| |
Collapse
|
17
|
Li Y, Zhou X, Cao H. Statistical analysis of spatially resolved transcriptomic data by incorporating multiomics auxiliary information. Genetics 2022; 221:iyac095. [PMID: 35731210 PMCID: PMC9339334 DOI: 10.1093/genetics/iyac095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2022] [Accepted: 06/14/2022] [Indexed: 11/13/2022] Open
Abstract
Effective control of false discovery rate is key for multiplicity problems. Here, we consider incorporating informative covariates from external datasets in the multiple testing procedure to boost statistical power while maintaining false discovery rate control. In particular, we focus on the statistical analysis of innovative high-dimensional spatial transcriptomic data while incorporating external multiomics data that provide distinct but complementary information to the detection of spatial expression patterns. We extend OrderShapeEM, an efficient covariate-assisted multiple testing procedure that incorporates one auxiliary study, to make it permissible to incorporate multiple external omics studies, to boost statistical power of spatial expression pattern detection. Specifically, we first use a recently proposed computationally efficient statistical analysis method, spatial pattern recognition via kernels, to produce the primary test statistics for spatial transcriptomic data. Afterwards, we construct the auxiliary covariate by combining information from multiple external omics studies, such as bulk and single-cell RNA-seq data using the Cauchy combination rule. Finally, we extend and implement the integrative analysis method OrderShapeEM on the primary P-values along with auxiliary data incorporating multiomics information for efficient covariate-assisted spatial expression analysis. We conduct a series of realistic simulations to evaluate the performance of our method with known ground truth. Four case studies in mouse olfactory bulb, mouse cerebellum, human breast cancer, and human heart tissues further demonstrate the substantial power gain of our method in detecting genes with spatial expression patterns compared to existing classic approaches that do not utilize any external information.
Collapse
Affiliation(s)
- Yan Li
- School of Mathematics, Jilin University, Changchun, Jilin 130012, China
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Hongyuan Cao
- School of Mathematics, Jilin University, Changchun, Jilin 130012, China
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| |
Collapse
|
18
|
Hutchinson A, Liley J, Wallace C. fcfdr: an R package to leverage continuous and binary functional genomic data in GWAS. BMC Bioinformatics 2022; 23:310. [PMID: 35907789 PMCID: PMC9338519 DOI: 10.1186/s12859-022-04838-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Accepted: 07/13/2022] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Genome-wide association studies (GWAS) are limited in power to detect associations that exceed the stringent genome-wide significance threshold. This limitation can be alleviated by leveraging relevant auxiliary data, such as functional genomic data. Frameworks utilising the conditional false discovery rate have been developed for this purpose, and have been shown to increase power for GWAS discovery whilst controlling the false discovery rate. However, the methods are currently only applicable for continuous auxiliary data and cannot be used to leverage auxiliary data with a binary representation, such as whether SNPs are synonymous or non-synonymous, or whether they reside in regions of the genome with specific activity states. RESULTS We describe an extension to the cFDR framework for binary auxiliary data, called "Binary cFDR". We demonstrate FDR control of our method using detailed simulations, and show that Binary cFDR performs better than a comparator method in terms of sensitivity and FDR control. We introduce an all-encompassing user-oriented CRAN R package ( https://annahutch.github.io/fcfdr/ ; https://cran.r-project.org/web/packages/fcfdr/index.html ) and demonstrate its utility in an application to type 1 diabetes, where we identify additional genetic associations. CONCLUSIONS Our all-encompassing R package, fcfdr, serves as a comprehensive toolkit to unite GWAS and functional genomic data in order to increase statistical power to detect genetic associations.
Collapse
Affiliation(s)
- Anna Hutchinson
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
| | - James Liley
- MRC Human Genetics Unit, University of Edinburgh, Edinburgh, UK
- The Alan Turing Institute, London, UK
| | - Chris Wallace
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
- Cambridge Institute of Therapeutic Immunology and Infectious Disease (CITIID), University of Cambridge, Cambridge, UK
- Department of Medicine, University of Cambridge, Cambridge, UK
| |
Collapse
|
19
|
Hyde R, O'Grady L, Green M. Stability selection for mixed effect models with large numbers of predictor variables: A simulation study. Prev Vet Med 2022; 206:105714. [PMID: 35843027 DOI: 10.1016/j.prevetmed.2022.105714] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 07/08/2022] [Accepted: 07/10/2022] [Indexed: 10/17/2022]
Abstract
Covariate selection when the number of available variables is large relative to the number of observations is problematic in epidemiology and remains the focus of continued research. Whilst a variety of statistical methods have been developed to attempt to overcome this issue, at present very few methods are available for wide data that include a clustered outcome. The purpose of this research was to make an empirical evaluation of a new method for covariate selection in wide data settings when the dependent variable is clustered. We used 3300 simulated datasets with a variety of defined structures and known sets of true predictor variables to conduct an empirical evaluation of a mixed model stability selection procedure. Comparison was made with an alternative method based on regularisation using the least absolute shrinkage and selection operator (Lasso) penalty. Model performance was assessed using several metrics including the true positive rate (proportion of true covariates selected in a final model) and false discovery rate (proportion of variables selected in a final model that were non-true (false) variables). For stability selection, the false discovery rate was consistently low, generally remaining ≤ 0.02 indicating that on average fewer than 1 in 50 of the variables selected in a final model were false variables. This was in contrast to the Lasso-based method in which the false discovery rate was between 0.59 and 0.72, indicating that generally more than 60% of variables selected in a final model were false variables. In contrast however, the Lasso method attained higher true positive rates than stability selection, although both methods achieved good results. For the Lasso method, true positive rates remained ≥ 0.93 whereas for stability selection the true positive rate was 0.73-0.97. Our results suggest both methods may be of value for covariate selection with high dimensional data with a clustered outcome. When high specificity is needed for identification of true covariates, stability selection appeared to offer the better solution, although with a slight loss of sensitivity. Conversely when high sensitivity is needed, the Lasso approach may be useful, even if accompanied by a substantial loss of specificity. Overall, the results indicated the loss of sensitivity when employing stability selection is relatively small compared to the loss of specificity when using the Lasso and therefore stability selection may provide the better option for the analyst when evaluating data of this type.
Collapse
Affiliation(s)
- Robert Hyde
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, United Kingdom
| | - Luke O'Grady
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, United Kingdom
| | - Martin Green
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, United Kingdom.
| |
Collapse
|
20
|
LOPER JH, Lei L, FITHIAN W, TANSEY W. Smoothed Nested Testing on Directed Acyclic Graphs. Biometrika 2022; 109:457-471. [PMID: 38694183 PMCID: PMC11061840 DOI: 10.1093/biomet/asab041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/04/2024] Open
Abstract
We consider the problem of multiple hypothesis testing when there is a logical nested structure to the hypotheses. When one hypothesis is nested inside another, the outer hypothesis must be false if the inner hypothesis is false. We model the nested structure as a directed acyclic graph, including chain and tree graphs as special cases. Each node in the graph is a hypothesis and rejecting a node requires also rejecting all of its ancestors. We propose a general framework for adjusting node-level test statistics using the known logical constraints. Within this framework, we study a smoothing procedure that combines each node with all of its descendants to form a more powerful statistic. We prove a broad class of smoothing strategies can be used with existing selection procedures to control the familywise error rate, false discovery exceedance rate, or false discovery rate, so long as the original test statistics are independent under the null. When the null statistics are not independent but are derived from positively-correlated normal observations, we prove control for all three error rates when the smoothing method is arithmetic averaging of the observations. Simulations and an application to a real biology dataset demonstrate that smoothing leads to substantial power gains.
Collapse
Affiliation(s)
- J. H. LOPER
- Department of Neuroscience, Columbia University, 716 Jerome L. Greene Building, New York, New York 10025, U.S.A
| | - L. Lei
- Department of Statistics, Stanford University, Sequoia Hall, Palo Alto, California 94305, U.S.A
| | - W. FITHIAN
- Department of Statistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720, U.S.A
| | - W. TANSEY
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, 321 E 61st St., New York, New York 10065, U.S.A
| |
Collapse
|
21
|
Ji Y, Chen R, Wang Q, Wei Q, Tao R, Li B. Leveraging Gene-Level Prediction as Informative Covariate in Hypothesis Weighting Improves Power for Rare Variant Association Studies. Genes (Basel) 2022; 13:381. [PMID: 35205424 PMCID: PMC8872452 DOI: 10.3390/genes13020381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Revised: 01/31/2022] [Accepted: 02/09/2022] [Indexed: 02/05/2023] Open
Abstract
Gene-based rare variant association studies (RVASs) have low power due to the infrequency of rare variants and the large multiple testing burden. To correct for multiple testing, traditional false discovery rate (FDR) procedures which depend solely on P-values are often used. Recently, Independent Hypothesis Weighting (IHW) was developed to improve the detection power while maintaining FDR control by leveraging prior information for each hypothesis. Here, we present a framework to increase power of gene-based RVASs by incorporating prior information using IHW. We first build supervised machine learning models to assign each gene a prediction score that measures its disease risk, using the input of multiple biological features, fed with high-confidence risk genes and local background genes selected near GWAS significant loci as the training set. Then we use the prediction scores as covariates to prioritize RVAS results via IHW. We demonstrate the effectiveness of this framework through applications to RVASs in schizophrenia and autism spectrum disorder. We found sizeable improvements in the number of significant associations compared to traditional FDR approaches, and independent evidence supporting the relevance of the genes identified by our framework but not traditional FDR, demonstrating the potential of our framework to improve power of gene-based RVASs.
Collapse
Affiliation(s)
- Ying Ji
- Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN 37232, USA; (Y.J.); (R.C.); (Q.W.); (Q.W.)
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN 37203, USA
| | - Rui Chen
- Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN 37232, USA; (Y.J.); (R.C.); (Q.W.); (Q.W.)
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN 37203, USA
| | - Quan Wang
- Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN 37232, USA; (Y.J.); (R.C.); (Q.W.); (Q.W.)
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN 37203, USA
| | - Qiang Wei
- Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN 37232, USA; (Y.J.); (R.C.); (Q.W.); (Q.W.)
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN 37203, USA
| | - Ran Tao
- Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN 37232, USA; (Y.J.); (R.C.); (Q.W.); (Q.W.)
- Department of Biostatistics, Vanderbilt University, Nashville, TN 37203, USA
| | - Bingshan Li
- Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN 37232, USA; (Y.J.); (R.C.); (Q.W.); (Q.W.)
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN 37203, USA
| |
Collapse
|
22
|
Cao H, Wu WB. Testing and estimation for clustered signals. BERNOULLI 2022. [DOI: 10.3150/21-bej1355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Hongyuan Cao
- School of Mathematics, Jilin University, 2699 Qianjing Street, Changchun, 130012, China
| | - Wei Biao Wu
- Department of Statistics, University of Chicago, 5747 South Ellis Avenue, Chicago, IL, 60637, USA
| |
Collapse
|
23
|
Zhang X, Chen J. Covariate Adaptive False Discovery Rate Control With Applications to Omics-Wide Multiple Testing. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2020.1783273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Xianyang Zhang
- Department of Statistics, Texas A&M University, College Station, TX
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, and Center for Individualized Medicine, Mayo Clinic, Rochester, MN
| |
Collapse
|
24
|
Yun S, Zhang X, Li B. Detection of Local Differences in Spatial Characteristics Between Two Spatiotemporal Random Fields. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2020.1775613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Sooin Yun
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL
| | - Xianyang Zhang
- Department of Statistics, Texas A&M University, College Station, TX
| | - Bo Li
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL
| |
Collapse
|
25
|
Katsevich E, Ramdas A. On the power of conditional independence testing under model-X. Electron J Stat 2022. [DOI: 10.1214/22-ejs2085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Affiliation(s)
- Eugene Katsevich
- Department of Statistics and Data Science, University of Pennsylvania
| | - Aaditya Ramdas
- Department of Statistics and Data Science, Carnegie Mellon University, Machine Learning Department, Carnegie Mellon University
| |
Collapse
|
26
|
Zhou H, Zhang X, Chen J. Covariate adaptive familywise error rate control for genome-wide association studies. Biometrika 2021; 108:915-931. [PMID: 34803516 DOI: 10.1093/biomet/asaa098] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Indexed: 11/12/2022] Open
Abstract
The familywise error rate has been widely used in genome-wide association studies. With the increasing availability of functional genomics data, it is possible to increase detection power by leveraging these genomic functional annotations. Previous efforts to accommodate covariates in multiple testing focused on false discovery rate control, while covariate-adaptive procedures controlling the familywise error rate remain underdeveloped. Here, we propose a novel covariate-adaptive procedure to control the familywise error rate that incorporates external covariates which are potentially informative of either the statistical power or the prior null probability. An efficient algorithm is developed to implement the proposed method. We prove its asymptotic validity and obtain the rate of convergence through a perturbation-type argument. Our numerical studies show that the new procedure is more powerful than competing methods and maintains robustness across different settings. We apply the proposed approach to the UK Biobank data and analyse 27 traits with 9 million single-nucleotide polymorphisms tested for associations. Seventy-five genomic annotations are used as covariates. Our approach detects more genome-wide significant loci than other methods in 21 out of the 27 traits.
Collapse
Affiliation(s)
- Huijuan Zhou
- Institute of Statistics and Big Data, Renmin University of China, Beijing 100872, China
| | - Xianyang Zhang
- Department of Statistics, Texas A&M University, College Station, Texas 77843, U.S.A
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Mayo Clinic, 200 First St. SW, Rochester, Minnesota 55905, U.S.A
| |
Collapse
|
27
|
Gang B, Sun W, Wang W. Structure–Adaptive Sequential Testing for Online False Discovery Rate Control. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1955688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Bowen Gang
- Department of Statistics, Fudan University, Shanghai, China
| | - Wenguang Sun
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| | | |
Collapse
|
28
|
Yurko R, Roeder K, Devlin B, G'Sell M. An approach to gene-based testing accounting for dependence of tests among nearby genes. Brief Bioinform 2021; 22:6359004. [PMID: 34459489 DOI: 10.1093/bib/bbab329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Revised: 07/20/2021] [Accepted: 07/29/2021] [Indexed: 11/14/2022] Open
Abstract
In genome-wide association studies (GWAS), it has become commonplace to test millions of single-nucleotide polymorphisms (SNPs) for phenotypic association. Gene-based testing can improve power to detect weak signal by reducing multiple testing and pooling signal strength. While such tests account for linkage disequilibrium (LD) structure of SNP alleles within each gene, current approaches do not capture LD of SNPs falling in different nearby genes, which can induce correlation of gene-based test statistics. We introduce an algorithm to account for this correlation. When a gene's test statistic is independent of others, it is assessed separately; when test statistics for nearby genes are strongly correlated, their SNPs are agglomerated and tested as a locus. To provide insight into SNPs and genes driving association within loci, we develop an interactive visualization tool to explore localized signal. We demonstrate our approach in the context of weakly powered GWAS for autism spectrum disorder, which is contrasted to more highly powered GWAS for schizophrenia and educational attainment. To increase power for these analyses, especially those for autism, we use adaptive $P$-value thresholding, guided by high-dimensional metadata modeled with gradient boosted trees, highlighting when and how it can be most useful. Notably our workflow is based on summary statistics.
Collapse
Affiliation(s)
- Ronald Yurko
- Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Kathryn Roeder
- Department of Computational Biology, Carnegie Mellon University, USA
| | - Bernie Devlin
- Department of Psychiatry, University of Pittsburgh School of Medicine, USA
| | - Max G'Sell
- Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
29
|
Wang W, Janson L. A High-Dimensional Power Analysis of the Conditional Randomization Test and Knockoffs. Biometrika 2021. [DOI: 10.1093/biomet/asab052] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Summary
In many scientific problems, researchers try to relate a response variable Y to a set of potential explanatory variables X = (X1,…,Xp), and start by trying to identify variables that contribute to this relationship. In statistical terms, this goal can be posed as trying to identify the Xj’s upon which Y is conditionally dependent. Sometimes it is of value to simultaneously test for each j, which is more commonly known as variable selection. The conditional randomization test, CRT, and model-X knockoffs are two recently proposed methods that respectively perform conditional independence testing and variable selection by, for each Xj, computing any test statistic on the data and assessing that test statistic’s significance by comparing it to test statistics computed on synthetic variables generated using knowledge of X’s distribution. Our main contribution is to analyse their power in a high-dimensional linear model where the ratio of the dimension p and the sample size n converge to a positive constant. We give explicit expressions for the asymptotic power of the CRT, variable selection with CRT p-values, and model-X knockoffs, each with a test statistic based on either the marginal covariance, the least squares coefficient, or the lasso. One useful application of our analysis is the direct theoretical comparison of the asymptotic powers of variable selection with CRT p-values and model-X knockoffs; in the instances with independent covariates that we consider, the CRT provably dominates knockoffs. We also analyse the power gain from using unlabelled data in the CRT when limited knowledge of X’s distribution is available, and the power of the CRT when samples are collected retrospectively.
Collapse
Affiliation(s)
- Wenshuo Wang
- Department of Statistics, Harvard University, One Oxford Street, Cambridge, Massachusetts 02138, U.S.A
| | - Lucas Janson
- Department of Statistics, Harvard University, One Oxford Street, Cambridge, Massachusetts 02138, U.S.A
| |
Collapse
|
30
|
Ge X, Chen YE, Song D, McDermott M, Woyshner K, Manousopoulou A, Wang N, Li W, Wang LD, Li JJ. Clipper: p-value-free FDR control on high-throughput data from two conditions. Genome Biol 2021; 22:288. [PMID: 34635147 PMCID: PMC8504070 DOI: 10.1186/s13059-021-02506-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 09/21/2021] [Indexed: 12/12/2022] Open
Abstract
High-throughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper outperforms existing methods for a broad range of applications in high-throughput data analysis.
Collapse
Affiliation(s)
- Xinzhou Ge
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA
| | - Yiling Elaine Chen
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA
| | - Dongyuan Song
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA
| | - MeiLu McDermott
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
- The Quantitative and Computational Biology section, University of Southern California, Los Angeles, 90089, CA, USA
| | - Kyla Woyshner
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Antigoni Manousopoulou
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Ning Wang
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA
| | - Wei Li
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, 92697, CA, USA
| | - Leo D Wang
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA.
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, 90095, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, 90095, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, 90095, CA, USA.
| |
Collapse
|
31
|
Hutchinson A, Reales G, Willis T, Wallace C. Leveraging auxiliary data from arbitrary distributions to boost GWAS discovery with Flexible cFDR. PLoS Genet 2021; 17:e1009853. [PMID: 34669738 PMCID: PMC8559959 DOI: 10.1371/journal.pgen.1009853] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 11/01/2021] [Accepted: 09/30/2021] [Indexed: 12/15/2022] Open
Abstract
Genome-wide association studies (GWAS) have identified thousands of genetic variants that are associated with complex traits. However, a stringent significance threshold is required to identify robust genetic associations. Leveraging relevant auxiliary covariates has the potential to boost statistical power to exceed the significance threshold. Particularly, abundant pleiotropy and the non-random distribution of SNPs across various functional categories suggests that leveraging GWAS test statistics from related traits and/or functional genomic data may boost GWAS discovery. While type 1 error rate control has become standard in GWAS, control of the false discovery rate can be a more powerful approach. The conditional false discovery rate (cFDR) extends the standard FDR framework by conditioning on auxiliary data to call significant associations, but current implementations are restricted to auxiliary data satisfying specific parametric distributions, typically GWAS p-values for related traits. We relax these distributional assumptions, enabling an extension of the cFDR framework that supports auxiliary covariates from arbitrary continuous distributions ("Flexible cFDR"). Our method can be applied iteratively, thereby supporting multi-dimensional covariate data. Through simulations we show that Flexible cFDR increases sensitivity whilst controlling FDR after one or several iterations. We further demonstrate its practical potential through application to an asthma GWAS, leveraging various functional genomic data to find additional genetic associations for asthma, which we validate in the larger, independent, UK Biobank data resource.
Collapse
Affiliation(s)
- Anna Hutchinson
- MRC Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom
| | - Guillermo Reales
- Cambridge Institute of Therapeutic Immunology and Infectious Disease (CITIID), University of Cambridge, Cambridge, United Kingdom
- Department of Medicine, University of Cambridge, Cambridge, United Kingdom
| | - Thomas Willis
- MRC Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom
| | - Chris Wallace
- MRC Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom
- Cambridge Institute of Therapeutic Immunology and Infectious Disease (CITIID), University of Cambridge, Cambridge, United Kingdom
- Department of Medicine, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
32
|
Salviato E, Djordjilović V, Hariprakash JM, Tagliaferri I, Pal K, Ferrari F. Leveraging three-dimensional chromatin architecture for effective reconstruction of enhancer-target gene regulatory interactions. Nucleic Acids Res 2021; 49:e97. [PMID: 34197622 PMCID: PMC8464068 DOI: 10.1093/nar/gkab547] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Revised: 06/07/2021] [Accepted: 06/17/2021] [Indexed: 12/23/2022] Open
Abstract
A growing amount of evidence in literature suggests that germline sequence variants and somatic mutations in non-coding distal regulatory elements may be crucial for defining disease risk and prognostic stratification of patients, in genetic disorders as well as in cancer. Their functional interpretation is challenging because genome-wide enhancer-target gene (ETG) pairing is an open problem in genomics. The solutions proposed so far do not account for the hierarchy of structural domains which define chromatin three-dimensional (3D) architecture. Here we introduce a change of perspective based on the definition of multi-scale structural chromatin domains, integrated in a statistical framework to define ETG pairs. In this work (i) we develop a computational and statistical framework to reconstruct a comprehensive map of ETG pairs leveraging functional genomics data; (ii) we demonstrate that the incorporation of chromatin 3D architecture information improves ETG pairing accuracy and (iii) we use multiple experimental datasets to extensively benchmark our method against previous solutions for the genome-wide reconstruction of ETG pairs. This solution will facilitate the annotation and interpretation of sequence variants in distal non-coding regulatory elements. We expect this to be especially helpful in clinically oriented applications of whole genome sequencing in cancer and undiagnosed genetic diseases research.
Collapse
Affiliation(s)
- Elisa Salviato
- IFOM, the FIRC Institute of Molecular Oncology, Milan 20139, Italy
| | - Vera Djordjilović
- Department of Economics, Ca’ Foscari University of Venice, Venice 30100, Italy
| | | | | | - Koustav Pal
- IFOM, the FIRC Institute of Molecular Oncology, Milan 20139, Italy
| | - Francesco Ferrari
- IFOM, the FIRC Institute of Molecular Oncology, Milan 20139, Italy
- Institute of Molecular Genetics “Luigi Luca Cavalli-Sforza”, National Research Council, Pavia 27100, Italy
| |
Collapse
|
33
|
Zhu Z, Fan Y, Kong Y, Lv J, Sun F. DeepLINK: Deep learning inference using knockoffs with applications to genomics. Proc Natl Acad Sci U S A 2021; 118:e2104683118. [PMID: 34480002 PMCID: PMC8433583 DOI: 10.1073/pnas.2104683118] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 07/16/2021] [Indexed: 11/18/2022] Open
Abstract
We propose a deep learning-based knockoffs inference framework, DeepLINK, that guarantees the false discovery rate (FDR) control in high-dimensional settings. DeepLINK is applicable to a broad class of covariate distributions described by the possibly nonlinear latent factor models. It consists of two major parts: an autoencoder network for the knockoff variable construction and a multilayer perceptron network for feature selection with the FDR control. The empirical performance of DeepLINK is investigated through extensive simulation studies, where it is shown to achieve FDR control in feature selection with both high selection power and high prediction accuracy. We also apply DeepLINK to three real data applications to demonstrate its practical utility.
Collapse
Affiliation(s)
- Zifan Zhu
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA 90089
| | - Yingying Fan
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089;
| | - Yinfei Kong
- Department of Information Systems and Decision Sciences, California State University, Fullerton, CA 92831
| | - Jinchi Lv
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA 90089;
| |
Collapse
|
34
|
Cui T, Wang P, Zhu W. Covariate-adjusted multiple testing in genome-wide association studies via factorial hidden Markov models. TEST-SPAIN 2021. [DOI: 10.1007/s11749-020-00746-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
35
|
Du L, Guo X, Sun W, Zou C. False Discovery Rate Control Under General Dependence By Symmetrized Data Aggregation. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1945459] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Lilun Du
- Department of ISOM, Hong Kong University of Science and Technology, ISOM, Kowloon, Hong Kong
| | - Xu Guo
- Department of Mathematical Statistics, Beijing Normal University, Beijing, China
| | - Wenguang Sun
- Data Sciences and Operations, University of Southern California, Los Angeles, CA
| | - Changliang Zou
- Department of Statistics and Data Sciences, Nankai University, Tianjin, China
| |
Collapse
|
36
|
Ignatiadis N, Huber W. Covariate powered cross‐weighted multiple testing. J R Stat Soc Series B Stat Methodol 2021. [DOI: 10.1111/rssb.12411] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
| | - Wolfgang Huber
- European Molecular Biology Laboratory Heidelberg Germany
| |
Collapse
|
37
|
Yi S, Zhang X, Yang L, Huang J, Liu Y, Wang C, Schaid DJ, Chen J. 2dFDR: a new approach to confounder adjustment substantially increases detection power in omics association studies. Genome Biol 2021; 22:208. [PMID: 34256818 PMCID: PMC8276451 DOI: 10.1186/s13059-021-02418-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Accepted: 06/24/2021] [Indexed: 11/10/2022] Open
Abstract
One challenge facing omics association studies is the loss of statistical power when adjusting for confounders and multiple testing. The traditional statistical procedure involves fitting a confounder-adjusted regression model for each omics feature, followed by multiple testing correction. Here we show that the traditional procedure is not optimal and present a new approach, 2dFDR, a two-dimensional false discovery rate control procedure, for powerful confounder adjustment in multiple testing. Through extensive evaluation, we demonstrate that 2dFDR is more powerful than the traditional procedure, and in the presence of strong confounding and weak signals, the power improvement could be more than 100%.
Collapse
Affiliation(s)
- Sangyoon Yi
- Department of Statistics, Texas A&M University, College Station, TX, 77843, USA
| | - Xianyang Zhang
- Department of Statistics, Texas A&M University, College Station, TX, 77843, USA.
| | - Lu Yang
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 55905, USA
| | - Jinyan Huang
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University, Shanghai, 200025, China
| | - Yuanhang Liu
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 55905, USA
| | - Chen Wang
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 55905, USA
| | - Daniel J Schaid
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 55905, USA
| | - Jun Chen
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 55905, USA.
| |
Collapse
|
38
|
Menyhart O, Weltz B, Győrffy B. MultipleTesting.com: A tool for life science researchers for multiple hypothesis testing correction. PLoS One 2021; 16:e0245824. [PMID: 34106935 PMCID: PMC8189492 DOI: 10.1371/journal.pone.0245824] [Citation(s) in RCA: 69] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Accepted: 05/14/2021] [Indexed: 11/18/2022] Open
Abstract
Scientists from nearly all disciplines face the problem of simultaneously evaluating many hypotheses. Conducting multiple comparisons increases the likelihood that a non-negligible proportion of associations will be false positives, clouding real discoveries. Drawing valid conclusions require taking into account the number of performed statistical tests and adjusting the statistical confidence measures. Several strategies exist to overcome the problem of multiple hypothesis testing. We aim to summarize critical statistical concepts and widely used correction approaches while also draw attention to frequently misinterpreted notions of statistical inference. We provide a step-by-step description of each multiple-testing correction method with clear examples and present an easy-to-follow guide for selecting the most suitable correction technique. To facilitate multiple-testing corrections, we developed a fully automated solution not requiring programming skills or the use of a command line. Our registration free online tool is available at www.multipletesting.com and compiles the five most frequently used adjustment tools, including the Bonferroni, the Holm (step-down), the Hochberg (step-up) corrections, allows to calculate False Discovery Rates (FDR) and q-values. The current summary provides a much needed practical synthesis of basic statistical concepts regarding multiple hypothesis testing in a comprehensible language with well-illustrated examples. The web tool will fill the gap for life science researchers by providing a user-friendly substitute for command-line alternatives.
Collapse
Affiliation(s)
- Otília Menyhart
- Department of Bioinformatics, Semmelweis University, Budapest, Hungary
- Research Centre for Natural Sciences, Cancer Biomarker Research Group, Institute of Enzymology, Budapest, Hungary
| | - Boglárka Weltz
- Research Centre for Natural Sciences, Cancer Biomarker Research Group, Institute of Enzymology, Budapest, Hungary
- A5 Genetics Ltd, Und, Hungary
| | - Balázs Győrffy
- Department of Bioinformatics, Semmelweis University, Budapest, Hungary
- Research Centre for Natural Sciences, Cancer Biomarker Research Group, Institute of Enzymology, Budapest, Hungary
- 2 Department of Pediatrics, Semmelweis University, Budapest, Hungary
- * E-mail:
| |
Collapse
|
39
|
Mussap M, Noto A, Piras C, Atzori L, Fanos V. Slotting metabolomics into routine precision medicine. EXPERT REVIEW OF PRECISION MEDICINE AND DRUG DEVELOPMENT 2021. [DOI: 10.1080/23808993.2021.1911639] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Affiliation(s)
- Michele Mussap
- Department of Surgical Science, University of Cagliari, Monserrato, Italy
| | - Antonio Noto
- Department of Medical Sciences and Public Health, University of Cagliari, Monserrato, Italy
| | - Cristina Piras
- Department of Surgical Science, University of Cagliari, Monserrato, Italy
- Department of Biomedical Sciences, University of Cagliari, Monserrato, Italy
| | - Luigi Atzori
- Department of Biomedical Sciences, University of Cagliari, Monserrato, Italy
| | - Vassilios Fanos
- Department of Surgical Science, University of Cagliari, Monserrato, Italy
| |
Collapse
|
40
|
Deb N, Saha S, Guntuboyina A, Sen B. Two-Component Mixture Model in the Presence of Covariates. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1888739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Nabarun Deb
- Department of Statistics, Columbia University, New York, NY
| | | | | | | |
Collapse
|
41
|
Liley J, Wallace C. Accurate error control in high-dimensional association testing using conditional false discovery rates. Biom J 2021; 63:1096-1130. [PMID: 33682201 PMCID: PMC7612315 DOI: 10.1002/bimj.201900254] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2019] [Revised: 12/05/2020] [Accepted: 12/30/2020] [Indexed: 01/13/2023]
Abstract
High-dimensional hypothesis testing is ubiquitous in the biomedical sciences, and informative covariates may be employed to improve power. The conditional false discovery rate (cFDR) is a widely used approach suited to the setting where the covariate is a set of p-values for the equivalent hypotheses for a second trait. Although related to the Benjamini–Hochberg procedure, it does not permit any easy control of type-1 error rate and existing methods are over-conservative. We propose a newmethod for type-1 error rate control based on identifyingmappings from the unit square to the unit interval defined by the estimated cFDR and splitting observations so that each map is independent of the observations it is used to test. We also propose an adjustment to the existing cFDR estimator which further improves power. We show by simulation that the new method more than doubles potential improvement in power over unconditional analyses compared to existing methods. We demonstrate our method on transcriptome-wide association studies and show that the method can be used in an iterative way, enabling the use of multiple covariates successively. Our methods substantially improve the power and applicability of cFDR analysis.
Collapse
Affiliation(s)
- James Liley
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK.,Department of Medicine, Addenbrookes Hospital, University of Cambridge, Cambridge, UK
| | - Chris Wallace
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK.,Department of Medicine, Addenbrookes Hospital, University of Cambridge, Cambridge, UK.,Cambridge Institute of Therapeutic Immunology and Infectious Disease, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, Cambridge, UK
| |
Collapse
|
42
|
Cai TT, Sun W, Xia Y. LAWS: A Locally Adaptive Weighting and Screening Approach to Spatial Multiple Testing. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2020.1859379] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Affiliation(s)
- T. Tony Cai
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA
| | - Wenguang Sun
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| | - Yin Xia
- Department of Statistics, School of Management, Fudan University, Shanghai, China
| |
Collapse
|
43
|
Fu L, Gang B, James GM, Sun W. Heteroscedasticity-Adjusted Ranking and Thresholding for Large-Scale Multiple Testing. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1840992] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Luella Fu
- Department of Mathematics, San Francisco State University, San Francisco, CA
| | - Bowen Gang
- Department of Statistics, Fudan University, Shanghai, China
| | - Gareth M. James
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| | - Wenguang Sun
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| |
Collapse
|
44
|
Katsevich E, Ramdas A. Simultaneous high-probability bounds on the false discovery proportion in structured, regression and online settings. Ann Stat 2020. [DOI: 10.1214/19-aos1938] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
45
|
Lei L, Ramdas A, Fithian W. A general interactive framework for false discovery rate control under structural constraints. Biometrika 2020. [DOI: 10.1093/biomet/asaa064] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Summary
We propose a general framework based on selectively traversed accumulation rules for interactive multiple testing with generic structural constraints on the rejection set. It combines accumulation tests from ordered multiple testing with data-carving ideas from post-selection inference, allowing highly flexible adaptation to generic structural information. Our procedure defines an interactive protocol for gradually pruning a candidate rejection set, beginning with the set of all hypotheses and shrinking the set with each step. By restricting the information at each step via a technique we call masking, our protocol enables interaction while controlling the false discovery rate in finite samples for any data-adaptive update rule that the analyst may choose. We suggest update rules for a variety of applications with complex structural constraints, demonstrate that selectively traversed accumulation rules perform well in problems ranging from convex region detection to false discovery rate control on directed acyclic graphs, and show how to extend the framework to regression problems where knockoff statistics are available in lieu of $p$-values.
Collapse
Affiliation(s)
- Lihua Lei
- Department of Statistics, Stanford University, 202 Sequoia Hall, 390 Serra Mall, Stanford, California 94305, U.S.A
| | - Aaditya Ramdas
- Department of Statistics and Data Science, Carnegie Mellon University, 132H Baker Hall, Pittsburgh, Pennsylvania 15213, U.S.A
| | - William Fithian
- Department of Statistics, University of California, Berkeley, 301 Evans Hall, Berkeley, California 94720, U.S.A
| |
Collapse
|
46
|
A selective inference approach for false discovery rate control using multiomics covariates yields insights into disease risk. Proc Natl Acad Sci U S A 2020; 117:15028-15035. [PMID: 32522875 PMCID: PMC7334489 DOI: 10.1073/pnas.1918862117] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Variation is rampant throughout human genomes: some of it affects disease risk, and most does not; to separate the two requires a plethora of hypothesis tests. This challenge of multiple testing—limiting false positives while maximizing power—arises in many “omics” studies and sciences. One approach is to control the false discovery rate (FDR), and a recent selective inference method for controlling FDR, adaptive P-value thresholding (AdaPT), facilitates incorporation of auxiliary information (covariates) related to each hypothesis test. How AdaPT performs on data is an open question. We apply AdaPT to results from genomic association studies and include many covariates. This adaptive search discovers a more complex and interpretable model with far greater power than classic multiple testing procedures. To correct for a large number of hypothesis tests, most researchers rely on simple multiple testing corrections. Yet, new methodologies of selective inference could potentially improve power while retaining statistical guarantees, especially those that enable exploration of test statistics using auxiliary information (covariates) to weight hypothesis tests for association. We explore one such method, adaptive P-value thresholding (AdaPT), in the framework of genome-wide association studies (GWAS) and gene expression/coexpression studies, with particular emphasis on schizophrenia (SCZ). Selected SCZ GWAS association P values play the role of the primary data for AdaPT; single-nucleotide polymorphisms (SNPs) are selected because they are gene expression quantitative trait loci (eQTLs). This natural pairing of SNPs and genes allow us to map the following covariate values to these pairs: GWAS statistics from genetically correlated bipolar disorder, the effect size of SNP genotypes on gene expression, and gene–gene coexpression, captured by subnetwork (module) membership. In all, 24 covariates per SNP/gene pair were included in the AdaPT analysis using flexible gradient boosted trees. We demonstrate a substantial increase in power to detect SCZ associations using gene expression information from the developing human prefrontal cortex. We interpret these results in light of recent theories about the polygenic nature of SCZ. Importantly, our entire process for identifying enrichment and creating features with independent complementary data sources can be implemented in many different high-throughput settings to ultimately improve power.
Collapse
|
47
|
Tian Z, Liang K, Li P. A powerful procedure that controls the false discovery rate with directional information. Biometrics 2020; 77:212-222. [PMID: 32277471 DOI: 10.1111/biom.13277] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Revised: 02/14/2020] [Accepted: 03/23/2020] [Indexed: 11/28/2022]
Abstract
In many multiple testing applications in genetics, the signs of the test statistics provide useful directional information, such as whether genes are potentially up- or down-regulated between two experimental conditions. However, most existing procedures that control the false discovery rate (FDR) are P-value based and ignore such directional information. We introduce a novel procedure, the signed-knockoff procedure, to utilize the directional information and control the FDR in finite samples. We demonstrate the power advantage of our procedure through simulation studies and two real applications.
Collapse
Affiliation(s)
- Zhaoyang Tian
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Kun Liang
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Pengfei Li
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| |
Collapse
|
48
|
Huang J, Bai L, Cui B, Wu L, Wang L, An Z, Ruan S, Yu Y, Zhang X, Chen J. Leveraging biological and statistical covariates improves the detection power in epigenome-wide association testing. Genome Biol 2020; 21:88. [PMID: 32252795 PMCID: PMC7132874 DOI: 10.1186/s13059-020-02001-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Accepted: 03/17/2020] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Epigenome-wide association studies (EWAS), which seek the association between epigenetic marks and an outcome or exposure, involve multiple hypothesis testing. False discovery rate (FDR) control has been widely used for multiple testing correction. However, traditional FDR control methods do not use auxiliary covariates, and they could be less powerful if the covariates could inform the likelihood of the null hypothesis. Recently, many covariate-adaptive FDR control methods have been developed, but application of these methods to EWAS data has not yet been explored. It is not clear whether these methods can significantly improve detection power, and if so, which covariates are more relevant for EWAS data. RESULTS In this study, we evaluate the performance of five covariate-adaptive FDR control methods with EWAS-related covariates using simulated as well as real EWAS datasets. We develop an omnibus test to assess the informativeness of the covariates. We find that statistical covariates are generally more informative than biological covariates, and the covariates of methylation mean and variance are almost universally informative. In contrast, the informativeness of biological covariates depends on specific datasets. We show that the independent hypothesis weighting (IHW) and covariate adaptive multiple testing (CAMT) method are overall more powerful, especially for sparse signals, and could improve the detection power by a median of 25% and 68% on real datasets, compared to the ST procedure. We further validate the findings in various biological contexts. CONCLUSIONS Covariate-adaptive FDR control methods with informative covariates can significantly increase the detection power for EWAS. For sparse signals, IHW and CAMT are recommended.
Collapse
Affiliation(s)
- Jinyan Huang
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China.
| | - Ling Bai
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China
| | - Bowen Cui
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China
| | - Liang Wu
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China
| | - Liwen Wang
- Department of General Surgery, Rui-Jin Hospital, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China
| | - Zhiyin An
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China
| | - Shulin Ruan
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China
| | - Yue Yu
- Division of Digital Health Sciences, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA
| | - Xianyang Zhang
- Department of Statistics, Texas A&M University, Blocker 449D, College Station, TX, 77843, USA.
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research and Center for Individualized Medicine, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA.
| |
Collapse
|
49
|
Durand G, Blanchard G, Neuvial P, Roquain E. Post hoc false positive control for structured hypotheses. Scand Stat Theory Appl 2020. [DOI: 10.1111/sjos.12453] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Guillermo Durand
- Laboratoire de probabilités Statistique et Modélisation, LPSM Sorbonne Université France
| | - Gilles Blanchard
- Laboratoire de Mathématiques d'Orsay Université Paris‐Sud, CNRS, Université Paris‐Saclay France
| | - Pierre Neuvial
- Institut de Mathématiques de Toulouse UMR 5219, Université de Toulouse, CNRS, UPS IMT France
| | - Etienne Roquain
- Laboratoire de probabilités Statistique et Modélisation, LPSM Sorbonne Université France
| |
Collapse
|
50
|
Duan B, Ramdas A, Balakrishnan S, Wasserman L. Interactive martingale tests for the global null. Electron J Stat 2020. [DOI: 10.1214/20-ejs1790] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|