1
|
Samaddar A, Maiti T, de los Campos G. Bayesian hierarchical hypothesis testing in large-scale genome-wide association analysis. Genetics 2024; 228:iyae164. [PMID: 39560456 PMCID: PMC11631447 DOI: 10.1093/genetics/iyae164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Accepted: 08/29/2024] [Indexed: 11/20/2024] Open
Abstract
Variable selection and large-scale hypothesis testing are techniques commonly used to analyze high-dimensional genomic data. Despite recent advances in theory and methodology, variable selection and inference with highly collinear features remain challenging. For instance, collinearity poses a great challenge in genome-wide association studies involving millions of variants, many of which may be in high linkage disequilibrium. In such settings, collinearity can significantly reduce the power of variable selection methods to identify individual variants associated with an outcome. To address such challenges, we developed a Bayesian hierarchical hypothesis testing (BHHT)-a novel multiresolution testing procedure that offers high power with adequate error control and fine-mapping resolution. We demonstrate through simulations that the proposed methodology has a power-FDR performance that is competitive with (and in many scenarios better than) state-of-the-art methods. Finally, we demonstrate the feasibility of using BHHT with large sample size (n∼ 300,000) and ultra dimensional genotypes (∼ 15 million single-nucleotide polymorphisms or SNPs) by applying it to eight complex traits using data from the UK-Biobank. Our results show that the proposed methodology leads to many more discoveries than those obtained using traditional SNP-centered inference procedures. The article is accompanied by open-source software that implements the methods described in this study using algorithms that scale to biobank-size ultra-high-dimensional data.
Collapse
Affiliation(s)
- Anirban Samaddar
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
| | - Tapabrata Maiti
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
| | - Gustavo de los Campos
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
2
|
Pudjihartono N, Ho D, Golovina E, Fadason T, Kempa-Liehr AW, O'Sullivan JM. Juvenile idiopathic arthritis-associated genetic loci exhibit spatially constrained gene regulatory effects across multiple tissues and immune cell types. J Autoimmun 2023; 138:103046. [PMID: 37229810 DOI: 10.1016/j.jaut.2023.103046] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2023] [Revised: 04/04/2023] [Accepted: 04/15/2023] [Indexed: 05/27/2023]
Abstract
Juvenile idiopathic arthritis (JIA) is an autoimmune, inflammatory joint disease with complex genetic etiology. Previous GWAS have found many genetic loci associated with JIA. However, the biological mechanism behind JIA remains unknown mainly because most risk loci are located in non-coding genetic regions. Interestingly, increasing evidence has found that regulatory elements in the non-coding regions can regulate the expression of distant target genes through spatial (physical) interactions. Here, we used information on the 3D genome organization (Hi-C data) to identify target genes that physically interact with SNPs within JIA risk loci. Subsequent analysis of these SNP-gene pairs using data from tissue and immune cell type-specific expression quantitative trait loci (eQTL) databases allowed the identification of risk loci that regulate the expression of their target genes. In total, we identified 59 JIA-risk loci that regulate the expression of 210 target genes across diverse tissues and immune cell types. Functional annotation of spatial eQTLs within JIA risk loci identified significant overlap with gene regulatory elements (i.e., enhancers and transcription factor binding sites). We found target genes involved in immune-related pathways such as antigen processing and presentation (e.g., ERAP2, HLA class I and II), the release of pro-inflammatory cytokines (e.g., LTBR, TYK2), proliferation and differentiation of specific immune cell types (e.g., AURKA in Th17 cells), and genes involved in physiological mechanisms related to pathological joint inflammation (e.g., LRG1 in arteries). Notably, many of the tissues where JIA-risk loci act as spatial eQTLs are not classically considered central to JIA pathology. Overall, our findings highlight the potential tissue and immune cell type-specific regulatory changes contributing to JIA pathogenesis. Future integration of our data with clinical studies can contribute to the development of improved JIA therapy.
Collapse
Affiliation(s)
- N Pudjihartono
- The Liggins Institute, The University of Auckland, Auckland, New Zealand.
| | - D Ho
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
| | - E Golovina
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
| | - T Fadason
- The Liggins Institute, The University of Auckland, Auckland, New Zealand
| | - A W Kempa-Liehr
- Department of Engineering Science, The University of Auckland, Auckland, New Zealand
| | - J M O'Sullivan
- The Liggins Institute, The University of Auckland, Auckland, New Zealand; The Maurice Wilkins Centre, The University of Auckland, Auckland, New Zealand; MRC Lifecourse Epidemiology Unit, University of Southampton, United Kingdom; Australian Parkinsons Mission, Garvan Institute of Medical Research, Sydney, New South Wales, 384 Victoria Street, Darlinghurst, NSW, 2010, Australia; A*STAR Singapore Institute for Clinical Sciences, Singapore, Singapore.
| |
Collapse
|
3
|
Zhang R, Wang H, Xie Y. Online score statistics for detecting clustered change in network point processes. Seq Anal 2023. [DOI: 10.1080/07474946.2022.2164307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/18/2023]
Affiliation(s)
- Rui Zhang
- School of Industrial and Systems Engineering (ISyE), Georgia Institute of Technology, Atlanta, Georgia, USA
| | - Haoyun Wang
- School of Industrial and Systems Engineering (ISyE), Georgia Institute of Technology, Atlanta, Georgia, USA
| | - Yao Xie
- School of Industrial and Systems Engineering (ISyE), Georgia Institute of Technology, Atlanta, Georgia, USA
| |
Collapse
|
4
|
Cao H, Wu WB. Testing and estimation for clustered signals. BERNOULLI 2022. [DOI: 10.3150/21-bej1355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Hongyuan Cao
- School of Mathematics, Jilin University, 2699 Qianjing Street, Changchun, 130012, China
| | - Wei Biao Wu
- Department of Statistics, University of Chicago, 5747 South Ellis Avenue, Chicago, IL, 60637, USA
| |
Collapse
|
5
|
Chen H, Ren H, Yao F, Zou C. Data-driven selection of the number of change-points via error rate control. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1999820] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Hui Chen
- School of Statistics and Data Science, Nankai University, China
| | - Haojie Ren
- School of Mathematical Sciences, Shanghai Jiao Tong University, China
| | - Fang Yao
- School of Mathematical Sciences, Peking University, China
| | - Changliang Zou
- School of Statistics and Data Science, Nankai University, China
| |
Collapse
|
6
|
Wang G, Zou C, Qiu P. Data-Driven Determination of the Number of Jumps in Regression Curves. Technometrics 2021. [DOI: 10.1080/00401706.2021.1978551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Guanghui Wang
- KLATASDS-MOE, Academy of Statistics and Interdisciplinary Sciences, East China Normal University, Shanghai, China
| | - Changliang Zou
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin, China
| | - Peihua Qiu
- Department of Biostatistics, University of Florida, Gainesville, FL
| |
Collapse
|
7
|
BOGOMOLOV MARINA, PETERSON CHRISTINEB, BENJAMINI YOAV, SABATTI CHIARA. Hypotheses on a tree: new error rates and testing strategies. Biometrika 2021; 108:575-590. [PMID: 36825068 PMCID: PMC9945647 DOI: 10.1093/biomet/asaa086] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We introduce a multiple testing procedure that controls global error rates at multiple levels of resolution. Conceptually, we frame this problem as the selection of hypotheses that are organized hierarchically in a tree structure. We describe a fast algorithm and prove that it controls relevant error rates given certain assumptions on the dependence between the p-values. Through simulations, we demonstrate that the proposed procedure provides the desired guarantees under a range of dependency structures and that it has the potential to gain power over alternative methods. Finally, we apply the method to studies on the genetic regulation of gene expression across multiple tissues and on the relation between the gut microbiome and colorectal cancer.
Collapse
Affiliation(s)
- MARINA BOGOMOLOV
- The William Davidson Faculty of Industrial Engineering and Management, Technion-Israel Institute of Technology, Technion City, Haifa 3200003, Israel
| | - CHRISTINE B. PETERSON
- Department of Biostatistics, Division of Basic Science Research, The University of Texas, MD Anderson Cancer Center, Houston, Texas 77030, U.S.A
| | - YOAV BENJAMINI
- Department of Statistics and Operations Research, Tel-Aviv University, P.O. Box 39040, Tel-Aviv 6997801, Israel
| | - CHIARA SABATTI
- Department of Statistics, Stanford University, 50 Governor’s Lane, Stanford, California 94305, U.S.A
| |
Collapse
|
8
|
Katsevich E, Sabatti C, Bogomolov M. Filtering the rejection set while preserving false discovery rate control. J Am Stat Assoc 2021; 118:165-176. [PMID: 37346227 PMCID: PMC10281705 DOI: 10.1080/01621459.2021.1920958] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Revised: 04/14/2021] [Accepted: 04/18/2021] [Indexed: 12/28/2022]
Abstract
Scientific hypotheses in a variety of applications have domain-specific structures, such as the tree structure of the International Classification of Diseases (ICD), the directed acyclic graph structure of the Gene Ontology (GO), or the spatial structure in genome-wide association studies. In the context of multiple testing, the resulting relationships among hypotheses can create redundancies among rejections that hinder interpretability. This leads to the practice of filtering rejection sets obtained from multiple testing procedures, which may in turn invalidate their inferential guarantees. We propose Focused BH, a simple, flexible, and principled methodology to adjust for the application of any pre-specified filter. We prove that Focused BH controls the false discovery rate under various conditions, including when the filter satisfies an intuitive monotonicity property and the p-values are positively dependent. We demonstrate in simulations that Focused BH performs well across a variety of settings, and illustrate this method's practical utility via analyses of real datasets based on ICD and GO.
Collapse
Affiliation(s)
| | - Chiara Sabatti
- Departments of Statistics and Biomedical Data Science, Stanford University
| | - Marina Bogomolov
- Faculty of Industrial Engineering and Management, Technion - Israel Institute of Technology
| |
Collapse
|
9
|
Tibbs Cortes L, Zhang Z, Yu J. Status and prospects of genome-wide association studies in plants. THE PLANT GENOME 2021; 14:e20077. [PMID: 33442955 DOI: 10.1002/tpg2.20077] [Citation(s) in RCA: 180] [Impact Index Per Article: 45.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 11/18/2020] [Indexed: 05/22/2023]
Abstract
Genome-wide association studies (GWAS) have developed into a powerful and ubiquitous tool for the investigation of complex traits. In large part, this was fueled by advances in genomic technology, enabling us to examine genome-wide genetic variants across diverse genetic materials. The development of the mixed model framework for GWAS dramatically reduced the number of false positives compared with naïve methods. Building on this foundation, many methods have since been developed to increase computational speed or improve statistical power in GWAS. These methods have allowed the detection of genomic variants associated with either traditional agronomic phenotypes or biochemical and molecular phenotypes. In turn, these associations enable applications in gene cloning and in accelerated crop breeding through marker assisted selection or genetic engineering. Current topics of investigation include rare-variant analysis, synthetic associations, optimizing the choice of GWAS model, and utilizing GWAS results to advance knowledge of biological processes. Ongoing research in these areas will facilitate further advances in GWAS methods and their applications.
Collapse
Affiliation(s)
| | - Zhiwu Zhang
- Department of Crop and Soil Sciences, Washington State University, Pullman, WA, 99164, USA
| | - Jianming Yu
- Department of Agronomy, Iowa State University, Ames, IA, 50010, USA
| |
Collapse
|
10
|
|
11
|
Fang X, Li J, Siegmund D. Segmentation and estimation of change-point models: False positive control and confidence regions. Ann Stat 2020. [DOI: 10.1214/19-aos1861] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
12
|
Wang Y, Wang G, Wang L, Ogden RT. Simultaneous confidence corridors for mean functions in functional data analysis of imaging data. Biometrics 2020; 76:427-437. [PMID: 31544958 PMCID: PMC7310608 DOI: 10.1111/biom.13156] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2018] [Accepted: 09/09/2019] [Indexed: 11/30/2022]
Abstract
Motivated by recent work involving the analysis of biomedical imaging data, we present a novel procedure for constructing simultaneous confidence corridors for the mean of imaging data. We propose to use flexible bivariate splines over triangulations to handle an irregular domain of the images that is common in brain imaging studies and in other biomedical imaging applications. The proposed spline estimators of the mean functions are shown to be consistent and asymptotically normal under some regularity conditions. We also provide a computationally efficient estimator of the covariance function and derive its uniform consistency. The procedure is also extended to the two-sample case in which we focus on comparing the mean functions from two populations of imaging data. Through Monte Carlo simulation studies, we examine the finite sample performance of the proposed method. Finally, the proposed method is applied to analyze brain positron emission tomography data in two different studies. One data set used in preparation of this article was obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database.
Collapse
Affiliation(s)
- Yueying Wang
- Department of Statistics, Iowa State University, Ames, Iowa
| | - Guannan Wang
- Department of Mathematics, College of William and Mary, Williamsburg, Virginia
| | - Li Wang
- Department of Statistics, Iowa State University, Ames, Iowa
| | - R. Todd Ogden
- Department of Biostatistics, Columbia University, New York, New York
| |
Collapse
|
13
|
Korthauer K, Chakraborty S, Benjamini Y, Irizarry RA. Detection and accurate false discovery rate control of differentially methylated regions from whole genome bisulfite sequencing. Biostatistics 2019; 20:367-383. [PMID: 29481604 PMCID: PMC6587918 DOI: 10.1093/biostatistics/kxy007] [Citation(s) in RCA: 98] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 01/21/2018] [Indexed: 12/22/2022] Open
Abstract
With recent advances in sequencing technology, it is now feasible to measure DNA methylation at tens of millions of sites across the entire genome. In most applications, biologists are interested in detecting differentially methylated regions, composed of multiple sites with differing methylation levels among populations. However, current computational approaches for detecting such regions do not provide accurate statistical inference. A major challenge in reporting uncertainty is that a genome-wide scan is involved in detecting these regions, which needs to be accounted for. A further challenge is that sample sizes are limited due to the costs associated with the technology. We have developed a new approach that overcomes these challenges and assesses uncertainty for differentially methylated regions in a rigorous manner. Region-level statistics are obtained by fitting a generalized least squares regression model with a nested autoregressive correlated error structure for the effect of interest on transformed methylation proportions. We develop an inferential approach, based on a pooled null distribution, that can be implemented even when as few as two samples per population are available. Here, we demonstrate the advantages of our method using both experimental data and Monte Carlo simulation. We find that the new method improves the specificity and sensitivity of lists of regions and accurately controls the false discovery rate.
Collapse
Affiliation(s)
- Keegan Korthauer
- Department of Biostatistics & Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA, USA and Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA, USA
| | - Sutirtha Chakraborty
- Novartis, Inorbit Mall Rd, Silpa Gram Craft Village, HITEC City, Hyderabad, Telangana, India
| | - Yuval Benjamini
- The Statistics Department, Hebrew University, Mount Scopus, Jerusalem, Israel
| | - Rafael A Irizarry
- Department of Biostatistics & Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA, USA and Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA, USA
| |
Collapse
|
14
|
Katsevich E, Sabatti C. MULTILAYER KNOCKOFF FILTER: CONTROLLED VARIABLE SELECTION AT MULTIPLE RESOLUTIONS. Ann Appl Stat 2019; 13:1-33. [PMID: 31687060 PMCID: PMC6827557 DOI: 10.1214/18-aoas1185] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
We tackle the problem of selecting from among a large number of variables those that are "important" for an outcome. We consider situations where groups of variables are also of interest. For example, each variable might be a genetic polymorphism, and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorphisms. In this context, to discover that a variable is relevant for the outcome implies discovering that the larger entity it represents is also important. To guarantee meaningful results with high chance of replicability, we suggest controlling the rate of false discoveries for findings at the level of individual variables and at the level of groups. Building on the knockoff construction of Barber and Candès [Ann. Statist. 43 (2015) 2055-2085] and the multilayer testing framework of Barber and Ramdas [J. Roy. Statist. Soc. Ser. B 79 (2017) 1247-1268], we introduce the multilayer knockoff filter (MKF). We prove that MKF simultaneously controls the FDR at each resolution and use simulations to show that it incurs little power loss compared to methods that provide guarantees only for the discoveries of individual variables. We apply MKF to analyze a genetic dataset and find that it successfully reduces the number of false gene discoveries without a significant reduction in power.
Collapse
Affiliation(s)
- Eugene Katsevich
- DEPARTMENT OF STATISTICS, STANFORD UNIVERSITY, 390 SERRA MALL, STANFORD, CALIFORNIA 94305, ,
| | - Chiara Sabatti
- DEPARTMENT OF STATISTICS, STANFORD UNIVERSITY, 390 SERRA MALL, STANFORD, CALIFORNIA 94305, ,
| |
Collapse
|
15
|
Benjamini Y, Taylor J, Irizarry RA. Selection-Corrected Statistical Inference for Region Detection With High-Throughput Assays. J Am Stat Assoc 2018; 114:1351-1365. [PMID: 36312875 PMCID: PMC9615469 DOI: 10.1080/01621459.2018.1498347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2016] [Revised: 06/01/2018] [Indexed: 10/28/2022]
Abstract
Scientists use high-dimensional measurement assays to detect and prioritize regions of strong signal in spatially organized domain. Examples include finding methylation enriched genomic regions using microarrays, and active cortical areas using brain-imaging. The most common procedure for detecting potential regions is to group neighboring sites where the signal passed a threshold. However, one needs to account for the selection bias induced by this procedure to avoid diminishing effects when generalizing to a population. This paper introduces pin-down inference, a model and an inference framework that permit population inference for these detected regions. Pin-down inference provides non-asymptotic point and confidence interval estimators for the mean effect in the region that account for local selection bias. Our estimators accommodate non-stationary covariances that are typical of these data, allowing researchers to better compare regions of different sizes and correlation structures. Inference is provided within a conditional one-parameter exponential family per region, with truncations that match the selection constraints. A secondary screening-and-adjustment step allows pruning the set of detected regions, while controlling the false-coverage rate over the reported regions. We apply the method to genomic regions with differing DNA-methylation rates across tissue. Our method provides superior power compared to other conditional and non-parametric approaches.
Collapse
Affiliation(s)
- Yuval Benjamini
- Department of Statistics, Hebrew University of Jerusalem, Israel
| | | | - Rafael A Irizarry
- Department of Biostatistics and Computational Biology, Dana Farber Cancer Institute and Department of Biostatistics, Harvard University
| |
Collapse
|
16
|
Page CM, Vos L, Rounge TB, Harbo HF, Andreassen BK. Assessing genome-wide significance for the detection of differentially methylated regions. Stat Appl Genet Mol Biol 2018; 17:/j/sagmb.ahead-of-print/sagmb-2017-0050/sagmb-2017-0050.xml. [PMID: 30231014 DOI: 10.1515/sagmb-2017-0050] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
DNA methylation plays an important role in human health and disease, and methods for the identification of differently methylated regions are of increasing interest. There is currently a lack of statistical methods which properly address multiple testing, i.e. control genome-wide significance for differentially methylated regions. We introduce a scan statistic (DMRScan), which overcomes these limitations. We benchmark DMRScan against two well established methods (bumphunter, DMRcate), using a simulation study based on real methylation data. An implementation of DMRScan is available from Bioconductor. Our method has higher power than alternative methods across different simulation scenarios, particularly for small effect sizes. DMRScan exhibits greater flexibility in statistical modeling and can be used with more complex designs than current methods. DMRScan is the first dynamic approach which properly addresses the multiple-testing challenges for the identification of differently methylated regions. DMRScan outperformed alternative methods in terms of power, while keeping the false discovery rate controlled.
Collapse
Affiliation(s)
- Christian M Page
- Department of Neurology, Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway.,Department of Neurology, Division of Clinical Neuroscience, Oslo University Hospital, N-0407 Oslo, Norway.,Department of Non-Communicable Diseases, Norwegian Institute of Public Health, N-0403 Oslo, Norway
| | - Linda Vos
- Department of Research, Cancer Registry of Norway, Oslo, Norway
| | - Trine B Rounge
- Department of Research, Cancer Registry of Norway, Oslo, Norway.,Genetic Epidemiology Group, Folkhälsan Research Center, Helsinki, Finland
| | - Hanne F Harbo
- Department of Neurology, Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway.,Department of Neurology, Division of Clinical Neuroscience, Oslo University Hospital, N-0407 Oslo, Norway
| | | |
Collapse
|
17
|
Picard F, Reynaud-Bouret P, Roquain E. Continuous testing for Poisson process intensities: a new perspective on scanning statistics. Biometrika 2018. [DOI: 10.1093/biomet/asy044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Affiliation(s)
- Franck Picard
- Centre National de la Recherche Scientifique, Laboratoire de Biométrie et Biologie Evolutive, 43 Boulevard du 11 Novembre 1918, Villeurbanne, France
| | - Patricia Reynaud-Bouret
- Centre National de la Recherche Scientifique, Laboratoire Jean Alexandre Dieudonné, Parc Valrose, Nice, France
| | - Etienne Roquain
- Sorbonne Université, Laboratoire de Probabilités, Statistique et Modélisation, 4 Place Jussieu, Paris, France
| |
Collapse
|
18
|
Li J, Gahm JK, Shi Y, Toga AW. Topological false discovery rates for brain mapping based on signal height. Neuroimage 2018; 167:478-487. [PMID: 27838286 PMCID: PMC5423870 DOI: 10.1016/j.neuroimage.2016.09.045] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2016] [Revised: 09/15/2016] [Accepted: 09/18/2016] [Indexed: 11/18/2022] Open
Abstract
Correcting the effect of multiple testing is important in statistical parametric mapping. If the threshold is too liberal, then spurious claims may flood in; if it is too conservative, then true hints may be overlooked. It is highly desirable to combine random field theory and the false discovery rate (FDR) to achieve more powerful detection under gauged topological errors. However, the current FDR method based on peak height does not fully meet this expectation, and sometimes is more conservative than the traditional family-wise error rate method, for unexplained reasons. In this paper, we introduce a new topological FDR method based on signal height. As analyzed in theory and validated with extensive experiments, it controls error rates much more accurately than the peak FDR method does, and substantially gains detection power. In addition, we discover reasons behind the peak FDR method's under-performance, and formulate equations to predict the two methods' behavior.
Collapse
Affiliation(s)
- Junning Li
- Laboratory of Neuro Imaging (LONI), Stevens Neuroimaging and Informatics Institute, Keck School of Medicine, University of Southern California, Los Angeles, USA.
| | - Jin Kyu Gahm
- Laboratory of Neuro Imaging (LONI), Stevens Neuroimaging and Informatics Institute, Keck School of Medicine, University of Southern California, Los Angeles, USA.
| | - Yonggang Shi
- Laboratory of Neuro Imaging (LONI), Stevens Neuroimaging and Informatics Institute, Keck School of Medicine, University of Southern California, Los Angeles, USA.
| | - Arthur W Toga
- Laboratory of Neuro Imaging (LONI), Stevens Neuroimaging and Informatics Institute, Keck School of Medicine, University of Southern California, Los Angeles, USA.
| |
Collapse
|
19
|
Genetic Dissection of Nutrition-Induced Plasticity in Insulin/Insulin-Like Growth Factor Signaling and Median Life Span in a Drosophila Multiparent Population. Genetics 2017; 206:587-602. [PMID: 28592498 DOI: 10.1534/genetics.116.197780] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2017] [Accepted: 03/13/2017] [Indexed: 11/18/2022] Open
Abstract
The nutritional environments that organisms experience are inherently variable, requiring tight coordination of how resources are allocated to different functions relative to the total amount of resources available. A growing body of evidence supports the hypothesis that key endocrine pathways play a fundamental role in this coordination. In particular, the insulin/insulin-like growth factor signaling (IIS) and target of rapamycin (TOR) pathways have been implicated in nutrition-dependent changes in metabolism and nutrient allocation. However, little is known about the genetic basis of standing variation in IIS/TOR or how diet-dependent changes in expression in this pathway influence phenotypes related to resource allocation. To characterize natural genetic variation in the IIS/TOR pathway, we used >250 recombinant inbred lines (RILs) derived from a multiparental mapping population, the Drosophila Synthetic Population Resource, to map transcript-level QTL of genes encoding 52 core IIS/TOR components in three different nutritional environments [dietary restriction (DR), control (C), and high sugar (HS)]. Nearly all genes, 87%, were significantly differentially expressed between diets, though not always in ways predicted by loss-of-function mutants. We identified cis (i.e., local) expression QTL (eQTL) for six genes, all of which are significant in multiple nutrient environments. Further, we identified trans (i.e., distant) eQTL for two genes, specific to a single nutrient environment. Our results are consistent with many small changes in the IIS/TOR pathways. A discriminant function analysis for the C and DR treatments identified a pattern of gene expression associated with the diet treatment. Mapping the composite discriminant function scores revealed a significant global eQTL within the DR diet. A correlation between the discriminant function scores and the median life span (r = 0.46) provides evidence that gene expression changes in response to diet are associated with longevity in these RILs.
Collapse
|
20
|
Szulc P, Bogdan M, Frommlet F, Tang H. Joint genotype- and ancestry-based genome-wide association studies in admixed populations. Genet Epidemiol 2017; 41:555-566. [PMID: 28657151 DOI: 10.1002/gepi.22056] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2016] [Revised: 04/01/2017] [Accepted: 04/25/2017] [Indexed: 12/21/2022]
Abstract
In genome-wide association studies (GWAS) genetic loci that influence complex traits are localized by inspecting associations between genotypes of genetic markers and the values of the trait of interest. On the other hand, admixture mapping, which is performed in case of populations consisting of a recent mix of two ancestral groups, relies on the ancestry information at each locus (locus-specific ancestry). Recently it has been proposed to jointly model genotype and locus-specific ancestry within the framework of single marker tests. Here, we extend this approach for population-based GWAS in the direction of multimarker models. A modified version of the Bayesian information criterion is developed for building a multilocus model that accounts for the differential correlation structure due to linkage disequilibrium (LD) and admixture LD. Simulation studies and a real data example illustrate the advantages of this new approach compared to single-marker analysis or modern model selection strategies based on separately analyzing genotype and ancestry data, as well as to single-marker analysis combining genotypic and ancestry information. Depending on the signal strength, our procedure automatically chooses whether genotypic or locus-specific ancestry markers are added to the model. This results in a good compromise between the power to detect causal mutations and the precision of their localization. The proposed method has been implemented in R and is available at http://www.math.uni.wroc.pl/~mbogdan/admixtures/.
Collapse
Affiliation(s)
- Piotr Szulc
- Faculty of Mathematics, Wroclaw University of Technology, Wroclaw, Poland
| | - Malgorzata Bogdan
- Faculty of Mathematics and Computer Science, University of Wroclaw, Wroclaw, Poland
| | - Florian Frommlet
- Department of Medical Statistics, CEMSIIS, Medical University of Vienna, Vienna, Austria
| | - Hua Tang
- Departments of Genetics and Statistics, Stanford University, Stanford, California, United States of America
| |
Collapse
|
21
|
The Beavis Effect in Next-Generation Mapping Panels in Drosophila melanogaster. G3-GENES GENOMES GENETICS 2017; 7:1643-1652. [PMID: 28592647 PMCID: PMC5473746 DOI: 10.1534/g3.117.041426] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
A major goal in the analysis of complex traits is to partition the observed genetic variation in a trait into components due to individual loci and perhaps variants within those loci. However, in both QTL mapping and genetic association studies, the estimated percent variation attributable to a QTL is upwardly biased conditional on it being discovered. This bias was first described in two-way QTL mapping experiments by William Beavis, and has been referred to extensively as “the Beavis effect.” The Beavis effect is likely to occur in multiparent population (MPP) panels as well as collections of sequenced lines used for genome-wide association studies (GWAS). However, the strength of the Beavis effect is unknown—and often implicitly assumed to be negligible—when “hits” are obtained from an association panel consisting of hundreds of inbred lines tested across millions of SNPs, or in multiparent mapping populations where mapping involves fitting a complex statistical model with several d.f. at thousands of genetic intervals. To estimate the size of the effect in more complex panels, we performed simulations of both biallelic and multiallelic QTL in two major Drosophila melanogaster mapping panels, the GWAS-based Drosophila Genetic Reference Panel (DGRP), and the MPP the Drosophila Synthetic Population Resource (DSPR). Our results show that overestimation is determined most strongly by sample size and is only minimally impacted by the mapping design. When < 100, 200, 500, and 1000 lines are employed, the variance attributable to hits is inflated by factors of 6, 3, 1.5, and 1.1, respectively, for a QTL that truly contributes 5% to the variation in the trait. This overestimation indicates that QTL could be difficult to validate in follow-up replication experiments where additional individuals are examined. Further, QTL could be difficult to cross-validate between the two Drosophila resources. We provide guidelines for: (1) the sample sizes necessary to accurately estimate the percent variance to an identified QTL, (2) the conditions under which one is likely to replicate a mapped QTL in a second study using the same mapping population, and (3) the conditions under which a QTL mapped in one mapping panel is likely to replicate in the other (DGRP and DSPR).
Collapse
|
22
|
Cheng D, Schwartzman A. MULTIPLE TESTING OF LOCAL MAXIMA FOR DETECTION OF PEAKS IN RANDOM FIELDS. Ann Stat 2017; 45:529-556. [PMID: 31527989 PMCID: PMC6746560 DOI: 10.1214/16-aos1458] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
A topological multiple testing scheme is presented for detecting peaks in images under stationary ergodic Gaussian noise, where tests are performed at local maxima of the smoothed observed signals. The procedure generalizes the one-dimensional scheme of [31] to Euclidean domains of arbitrary dimension. Two methods are developed according to two different ways of computing p-values: (i) using the exact distribution of the height of local maxima, available explicitly when the noise field is isotropic [9, 10]; (ii) using an approximation to the overshoot distribution of local maxima above a pre-threshold, applicable when the exact distribution is unknown, such as when the stationary noise field is non-isotropic [9]. The algorithms, combined with the Benjamini-Hochberg procedure for thresholding p-values, provide asymptotic strong control of the False Discovery Rate (FDR) and power consistency, with specific rates, as the search space and signal strength get large. The optimal smoothing bandwidth and optimal pre-threshold are obtained to achieve maximum power. Simulations show that FDR levels are maintained in non-asymptotic conditions. The methods are illustrated in the analysis of functional magnetic resonance images of the brain.
Collapse
Affiliation(s)
- Dan Cheng
- Division of Biostatistics, University of California, San Diego
| | | |
Collapse
|
23
|
Brzyski D, Peterson CB, Sobczyk P, Candès EJ, Bogdan M, Sabatti C. Controlling the Rate of GWAS False Discoveries. Genetics 2017; 205:61-75. [PMID: 27784720 PMCID: PMC5223524 DOI: 10.1534/genetics.116.193987] [Citation(s) in RCA: 72] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2016] [Accepted: 10/11/2016] [Indexed: 01/13/2023] Open
Abstract
With the rise of both the number and the complexity of traits of interest, control of the false discovery rate (FDR) in genetic association studies has become an increasingly appealing and accepted target for multiple comparison adjustment. While a number of robust FDR-controlling strategies exist, the nature of this error rate is intimately tied to the precise way in which discoveries are counted, and the performance of FDR-controlling procedures is satisfactory only if there is a one-to-one correspondence between what scientists describe as unique discoveries and the number of rejected hypotheses. The presence of linkage disequilibrium between markers in genome-wide association studies (GWAS) often leads researchers to consider the signal associated to multiple neighboring SNPs as indicating the existence of a single genomic locus with possible influence on the phenotype. This a posteriori aggregation of rejected hypotheses results in inflation of the relevant FDR. We propose a novel approach to FDR control that is based on prescreening to identify the level of resolution of distinct hypotheses. We show how FDR-controlling strategies can be adapted to account for this initial selection both with theoretical results and simulations that mimic the dependence structure to be expected in GWAS. We demonstrate that our approach is versatile and useful when the data are analyzed using both tests based on single markers and multiple regression. We provide an R package that allows practitioners to apply our procedure on standard GWAS format data, and illustrate its performance on lipid traits in the North Finland Birth Cohort 66 cohort study.
Collapse
Affiliation(s)
- Damian Brzyski
- Institute of Mathematics, Jagiellonian University, 30-348 Kraków, Poland
- Department of Epidemiology and Biostatistics, Indiana University, Bloomington, Indiana 47405
| | - Christine B Peterson
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, Texas 77030
| | - Piotr Sobczyk
- Faculty of Pure and Applied Mathematics, Wrocław University of Science and Technology, 50-370 Wroclaw, Poland
| | | | - Malgorzata Bogdan
- Institute of Mathematics, University of Wrocław, 50-384 Wroclaw, Poland
| | - Chiara Sabatti
- Department of Biomedical Data Science, Stanford University, California
| |
Collapse
|
24
|
Reiner-Benaim A. Scan Statistic Tail Probability Assessment Based on Process Covariance and Window Size. Methodol Comput Appl Probab 2016. [DOI: 10.1007/s11009-015-9447-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
25
|
Zhang NR, Yakir B, Xia LC, Siegmund D. Scan statistics on Poisson random fields with applications in genomics. Ann Appl Stat 2016. [DOI: 10.1214/15-aoas892] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
26
|
Kang M. Optimal False Discovery Rate Control with Kernel Density Estimation in a Microarray Experiment. COMMUN STAT-SIMUL C 2016. [DOI: 10.1080/03610918.2013.875569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Moonsu Kang
- Department of Information Statistics, Gangneung-Wonju National University, Gangneung-si, Republic of Korea
| |
Collapse
|
27
|
|
28
|
Doris SM, Smith DR, Beamesderfer JN, Raphael BJ, Nathanson JA, Gerbi SA. Universal and domain-specific sequences in 23S-28S ribosomal RNA identified by computational phylogenetics. RNA (NEW YORK, N.Y.) 2015; 21:1719-1730. [PMID: 26283689 PMCID: PMC4574749 DOI: 10.1261/rna.051144.115] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/30/2015] [Accepted: 07/07/2015] [Indexed: 06/01/2023]
Abstract
Comparative analysis of ribosomal RNA (rRNA) sequences has elucidated phylogenetic relationships. However, this powerful approach has not been fully exploited to address ribosome function. Here we identify stretches of evolutionarily conserved sequences, which correspond with regions of high functional importance. For this, we developed a structurally aligned database, FLORA (full-length organismal rRNA alignment) to identify highly conserved nucleotide elements (CNEs) in 23S-28S rRNA from each phylogenetic domain (Eukarya, Bacteria, and Archaea). Universal CNEs (uCNEs) are conserved in sequence and structural position in all three domains. Those in regions known to be essential for translation validate our approach. Importantly, some uCNEs reside in areas of unknown function, thus identifying novel sequences of likely great importance. In contrast to uCNEs, domain-specific CNEs (dsCNEs) are conserved in just one phylogenetic domain. This is the first report of conserved sequence elements in rRNA that are domain-specific; they are largely a eukaryotic phenomenon. The locations of the eukaryotic dsCNEs within the structure of the ribosome suggest they may function in nascent polypeptide transit through the ribosome tunnel and in tRNA exit from the ribosome. Our findings provide insights and a resource for ribosome function studies.
Collapse
Affiliation(s)
- Stephen M Doris
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University Division of Biology and Medicine, Providence, Rhode Island 02912, USA
| | - Deborah R Smith
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University Division of Biology and Medicine, Providence, Rhode Island 02912, USA
| | - Julia N Beamesderfer
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University Division of Biology and Medicine, Providence, Rhode Island 02912, USA
| | - Benjamin J Raphael
- Department of Computer Science and Center for Computational Molecular Biology, Brown University Division of Biology and Medicine, Providence, Rhode Island 02912, USA
| | - Judith A Nathanson
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University Division of Biology and Medicine, Providence, Rhode Island 02912, USA
| | - Susan A Gerbi
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University Division of Biology and Medicine, Providence, Rhode Island 02912, USA
| |
Collapse
|
29
|
|
30
|
Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach. Am J Hum Genet 2015; 96:740-52. [PMID: 25892113 DOI: 10.1016/j.ajhg.2015.03.008] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2014] [Accepted: 03/10/2015] [Indexed: 01/21/2023] Open
Abstract
Elucidating the genetic basis of complex traits and diseases in non-European populations is particularly challenging because US minority populations have been under-represented in genetic association studies. We developed an empirical Bayes approach named XPEB (cross-population empirical Bayes), designed to improve the power for mapping complex-trait-associated loci in a minority population by exploiting information from genome-wide association studies (GWASs) from another ethnic population. Taking as input summary statistics from two GWASs-a target GWAS from an ethnic minority population of primary interest and an auxiliary base GWAS (such as a larger GWAS in Europeans)-our XPEB approach reprioritizes SNPs in the target population to compute local false-discovery rates. We demonstrated, through simulations, that whenever the base GWAS harbors relevant information, XPEB gains efficiency. Moreover, XPEB has the ability to discard irrelevant auxiliary information, providing a safeguard against inflated false-discovery rates due to genetic heterogeneity between populations. Applied to a blood-lipids study in African Americans, XPEB more than quadrupled the discoveries from the conventional approach, which used a target GWAS alone, bringing the number of significant loci from 14 to 65. Thus, XPEB offers a flexible framework for mapping complex traits in minority populations.
Collapse
|
31
|
Affiliation(s)
- Klaus Frick
- Interstate University of Applied Sciences of Technology; Buchs Switzerland
| | - Axel Munk
- University of Göttingen; Göttingen Germany
- Max Planck Institute for Biophysical Chemistry; Göttingen Germany
| | | |
Collapse
|
32
|
Berry CC, Ocwieja KE, Malani N, Bushman FD. Comparing DNA integration site clusters with scan statistics. ACTA ACUST UNITED AC 2014; 30:1493-500. [PMID: 24489369 DOI: 10.1093/bioinformatics/btu035] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Gene therapy with retroviral vectors can induce adverse effects when those vectors integrate in sensitive genomic regions. Retroviral vectors are preferred that target sensitive regions less frequently, motivating the search for localized clusters of integration sites and comparison of the clusters formed by integration of different vectors. Scan statistics allow the discovery of spatial differences in clustering and calculation of false discovery rates providing statistical methods for comparing retroviral vectors. RESULTS A scan statistic for comparing two vectors using multiple window widths is proposed with software to detect clustering differentials and compute false discovery rates. Application to several sets of experimentally determined HIV integration sites demonstrates the software. Simulated datasets of various sizes and signal strengths are used to determine the power to discover clusters and evaluate a convenient lower bound. This provides a toolkit for planning evaluations of new gene therapy vectors. AVAILABILITY AND IMPLEMENTATION The geneRxCluster R package containing a simple tutorial and usage hints is available from http://www.bioconductor.org.
Collapse
Affiliation(s)
- Charles C Berry
- Division of Biostatistics and BioInformatics, Department of Family and Preventive Medicine, University of California at San Diego, La Jolla, CA 92093-0901 and Department of Microbiology, Perelman School of Medicine at the University of Pennsylvania, 425 Johnson Pavilion, Philadelphia, PA 19104-6076, USA
| | - Karen E Ocwieja
- Division of Biostatistics and BioInformatics, Department of Family and Preventive Medicine, University of California at San Diego, La Jolla, CA 92093-0901 and Department of Microbiology, Perelman School of Medicine at the University of Pennsylvania, 425 Johnson Pavilion, Philadelphia, PA 19104-6076, USA
| | - Nirav Malani
- Division of Biostatistics and BioInformatics, Department of Family and Preventive Medicine, University of California at San Diego, La Jolla, CA 92093-0901 and Department of Microbiology, Perelman School of Medicine at the University of Pennsylvania, 425 Johnson Pavilion, Philadelphia, PA 19104-6076, USA
| | - Frederic D Bushman
- Division of Biostatistics and BioInformatics, Department of Family and Preventive Medicine, University of California at San Diego, La Jolla, CA 92093-0901 and Department of Microbiology, Perelman School of Medicine at the University of Pennsylvania, 425 Johnson Pavilion, Philadelphia, PA 19104-6076, USA
| |
Collapse
|
33
|
|