1
|
Scott DAV, Benavente E, Libiseller-Egger J, Fedorov D, Phelan J, Ilina E, Tikhonova P, Kudryavstev A, Galeeva J, Clark T, Lewin A. Bayesian compositional regression with microbiome features via variational inference. BMC Bioinformatics 2023; 24:210. [PMID: 37217852 PMCID: PMC10201722 DOI: 10.1186/s12859-023-05219-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Accepted: 03/02/2023] [Indexed: 05/24/2023] Open
Abstract
The microbiome plays a key role in the health of the human body. Interest often lies in finding features of the microbiome, alongside other covariates, which are associated with a phenotype of interest. One important property of microbiome data, which is often overlooked, is its compositionality as it can only provide information about the relative abundance of its constituting components. Typically, these proportions vary by several orders of magnitude in datasets of high dimensions. To address these challenges we develop a Bayesian hierarchical linear log-contrast model which is estimated by mean field Monte-Carlo co-ordinate ascent variational inference (CAVI-MC) and easily scales to high dimensional data. We use novel priors which account for the large differences in scale and constrained parameter space associated with the compositional covariates. A reversible jump Monte Carlo Markov chain guided by the data through univariate approximations of the variational posterior probability of inclusion, with proposal parameters informed by approximating variational densities via auxiliary parameters, is used to estimate intractable marginal expectations. We demonstrate that our proposed Bayesian method performs favourably against existing frequentist state of the art compositional data analysis methods. We then apply the CAVI-MC to the analysis of real data exploring the relationship of the gut microbiome to body mass index.
Collapse
Affiliation(s)
- Darren A. V. Scott
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, London, United Kingdom
| | - Ernest Benavente
- Laboratory of Experimental Cardiology, University Medical Center Utrecht, Utrecht University, Utrecht, Netherlands
| | - Julian Libiseller-Egger
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, London, United Kingdom
| | - Dmitry Fedorov
- Federal Research and Clinical Center of Physical-Chemical Medicine, Moscow, Russia
| | - Jody Phelan
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, London, United Kingdom
| | - Elena Ilina
- Federal Research and Clinical Center of Physical-Chemical Medicine, Moscow, Russia
| | - Polina Tikhonova
- Federal Research and Clinical Center of Physical-Chemical Medicine, Moscow, Russia
- Bioinformatics and Genomics Intercollege Graduate Program, Huck Institutes of Life Sciences, Pennsylvania State University, Pennsylvania, USA
| | | | - Julia Galeeva
- Federal Research and Clinical Center of Physical-Chemical Medicine, Moscow, Russia
| | - Taane Clark
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, London, United Kingdom
| | - Alex Lewin
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, London, United Kingdom
| |
Collapse
|
2
|
Bottolo L, Banterle M, Richardson S, Ala-Korpela M, Järvelin MR, Lewin A. A computationally efficient Bayesian seemingly unrelated regressions model for high-dimensional quantitative trait loci discovery. J R Stat Soc Ser C Appl Stat 2021; 70:886-908. [PMID: 35001978 PMCID: PMC7612194 DOI: 10.1111/rssc.12490] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
Our work is motivated by the search for metabolite quantitative trait loci (QTL) in a cohort of more than 5000 people. There are 158 metabolites measured by NMR spectroscopy in the 31-year follow-up of the Northern Finland Birth Cohort 1966 (NFBC66). These metabolites, as with many multivariate phenotypes produced by high-throughput biomarker technology, exhibit strong correlation structures. Existing approaches for combining such data with genetic variants for multivariate QTL analysis generally ignore phenotypic correlations or make restrictive assumptions about the associations between phenotypes and genetic loci. We present a computationally efficient Bayesian seemingly unrelated regressions model for high-dimensional data, with cell-sparse variable selection and sparse graphical structure for covariance selection. Cell sparsity allows different phenotype responses to be associated with different genetic predictors and the graphical structure is used to represent the conditional dependencies between phenotype variables. To achieve feasible computation of the large model space, we exploit a factorisation of the covariance matrix. Applying the model to the NFBC66 data with 9000 directly genotyped single nucleotide polymorphisms, we are able to simultaneously estimate genotype-phenotype associations and the residual dependence structure among the metabolites. The R package BayesSUR with full documentation is available at https://cran.r-project.org/web/packages/BayesSUR/.
Collapse
Affiliation(s)
- Leonardo Bottolo
- Department of Medical Genetics, University of Cambridge, Cambridge, UK
- The Alan Turing Institute, London, UK
- MRC Biostatistics Unit, Cambridge, UK
| | - Marco Banterle
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, UK
| | - Sylvia Richardson
- The Alan Turing Institute, London, UK
- MRC Biostatistics Unit, Cambridge, UK
| | - Mika Ala-Korpela
- Computational Medicine, Faculty of Medicine, University of Oulu and Biocenter Oulu, Oulu, Finland
- NMR Metabolomics Laboratory, School of Pharmacy, University of Eastern Finland, Kuopio, Finland
| | - Marjo-Riitta Järvelin
- Center for Life Course Health Research, University of Oulu, Oulu, Finland
- Biocenter Oulu, University of Oulu, Oulu, Finland
- Department of Epidemiology and Biostatistics, Imperial College London, London, UK
- MRC-PHE Centre for Environment and Health, Imperial College London, London, UK
- Department of Life Sciences, Brunel University London, Uxbridge, UK
| | - Alex Lewin
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, UK
| |
Collapse
|
3
|
Ruffieux H, Fairfax BP, Nassiri I, Vigorito E, Wallace C, Richardson S, Bottolo L. EPISPOT: An epigenome-driven approach for detecting and interpreting hotspots in molecular QTL studies. Am J Hum Genet 2021; 108:983-1000. [PMID: 33909991 PMCID: PMC8206410 DOI: 10.1016/j.ajhg.2021.04.010] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Accepted: 04/08/2021] [Indexed: 12/27/2022] Open
Abstract
We present EPISPOT, a fully joint framework which exploits large panels of epigenetic annotations as variant-level information to enhance molecular quantitative trait locus (QTL) mapping. Thanks to a purpose-built Bayesian inferential algorithm, EPISPOT accommodates functional information for both cis and trans actions, including QTL hotspot effects. It effectively couples simultaneous QTL analysis of thousands of genetic variants and molecular traits with hypothesis-free selection of biologically interpretable annotations which directly contribute to the QTL effects. This unified, epigenome-aided learning boosts statistical power and sheds light on the regulatory basis of the uncovered hits; EPISPOT therefore marks an essential step toward improving the challenging detection and functional interpretation of trans-acting genetic variants and hotspots. We illustrate the advantages of EPISPOT in simulations emulating real-data conditions and in a monocyte expression QTL study, which confirms known hotspots and finds other signals, as well as plausible mechanisms of action. In particular, by highlighting the role of monocyte DNase-I sensitivity sites from >150 epigenetic annotations, we clarify the mediation effects and cell-type specificity of major hotspots close to the lysozyme gene. Our approach forgoes the daunting and underpowered task of one-annotation-at-a-time enrichment analyses for prioritizing cis and trans QTL hits and is tailored to any transcriptomic, proteomic, or metabolomic QTL problem. By enabling principled epigenome-driven QTL mapping transcriptome-wide, EPISPOT helps progress toward a better functional understanding of genetic regulation.
Collapse
Affiliation(s)
- Hélène Ruffieux
- MRC Biostatistics Unit, University of Cambridge, Cambridge CB2 0SR, UK.
| | - Benjamin P Fairfax
- Department of Oncology, MRC Weatherall Institute for Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford OX3 9DS, UK
| | - Isar Nassiri
- Department of Oncology, MRC Weatherall Institute for Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford OX3 9DS, UK
| | - Elena Vigorito
- MRC Biostatistics Unit, University of Cambridge, Cambridge CB2 0SR, UK
| | - Chris Wallace
- MRC Biostatistics Unit, University of Cambridge, Cambridge CB2 0SR, UK; Cambridge Institute of Therapeutic Immunology and Infectious Disease, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge CB2 0AW, UK
| | - Sylvia Richardson
- MRC Biostatistics Unit, University of Cambridge, Cambridge CB2 0SR, UK; The Alan Turing Institute, London NW1 2DB, UK
| | - Leonardo Bottolo
- MRC Biostatistics Unit, University of Cambridge, Cambridge CB2 0SR, UK; The Alan Turing Institute, London NW1 2DB, UK; Department of Medical Genetics, University of Cambridge, Cambridge CB2 0QQ, UK
| |
Collapse
|
4
|
Siebert JC, Stanislawski MA, Zaman A, Ostendorf DM, Konigsberg IR, Jambal P, Ir D, Bing K, Wayland L, Scorsone JJ, Lozupone CA, Görg C, Frank DN, Bessesen D, MacLean PS, Melanson EL, Catenacci VA, Borengasser SJ. Multiomic Predictors of Short-Term Weight Loss and Clinical Outcomes During a Behavioral-Based Weight Loss Intervention. Obesity (Silver Spring) 2021; 29:859-869. [PMID: 33811477 PMCID: PMC8085074 DOI: 10.1002/oby.23127] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 12/15/2020] [Accepted: 01/08/2021] [Indexed: 12/24/2022]
Abstract
OBJECTIVE Identifying predictors of weight loss and clinical outcomes may increase understanding of individual variability in weight loss response. We hypothesized that baseline multiomic features, including DNA methylation (DNAme), metabolomics, and gut microbiome, would be predictive of short-term changes in body weight and other clinical outcomes within a comprehensive weight loss intervention. METHODS Healthy adults with overweight or obesity (n = 62, age 18-55 years, BMI 27-45 kg/m2 , 75.8% female) participated in a 1-year behavioral weight loss intervention. To identify baseline omic predictors of changes in clinical outcomes at 3 and 6 months, whole-blood DNAme, plasma metabolites, and gut microbial genera were analyzed. RESULTS A network of multiomic relationships informed predictive models for 10 clinical outcomes (body weight, waist circumference, fat mass, hemoglobin A1c , homeostatic model assessment of insulin resistance, total cholesterol, triglycerides, C-reactive protein, leptin, and ghrelin) that changed significantly (P < 0.05). For eight of these, adjusted R2 ranged from 0.34 to 0.78. Our models identified specific DNAme sites, gut microbes, and metabolites that were predictive of variability in weight loss, waist circumference, and circulating triglycerides and that are biologically relevant to obesity and metabolic pathways. CONCLUSIONS These data support the feasibility of using baseline multiomic features to provide insight for precision nutrition-based weight loss interventions.
Collapse
Affiliation(s)
- Janet C. Siebert
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | | | - Adnin Zaman
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Danielle M. Ostendorf
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Iain R. Konigsberg
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Purevsuren Jambal
- Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Diana Ir
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Kristen Bing
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Liza Wayland
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Jared J. Scorsone
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Catherine A. Lozupone
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Carsten Görg
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA
| | - Daniel N. Frank
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Daniel Bessesen
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Paul S. MacLean
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Edward L. Melanson
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
- Division of Geriatric Medicine, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
- Eastern Colorado Veterans Affairs Geriatric Research, Education, and Clinical Center, Denver, CO, USA
| | - Victoria A. Catenacci
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Sarah J. Borengasser
- Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| |
Collapse
|
5
|
Proteome-wide Systems Genetics to Identify Functional Regulators of Complex Traits. Cell Syst 2021; 12:5-22. [PMID: 33476553 DOI: 10.1016/j.cels.2020.10.005] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 09/15/2020] [Accepted: 10/07/2020] [Indexed: 02/08/2023]
Abstract
Proteomic technologies now enable the rapid quantification of thousands of proteins across genetically diverse samples. Integration of these data with systems-genetics analyses is a powerful approach to identify new regulators of economically important or disease-relevant phenotypes in various populations. In this review, we summarize the latest proteomic technologies and discuss technical challenges for their use in population studies. We demonstrate how the analysis of correlation structure and loci mapping can be used to identify genetic factors regulating functional protein networks and complex traits. Finally, we provide an extensive summary of the use of proteome-wide systems genetics throughout fungi, plant, and animal kingdoms and discuss the power of this approach to identify candidate regulators and drug targets in large human consortium studies.
Collapse
|
6
|
Chakraborty A, Bhattacharya A, Mallick BK. Bayesian sparse multiple regression for simultaneous rank reduction and variable selection. Biometrika 2020; 107:205-221. [PMID: 33100350 DOI: 10.1093/biomet/asz056] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We develop a Bayesian methodology aimed at simultaneously estimating low-rank and row-sparse matrices in a high-dimensional multiple-response linear regression model. We consider a carefully devised shrinkage prior on the matrix of regression coefficients which obviates the need to specify a prior on the rank, and shrinks the regression matrix towards low-rank and row-sparse structures. We provide theoretical support to the proposed methodology by proving minimax optimality of the posterior mean under the prediction risk in ultra-high dimensional settings where the number of predictors can grow sub-exponentially relative to the sample size. A one-step post-processing scheme induced by group lasso penalties on the rows of the estimated coefficient matrix is proposed for variable selection, with default choices of tuning parameters. We additionally provide an estimate of the rank using a novel optimization function achieving dimension reduction in the covariate space. We exhibit the performance of the proposed methodology in an extensive simulation study and a real data example.
Collapse
Affiliation(s)
- Antik Chakraborty
- Department of Statistics, Texas A&M University, College Station, Texas, 77843, USA
| | - Anirban Bhattacharya
- Department of Statistics, Texas A&M University, College Station, Texas, 77843, USA
| | - Bani K Mallick
- Department of Statistics, Texas A&M University, College Station, Texas, 77843, USA
| |
Collapse
|
7
|
Genetics meets proteomics: perspectives for large population-based studies. Nat Rev Genet 2020; 22:19-37. [PMID: 32860016 DOI: 10.1038/s41576-020-0268-2] [Citation(s) in RCA: 226] [Impact Index Per Article: 45.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/14/2020] [Indexed: 12/22/2022]
Abstract
Proteomic analysis of cells, tissues and body fluids has generated valuable insights into the complex processes influencing human biology. Proteins represent intermediate phenotypes for disease and provide insight into how genetic and non-genetic risk factors are mechanistically linked to clinical outcomes. Associations between protein levels and DNA sequence variants that colocalize with risk alleles for common diseases can expose disease-associated pathways, revealing novel drug targets and translational biomarkers. However, genome-wide, population-scale analyses of proteomic data are only now emerging. Here, we review current findings from studies of the plasma proteome and discuss their potential for advancing biomedical translation through the interpretation of genome-wide association analyses. We highlight the challenges faced by currently available technologies and provide perspectives relevant to their future application in large-scale biobank studies.
Collapse
|
8
|
Loika Y, Irincheeva I, Culminskaya I, Nazarian A, Kulminski AM. Polygenic risk scores: pleiotropy and the effect of environment. GeroScience 2020; 42:1635-1647. [PMID: 32488673 DOI: 10.1007/s11357-020-00203-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Accepted: 05/08/2020] [Indexed: 10/24/2022] Open
Abstract
Polygenic risk scores (PRSs) discriminate trait risks better than single genetic markers because they aggregate the effects of risk alleles from multiple genetic loci. Constructing pleiotropic PRSs and understanding heterogeneity, and the replication of PRS-trait associations can strengthen its applications. By using variational Bayesian multivariate high-dimensional regression, we constructed pleiotropic PRSs jointly associated with body mass index, systolic and diastolic blood pressure, total and high-density lipoprotein cholesterol in a sample of 18,108 Caucasians from three independent cohorts. We found that dissecting heterogeneity associated with birth year, which is a proxy of exogenous exposures, improved the replication of significant PRS-trait associations from 37.5% (6 of 16) in the entire sample to 90% (18 of 20) in the more homogeneous sample of individuals born before the year 1925. Our findings suggest that secular changes in exogenous exposures may substantially modify pleiotropic risk profiles affecting translation of genetic discoveries into health care.
Collapse
Affiliation(s)
- Yury Loika
- Biodemography of Aging Research Unit, Social Science Research Institute, Duke University, Durham, NC, 27708-0408, USA.
| | - Irina Irincheeva
- Biodemography of Aging Research Unit, Social Science Research Institute, Duke University, Durham, NC, 27708-0408, USA.
| | - Irina Culminskaya
- Biodemography of Aging Research Unit, Social Science Research Institute, Duke University, Durham, NC, 27708-0408, USA
| | - Alireza Nazarian
- Biodemography of Aging Research Unit, Social Science Research Institute, Duke University, Durham, NC, 27708-0408, USA
| | - Alexander M Kulminski
- Biodemography of Aging Research Unit, Social Science Research Institute, Duke University, Durham, NC, 27708-0408, USA.
| |
Collapse
|
9
|
A fully joint Bayesian quantitative trait locus mapping of human protein abundance in plasma. PLoS Comput Biol 2020; 16:e1007882. [PMID: 32492067 PMCID: PMC7295243 DOI: 10.1371/journal.pcbi.1007882] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Revised: 06/15/2020] [Accepted: 04/16/2020] [Indexed: 11/19/2022] Open
Abstract
Molecular quantitative trait locus (QTL) analyses are increasingly popular to explore the genetic architecture of complex traits, but existing studies do not leverage shared regulatory patterns and suffer from a large multiplicity burden, which hampers the detection of weak signals such as trans associations. Here, we present a fully multivariate proteomic QTL (pQTL) analysis performed with our recently proposed Bayesian method LOCUS on data from two clinical cohorts, with plasma protein levels quantified by mass-spectrometry and aptamer-based assays. Our two-stage study identifies 136 pQTL associations in the first cohort, of which >80% replicate in the second independent cohort and have significant enrichment with functional genomic elements and disease risk loci. Moreover, 78% of the pQTLs whose protein abundance was quantified by both proteomic techniques are confirmed across assays. Our thorough comparisons with standard univariate QTL mapping on (1) these data and (2) synthetic data emulating the real data show how LOCUS borrows strength across correlated protein levels and markers on a genome-wide scale to effectively increase statistical power. Notably, 15% of the pQTLs uncovered by LOCUS would be missed by the univariate approach, including several trans and pleiotropic hits with successful independent validation. Finally, the analysis of extensive clinical data from the two cohorts indicates that the genetically-driven proteins identified by LOCUS are enriched in associations with low-grade inflammation, insulin resistance and dyslipidemia and might therefore act as endophenotypes for metabolic diseases. While considerations on the clinical role of the pQTLs are beyond the scope of our work, these findings generate useful hypotheses to be explored in future research; all results are accessible online from our searchable database. Thanks to its efficient variational Bayes implementation, LOCUS can analyze jointly thousands of traits and millions of markers. Its applicability goes beyond pQTL studies, opening new perspectives for large-scale genome-wide association and QTL analyses. Diet, Obesity and Genes (DiOGenes) trial registration number: NCT00390637. Exploring the functional mechanisms between the genotype and disease endpoints in view of identifying innovative therapeutic targets has prompted molecular quantitative trait locus studies, which assess how genetic variants (single nucleotide polymorphisms, SNPs) affect intermediate gene (eQTL), protein (pQTL) or metabolite (mQTL) levels. However, conventional univariate screening approaches do not account for local dependencies and association structures shared by multiple molecular levels and markers. Conversely, the current joint modelling approaches are restricted to small datasets by computational constraints. We illustrate and exploit the advantages of our recently introduced Bayesian framework LOCUS in a fully multivariate pQTL study, with ≈300K tag SNPs (capturing information from 4M markers) and 100 − 1, 000 plasma protein levels measured by two distinct technologies. LOCUS identifies novel pQTLs that replicate in an independent cohort, confirms signals documented in studies 2 − 18 times larger, and detects more pQTLs than a conventional two-stage univariate analysis of our datasets. Moreover, some of these pQTLs might be of biomedical relevance and would therefore deserve dedicated investigation. Our extensive numerical experiments on these data and on simulated data demonstrate that the increased statistical power of LOCUS over standard approaches is largely attributable to its ability to exploit shared information across outcomes while efficiently accounting for the genetic correlation structures at a genome-wide level.
Collapse
|
10
|
Ruffieux H, Davison AC, Hager J, Inshaw J, Fairfax BP, Richardson S, Bottolo L. A Global-Local Approach for Detecting Hotspots in Multiple-Response Regression. Ann Appl Stat 2020; 14:905-928. [PMID: 34992707 PMCID: PMC7612176 DOI: 10.1214/20-aoas1332] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
We tackle modelling and inference for variable selection in regression problems with many predictors and many responses. We focus on detecting hotspots, that is, predictors associated with several responses. Such a task is critical in statistical genetics, as hotspot genetic variants shape the architecture of the genome by controlling the expression of many genes and may initiate decisive functional mechanisms underlying disease endpoints. Existing hierarchical regression approaches designed to model hotspots suffer from two limitations: their discrimination of hotspots is sensitive to the choice of top-level scale parameters for the propensity of predictors to be hotspots, and they do not scale to large predictor and response vectors, for example, of dimensions 103-105 in genetic applications. We address these shortcomings by introducing a flexible hierarchical regression framework that is tailored to the detection of hotspots and scalable to the above dimensions. Our proposal implements a fully Bayesian model for hotspots based on the horseshoe shrinkage prior. Its global-local formulation shrinks noise globally and, hence, accommodates the highly sparse nature of genetic analyses while being robust to individual signals, thus leaving the effects of hotspots unshrunk. Inference is carried out using a fast variational algorithm coupled with a novel simulated annealing procedure that allows efficient exploration of multimodal distributions.
Collapse
Affiliation(s)
| | | | | | - Jamie Inshaw
- Wellcome Centre for Human Genetics, Oxford, University of Oxford
| | - Benjamin P. Fairfax
- Department of Oncology, MRC Weatherall Institute for Molecular Medicine, University of Oxford
| | - Sylvia Richardson
- MRC Biostatistics Unit, University of Cambridge
- Alan Turing Institute
| | - Leonardo Bottolo
- MRC Biostatistics Unit, University of Cambridge
- Alan Turing Institute
- Department of Medical Genetics, University of Cambridge
| |
Collapse
|
11
|
Li X, Wu D, Cui Y, Liu B, Walter H, Schumann G, Li C, Jiang T. Reliable heritability estimation using sparse regularization in ultrahigh dimensional genome-wide association studies. BMC Bioinformatics 2019; 20:219. [PMID: 31039742 PMCID: PMC6492418 DOI: 10.1186/s12859-019-2792-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Accepted: 04/02/2019] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Data from genome-wide association studies (GWASs) have been used to estimate the heritability of human complex traits in recent years. Existing methods are based on the linear mixed model, with the assumption that the genetic effects are random variables, which is opposite to the fixed effect assumption embedded in the framework of quantitative genetics theory. Moreover, heritability estimators provided by existing methods may have large standard errors, which calls for the development of reliable and accurate methods to estimate heritability. RESULTS In this paper, we first investigate the influences of the fixed and random effect assumption on heritability estimation, and prove that these two assumptions are equivalent under mild conditions in the theoretical aspect. Second, we propose a two-stage strategy by first performing sparse regularization via cross-validated elastic net, and then applying variance estimation methods to construct reliable heritability estimations. Results on both simulated data and real data show that our strategy achieves a considerable reduction in the standard error while reserving the accuracy. CONCLUSIONS The proposed strategy allows for a reliable and accurate heritability estimation using GWAS data. It shows the promising future that reliable estimations can still be obtained with even a relatively restricted sample size, and should be especially useful for large-scale heritability analyses in the genomics era.
Collapse
Affiliation(s)
- Xin Li
- School of Mathematical Sciences, Zhejiang University, 38 Zheda Road, Hangzhou, 310027 China
| | - Dongya Wu
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- University of Chinese Academy of Sciences, 19 Yuquan Road, Beijing, 100049 China
| | - Yue Cui
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
| | - Bing Liu
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
| | - Henrik Walter
- Department of Psychiatry and Psychotherapy, Campus Charité Mitte, Charité, Universitätsmedizin Berlin, Berlin, Germany
| | - Gunter Schumann
- Centre for Population Neuroscience and Stratified Medicine (PONS) and MRC-SGDP Centre, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, London, United Kingdom
| | - Chong Li
- School of Mathematical Sciences, Zhejiang University, 38 Zheda Road, Hangzhou, 310027 China
| | - Tianzi Jiang
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- The Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for Neuroinformation, University of Electronic Science and Technology of China, 4 Section 2 North Jianshe Road, Chengdu, 610054 China
- The Queensland Brain Institute, University of Queensland, Brisbane, QLD 4072 Australia
- University of Chinese Academy of Sciences, 19 Yuquan Road, Beijing, 100049 China
| |
Collapse
|
12
|
Xia Y, Cai TT, Li H. Joint testing and false discovery rate control in high-dimensional multivariate regression. Biometrika 2019; 105:249-269. [PMID: 30799872 DOI: 10.1093/biomet/asx085] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Indexed: 01/15/2023] Open
Abstract
Multivariate regression with high-dimensional covariates has many applications in genomic and genetic research, in which some covariates are expected to be associated with multiple responses. This paper considers joint testing for regression coefficients over multiple responses and develops simultaneous testing methods with false discovery rate control. The test statistic is based on inverse regression and bias-corrected group lasso estimates of the regression coefficients and is shown to have an asymptotic chi-squared null distribution. A row-wise multiple testing procedure is developed to identify the covariates associated with the responses. The procedure is shown to control the false discovery proportion and false discovery rate at a prespecified level asymptotically. Simulations demonstrate the gain in power, relative to entrywise testing, in detecting the covariates associated with the responses. The test is applied to an ovarian cancer dataset to identify the microRNA regulators that regulate protein expression.
Collapse
Affiliation(s)
- Yin Xia
- Department of Statistics, School of Management, Fudan University, Shanghai, China
| | - T Tony Cai
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A
| | - Hongzhe Li
- Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A
| |
Collapse
|