1
|
Kontou PI, Bagos PG. The goldmine of GWAS summary statistics: a systematic review of methods and tools. BioData Min 2024; 17:31. [PMID: 39238044 PMCID: PMC11375927 DOI: 10.1186/s13040-024-00385-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Accepted: 08/27/2024] [Indexed: 09/07/2024] Open
Abstract
Genome-wide association studies (GWAS) have revolutionized our understanding of the genetic architecture of complex traits and diseases. GWAS summary statistics have become essential tools for various genetic analyses, including meta-analysis, fine-mapping, and risk prediction. However, the increasing number of GWAS summary statistics and the diversity of software tools available for their analysis can make it challenging for researchers to select the most appropriate tools for their specific needs. This systematic review aims to provide a comprehensive overview of the currently available software tools and databases for GWAS summary statistics analysis. We conducted a comprehensive literature search to identify relevant software tools and databases. We categorized the tools and databases by their functionality, including data management, quality control, single-trait analysis, and multiple-trait analysis. We also compared the tools and databases based on their features, limitations, and user-friendliness. Our review identified a total of 305 functioning software tools and databases dedicated to GWAS summary statistics, each with unique strengths and limitations. We provide descriptions of the key features of each tool and database, including their input/output formats, data types, and computational requirements. We also discuss the overall usability and applicability of each tool for different research scenarios. This comprehensive review will serve as a valuable resource for researchers who are interested in using GWAS summary statistics to investigate the genetic basis of complex traits and diseases. By providing a detailed overview of the available tools and databases, we aim to facilitate informed tool selection and maximize the effectiveness of GWAS summary statistics analysis.
Collapse
Affiliation(s)
| | - Pantelis G Bagos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35131, Lamia, Greece.
| |
Collapse
|
2
|
Willis TW, Wallace C. Accurate detection of shared genetic architecture from GWAS summary statistics in the small-sample context. PLoS Genet 2023; 19:e1010852. [PMID: 37585442 PMCID: PMC10461826 DOI: 10.1371/journal.pgen.1010852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Revised: 08/28/2023] [Accepted: 06/30/2023] [Indexed: 08/18/2023] Open
Abstract
Assessment of the genetic similarity between two phenotypes can provide insight into a common genetic aetiology and inform the use of pleiotropy-informed, cross-phenotype analytical methods to identify novel genetic associations. The genetic correlation is a well-known means of quantifying and testing for genetic similarity between traits, but its estimates are subject to comparatively large sampling error. This makes it unsuitable for use in a small-sample context. We discuss the use of a previously published nonparametric test of genetic similarity for application to GWAS summary statistics. We establish that the null distribution of the test statistic is modelled better by an extreme value distribution than a transformation of the standard exponential distribution. We show with simulation studies and real data from GWAS of 18 phenotypes from the UK Biobank that the test is to be preferred for use with small sample sizes, particularly when genetic effects are few and large, outperforming the genetic correlation and another nonparametric statistical test of independence. We find the test suitable for the detection of genetic similarity in the rare disease context.
Collapse
Affiliation(s)
- Thomas W. Willis
- MRC Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom
| | - Chris Wallace
- MRC Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom
- Cambridge Institute of Therapeutic Immunology and Infectious Disease, University of Cambridge, Cambridge, United Kingdom
- Department of Medicine, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
3
|
Deng Q, Gupta A, Jeon H, Nam JH, Yilmaz AS, Chang W, Pietrzak M, Li L, Kim HJ, Chung D. graph-GPA 2.0: improving multi-disease genetic analysis with integration of functional annotation data. Front Genet 2023; 14:1079198. [PMID: 37501720 PMCID: PMC10370274 DOI: 10.3389/fgene.2023.1079198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Accepted: 06/21/2023] [Indexed: 07/29/2023] Open
Abstract
Genome-wide association studies (GWAS) have successfully identified a large number of genetic variants associated with traits and diseases. However, it still remains challenging to fully understand the functional mechanisms underlying many associated variants. This is especially the case when we are interested in variants shared across multiple phenotypes. To address this challenge, we propose graph-GPA 2.0 (GGPA 2.0), a statistical framework to integrate GWAS datasets for multiple phenotypes and incorporate functional annotations within a unified framework. Our simulation studies showed that incorporating functional annotation data using GGPA 2.0 not only improves the detection of disease-associated variants, but also provides a more accurate estimation of relationships among diseases. Next, we analyzed five autoimmune diseases and five psychiatric disorders with the functional annotations derived from GenoSkyline and GenoSkyline-Plus, along with the prior disease graph generated by biomedical literature mining. For autoimmune diseases, GGPA 2.0 identified enrichment for blood-related epigenetic marks, especially B cells and regulatory T cells, across multiple diseases. Psychiatric disorders were enriched for brain-related epigenetic marks, especially the prefrontal cortex and the inferior temporal lobe for bipolar disorder and schizophrenia, respectively. In addition, the pleiotropy between bipolar disorder and schizophrenia was also detected. Finally, we found that GGPA 2.0 is robust to the use of irrelevant and/or incorrect functional annotations. These results demonstrate that GGPA 2.0 can be a powerful tool to identify genetic variants associated with each phenotype or those shared across multiple phenotypes, while also promoting an understanding of functional mechanisms underlying the associated variants.
Collapse
Affiliation(s)
- Qiaolan Deng
- The Interdisciplinary PhD Program in Biostatistics, The Ohio State University, Columbus, OH, United States
| | - Arkobrato Gupta
- The Interdisciplinary PhD Program in Biostatistics, The Ohio State University, Columbus, OH, United States
| | - Hyeongseon Jeon
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, United States
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH, United States
| | - Jin Hyun Nam
- Division of Big Data Science, Korea University Sejong Campus, Sejong, Republic of Korea
| | - Ayse Selen Yilmaz
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, United States
| | - Won Chang
- Division of Statistics and Data Science, University of Cincinnati, Cincinnati, OH, United States
| | - Maciej Pietrzak
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, United States
| | - Lang Li
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, United States
| | - Hang J. Kim
- Division of Statistics and Data Science, University of Cincinnati, Cincinnati, OH, United States
| | - Dongjun Chung
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, United States
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH, United States
| |
Collapse
|
4
|
Sutton M, Sugier PE, Truong T, Liquet B. Leveraging pleiotropic association using sparse group variable selection in genomics data. BMC Med Res Methodol 2022; 22:9. [PMID: 34996381 PMCID: PMC8742466 DOI: 10.1186/s12874-021-01491-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Accepted: 12/03/2021] [Indexed: 12/04/2022] Open
Abstract
Background Genome-wide association studies (GWAS) have identified genetic variants associated with multiple complex diseases. We can leverage this phenomenon, known as pleiotropy, to integrate multiple data sources in a joint analysis. Often integrating additional information such as gene pathway knowledge can improve statistical efficiency and biological interpretation. In this article, we propose statistical methods which incorporate both gene pathway and pleiotropy knowledge to increase statistical power and identify important risk variants affecting multiple traits. Methods We propose novel feature selection methods for the group variable selection in multi-task regression problem. We develop penalised likelihood methods exploiting different penalties to induce structured sparsity at a gene (or pathway) and SNP level across all studies. We implement an alternating direction method of multipliers (ADMM) algorithm for our penalised regression methods. The performance of our approaches are compared to a subset based meta analysis approach on simulated data sets. A bootstrap sampling strategy is provided to explore the stability of the penalised methods. Results Our methods are applied to identify potential pleiotropy in an application considering the joint analysis of thyroid and breast cancers. The methods were able to detect eleven potential pleiotropic SNPs and six pathways. A simulation study found that our method was able to detect more true signals than a popular competing method while retaining a similar false discovery rate. Conclusion We developed feature selection methods for jointly analysing multiple logistic regression tasks where prior grouping knowledge is available. Our method performed well on both simulation studies and when applied to a real data analysis of multiple cancers.
Collapse
Affiliation(s)
- Matthew Sutton
- Queensland University of Technology Centre for Data Science, Brisbane, Australia.
| | - Pierre-Emmanuel Sugier
- Laboratoire De Mathématiques et de leurs Applications de PAU E2S UPPA, CNRS, Pau, France.,University Paris-Saclay, UVSQ, Inserm, Gustave Roussy, CESP, Team "Exposome and Heredity", Villejuif, France
| | - Therese Truong
- University Paris-Saclay, UVSQ, Inserm, Gustave Roussy, CESP, Team "Exposome and Heredity", Villejuif, France
| | - Benoit Liquet
- Laboratoire De Mathématiques et de leurs Applications de PAU E2S UPPA, CNRS, Pau, France.,Department of Mathematics and Statistics, Macquarie University, Sydney, Australia
| |
Collapse
|
5
|
Wang T, Lu H, Zeng P. Identifying pleiotropic genes for complex phenotypes with summary statistics from a perspective of composite null hypothesis testing. Brief Bioinform 2021; 23:6375058. [PMID: 34571531 DOI: 10.1093/bib/bbab389] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2021] [Revised: 08/06/2021] [Accepted: 08/28/2021] [Indexed: 12/13/2022] Open
Abstract
Pleiotropy has important implication on genetic connection among complex phenotypes and facilitates our understanding of disease etiology. Genome-wide association studies provide an unprecedented opportunity to detect pleiotropic associations; however, efficient pleiotropy test methods are still lacking. We here consider pleiotropy identification from a methodological perspective of high-dimensional composite null hypothesis and propose a powerful gene-based method called MAIUP. MAIUP is constructed based on the traditional intersection-union test with two sets of independent P-values as input and follows a novel idea that was originally proposed under the high-dimensional mediation analysis framework. The key improvement of MAIUP is that it takes the composite null nature of pleiotropy test into account by fitting a three-component mixture null distribution, which can ultimately generate well-calibrated P-values for effective control of family-wise error rate and false discover rate. Another attractive advantage of MAIUP is its ability to effectively address the issue of overlapping subjects commonly encountered in association studies. Simulation studies demonstrate that compared with other methods, only MAIUP can maintain correct type I error control and has higher power across a wide range of scenarios. We apply MAIUP to detect shared associated genes among 14 psychiatric disorders with summary statistics and discover many new pleiotropic genes that are otherwise not identified if failing to account for the issue of composite null hypothesis testing. Functional and enrichment analyses offer additional evidence supporting the validity of these identified pleiotropic genes associated with psychiatric disorders. Overall, MAIUP represents an efficient method for pleiotropy identification.
Collapse
Affiliation(s)
- Ting Wang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Haojie Lu
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Ping Zeng
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China.,Center for Medical Statistics and Data Analysis, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China.,Key Laboratory of Human Genetics and Environmental Medicine, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| |
Collapse
|
6
|
The statistical practice of the GTEx Project: from single to multiple tissues. QUANTITATIVE BIOLOGY 2020. [DOI: 10.1007/s40484-020-0210-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
7
|
Dai M, Wan X, Peng H, Wang Y, Liu Y, Liu J, Xu Z, Yang C. Joint analysis of individual-level and summary-level GWAS data by leveraging pleiotropy. Bioinformatics 2020; 35:1729-1736. [PMID: 30307540 DOI: 10.1093/bioinformatics/bty870] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Revised: 09/06/2018] [Accepted: 10/09/2018] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION A large number of recent genome-wide association studies (GWASs) for complex phenotypes confirm the early conjecture for polygenicity, suggesting the presence of large number of variants with only tiny or moderate effects. However, due to the limited sample size of a single GWAS, many associated genetic variants are too weak to achieve the genome-wide significance. These undiscovered variants further limit the prediction capability of GWAS. Restricted access to the individual-level data and the increasing availability of the published GWAS results motivate the development of methods integrating both the individual-level and summary-level data. How to build the connection between the individual-level and summary-level data determines the efficiency of using the existing abundant summary-level resources with limited individual-level data, and this issue inspires more efforts in the existing area. RESULTS In this study, we propose a novel statistical approach, LEP, which provides a novel way of modeling the connection between the individual-level data and summary-level data. LEP integrates both types of data by LEveraging Pleiotropy to increase the statistical power of risk variants identification and the accuracy of risk prediction. The algorithm for parameter estimation is developed to handle genome-wide-scale data. Through comprehensive simulation studies, we demonstrated the advantages of LEP over the existing methods. We further applied LEP to perform integrative analysis of Crohn's disease from WTCCC and summary statistics from GWAS of some other diseases, such as Type 1 diabetes, Ulcerative colitis and Primary biliary cirrhosis. LEP was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.39% (±0.58%) to 68.33% (±0.32%) using about 195 000 variants. AVAILABILITY AND IMPLEMENTATION The LEP software is available at https://github.com/daviddaigithub/LEP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mingwei Dai
- Department of Applied Mathematics, School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China.,Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong, China
| | - Xiang Wan
- ShenZhen Research Institute of Big Data, ShenZhen, China
| | - Hao Peng
- School of Business Administration, Southwestern University of Finance and Economics, Chengdu, China
| | - Yao Wang
- Department of Applied Mathematics, School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China
| | - Yue Liu
- Xiyuan Hospital of China Academy of Chinese Medical Sciences, Beijing, China
| | - Jin Liu
- Centre for Quantitative Medicine, Program in Health Services and Systems Research, Duke-NUS Medical School, Singapore
| | - Zongben Xu
- Department of Applied Mathematics, School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China
| | - Can Yang
- Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong, China
| |
Collapse
|
8
|
Yang C, Wan X, Lin X, Chen M, Zhou X, Liu J. CoMM: a collaborative mixed model to dissecting genetic contributions to complex traits by leveraging regulatory information. Bioinformatics 2020; 35:1644-1652. [PMID: 30295737 DOI: 10.1093/bioinformatics/bty865] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2018] [Revised: 09/15/2018] [Accepted: 10/05/2018] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION Genome-wide association studies (GWASs) have been successful in identifying many genetic variants associated with complex traits. However, the mechanistic links between these variants and complex traits remain elusive. A scientific hypothesis is that genetic variants influence complex traits at the organismal level via affecting cellular traits, such as regulating gene expression and altering protein abundance. Although earlier works have already presented some scientific insights about this hypothesis and their findings are very promising, statistical methods that effectively harness multilayered data (e.g. genetic variants, cellular traits and organismal traits) on a large scale for functional and mechanistic exploration are highly demanding. RESULTS In this study, we propose a collaborative mixed model (CoMM) to investigate the mechanistic role of associated variants in complex traits. The key idea is built upon the emerging scientific evidence that genetic effects at the cellular level are much stronger than those at the organismal level. Briefly, CoMM combines two models: the first model relating gene expression with genotype and the second model relating phenotype with predicted gene expression using the first model. The two models are fitted jointly in CoMM, such that the uncertainty in predicting gene expression has been fully accounted. To demonstrate the advantages of CoMM over existing methods, we conducted extensive simulation studies, and also applied CoMM to analyze 25 traits in NFBC1966 and Genetic Epidemiology Research on Aging (GERA) studies by integrating transcriptome information from the Genetic European in Health and Disease (GEUVADIS) Project. The results indicate that by leveraging regulatory information, CoMM can effectively improve the power of prioritizing risk variants. Regarding the computational efficiency, CoMM can complete the analysis of NFBC1966 dataset and GERA datasets in 2 and 18 min, respectively. AVAILABILITY AND IMPLEMENTATION The developed R package is available at https://github.com/gordonliu810822/CoMM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Can Yang
- Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong, China
| | - Xiang Wan
- Shenzhen Research Institute of Big Data, Shenzhen, China
| | - Xinyi Lin
- Centre for Quantitative Medicine, Program in Health Services and Systems Research, Duke-NUS Medical School, Singapore
| | - Mengjie Chen
- Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Jin Liu
- Centre for Quantitative Medicine, Program in Health Services and Systems Research, Duke-NUS Medical School, Singapore
| |
Collapse
|
9
|
van Rheenen W, Peyrot WJ, Schork AJ, Lee SH, Wray NR. Genetic correlations of polygenic disease traits: from theory to practice. Nat Rev Genet 2019; 20:567-581. [PMID: 31171865 DOI: 10.1038/s41576-019-0137-z] [Citation(s) in RCA: 202] [Impact Index Per Article: 33.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
The genetic correlation describes the genetic relationship between two traits and can contribute to a better understanding of the shared biological pathways and/or the causality relationships between them. The rarity of large family cohorts with recorded instances of two traits, particularly disease traits, has made it difficult to estimate genetic correlations using traditional epidemiological approaches. However, advances in genomic methodologies, such as genome-wide association studies, and widespread sharing of data now allow genetic correlations to be estimated for virtually any trait pair. Here, we review the definition, estimation, interpretation and uses of genetic correlations, with a focus on applications to human disease.
Collapse
Affiliation(s)
- Wouter van Rheenen
- Department of Neurology, Brain Center Rudolf Magnus, University Medical Center Utrecht, Utrecht, Netherlands.
| | - Wouter J Peyrot
- Department of Psychiatry, Amsterdam UMC, VU University Medical Center, Amsterdam, Netherlands
| | - Andrew J Schork
- Institute for Biological Psychiatry, Mental Health Services Snct. Hans, Roskilde, Denmark
| | - S Hong Lee
- Australian Centre for Precision Health, University of South Australia Cancer Research Institute, University of South Australia, Adelaide, South Australia, Australia
| | - Naomi R Wray
- Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland, Australia.
- Queensland Brain Institute, University of Queensland, Brisbane, Queensland, Australia.
| |
Collapse
|
10
|
Zeng P, Hao X, Zhou X. Pleiotropic mapping and annotation selection in genome-wide association studies with penalized Gaussian mixture models. Bioinformatics 2019; 34:2797-2807. [PMID: 29635306 DOI: 10.1093/bioinformatics/bty204] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Accepted: 04/02/2018] [Indexed: 12/11/2022] Open
Abstract
Motivation Genome-wide association studies (GWASs) have identified many genetic loci associated with complex traits. A substantial fraction of these identified loci is associated with multiple traits-a phenomena known as pleiotropy. Identification of pleiotropic associations can help characterize the genetic relationship among complex traits and can facilitate our understanding of disease etiology. Effective pleiotropic association mapping requires the development of statistical methods that can jointly model multiple traits with genome-wide single nucleic polymorphisms (SNPs) together. Results We develop a joint modeling method, which we refer to as the integrative MApping of Pleiotropic association (iMAP). iMAP models summary statistics from GWASs, uses a multivariate Gaussian distribution to account for phenotypic correlation, simultaneously infers genome-wide SNP association pattern using mixture modeling and has the potential to reveal causal relationship between traits. Importantly, iMAP integrates a large number of SNP functional annotations to substantially improve association mapping power, and, with a sparsity-inducing penalty, is capable of selecting informative annotations from a large, potentially non-informative set. To enable scalable inference of iMAP to association studies with hundreds of thousands of individuals and millions of SNPs, we develop an efficient expectation maximization algorithm based on an approximate penalized regression algorithm. With simulations and comparisons to existing methods, we illustrate the benefits of iMAP in terms of both high association mapping power and accurate estimation of genome-wide SNP association patterns. Finally, we apply iMAP to perform a joint analysis of 48 traits from 31 GWAS consortia together with 40 tissue-specific SNP annotations generated from the Roadmap Project. Availability and implementation iMAP is freely available at http://www.xzlab.org/software.html. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ping Zeng
- Department of Epidemiology and Biostatistics, Xuzhou Medical University, Xuzhou, Jiangsu, China.,Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA.,Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Xingjie Hao
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA.,Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA.,Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
11
|
Wu M, Lin Z, Ma S, Chen T, Jiang R, Wong WH. Simultaneous inference of phenotype-associated genes and relevant tissues from GWAS data via Bayesian integration of multiple tissue-specific gene networks. J Mol Cell Biol 2019; 9:436-452. [PMID: 29300920 DOI: 10.1093/jmcb/mjx059] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Accepted: 12/20/2017] [Indexed: 02/07/2023] Open
Abstract
Although genome-wide association studies (GWAS) have successfully identified thousands of genomic loci associated with hundreds of complex traits in the past decade, the debate about such problems as missing heritability and weak interpretability has been appealing for effective computational methods to facilitate the advanced analysis of the vast volume of existing and anticipated genetic data. Towards this goal, gene-level integrative GWAS analysis with the assumption that genes associated with a phenotype tend to be enriched in biological gene sets or gene networks has recently attracted much attention, due to such advantages as straightforward interpretation, less multiple testing burdens, and robustness across studies. However, existing methods in this category usually exploit non-tissue-specific gene networks and thus lack the ability to utilize informative tissue-specific characteristics. To overcome this limitation, we proposed a Bayesian approach called SIGNET (Simultaneously Inference of GeNEs and Tissues) to integrate GWAS data and multiple tissue-specific gene networks for the simultaneous inference of phenotype-associated genes and relevant tissues. Through extensive simulation studies, we showed the effectiveness of our method in finding both associated genes and relevant tissues for a phenotype. In applications to real GWAS data of 14 complex phenotypes, we demonstrated the power of our method in both deciphering genetic basis and discovering biological insights of a phenotype. With this understanding, we expect to see SIGNET as a valuable tool for integrative GWAS analysis, thereby boosting the prevention, diagnosis, and treatment of human inherited diseases and eventually facilitating precision medicine.
Collapse
Affiliation(s)
- Mengmeng Wu
- Department of Computer Science, Tsinghua University, Beijing 100084, China.,Ministry of Education Key Laboratory of Bioinformatics and Bioinformatics Division, Tsinghua National Laboratory for Information Science and Technology, Beijing 100084, China.,Department of Statistics, Stanford University, CA 94305, USA
| | - Zhixiang Lin
- Department of Statistics, Stanford University, CA 94305, USA
| | - Shining Ma
- Department of Statistics, Stanford University, CA 94305, USA
| | - Ting Chen
- Department of Computer Science, Tsinghua University, Beijing 100084, China.,Ministry of Education Key Laboratory of Bioinformatics and Bioinformatics Division, Tsinghua National Laboratory for Information Science and Technology, Beijing 100084, China
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics and Bioinformatics Division, Tsinghua National Laboratory for Information Science and Technology, Beijing 100084, China.,Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wing Hung Wong
- Department of Statistics, Stanford University, CA 94305, USA
| |
Collapse
|
12
|
Yang Y, Dai M, Huang J, Lin X, Yang C, Chen M, Liu J. LPG: A four-group probabilistic approach to leveraging pleiotropy in genome-wide association studies. BMC Genomics 2018; 19:503. [PMID: 29954342 PMCID: PMC6022345 DOI: 10.1186/s12864-018-4851-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2017] [Accepted: 06/04/2018] [Indexed: 01/30/2023] Open
Abstract
Background To date, genome-wide association studies (GWAS) have successfully identified tens of thousands of genetic variants among a variety of traits/diseases, shedding light on the genetic architecture of complex disease. The polygenicity of complex diseases is a widely accepted phenomenon through which a vast number of risk variants, each with a modest individual effect, collectively contribute to the heritability of complex diseases. This imposes a major challenge on fully characterizing the genetic bases of complex diseases. An immediate implication of polygenicity is that a much larger sample size is required to detect individual risk variants with weak/moderate effects. Meanwhile, accumulating evidence suggests that different complex diseases can share genetic risk variants, a phenomenon known as pleiotropy. Results In this study, we propose a statistical framework for Leveraging Pleiotropic effects in large-scale GWAS data (LPG). LPG utilizes a variational Bayesian expectation-maximization (VBEM) algorithm, making it computationally efficient and scalable for genome-wide-scale analysis. To demonstrate the advantages of LPG over existing methods that do not leverage pleiotropy, we conducted extensive simulation studies and applied LPG to analyze two pairs of disorders (Crohn’s disease and Type 1 diabetes, as well as rheumatoid arthritis and Type 1 diabetes). The results indicate that by levelaging pleiotropy, LPG can improve the power of prioritization of risk variants and the accuracy of risk prediction. Conclusions Our methodology provides a novel and efficient tool to detect pleiotropy among GWAS data for multiple traits/diseases collected from different studies. The software is available at https://github.com/Shufeyangyi2015310117/LPG. Electronic supplementary material The online version of this article (10.1186/s12864-018-4851-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yi Yang
- School of Statistics and Management, The Shanghai University of Finance and Economics, Guoding Road, Shanghai, China.,Centre for Quantitative Medicine, Duke-NUS Medical School, 8 College Road, Singapore, Singapore
| | - Mingwei Dai
- Institute for Information and System Sciences, Xian Jiaotong University, No.28, Xianning West Road, Xi'an, China.,Department of Mathematics, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Jian Huang
- Department of Applied Mathematics, Hong Kong Polytechnics University, Hung Hom, Hong Kong, China
| | - Xinyi Lin
- Centre for Quantitative Medicine, Duke-NUS Medical School, 8 College Road, Singapore, Singapore
| | - Can Yang
- Department of Mathematics, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Min Chen
- School of Statistics and Management, The Shanghai University of Finance and Economics, Guoding Road, Shanghai, China
| | - Jin Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, 8 College Road, Singapore, Singapore.
| |
Collapse
|
13
|
Hackinger S, Zeggini E. Statistical methods to detect pleiotropy in human complex traits. Open Biol 2018; 7:rsob.170125. [PMID: 29093210 PMCID: PMC5717338 DOI: 10.1098/rsob.170125] [Citation(s) in RCA: 89] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2017] [Accepted: 09/29/2017] [Indexed: 12/13/2022] Open
Abstract
In recent years pleiotropy, the phenomenon of one genetic locus influencing several traits, has become a widely researched field in human genetics. With the increasing availability of genome-wide association study summary statistics, as well as the establishment of deeply phenotyped sample collections, it is now possible to systematically assess the genetic overlap between multiple traits and diseases. In addition to increasing power to detect associated variants, multi-trait methods can also aid our understanding of how different disorders are aetiologically linked by highlighting relevant biological pathways. A plethora of available tools to perform such analyses exists, each with their own advantages and limitations. In this review, we outline some of the currently available methods to conduct multi-trait analyses. First, we briefly introduce the concept of pleiotropy and outline the current landscape of pleiotropy research in human genetics; second, we describe analytical considerations and analysis methods; finally, we discuss future directions for the field.
Collapse
|
14
|
Dai M, Ming J, Cai M, Liu J, Yang C, Wan X, Xu Z. IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies. Bioinformatics 2018; 33:2882-2889. [PMID: 28498950 DOI: 10.1093/bioinformatics/btx314] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2016] [Accepted: 05/10/2017] [Indexed: 01/24/2023] Open
Abstract
Motivation Results from genome-wide association studies (GWAS) suggest that a complex phenotype is often affected by many variants with small effects, known as 'polygenicity'. Tens of thousands of samples are often required to ensure statistical power of identifying these variants with small effects. However, it is often the case that a research group can only get approval for the access to individual-level genotype data with a limited sample size (e.g. a few hundreds or thousands). Meanwhile, summary statistics generated using single-variant-based analysis are becoming publicly available. The sample sizes associated with the summary statistics datasets are usually quite large. How to make the most efficient use of existing abundant data resources largely remains an open question. Results In this study, we propose a statistical approach, IGESS, to increasing statistical power of identifying risk variants and improving accuracy of risk prediction by i ntegrating individual level ge notype data and s ummary s tatistics. An efficient algorithm based on variational inference is developed to handle the genome-wide analysis. Through comprehensive simulation studies, we demonstrated the advantages of IGESS over the methods which take either individual-level data or summary statistics data as input. We applied IGESS to perform integrative analysis of Crohns Disease from WTCCC and summary statistics from other studies. IGESS was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.2% ( ±0.4% ) to 69.4% ( ±0.1% ) using about 240 000 variants. Availability and implementation The IGESS software is available at https://github.com/daviddaigithub/IGESS . Contact zbxu@xjtu.edu.cn or xwan@comp.hkbu.edu.hk or eeyang@hkbu.edu.hk. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mingwei Dai
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China.,Department of Mathematics, Hong Kong Baptist University, Hong Kong
| | - Jingsi Ming
- Department of Mathematics, Hong Kong Baptist University, Hong Kong
| | - Mingxuan Cai
- Department of Mathematics, Hong Kong Baptist University, Hong Kong
| | - Jin Liu
- Centre of Quantitative Medicine, Duke-NUS Medical School, Singapore
| | - Can Yang
- Department of Mathematics, Hong Kong Baptist University, Hong Kong
| | - Xiang Wan
- Department of Computer Science, Hong Kong Baptist University, Hong Kong
| | - Zongben Xu
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China
| |
Collapse
|
15
|
Su YR, Di C, Bien S, Huang L, Dong X, Abecasis G, Berndt S, Bezieau S, Brenner H, Caan B, Casey G, Chang-Claude J, Chanock S, Chen S, Connolly C, Curtis K, Figueiredo J, Gala M, Gallinger S, Harrison T, Hoffmeister M, Hopper J, Huyghe JR, Jenkins M, Joshi A, Le Marchand L, Newcomb P, Nickerson D, Potter J, Schoen R, Slattery M, White E, Zanke B, Peters U, Hsu L. A Mixed-Effects Model for Powerful Association Tests in Integrative Functional Genomics. Am J Hum Genet 2018; 102:904-919. [PMID: 29727690 PMCID: PMC5986723 DOI: 10.1016/j.ajhg.2018.03.019] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2017] [Accepted: 03/15/2018] [Indexed: 01/05/2023] Open
Abstract
Genome-wide association studies (GWASs) have successfully identified thousands of genetic variants for many complex diseases; however, these variants explain only a small fraction of the heritability. Recently, genetic association studies that leverage external transcriptome data have received much attention and shown promise for discovering novel variants. One such approach, PrediXcan, is to use predicted gene expression through genetic regulation. However, there are limitations in this approach. The predicted gene expression may be biased, resulting from regularized regression applied to moderately sample-sized reference studies. Further, some variants can individually influence disease risk through alternative functional mechanisms besides expression. Thus, testing only the association of predicted gene expression as proposed in PrediXcan will potentially lose power. To tackle these challenges, we consider a unified mixed effects model that formulates the association of intermediate phenotypes such as imputed gene expression through fixed effects, while allowing residual effects of individual variants to be random. We consider a set-based score testing framework, MiST (mixed effects score test), and propose two data-driven combination approaches to jointly test for the fixed and random effects. We establish the asymptotic distributions, which enable rapid calculation of p values for genome-wide analyses, and provide p values for fixed and random effects separately to enhance interpretability over GWASs. Extensive simulations demonstrate that our approaches are more powerful than existing ones. We apply our approach to a large-scale GWAS of colorectal cancer and identify two genes, POU5F1B and ATF1, which would have otherwise been missed by PrediXcan, after adjusting for all known loci.
Collapse
Affiliation(s)
- Yu-Ru Su
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA.
| | - Chongzhi Di
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Stephanie Bien
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Licai Huang
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Xinyuan Dong
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| | - Goncalo Abecasis
- Biostatistics, School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA
| | - Sonja Berndt
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, MD 20850, USA
| | - Stephane Bezieau
- Service de Génétique Médicale Centre Hospitalier Universitaire (CHU) Nantes, Nantes 44093, France
| | - Hermann Brenner
- Division of Clinical Epidemiology and Aging Research, German Cancer Research Center (DKFZ), Heidelberg 69120, Germany
| | - Bette Caan
- Division of Research, Kaiser Permanente Medical Care Program of Northern California, Oakland, CA 94612, USA
| | - Graham Casey
- Public Health Sciences Division, University of Virginia, Charlottesville, VA 22908, USA
| | - Jenny Chang-Claude
- Division of Cancer Epidemiology, German Cancer Research Center, Heidelberg 69009, Germany
| | - Stephen Chanock
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, MD 20850, USA
| | - Sai Chen
- Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Charles Connolly
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Keith Curtis
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Jane Figueiredo
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA
| | - Manish Gala
- Division of Gastroenterology, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Steven Gallinger
- Department of Surgery, Mount Sinai Hospital, Toronto, ON M5G 1X5, Canada
| | - Tabitha Harrison
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Michael Hoffmeister
- Division of Clinical Epidemiology and Aging Research, German Cancer Research Center (DKFZ), Heidelberg 69120, Germany
| | - John Hopper
- Melborne School of Population Health, The University of Melborne, Carlton, VIC 3010, Australia
| | - Jeroen R Huyghe
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Mark Jenkins
- Melborne School of Population Health, The University of Melborne, Carlton, VIC 3010, Australia
| | - Amit Joshi
- Department of Epidemiology, Harvard School of Public Health, Boston, MA 02115, USA
| | - Loic Le Marchand
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI 96813, USA
| | - Polly Newcomb
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA; Department of Epidemiology, University of Washington School of Public Health, Seattle, WA 98109, USA
| | | | - John Potter
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA; Department of Epidemiology, University of Washington School of Public Health, Seattle, WA 98109, USA
| | - Robert Schoen
- Department of Medicine and Epidemiology, University of Pittsburgh Medical Center, Pittsburgh, PA 15213, USA
| | - Martha Slattery
- Department of Internal Medicine, University of Utah Health Sciences Center, Salt Lake City, UT 84132, USA
| | - Emily White
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA; Department of Epidemiology, University of Washington School of Public Health, Seattle, WA 98109, USA
| | - Brent Zanke
- Division of Hematology, Faculty of Medicine, The University of Ottawa, Ottawa, ON K1Y 4E9, USA
| | - Ulrike Peters
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA; Department of Epidemiology, University of Washington School of Public Health, Seattle, WA 98109, USA
| | - Li Hsu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA; Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.
| |
Collapse
|
16
|
Ming J, Dai M, Cai M, Wan X, Liu J, Yang C. LSMM: a statistical approach to integrating functional annotations with genome-wide association studies. Bioinformatics 2018; 34:2788-2796. [DOI: 10.1093/bioinformatics/bty187] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2017] [Accepted: 03/27/2018] [Indexed: 01/27/2023] Open
Affiliation(s)
- Jingsi Ming
- Department of Mathematics, Hong Kong Baptist University, Hong Kong
| | - Mingwei Dai
- School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, China
- Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong
| | - Mingxuan Cai
- Department of Mathematics, Hong Kong Baptist University, Hong Kong
| | - Xiang Wan
- Shenzhen Research Institute of Big Data, Shenzhen, China
| | - Jin Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore
| | - Can Yang
- Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong
| |
Collapse
|
17
|
Benner C, Havulinna AS, Järvelin MR, Salomaa V, Ripatti S, Pirinen M. Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies. Am J Hum Genet 2017; 101:539-551. [PMID: 28942963 DOI: 10.1016/j.ajhg.2017.08.012] [Citation(s) in RCA: 148] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 08/17/2017] [Indexed: 01/15/2023] Open
Abstract
During the past few years, various novel statistical methods have been developed for fine-mapping with the use of summary statistics from genome-wide association studies (GWASs). Although these approaches require information about the linkage disequilibrium (LD) between variants, there has not been a comprehensive evaluation of how estimation of the LD structure from reference genotype panels performs in comparison with that from the original individual-level GWAS data. Using population genotype data from Finland and the UK Biobank, we show here that a reference panel of 1,000 individuals from the target population is adequate for a GWAS cohort of up to 10,000 individuals, whereas smaller panels, such as those from the 1000 Genomes Project, should be avoided. We also show, both theoretically and empirically, that the size of the reference panel needs to scale with the GWAS sample size; this has important consequences for the application of these methods in ongoing GWAS meta-analyses and large biobank studies. We conclude by providing software tools and by recommending practices for sharing LD information to more efficiently exploit summary statistics in genetics research.
Collapse
Affiliation(s)
- Christian Benner
- Institute for Molecular Medicine Finland, University of Helsinki, 00014 Helsinki, Finland; Department of Public Health, University of Helsinki, 00014 Helsinki, Finland.
| | - Aki S Havulinna
- Institute for Molecular Medicine Finland, University of Helsinki, 00014 Helsinki, Finland; National Institute for Health and Welfare, 00271 Helsinki, Finland
| | - Marjo-Riitta Järvelin
- Center for Life-Course Health Research and Northern Finland Cohort Center, Biocenter Oulu, University of Oulu, 90014 Oulu, Finland; Faculty of Medicine, University of Oulu, 90014 Oulu, Finland; Unit of Primary Care, Oulu University Hospital, 90220 Oulu, Finland; Department of Epidemiology and Biostatistics, School of Public Health, Faculty of Medicine, Imperial College London, W2 1PG, UK
| | - Veikko Salomaa
- National Institute for Health and Welfare, 00271 Helsinki, Finland
| | - Samuli Ripatti
- Institute for Molecular Medicine Finland, University of Helsinki, 00014 Helsinki, Finland; Department of Public Health, University of Helsinki, 00014 Helsinki, Finland; Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, Cambridge, UK
| | - Matti Pirinen
- Institute for Molecular Medicine Finland, University of Helsinki, 00014 Helsinki, Finland; Department of Public Health, University of Helsinki, 00014 Helsinki, Finland; Helsinki Institute for Information Technology and Department of Mathematics and Statistics, University of Helsinki, 00014 Helsinki, Finland.
| |
Collapse
|