1
|
Kontou PI, Bagos PG. The goldmine of GWAS summary statistics: a systematic review of methods and tools. BioData Min 2024; 17:31. [PMID: 39238044 PMCID: PMC11375927 DOI: 10.1186/s13040-024-00385-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Accepted: 08/27/2024] [Indexed: 09/07/2024] Open
Abstract
Genome-wide association studies (GWAS) have revolutionized our understanding of the genetic architecture of complex traits and diseases. GWAS summary statistics have become essential tools for various genetic analyses, including meta-analysis, fine-mapping, and risk prediction. However, the increasing number of GWAS summary statistics and the diversity of software tools available for their analysis can make it challenging for researchers to select the most appropriate tools for their specific needs. This systematic review aims to provide a comprehensive overview of the currently available software tools and databases for GWAS summary statistics analysis. We conducted a comprehensive literature search to identify relevant software tools and databases. We categorized the tools and databases by their functionality, including data management, quality control, single-trait analysis, and multiple-trait analysis. We also compared the tools and databases based on their features, limitations, and user-friendliness. Our review identified a total of 305 functioning software tools and databases dedicated to GWAS summary statistics, each with unique strengths and limitations. We provide descriptions of the key features of each tool and database, including their input/output formats, data types, and computational requirements. We also discuss the overall usability and applicability of each tool for different research scenarios. This comprehensive review will serve as a valuable resource for researchers who are interested in using GWAS summary statistics to investigate the genetic basis of complex traits and diseases. By providing a detailed overview of the available tools and databases, we aim to facilitate informed tool selection and maximize the effectiveness of GWAS summary statistics analysis.
Collapse
Affiliation(s)
| | - Pantelis G Bagos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35131, Lamia, Greece.
| |
Collapse
|
2
|
Yao X, Ouyang S, Lian Y, Peng Q, Zhou X, Huang F, Hu X, Shi F, Xia J. PheSeq, a Bayesian deep learning model to enhance and interpret the gene-disease association studies. Genome Med 2024; 16:56. [PMID: 38627848 PMCID: PMC11020195 DOI: 10.1186/s13073-024-01330-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 04/02/2024] [Indexed: 04/19/2024] Open
Abstract
Despite the abundance of genotype-phenotype association studies, the resulting association outcomes often lack robustness and interpretations. To address these challenges, we introduce PheSeq, a Bayesian deep learning model that enhances and interprets association studies through the integration and perception of phenotype descriptions. By implementing the PheSeq model in three case studies on Alzheimer's disease, breast cancer, and lung cancer, we identify 1024 priority genes for Alzheimer's disease and 818 and 566 genes for breast cancer and lung cancer, respectively. Benefiting from data fusion, these findings represent moderate positive rates, high recall rates, and interpretation in gene-disease association studies.
Collapse
Affiliation(s)
- Xinzhi Yao
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Sizhuo Ouyang
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Yulong Lian
- College of Science, Huazhong Agricultural University, Wuhan, China
| | - Qianqian Peng
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Xionghui Zhou
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Feier Huang
- College of Life Science and Technology, Huazhong Agricultural University, Wuhan, China
| | - Xuehai Hu
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Feng Shi
- College of Science, Huazhong Agricultural University, Wuhan, China
| | - Jingbo Xia
- College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China.
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China.
| |
Collapse
|
3
|
Fu S, Deng L, Zhang H, Qin J, Yu K. Integrative analysis of individual-level data and high-dimensional summary statistics. BIOINFORMATICS (OXFORD, ENGLAND) 2023; 39:7085950. [PMID: 36964712 PMCID: PMC10361352 DOI: 10.1093/bioinformatics/btad156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 03/19/2023] [Accepted: 03/22/2023] [Indexed: 04/23/2023]
Abstract
MOTIVATION Researchers usually conduct statistical analyses based on models built on raw data collected from individual participants (individual-level data). There is a growing interest in enhancing inference efficiency by incorporating aggregated summary information from other sources, such as summary statistics on genetic markers' marginal associations with a given trait generated from genome-wide association studies. However, combining high-dimensional summary data with individual-level data using existing integrative procedures can be challenging due to various numeric issues in optimizing an objective function over a large number of unknown parameters. RESULTS We develop a procedure to improve the fitting of a targeted statistical model by leveraging external summary data for more efficient statistical inference (both effect estimation and hypothesis testing). To make this procedure scalable to high-dimensional summary data, we propose a divide-and-conquer strategy by breaking the task into easier parallel jobs, each fitting the targeted model by integrating the individual-level data with a small proportion of summary data. We obtain the final estimates of model parameters by pooling results from multiple fitted models through the minimum distance estimation procedure. We improve the procedure for a general class of additive models commonly encountered in genetic studies. We further expand these two approaches to integrate individual-level and high-dimensional summary data from different study populations. We demonstrate the advantage of the proposed methods through simulations and an application to the study of the effect on pancreatic cancer risk by the polygenic risk score defined by BMI-associated genetic markers. AVAILABILITY AND IMPLEMENTATION R package is available at https://github.com/fushengstat/MetaGIM.
Collapse
Affiliation(s)
- Sheng Fu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA
| | - Lu Deng
- School of Statistics and Data Science, Nankai University, Tianjin 300071, China
| | - Han Zhang
- Information Management Services, Inc, Bethesda, MD 20892, USA
| | - Jing Qin
- National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Kai Yu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA
| |
Collapse
|
4
|
Wang A, Liu W, Liu Z. A two-sample robust Bayesian Mendelian Randomization method accounting for linkage disequilibrium and idiosyncratic pleiotropy with applications to the COVID-19 outcomes. Genet Epidemiol 2022; 46:159-169. [PMID: 35192729 PMCID: PMC9648496 DOI: 10.1002/gepi.22445] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Revised: 11/03/2021] [Accepted: 01/20/2022] [Indexed: 01/02/2023]
Abstract
Mendelian randomization (MR) is a statistical method exploiting genetic variants as instrumental variables to estimate the causal effect of modifiable risk factors on an outcome of interest. Despite wide uses of various popular two-sample MR methods based on genome-wide association study summary level data, however, those methods could suffer from potential power loss or/and biased inference when the chosen genetic variants are in linkage disequilibrium (LD), and also have relatively large direct effects on the outcome whose distribution might be heavy-tailed which is commonly referred to as the idiosyncratic pleiotropy phenomenon. To resolve those two issues, we propose a novel Robust Bayesian Mendelian Randomization (RBMR) model that uses the more robust multivariate generalized t $t$ -distribution to model such direct effects in a probabilistic model framework which can also incorporate the LD structure explicitly. The generalized t $t$ -distribution can be represented as a Gaussian scaled mixture so that our model parameters can be estimated by the expectation maximization (EM)-type algorithms. We compute the standard errors by calibrating the evidence lower bound using the likelihood ratio test. Through extensive simulation studies, we show that our RBMR has robust performance compared with other competing methods. We further apply our RBMR method to two benchmark data sets and find that RBMR has smaller bias and standard errors. Using our proposed RBMR method, we find that coronary artery disease is associated with increased risk of critically ill coronavirus disease 2019. We also develop a user-friendly R package RBMR (https://github.com/AnqiWang2021/RBMR) for public use.
Collapse
Affiliation(s)
- Anqi Wang
- Department of Statistics and Actuarial ScienceUniversity of Hong KongHong KongSARChina
| | - Wei Liu
- Department of Statistics and Actuarial ScienceUniversity of Hong KongHong KongSARChina
| | - Zhonghua Liu
- Department of Statistics and Actuarial ScienceUniversity of Hong KongHong KongSARChina
| |
Collapse
|
5
|
Zhou K, Wang Y, Bretonnel Cohen K, Kim JD, Ma X, Shen Z, Meng X, Xia J. Bridging heterogeneous mutation data to enhance disease gene discovery. Brief Bioinform 2021; 22:6224263. [PMID: 33847357 DOI: 10.1093/bib/bbab079] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 02/01/2021] [Accepted: 02/19/2021] [Indexed: 12/18/2022] Open
Abstract
Bridging heterogeneous mutation data fills in the gap between various data categories and propels discovery of disease-related genes. It is known that genome-wide association study (GWAS) infers significant mutation associations that link genotype and phenotype. However, due to the differences of size and quality between GWAS studies, not all de facto vital variations are able to pass the multiple testing. In the meantime, mutation events widely reported in literature unveil typical functional biological process, including mutation types like gain of function and loss of function. To bring together the heterogeneous mutation data, we propose a 'Gene-Disease Association prediction by Mutation Data Bridging (GDAMDB)' pipeline with a statistic generative model. The model learns the distribution parameters of mutation associations and mutation types and recovers false-negative GWAS mutations that fail to pass significant test but represent supportive evidences of functional biological process in literature. Eventually, we applied GDAMDB in Alzheimer's disease (AD) and predicted 79 AD-associated genes. Besides, 12 of them from the original GWAS, 60 of them are supported to be AD-related by other GWAS or literature report, and rest of them are newly predicted genes. Our model is capable of enhancing the GWAS-based gene association discovery by well combining text mining results. The positive result indicates that bridging the heterogeneous mutation data is contributory for the novel disease-related gene discovery.
Collapse
Affiliation(s)
- Kaiyin Zhou
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei Province, P.R. China
| | - Yuxing Wang
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei Province, P.R. China
| | - Kevin Bretonnel Cohen
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei Province, P.R. China
| | - Jin-Dong Kim
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei Province, P.R. China
| | - Xiaohang Ma
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei Province, P.R. China
| | - Zhixue Shen
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei Province, P.R. China
| | - Xiangyu Meng
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei Province, P.R. China
| | - Jingbo Xia
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei Province, P.R. China
| |
Collapse
|
6
|
Yang Y, Shi X, Jiao Y, Huang J, Chen M, Zhou X, Sun L, Lin X, Yang C, Liu J. CoMM-S2: a collaborative mixed model using summary statistics in transcriptome-wide association studies. Bioinformatics 2020; 36:2009-2016. [PMID: 31755899 DOI: 10.1093/bioinformatics/btz880] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Revised: 09/25/2019] [Accepted: 11/21/2019] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION Although genome-wide association studies (GWAS) have deepened our understanding of the genetic architecture of complex traits, the mechanistic links that underlie how genetic variants cause complex traits remains elusive. To advance our understanding of the underlying mechanistic links, various consortia have collected a vast volume of genomic data that enable us to investigate the role that genetic variants play in gene expression regulation. Recently, a collaborative mixed model (CoMM) was proposed to jointly interrogate genome on complex traits by integrating both the GWAS dataset and the expression quantitative trait loci (eQTL) dataset. Although CoMM is a powerful approach that leverages regulatory information while accounting for the uncertainty in using an eQTL dataset, it requires individual-level GWAS data and cannot fully make use of widely available GWAS summary statistics. Therefore, statistically efficient methods that leverages transcriptome information using only summary statistics information from GWAS data are required. RESULTS In this study, we propose a novel probabilistic model, CoMM-S2, to examine the mechanistic role that genetic variants play, by using only GWAS summary statistics instead of individual-level GWAS data. Similar to CoMM which uses individual-level GWAS data, CoMM-S2 combines two models: the first model examines the relationship between gene expression and genotype, while the second model examines the relationship between the phenotype and the predicted gene expression from the first model. Distinct from CoMM, CoMM-S2 requires only GWAS summary statistics. Using both simulation studies and real data analysis, we demonstrate that even though CoMM-S2 utilizes GWAS summary statistics, it has comparable performance as CoMM, which uses individual-level GWAS data. AVAILABILITY AND IMPLEMENTATION The implement of CoMM-S2 is included in the CoMM package that can be downloaded from https://github.com/gordonliu810822/CoMM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yi Yang
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China.,Centre for Quantitative Medicine, Program in Health Services & Systems Research, Duke-NUS Medical School, 169857, Singapore
| | - Xingjie Shi
- Centre for Quantitative Medicine, Program in Health Services & Systems Research, Duke-NUS Medical School, 169857, Singapore.,Department of Statistics, Nanjing University of Finance and Economics, Nanjing 210046, China
| | - Yuling Jiao
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan 430073, China
| | - Jian Huang
- Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA 52242, USA
| | - Min Chen
- Academy of Mathematics and Systems Science, The Chinese Academy of Sciences, Beijing 100190, China
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Lei Sun
- Cardiovascular and Metabolic Disorders Program, Duke-NUS Medical School, 169857, Singapore
| | - Xinyi Lin
- Centre for Quantitative Medicine, Program in Health Services & Systems Research, Duke-NUS Medical School, 169857, Singapore.,Singapore Clinical Research Institute, 138669, Singapore.,Singapore Institute for Clinical Sciences, A*STAR, 117609, Singapore
| | - Can Yang
- Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong 999077, China
| | - Jin Liu
- Centre for Quantitative Medicine, Program in Health Services & Systems Research, Duke-NUS Medical School, 169857, Singapore
| |
Collapse
|
7
|
Turek C, Wróbel S, Piwowar M. OmicsON - Integration of omics data with molecular networks and statistical procedures. PLoS One 2020; 15:e0235398. [PMID: 32726348 PMCID: PMC7390260 DOI: 10.1371/journal.pone.0235398] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Accepted: 06/15/2020] [Indexed: 12/05/2022] Open
Abstract
A huge amount of atomized biological data collected in various databases and the need for a description of their relation by theoretical methods causes the development of data integration methods. The omics data analysis by integration of biological knowledge with mathematical procedures implemented in the OmicsON R library is presented in the paper. OmicsON is a tool for the integration of two sets of data: transcriptomics and metabolomics. In the workflow of the library, the functional grouping and statistical analysis are applied. Subgroups among the transcriptomic and metabolomics sets are created based on the biological knowledge stored in Reactome and String databases. It gives the possibility to analyze such sets of data by multivariate statistical procedures like Canonical Correlation Analysis (CCA) or Partial Least Squares (PLS). The integration of metabolomic and transcriptomic data based on the methodology contained in OmicsON helps to easily obtain information on the connection of data from two different sets. This information can significantly help in assessing the relationship between gene expression and metabolite concentrations, which in turn facilitates the biological interpretation of the analyzed process.
Collapse
Affiliation(s)
- Cezary Turek
- Department of Bioinformatics and Telemedicine, Jagiellonian University–Medical College, Krakow, Poland
| | - Sonia Wróbel
- Department of Medical Physics, Jagiellonian University, Marian Smoluchowski Institute of Physics, Krakow, Poland
| | - Monika Piwowar
- Department of Bioinformatics and Telemedicine, Jagiellonian University–Medical College, Krakow, Poland
- * E-mail:
| |
Collapse
|
8
|
Dai M, Wan X, Peng H, Wang Y, Liu Y, Liu J, Xu Z, Yang C. Joint analysis of individual-level and summary-level GWAS data by leveraging pleiotropy. Bioinformatics 2020; 35:1729-1736. [PMID: 30307540 DOI: 10.1093/bioinformatics/bty870] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Revised: 09/06/2018] [Accepted: 10/09/2018] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION A large number of recent genome-wide association studies (GWASs) for complex phenotypes confirm the early conjecture for polygenicity, suggesting the presence of large number of variants with only tiny or moderate effects. However, due to the limited sample size of a single GWAS, many associated genetic variants are too weak to achieve the genome-wide significance. These undiscovered variants further limit the prediction capability of GWAS. Restricted access to the individual-level data and the increasing availability of the published GWAS results motivate the development of methods integrating both the individual-level and summary-level data. How to build the connection between the individual-level and summary-level data determines the efficiency of using the existing abundant summary-level resources with limited individual-level data, and this issue inspires more efforts in the existing area. RESULTS In this study, we propose a novel statistical approach, LEP, which provides a novel way of modeling the connection between the individual-level data and summary-level data. LEP integrates both types of data by LEveraging Pleiotropy to increase the statistical power of risk variants identification and the accuracy of risk prediction. The algorithm for parameter estimation is developed to handle genome-wide-scale data. Through comprehensive simulation studies, we demonstrated the advantages of LEP over the existing methods. We further applied LEP to perform integrative analysis of Crohn's disease from WTCCC and summary statistics from GWAS of some other diseases, such as Type 1 diabetes, Ulcerative colitis and Primary biliary cirrhosis. LEP was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.39% (±0.58%) to 68.33% (±0.32%) using about 195 000 variants. AVAILABILITY AND IMPLEMENTATION The LEP software is available at https://github.com/daviddaigithub/LEP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mingwei Dai
- Department of Applied Mathematics, School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China.,Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong, China
| | - Xiang Wan
- ShenZhen Research Institute of Big Data, ShenZhen, China
| | - Hao Peng
- School of Business Administration, Southwestern University of Finance and Economics, Chengdu, China
| | - Yao Wang
- Department of Applied Mathematics, School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China
| | - Yue Liu
- Xiyuan Hospital of China Academy of Chinese Medical Sciences, Beijing, China
| | - Jin Liu
- Centre for Quantitative Medicine, Program in Health Services and Systems Research, Duke-NUS Medical School, Singapore
| | - Zongben Xu
- Department of Applied Mathematics, School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China
| | - Can Yang
- Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong, China
| |
Collapse
|
9
|
Cai M, Chen LS, Liu J, Yang C. IGREX for quantifying the impact of genetically regulated expression on phenotypes. NAR Genom Bioinform 2020; 2:lqaa010. [PMID: 32118202 PMCID: PMC7034630 DOI: 10.1093/nargab/lqaa010] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2019] [Revised: 01/08/2020] [Accepted: 02/05/2020] [Indexed: 12/20/2022] Open
Abstract
By leveraging existing GWAS and eQTL resources, transcriptome-wide association studies (TWAS) have achieved many successes in identifying trait-associations of genetically regulated expression (GREX) levels. TWAS analysis relies on the shared GREX variation across GWAS and the reference eQTL data, which depends on the cellular conditions of the eQTL data. Considering the increasing availability of eQTL data from different conditions and the often unknown trait-relevant cell/tissue-types, we propose a method and tool, IGREX, for precisely quantifying the proportion of phenotypic variation attributed to the GREX component. IGREX takes as input a reference eQTL panel and individual-level or summary-level GWAS data. Using eQTL data of 48 tissue types from the GTEx project as a reference panel, we evaluated the tissue-specific IGREX impact on a wide spectrum of phenotypes. We observed strong GREX effects on immune-related protein biomarkers. By incorporating trans-eQTLs and analyzing genetically regulated alternative splicing events, we evaluated new potential directions for TWAS analysis.
Collapse
Affiliation(s)
- Mingxuan Cai
- Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
| | - Lin S Chen
- Department of Public Health Sciences, The University of Chicago, IL 60637, USA
| | - Jin Liu
- Center for Quantitative Medicine, Duke-NUS Medical School, 169856, Singapore
| | - Can Yang
- Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
| |
Collapse
|
10
|
Dai M, Liu J, Yang C. LEP: A Statistical Method Integrating Individual-Level and Summary-Level Data of the Same Trait From Different Populations. BIOMEDICAL INFORMATICS INSIGHTS 2019; 11:1178222619881624. [PMID: 31666794 PMCID: PMC6798161 DOI: 10.1177/1178222619881624] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Accepted: 09/19/2019] [Indexed: 11/24/2022]
Abstract
Statistical approaches for integrating multiple data sets in genome-wide
association studies (GWASs) are increasingly important. Proper
utilization of more relevant information is expected to improve
statistical efficiency in the analysis. Among these approaches, LEP
was proposed for joint analysis of individual-level data and
summary-level data in the same population by leveraging pleiotropy.
The key idea of LEP is to explore correlation of the association
status among different data sets while accounting for the
heterogeneity. In this commentary, we show that LEP is applicable to
integrate individual-level data and summary-level data of the same
trait from different populations, providing new insights into the
genetic architecture of different populations.
Collapse
Affiliation(s)
- Mingwei Dai
- Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics, Chengdu, China
| | - Jin Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore
| | - Can Yang
- Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong, China
| |
Collapse
|
11
|
Yang Y, Dai M, Huang J, Lin X, Yang C, Chen M, Liu J. LPG: A four-group probabilistic approach to leveraging pleiotropy in genome-wide association studies. BMC Genomics 2018; 19:503. [PMID: 29954342 PMCID: PMC6022345 DOI: 10.1186/s12864-018-4851-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2017] [Accepted: 06/04/2018] [Indexed: 01/30/2023] Open
Abstract
Background To date, genome-wide association studies (GWAS) have successfully identified tens of thousands of genetic variants among a variety of traits/diseases, shedding light on the genetic architecture of complex disease. The polygenicity of complex diseases is a widely accepted phenomenon through which a vast number of risk variants, each with a modest individual effect, collectively contribute to the heritability of complex diseases. This imposes a major challenge on fully characterizing the genetic bases of complex diseases. An immediate implication of polygenicity is that a much larger sample size is required to detect individual risk variants with weak/moderate effects. Meanwhile, accumulating evidence suggests that different complex diseases can share genetic risk variants, a phenomenon known as pleiotropy. Results In this study, we propose a statistical framework for Leveraging Pleiotropic effects in large-scale GWAS data (LPG). LPG utilizes a variational Bayesian expectation-maximization (VBEM) algorithm, making it computationally efficient and scalable for genome-wide-scale analysis. To demonstrate the advantages of LPG over existing methods that do not leverage pleiotropy, we conducted extensive simulation studies and applied LPG to analyze two pairs of disorders (Crohn’s disease and Type 1 diabetes, as well as rheumatoid arthritis and Type 1 diabetes). The results indicate that by levelaging pleiotropy, LPG can improve the power of prioritization of risk variants and the accuracy of risk prediction. Conclusions Our methodology provides a novel and efficient tool to detect pleiotropy among GWAS data for multiple traits/diseases collected from different studies. The software is available at https://github.com/Shufeyangyi2015310117/LPG. Electronic supplementary material The online version of this article (10.1186/s12864-018-4851-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yi Yang
- School of Statistics and Management, The Shanghai University of Finance and Economics, Guoding Road, Shanghai, China.,Centre for Quantitative Medicine, Duke-NUS Medical School, 8 College Road, Singapore, Singapore
| | - Mingwei Dai
- Institute for Information and System Sciences, Xian Jiaotong University, No.28, Xianning West Road, Xi'an, China.,Department of Mathematics, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Jian Huang
- Department of Applied Mathematics, Hong Kong Polytechnics University, Hung Hom, Hong Kong, China
| | - Xinyi Lin
- Centre for Quantitative Medicine, Duke-NUS Medical School, 8 College Road, Singapore, Singapore
| | - Can Yang
- Department of Mathematics, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Min Chen
- School of Statistics and Management, The Shanghai University of Finance and Economics, Guoding Road, Shanghai, China
| | - Jin Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, 8 College Road, Singapore, Singapore.
| |
Collapse
|