1
|
Lee D, Bacanu SA. GAUSS: a summary-statistics-based R package for accurate estimation of linkage disequilibrium for variants, Gaussian imputation, and TWAS analysis of cosmopolitan cohorts. Bioinformatics 2024; 40:btae203. [PMID: 38632050 PMCID: PMC11052653 DOI: 10.1093/bioinformatics/btae203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 02/25/2024] [Accepted: 04/16/2024] [Indexed: 04/19/2024] Open
Abstract
MOTIVATION As the availability of larger and more ethnically diverse reference panels grows, there is an increase in demand for ancestry-informed imputation of genome-wide association studies (GWAS), and other downstream analyses, e.g. fine-mapping. Performing such analyses at the genotype level is computationally challenging and necessitates, at best, a laborious process to access individual-level genotype and phenotype data. Summary-statistics-based tools, not requiring individual-level data, provide an efficient alternative that streamlines computational requirements and promotes open science by simplifying the re-analysis and downstream analysis of existing GWAS summary data. However, existing tools perform only disparate parts of needed analysis, have only command-line interfaces, and are difficult to extend/link by applied researchers. RESULTS To address these challenges, we present Genome Analysis Using Summary Statistics (GAUSS)-a comprehensive and user-friendly R package designed to facilitate the re-analysis/downstream analysis of GWAS summary statistics. GAUSS offers an integrated toolkit for a range of functionalities, including (i) estimating ancestry proportion of study cohorts, (ii) calculating ancestry-informed linkage disequilibrium, (iii) imputing summary statistics of unobserved variants, (iv) conducting transcriptome-wide association studies, and (v) correcting for "Winner's Curse" biases. Notably, GAUSS utilizes an expansive, multi-ethnic reference panel consisting of 32 953 genomes from 29 ethnic groups. This panel enhances the range and accuracy of imputable variants, including the ability to impute summary statistics of rarer variants. As a result, GAUSS elevates the quality and applicability of existing GWAS analyses without requiring access to subject-level genotypic and phenotypic information. AVAILABILITY AND IMPLEMENTATION The GAUSS R package, complete with its source code, is readily accessible to the public via our GitHub repository at https://github.com/statsleelab/gauss. To further assist users, we provided illustrative use-case scenarios that are conveniently found at https://statsleelab.github.io/gauss/, along with a comprehensive user guide detailed in Supplementary Text S1.
Collapse
Affiliation(s)
- Donghyung Lee
- Department of Statistics, Miami University, Oxford, OH 45056, United States
| | - Silviu-Alin Bacanu
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA 23298, United States
| |
Collapse
|
2
|
Moore A, Marks JA, Quach BC, Guo Y, Bierut LJ, Gaddis NC, Hancock DB, Page GP, Johnson EO. Evaluating 17 methods incorporating biological function with GWAS summary statistics to accelerate discovery demonstrates a tradeoff between high sensitivity and high positive predictive value. Commun Biol 2023; 6:1199. [PMID: 38001305 PMCID: PMC10673847 DOI: 10.1038/s42003-023-05413-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Accepted: 10/03/2023] [Indexed: 11/26/2023] Open
Abstract
Where sufficiently large genome-wide association study (GWAS) samples are not currently available or feasible, methods that leverage increasing knowledge of the biological function of variants may illuminate discoveries without increasing sample size. We comprehensively evaluated 17 functional weighting methods for identifying novel associations. We assessed the performance of these methods using published results from multiple GWAS waves across each of five complex traits. Although no method achieved both high sensitivity and positive predictive value (PPV) for any trait, a subset of methods utilizing pleiotropy and expression quantitative trait loci nominated variants with high PPV (>75%) for multiple traits. Application of functionally weighting methods to enhance GWAS power for locus discovery is unlikely to circumvent the need for larger sample sizes in truly underpowered GWAS, but these results suggest that applying functional weighting to GWAS can accurately nominate additional novel loci from available samples for follow-up studies.
Collapse
Affiliation(s)
- Amy Moore
- Genomics and Translational Research Center, RTI International, Research Triangle Park, NC, 27709, USA.
| | - Jesse A Marks
- Genomics and Translational Research Center, RTI International, Research Triangle Park, NC, 27709, USA
| | - Bryan C Quach
- Genomics and Translational Research Center, RTI International, Research Triangle Park, NC, 27709, USA
| | - Yuelong Guo
- GeneCentric Therapeutics, Inc., Cary, NC, USA
| | - Laura J Bierut
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA
| | - Nathan C Gaddis
- Genomics and Translational Research Center, RTI International, Research Triangle Park, NC, 27709, USA
| | - Dana B Hancock
- Genomics and Translational Research Center, RTI International, Research Triangle Park, NC, 27709, USA
| | - Grier P Page
- Genomics and Translational Research Center, RTI International, Research Triangle Park, NC, 27709, USA
- Fellow Program, RTI International, Research Triangle Park, NC, 27709, USA
| | - Eric O Johnson
- Genomics and Translational Research Center, RTI International, Research Triangle Park, NC, 27709, USA.
- Fellow Program, RTI International, Research Triangle Park, NC, 27709, USA.
| |
Collapse
|
3
|
Gedik H, Peterson RE, Riley BP, Vladimirov VI, Bacanu SA. Integrative Post-Genome-Wide Association Study Analyses Relevant to Psychiatric Disorders: Imputing Transcriptome and Proteome Signals. Complex Psychiatry 2023; 9:130-144. [PMID: 37588130 PMCID: PMC10425719 DOI: 10.1159/000530223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Accepted: 03/09/2023] [Indexed: 08/18/2023] Open
Abstract
Background The genome-wide association study (GWAS) is a common tool to identify genetic variants associated with complex traits, including psychiatric disorders (PDs). However, post-GWAS analyses are needed to extend the statistical inference to biologically relevant entities, e.g., genes, proteins, and pathways. To achieve this goal, researchers developed methods that incorporate biologically relevant intermediate molecular phenotypes, such as gene expression and protein abundance, which are posited to mediate the variant-trait association. Transcriptome-wide association study (TWAS) and proteome-wide association study (PWAS) are commonly used methods to test the association between these molecular mediators and the trait. Summary In this review, we discuss the most recent developments in TWAS and PWAS. These methods integrate existing "omic" information with the GWAS summary statistics for trait(s) of interest. Specifically, they impute transcript/protein data and test the association between imputed gene expression/protein level with phenotype of interest by using (i) GWAS summary statistics and (ii) reference transcriptomic/proteomic/genomic datasets. TWAS and PWAS are suitable as analysis tools for (i) primary association scan and (ii) fine-mapping to identify potentially causal genes for PDs. Key Messages As post-GWAS analyses, TWAS and PWAS have the potential to highlight causal genes for PDs. These prioritized genes could indicate targets for the development of novel drug therapies. For researchers attempting such analyses, we recommend Mendelian randomization tools that use GWAS statistics for both trait and reference datasets, e.g., summary Mendelian randomization (SMR). We base our recommendation on (i) being able to use the same tool for both TWAS and PWAS, (ii) not requiring the pre-computed weights (and thus easier to update for larger reference datasets), and (iii) most larger transcriptome reference datasets are publicly available and easy to transform into a compatible format for SMR analysis.
Collapse
Affiliation(s)
- Huseyin Gedik
- Integrative Life Sciences, Virginia Institute of Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA, USA
| | - Roseann E. Peterson
- Institute for Genomics in Health, SUNY Downstate Health Sciences University, Brooklyn, NY, USA
| | - Brien P. Riley
- Institute for Genomics in Health, SUNY Downstate Health Sciences University, Brooklyn, NY, USA
| | - Vladimir I. Vladimirov
- Department of Psychiatry, College of Medicine-Phoenix, University of Arizona, Phoenix, AZ, USA
| | - Silviu-Alin Bacanu
- Institute for Genomics in Health, SUNY Downstate Health Sciences University, Brooklyn, NY, USA
| |
Collapse
|
4
|
Gazal S, Weissbrod O, Hormozdiari F, Dey KK, Nasser J, Jagadeesh KA, Weiner DJ, Shi H, Fulco CP, O'Connor LJ, Pasaniuc B, Engreitz JM, Price AL. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat Genet 2022; 54:827-836. [PMID: 35668300 PMCID: PMC9894581 DOI: 10.1038/s41588-022-01087-y] [Citation(s) in RCA: 48] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 04/27/2022] [Indexed: 02/04/2023]
Abstract
Disease-associated single-nucleotide polymorphisms (SNPs) generally do not implicate target genes, as most disease SNPs are regulatory. Many SNP-to-gene (S2G) linking strategies have been developed to link regulatory SNPs to the genes that they regulate in cis. Here, we developed a heritability-based framework for evaluating and combining different S2G strategies to optimize their informativeness for common disease risk. Our optimal combined S2G strategy (cS2G) included seven constituent S2G strategies and achieved a precision of 0.75 and a recall of 0.33, more than doubling the recall of any individual strategy. We applied cS2G to fine-mapping results for 49 UK Biobank diseases/traits to predict 5,095 causal SNP-gene-disease triplets (with S2G-derived functional interpretation) with high confidence. We further applied cS2G to provide an empirical assessment of disease omnigenicity; we determined that the top 1% of genes explained roughly half of the SNP heritability linked to all genes and that gene-level architectures vary with variant allele frequency.
Collapse
Affiliation(s)
- Steven Gazal
- Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
- Center for Genetic Epidemiology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | - Omer Weissbrod
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Farhad Hormozdiari
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Kushal K Dey
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Joseph Nasser
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Karthik A Jagadeesh
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Huwenbo Shi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Charles P Fulco
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Bristol Myers Squibb, Cambridge, MA, USA
| | | | - Bogdan Pasaniuc
- Departments of Computational Medicine, Human Genetics, Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Jesse M Engreitz
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- BASE Initiative, Betty Irene Moore Children's Heart Center, Lucile Packard Children's Hospital, Stanford University School of Medicine, Stanford, CA, USA
| | - Alkes L Price
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| |
Collapse
|
5
|
Xie Y, Shan N, Zhao H, Hou L. Transcriptome wide association studies: general framework and methods. QUANTITATIVE BIOLOGY 2021. [DOI: 10.15302/j-qb-020-0228] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
6
|
Wang L, Xia Y, Chen Y, Dai R, Qiu W, Meng Q, Kuney L, Chen C. Brain Banks Spur New Frontiers in Neuropsychiatric Research and Strategies for Analysis and Validation. GENOMICS, PROTEOMICS & BIOINFORMATICS 2019; 17:402-414. [PMID: 31811942 PMCID: PMC6943778 DOI: 10.1016/j.gpb.2019.02.002] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Revised: 02/13/2019] [Accepted: 03/01/2019] [Indexed: 12/27/2022]
Abstract
Neuropsychiatric disorders affect hundreds of millions of patients and families worldwide. To decode the molecular framework of these diseases, many studies use human postmortem brain samples. These studies reveal brain-specific genetic and epigenetic patterns via high-throughput sequencing technologies. Identifying best practices for the collection of postmortem brain samples, analyzing such large amounts of sequencing data, and interpreting these results are critical to advance neuropsychiatry. We provide an overview of human brain banks worldwide, including progress in China, highlighting some well-known projects using human postmortem brain samples to understand molecular regulation in both normal brains and those with neuropsychiatric disorders. Finally, we discuss future research strategies, as well as state-of-the-art statistical and experimental methods that are drawn upon brain bank resources to improve our understanding of the agents of neuropsychiatric disorders.
Collapse
Affiliation(s)
- Le Wang
- Center for Medical Genetics & Hunan Key Laboratory of Medical Genetics, School of Life Sciences, Central South University, Changsha 410078, China; Child Health Institute of New Jersey, Department of Neuroscience, Rutgers Robert Wood Johnson Medical School, New Brunswick, NJ 08901, USA
| | - Yan Xia
- Center for Medical Genetics & Hunan Key Laboratory of Medical Genetics, School of Life Sciences, Central South University, Changsha 410078, China; Psychiatry Department, SUNY Upstate Medical University, Syracuse, NY 13210, USA
| | - Yu Chen
- Center for Medical Genetics & Hunan Key Laboratory of Medical Genetics, School of Life Sciences, Central South University, Changsha 410078, China
| | - Rujia Dai
- Center for Medical Genetics & Hunan Key Laboratory of Medical Genetics, School of Life Sciences, Central South University, Changsha 410078, China; Psychiatry Department, SUNY Upstate Medical University, Syracuse, NY 13210, USA
| | - Wenying Qiu
- Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100101, China
| | - Qingtuan Meng
- Center for Medical Genetics & Hunan Key Laboratory of Medical Genetics, School of Life Sciences, Central South University, Changsha 410078, China; Affiliated Hospital of Guilin Medical University, Guilin 541000, China
| | - Liz Kuney
- Psychiatry Department, SUNY Upstate Medical University, Syracuse, NY 13210, USA
| | - Chao Chen
- Center for Medical Genetics & Hunan Key Laboratory of Medical Genetics, School of Life Sciences, Central South University, Changsha 410078, China; National Clinical Research Centre for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha 410000, China.
| |
Collapse
|
7
|
Liu Y, Xie J. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J Am Stat Assoc 2019; 115:393-402. [PMID: 33012899 DOI: 10.1080/01621459.2018.1554485] [Citation(s) in RCA: 153] [Impact Index Per Article: 30.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Combining individual p-values to aggregate multiple small effects has a long-standing interest in statistics, dating back to the classic Fisher's combination test. In modern large-scale data analysis, correlation and sparsity are common features and efficient computation is a necessary requirement for dealing with massive data. To overcome these challenges, we propose a new test that takes advantage of the Cauchy distribution. Our test statistic has a simple form and is defined as a weighted sum of Cauchy transformation of individual p-values. We prove a non-asymptotic result that the tail of the null distribution of our proposed test statistic can be well approximated by a Cauchy distribution under arbitrary dependency structures. Based on this theoretical result, the p-value calculation of our proposed test is not only accurate, but also as simple as the classic z-test or t-test, making our test well suited for analyzing massive data. We further show that the power of the proposed test is asymptotically optimal in a strong sparsity setting. Extensive simulations demonstrate that the proposed test has both strong power against sparse alternatives and a good accuracy with respect to p-value calculations, especially for very small p-values. The proposed test has also been applied to a genome-wide association study of Crohn's disease and compared with several existing tests.
Collapse
Affiliation(s)
- Yaowu Liu
- Department of Biostatistics, Harvard School of Public Health
| | - Jun Xie
- Department of Statistics, Purdue University
| |
Collapse
|
8
|
Zhang H, Wheeler W, Song L, Yu K. Proper joint analysis of summary association statistics requires the adjustment of heterogeneity in SNP coverage pattern. Brief Bioinform 2018; 19:1337-1343. [PMID: 28981575 DOI: 10.1093/bib/bbx072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Indexed: 11/12/2022] Open
Abstract
As meta-analysis results published by consortia of genome-wide association studies (GWASs) become increasingly available, many association summary statistics-based multi-locus tests have been developed to jointly evaluate multiple single-nucleotide polymorphisms (SNPs) to reveal novel genetic architectures of various complex traits. The validity of these approaches relies on the accurate estimate of z-score correlations at considered SNPs, which in turn requires knowledge on the set of SNPs assessed by each study participating in the meta-analysis. However, this exact SNP coverage information is usually unavailable from the meta-analysis results published by GWAS consortia. In the absence of the coverage information, researchers typically estimate the z-score correlations by making oversimplified coverage assumptions. We show through real studies that such a practice can generate highly inflated type I errors, and we demonstrate the proper way to incorporate correct coverage information into multi-locus analyses. We advocate that consortia should make SNP coverage information available when posting their meta-analysis results, and that investigators who develop analytic tools for joint analyses based on summary data should pay attention to the variation in SNP coverage and adjust for it appropriately.
Collapse
Affiliation(s)
- Han Zhang
- Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, USA
| | | | - Lei Song
- Cancer Genomics Research Laboratory, Frederick National Laboratory for Cancer Research, Leidos Biomedical Research Inc., USA
| | - Kai Yu
- Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, USA
| |
Collapse
|
9
|
Rüeger S, McDaid A, Kutalik Z. Evaluation and application of summary statistic imputation to discover new height-associated loci. PLoS Genet 2018; 14:e1007371. [PMID: 29782485 PMCID: PMC5983877 DOI: 10.1371/journal.pgen.1007371] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2017] [Revised: 06/01/2018] [Accepted: 04/18/2018] [Indexed: 12/11/2022] Open
Abstract
As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation, which we improved to accommodate variable sample size across SNVs. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, genotype imputation boasts a 3- to 5-fold lower root-mean-square error, and better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded a decrease in statistical power by 9, 43 and 35%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian randomisation or LD-score regression.
Collapse
Affiliation(s)
- Sina Rüeger
- Institute of Social and Preventive Medicine, Lausanne University Hospital, Lausanne, 1010, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| | - Aaron McDaid
- Institute of Social and Preventive Medicine, Lausanne University Hospital, Lausanne, 1010, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| | - Zoltán Kutalik
- Institute of Social and Preventive Medicine, Lausanne University Hospital, Lausanne, 1010, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| |
Collapse
|
10
|
Chatzinakos C, Lee D, Webb BT, Vladimirov VI, Kendler KS, Bacanu SA. JEPEGMIX2: improved gene-level joint analysis of eQTLs in cosmopolitan cohorts. Bioinformatics 2018; 34:286-288. [PMID: 28968763 PMCID: PMC5860197 DOI: 10.1093/bioinformatics/btx509] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Revised: 07/10/2017] [Accepted: 09/13/2017] [Indexed: 12/17/2022] Open
Abstract
Motivation To increase detection power, researchers use gene level analysis methods to aggregate weak marker signals. Due to gene expression controlling biological processes, researchers proposed aggregating signals for expression Quantitative Trait Loci (eQTL). Most gene-level eQTL methods make statistical inferences based on (i) summary statistics from genome-wide association studies (GWAS) and (ii) linkage disequilibrium patterns from a relevant reference panel. While most such tools assume homogeneous cohorts, our Gene-level Joint Analysis of functional SNPs in Cosmopolitan Cohorts (JEPEGMIX) method accommodates cosmopolitan cohorts by using heterogeneous panels. However, JEPGMIX relies on brain eQTLs from older gene expression studies and does not adjust for background enrichment in GWAS signals. Results We propose JEPEGMIX2, an extension of JEPEGMIX. When compared to JPEGMIX, it uses (i) cis-eQTL SNPs from the latest expression studies and (ii) brains specific (sub)tissues and tissues other than brain. JEPEGMIX2 also (i) avoids accumulating averagely enriched polygenic information by adjusting for background enrichment and (ii) to avoid an increase in false positive rates for studies with numerous highly enriched (above the background) genes, it outputs gene q-values based on Holm adjustment of P-values. Availability and implementation https://github.com/Chatzinakos/JEPEGMIX2. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chris Chatzinakos
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA, USA
| | - Donghyung Lee
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Bradley T Webb
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA, USA
| | | | - Kenneth S Kendler
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA, USA
| | - Silviu-Alin Bacanu
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
11
|
Benner C, Havulinna AS, Järvelin MR, Salomaa V, Ripatti S, Pirinen M. Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies. Am J Hum Genet 2017; 101:539-551. [PMID: 28942963 DOI: 10.1016/j.ajhg.2017.08.012] [Citation(s) in RCA: 129] [Impact Index Per Article: 18.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 08/17/2017] [Indexed: 01/15/2023] Open
Abstract
During the past few years, various novel statistical methods have been developed for fine-mapping with the use of summary statistics from genome-wide association studies (GWASs). Although these approaches require information about the linkage disequilibrium (LD) between variants, there has not been a comprehensive evaluation of how estimation of the LD structure from reference genotype panels performs in comparison with that from the original individual-level GWAS data. Using population genotype data from Finland and the UK Biobank, we show here that a reference panel of 1,000 individuals from the target population is adequate for a GWAS cohort of up to 10,000 individuals, whereas smaller panels, such as those from the 1000 Genomes Project, should be avoided. We also show, both theoretically and empirically, that the size of the reference panel needs to scale with the GWAS sample size; this has important consequences for the application of these methods in ongoing GWAS meta-analyses and large biobank studies. We conclude by providing software tools and by recommending practices for sharing LD information to more efficiently exploit summary statistics in genetics research.
Collapse
Affiliation(s)
- Christian Benner
- Institute for Molecular Medicine Finland, University of Helsinki, 00014 Helsinki, Finland; Department of Public Health, University of Helsinki, 00014 Helsinki, Finland.
| | - Aki S Havulinna
- Institute for Molecular Medicine Finland, University of Helsinki, 00014 Helsinki, Finland; National Institute for Health and Welfare, 00271 Helsinki, Finland
| | - Marjo-Riitta Järvelin
- Center for Life-Course Health Research and Northern Finland Cohort Center, Biocenter Oulu, University of Oulu, 90014 Oulu, Finland; Faculty of Medicine, University of Oulu, 90014 Oulu, Finland; Unit of Primary Care, Oulu University Hospital, 90220 Oulu, Finland; Department of Epidemiology and Biostatistics, School of Public Health, Faculty of Medicine, Imperial College London, W2 1PG, UK
| | - Veikko Salomaa
- National Institute for Health and Welfare, 00271 Helsinki, Finland
| | - Samuli Ripatti
- Institute for Molecular Medicine Finland, University of Helsinki, 00014 Helsinki, Finland; Department of Public Health, University of Helsinki, 00014 Helsinki, Finland; Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, Cambridge, UK
| | - Matti Pirinen
- Institute for Molecular Medicine Finland, University of Helsinki, 00014 Helsinki, Finland; Department of Public Health, University of Helsinki, 00014 Helsinki, Finland; Helsinki Institute for Information Technology and Department of Mathematics and Statistics, University of Helsinki, 00014 Helsinki, Finland.
| |
Collapse
|
12
|
Zhu X, Stephens M. BAYESIAN LARGE-SCALE MULTIPLE REGRESSION WITH SUMMARY STATISTICS FROM GENOME-WIDE ASSOCIATION STUDIES. Ann Appl Stat 2017; 11:1561-1592. [PMID: 29399241 PMCID: PMC5796536 DOI: 10.1214/17-aoas1046] [Citation(s) in RCA: 78] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/23/2023]
Abstract
Bayesian methods for large-scale multiple regression provide attractive approaches to the analysis of genome-wide association studies (GWAS). For example, they can estimate heritability of complex traits, allowing for both polygenic and sparse models; and by incorporating external genomic data into the priors, they can increase power and yield new biological insights. However, these methods require access to individual genotypes and phenotypes, which are often not easily available. Here we provide a framework for performing these analyses without individual-level data. Specifically, we introduce a "Regression with Summary Statistics" (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results that are often easily available. The RSS likelihood requires estimates of correlations among covariates (SNPs), which also can be obtained from public databases. We perform Bayesian multiple regression analysis by combining the RSS likelihood with previously proposed prior distributions, sampling posteriors by Markov chain Monte Carlo. In a wide range of simulations RSS performs similarly to analyses using the individual data, both for estimating heritability and detecting associations. We apply RSS to a GWAS of human height that contains 253,288 individuals typed at 1.06 million SNPs, for which analyses of individual-level data are practically impossible. Estimates of heritability (52%) are consistent with, but more precise, than previous results using subsets of these data. We also identify many previously unreported loci that show evidence for association with height in our analyses. Software is available at https://github.com/stephenslab/rss.
Collapse
|
13
|
Gordon D, Londono D, Patel P, Kim W, Finch SJ, Heiman GA. An Analytic Solution to the Computation of Power and Sample Size for Genetic Association Studies under a Pleiotropic Mode of Inheritance. Hum Hered 2017; 81:194-209. [PMID: 28315880 DOI: 10.1159/000457135] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Accepted: 01/20/2017] [Indexed: 01/14/2023] Open
Abstract
Our motivation here is to calculate the power of 3 statistical tests used when there are genetic traits that operate under a pleiotropic mode of inheritance and when qualitative phenotypes are defined by use of thresholds for the multiple quantitative phenotypes. Specifically, we formulate a multivariate function that provides the probability that an individual has a vector of specific quantitative trait values conditional on having a risk locus genotype, and we apply thresholds to define qualitative phenotypes (affected, unaffected) and compute penetrances and conditional genotype frequencies based on the multivariate function. We extend the analytic power and minimum-sample-size-necessary (MSSN) formulas for 2 categorical data-based tests (genotype, linear trend test [LTT]) of genetic association to the pleiotropic model. We further compare the MSSN of the genotype test and the LTT with that of a multivariate ANOVA (Pillai). We approximate the MSSN for statistics by linear models using a factorial design and ANOVA. With ANOVA decomposition, we determine which factors most significantly change the power/MSSN for all statistics. Finally, we determine which test statistics have the smallest MSSN. In this work, MSSN calculations are for 2 traits (bivariate distributions) only (for illustrative purposes). We note that the calculations may be extended to address any number of traits. Our key findings are that the genotype test usually has lower MSSN requirements than the LTT. More inclusive thresholds (top/bottom 25% vs. top/bottom 10%) have higher sample size requirements. The Pillai test has a much larger MSSN than both the genotype test and the LTT, as a result of sample selection. With these formulas, researchers can specify how many subjects they must collect to localize genes for pleiotropic phenotypes.
Collapse
Affiliation(s)
- Derek Gordon
- Department of Genetics, The State University of New Jersey, Piscataway, NJ, USA
| | | | | | | | | | | |
Collapse
|
14
|
Dissecting the genetics of complex traits using summary association statistics. Nat Rev Genet 2016; 18:117-127. [PMID: 27840428 DOI: 10.1038/nrg.2016.142] [Citation(s) in RCA: 248] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
During the past decade, genome-wide association studies (GWAS) have been used to successfully identify tens of thousands of genetic variants associated with complex traits and diseases. These studies have produced extensive repositories of genetic variation and trait measurements across large numbers of individuals, providing tremendous opportunities for further analyses. However, privacy concerns and other logistical considerations often limit access to individual-level genetic data, motivating the development of methods that analyse summary association statistics. Here, we review recent progress on statistical methods that leverage summary association data to gain insights into the genetic basis of complex traits and diseases.
Collapse
|
15
|
Liu J, Wan X, Ma S, Yang C. EPS: an empirical Bayes approach to integrating pleiotropy and tissue-specific information for prioritizing risk genes. Bioinformatics 2016; 32:1856-64. [DOI: 10.1093/bioinformatics/btw081] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2015] [Accepted: 02/05/2016] [Indexed: 12/12/2022] Open
Affiliation(s)
- Jin Liu
- Center of Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore,
| | - Xiang Wan
- Department of Computer Science, Institute of Computational and Theoretical Studies, Hong Kong Baptist University, Kowloon, Hong Kong
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Heaven, CT, USA
| | - Can Yang
- Department of Mathematics, Hong Kong Baptist University, Kowloon, Hong Kong
| |
Collapse
|
16
|
Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BWJH, Jansen R, de Geus EJC, Boomsma DI, Wright FA, Sullivan PF, Nikkola E, Alvarez M, Civelek M, Lusis AJ, Lehtimäki T, Raitoharju E, Kähönen M, Seppälä I, Raitakari OT, Kuusisto J, Laakso M, Price AL, Pajukanta P, Pasaniuc B. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet 2016; 48:245-52. [PMID: 26854917 DOI: 10.1038/ng.3506] [Citation(s) in RCA: 1151] [Impact Index Per Article: 143.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2015] [Accepted: 01/14/2016] [Indexed: 02/07/2023]
Abstract
Many genetic variants influence complex traits by modulating gene expression, thus altering the abundance of one or multiple proteins. Here we introduce a powerful strategy that integrates gene expression measurements with summary association statistics from large-scale genome-wide association studies (GWAS) to identify genes whose cis-regulated expression is associated with complex traits. We leverage expression imputation from genetic data to perform a transcriptome-wide association study (TWAS) to identify significant expression-trait associations. We applied our approaches to expression data from blood and adipose tissue measured in ∼ 3,000 individuals overall. We imputed gene expression into GWAS data from over 900,000 phenotype measurements to identify 69 new genes significantly associated with obesity-related traits (BMI, lipids and height). Many of these genes are associated with relevant phenotypes in the Hybrid Mouse Diversity Panel. Our results showcase the power of integrating genotype, gene expression and phenotype to gain insights into the genetic basis of complex traits.
Collapse
Affiliation(s)
- Alexander Gusev
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Program in Medical and Population Genetics, Broad Institute, Cambridge, Massachusetts, USA
| | - Arthur Ko
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA.,Molecular Biology Institute, University of California, Los Angeles, Los Angeles, California, USA
| | - Huwenbo Shi
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, California, USA
| | - Gaurav Bhatia
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Program in Medical and Population Genetics, Broad Institute, Cambridge, Massachusetts, USA
| | - Wonil Chung
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Brenda W J H Penninx
- Department of Psychiatry, VU University Medical Center, Amsterdam, the Netherlands
| | - Rick Jansen
- Department of Psychiatry, VU University Medical Center, Amsterdam, the Netherlands
| | - Eco J C de Geus
- Department of Biological Psychology, VU University, Amsterdam, the Netherlands
| | - Dorret I Boomsma
- Department of Biological Psychology, VU University, Amsterdam, the Netherlands
| | - Fred A Wright
- Bioinformatics Research Center, Department of Statistics, Department of Biological Sciences, North Carolina State University, Raleigh, North Carolina, USA
| | - Patrick F Sullivan
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA.,Department of Psychiatry, University of North Carolina, Chapel Hill, North Carolina, USA.,Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Elina Nikkola
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA
| | - Marcus Alvarez
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA
| | - Mete Civelek
- Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA
| | - Aldons J Lusis
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA.,Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA
| | - Terho Lehtimäki
- Department of Clinical Chemistry, Fimlab Laboratories and University of Tampere School of Medicine, Tampere, Finland
| | - Emma Raitoharju
- Department of Clinical Chemistry, Fimlab Laboratories and University of Tampere School of Medicine, Tampere, Finland
| | - Mika Kähönen
- Department of Clinical Physiology, Pirkanmaa Hospital District and University of Tampere School of Medicine, Tampere, Finland
| | - Ilkka Seppälä
- Department of Clinical Chemistry, Fimlab Laboratories and University of Tampere School of Medicine, Tampere, Finland
| | - Olli T Raitakari
- Research Centre of Applied and Preventive Cardiovascular Medicine, University of Turku, Turku, Finland.,Department of Clinical Physiology and Nuclear Medicine, Turku University Hospital, Turku, Finland
| | - Johanna Kuusisto
- Department of Medicine, University of Eastern Finland and Kuopio University Hospital, Kuopio, Finland
| | - Markku Laakso
- Department of Medicine, University of Eastern Finland and Kuopio University Hospital, Kuopio, Finland
| | - Alkes L Price
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Program in Medical and Population Genetics, Broad Institute, Cambridge, Massachusetts, USA
| | - Päivi Pajukanta
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA.,Molecular Biology Institute, University of California, Los Angeles, Los Angeles, California, USA
| | - Bogdan Pasaniuc
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA.,Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, California, USA.,Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA
| |
Collapse
|
17
|
Lamparter D, Marbach D, Rueedi R, Kutalik Z, Bergmann S. Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics. PLoS Comput Biol 2016; 12:e1004714. [PMID: 26808494 PMCID: PMC4726509 DOI: 10.1371/journal.pcbi.1004714] [Citation(s) in RCA: 208] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2015] [Accepted: 12/17/2015] [Indexed: 12/17/2022] Open
Abstract
Integrating single nucleotide polymorphism (SNP) p-values from genome-wide association studies (GWAS) across genes and pathways is a strategy to improve statistical power and gain biological insight. Here, we present Pascal (Pathway scoring algorithm), a powerful tool for computing gene and pathway scores from SNP-phenotype association summary statistics. For gene score computation, we implemented analytic and efficient numerical solutions to calculate test statistics. We examined in particular the sum and the maximum of chi-squared statistics, which measure the strongest and the average association signals per gene, respectively. For pathway scoring, we use a modified Fisher method, which offers not only significant power improvement over more traditional enrichment strategies, but also eliminates the problem of arbitrary threshold selection inherent in any binary membership based pathway enrichment approach. We demonstrate the marked increase in power by analyzing summary statistics from dozens of large meta-studies for various traits. Our extensive testing indicates that our method not only excels in rigorous type I error control, but also results in more biologically meaningful discoveries. Genome-wide association studies (GWAS) typically generate lists of trait- or disease-associated SNPs. Yet, such output sheds little light on the underlying molecular mechanisms and tools are needed to extract biological insight from the results at the SNP level. Pathway analysis tools integrate signals from multiple SNPs at various positions in the genome in order to map associated genomic regions to well-established pathways, i.e., sets of genes known to act in concert. The nature of GWAS association results requires specifically tailored methods for this task. Here, we present Pascal (Pathway scoring algorithm), a tool that allows gene and pathway-level analysis of GWAS association results without the need to access the original genotypic data. Pascal was designed to be fast, accurate and to have high power to detect relevant pathways. We extensively tested our approach on a large collection of real GWAS association results and saw better discovery of confirmed pathways than with other popular methods. We believe that these results together with the ease-of-use of our publicly available software will allow Pascal to become a useful addition to the toolbox of the GWAS community.
Collapse
Affiliation(s)
- David Lamparter
- Department of Medical Genetics, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Daniel Marbach
- Department of Medical Genetics, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Rico Rueedi
- Department of Medical Genetics, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Zoltán Kutalik
- Department of Medical Genetics, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Institute of Social and Preventive Medicine (IUMSP), Centre Hospitalier Universitaire Vaudois (CHUV), Lausanne, Switzerland
- * E-mail: ;
| | - Sven Bergmann
- Department of Medical Genetics, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- * E-mail: ;
| |
Collapse
|
18
|
Lee D, Williamson VS, Bigdeli TB, Riley BP, Webb BT, Fanous AH, Kendler KS, Vladimirov VI, Bacanu SA. JEPEGMIX: gene-level joint analysis of functional SNPs in cosmopolitan cohorts. Bioinformatics 2016; 32:295-7. [PMID: 26428293 PMCID: PMC4708106 DOI: 10.1093/bioinformatics/btv567] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2015] [Revised: 09/01/2015] [Accepted: 09/22/2015] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION To increase detection power, gene level analysis methods are used to aggregate weak signals. To greatly increase computational efficiency, most methods use as input summary statistics from genome-wide association studies (GWAS). Subsequently, gene statistics are constructed using linkage disequilibrium (LD) patterns from a relevant reference panel. However, all methods, including our own Joint Effect on Phenotype of eQTL/functional single nucleotide polymorphisms (SNPs) associated with a Gene (JEPEG), assume homogeneous panels, e.g. European. However, this renders these tools unsuitable for the analysis of large cosmopolitan cohorts. RESULTS We propose a JEPEG extension, JEPEGMIX, which similar to one of our software tools, Direct Imputation of summary STatistics of unmeasured SNPs from MIXed ethnicity cohorts, is capable of estimating accurate LD patterns for cosmopolitan cohorts. JEPEGMIX uses this accurate LD estimates to (i) impute the summary statistics at unmeasured functional variants and (ii) test for the joint effect of all measured and imputed functional variants which are associated with a gene. We illustrate the performance of our tool by analyzing the GWAS meta-analysis summary statistics from the multi-ethnic Psychiatric Genomics Consortium Schizophrenia stage 2 cohort. This practical application supports the immune system being one of the main drivers of the process leading to schizophrenia. AVAILABILITY AND IMPLEMENTATION Software, annotation database and examples are available at http://dleelab.github.io/jepegmix/. CONTACT donghyung.lee@vcuhealth.org SUPPLEMENTARY INFORMATION Supplementary material is available at Bioinformatics online.
Collapse
Affiliation(s)
- Donghyung Lee
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Vernell S Williamson
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - T Bernard Bigdeli
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Brien P Riley
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Bradley T Webb
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Ayman H Fanous
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Kenneth S Kendler
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA 23298, USA
| | | | - Silviu-Alin Bacanu
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA 23298, USA
| |
Collapse
|
19
|
Lee D, Bigdeli TB, Williamson VS, Vladimirov VI, Riley BP, Fanous AH, Bacanu SA. DISTMIX: direct imputation of summary statistics for unmeasured SNPs from mixed ethnicity cohorts. Bioinformatics 2015; 31:3099-104. [PMID: 26059716 PMCID: PMC4576696 DOI: 10.1093/bioinformatics/btv348] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2014] [Accepted: 05/29/2015] [Indexed: 01/09/2023] Open
Abstract
Motivation: To increase the signal resolution for large-scale meta-analyses of genome-wide association studies, genotypes at unmeasured single nucleotide polymorphisms (SNPs) are commonly imputed using large multi-ethnic reference panels. However, the ever increasing size and ethnic diversity of both reference panels and cohorts makes genotype imputation computationally challenging for moderately sized computer clusters. Moreover, genotype imputation requires subject-level genetic data, which unlike summary statistics provided by virtually all studies, is not publicly available. While there are much less demanding methods which avoid the genotype imputation step by directly imputing SNP statistics, e.g. Directly Imputing summary STatistics (DIST) proposed by our group, their implicit assumptions make them applicable only to ethnically homogeneous cohorts. Results: To decrease computational and access requirements for the analysis of cosmopolitan cohorts, we propose DISTMIX, which extends DIST capabilities to the analysis of mixed ethnicity cohorts. The method uses a relevant reference panel to directly impute unmeasured SNP statistics based only on statistics at measured SNPs and estimated/user-specified ethnic proportions. Simulations show that the proposed method adequately controls the Type I error rates. The 1000 Genomes panel imputation of summary statistics from the ethnically diverse Psychiatric Genetic Consortium Schizophrenia Phase 2 suggests that, when compared to genotype imputation methods, DISTMIX offers comparable imputation accuracy for only a fraction of computational resources. Availability and implementation: DISTMIX software, its reference population data, and usage examples are publicly available at http://code.google.com/p/distmix. Contact:dlee4@vcu.edu Supplementary information:Supplementary Data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Donghyung Lee
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA 23298, USA, Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - T Bernard Bigdeli
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Vernell S Williamson
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Vladimir I Vladimirov
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA 23298, USA, Center for Biomarker Research & Personalized Medicine, Virginia Commonwealth University, Richmond, VA 23298, USA and Lieber Institute for Brain Development, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Brien P Riley
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Ayman H Fanous
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Silviu-Alin Bacanu
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA 23298, USA
| |
Collapse
|