1
|
Santorsola M, Lescai F. The promise of explainable deep learning for omics data analysis: Adding new discovery tools to AI. N Biotechnol 2023; 77:1-11. [PMID: 37329982 DOI: 10.1016/j.nbt.2023.06.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 06/01/2023] [Accepted: 06/14/2023] [Indexed: 06/19/2023]
Abstract
Deep learning has already revolutionised the way a wide range of data is processed in many areas of daily life. The ability to learn abstractions and relationships from heterogeneous data has provided impressively accurate prediction and classification tools to handle increasingly big datasets. This has a significant impact on the growing wealth of omics datasets, with the unprecedented opportunity for a better understanding of the complexity of living organisms. While this revolution is transforming the way these data are analyzed, explainable deep learning is emerging as an additional tool with the potential to change the way biological data is interpreted. Explainability addresses critical issues such as transparency, so important when computational tools are introduced especially in clinical environments. Moreover, it empowers artificial intelligence with the capability to provide new insights into the input data, thus adding an element of discovery to these already powerful resources. In this review, we provide an overview of the transformative effects explainable deep learning is having on multiple sectors, ranging from genome engineering and genomics, from radiomics to drug design and clinical trials. We offer a perspective to life scientists, to better understand the potential of these tools, and a motivation to implement them in their research, by suggesting learning resources they can use to move their first steps in this field.
Collapse
Affiliation(s)
| | - Francesco Lescai
- Department of Biology and Biotechnology, University of Pavia, Pavia, Italy.
| |
Collapse
|
2
|
Controlling for human population stratification in rare variant association studies. Sci Rep 2021; 11:19015. [PMID: 34561511 PMCID: PMC8463695 DOI: 10.1038/s41598-021-98370-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 08/25/2021] [Indexed: 12/05/2022] Open
Abstract
Population stratification is a confounder of genetic association studies. In analyses of rare variants, corrections based on principal components (PCs) and linear mixed models (LMMs) yield conflicting conclusions. Studies evaluating these approaches generally focused on limited types of structure and large sample sizes. We investigated the properties of several correction methods through a large simulation study using real exome data, and several within- and between-continent stratification scenarios. We considered different sample sizes, with situations including as few as 50 cases, to account for the analysis of rare disorders. Large samples showed that accounting for stratification was more difficult with a continental than with a worldwide structure. When considering a sample of 50 cases, an inflation of type-I-errors was observed with PCs for small numbers of controls (≤ 100), and with LMMs for large numbers of controls (≥ 1000). We also tested a novel local permutation method (LocPerm), which maintained a correct type-I-error in all situations. Powers were equivalent for all approaches pointing out that the key issue is to properly control type-I-errors. Finally, we found that power of analyses including small numbers of cases can be increased, by adding a large panel of external controls, provided an appropriate stratification correction was used.
Collapse
|
3
|
Mullaert J, Bouaziz M, Seeleuthner Y, Bigio B, Casanova JL, Alcaïs A, Abel L, Cobat A. Taking population stratification into account by local permutations in rare-variant association studies on small samples. Genet Epidemiol 2021; 45:821-829. [PMID: 34402542 DOI: 10.1002/gepi.22426] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 06/07/2021] [Accepted: 07/15/2021] [Indexed: 11/08/2022]
Abstract
Many methods for rare variant association studies require permutations to assess the significance of tests. Standard permutations assume that all individuals are exchangeable and do not take population stratification (PS), a known confounding factor in genetic studies, into account. We propose a novel strategy, LocPerm, in which individual phenotypes are permuted only with their closest ancestry-based neighbors. We performed a simulation study, focusing on small samples, to evaluate and compare LocPerm with standard permutations and classical adjustment on first principal components. Under the null hypothesis, LocPerm was the only method providing an acceptable type I error, regardless of sample size and level of stratification. The power of LocPerm was similar to that of standard permutation in the absence of PS, and remained stable in different PS scenarios. We conclude that LocPerm is a method of choice for taking PS and/or small sample size into account in rare variant association studies.
Collapse
Affiliation(s)
- Jimmy Mullaert
- Université de Paris, IAME, INSERM, Paris, France.,AP-HP, Hôpital Bichat, DEBRC, Paris, France.,Laboratory of Human Genetics of Infectious Diseases, Paris, EU, France
| | - Matthieu Bouaziz
- Laboratory of Human Genetics of Infectious Diseases, Paris, EU, France.,Université de Paris, Imagine Institute, Paris, EU, France
| | - Yoann Seeleuthner
- Laboratory of Human Genetics of Infectious Diseases, Paris, EU, France.,Université de Paris, Imagine Institute, Paris, EU, France
| | - Benedetta Bigio
- St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, New York, USA
| | - Jean-Laurent Casanova
- Laboratory of Human Genetics of Infectious Diseases, Paris, EU, France.,Université de Paris, Imagine Institute, Paris, EU, France.,St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, New York, USA.,Howard Hughes Medical Institute, New York, New York, USA
| | - Alexandre Alcaïs
- Laboratory of Human Genetics of Infectious Diseases, Paris, EU, France.,Université de Paris, Imagine Institute, Paris, EU, France
| | - Laurent Abel
- Laboratory of Human Genetics of Infectious Diseases, Paris, EU, France.,Université de Paris, Imagine Institute, Paris, EU, France.,St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, New York, USA
| | - Aurélie Cobat
- Laboratory of Human Genetics of Infectious Diseases, Paris, EU, France.,Université de Paris, Imagine Institute, Paris, EU, France
| |
Collapse
|
4
|
Qi W, Allen AS, Li YJ. Family-based association tests for rare variants with censored traits. PLoS One 2019; 14:e0210870. [PMID: 30682063 PMCID: PMC6347269 DOI: 10.1371/journal.pone.0210870] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2018] [Accepted: 12/27/2018] [Indexed: 11/30/2022] Open
Abstract
We propose a set of family-based burden and kernel tests for censored traits (FamBAC and FamKAC). Here, censored traits refer to time-to-event outcomes, for instance, age-at-onset of a disease. To model censored traits in family-based designs, we used the frailty model, which incorporated not only fixed genetic effects of rare variants in a region of interest but also random polygenic effects shared within families. We first partitioned genotype scores of rare variants into orthogonal between- and within-family components, and then derived their corresponding efficient score statistics from the frailty model. Finally, FamBAC and FamKAC were constructed by aggregating the weighted efficient scores of the within-family components across rare variants and subjects. FamBAC collapsed rare variants within subject first to form a burden test that followed a chi-squared distribution; whereas FamKAC was a variant component test following a mixture of chi-squared distributions. For FamKAC, p-values can be computed by permutation tests or for computational efficiency by approximation methods. Through simulation studies, we showed that type I error was correctly controlled by FamBAC for various variant weighting schemes (0.0371 to 0.0527). However, FamKAC type I error rates based on approximation methods were deflated (max 0.0376) but improved by permutation tests. Our simulations also demonstrated that burden test FamBAC had higher power than kernel test FamKAC when high proportion (e.g. ≥ 80%) of causal variants had effects in the same direction. In contrast, when the effects of causal variants on the censored trait were in mixed directions, FamKAC outperformed FamBAC and had comparable or higher power than an existing method, RVFam. Our proposed framework has the flexibility to accommodate general nuclear families, and can be used to analyze sequence data for censored traits such as age-at-onset of a complex disease of interest.
Collapse
Affiliation(s)
- Wenjing Qi
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, United States of America
- Duke Molecular Physiology Institute, Duke University, Durham, NC, United States of America
| | - Andrew S. Allen
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, United States of America
- Center for Statistical Genetics and Genomics, Duke University, Durham, NC, United States of America
| | - Yi-Ju Li
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, United States of America
- Duke Molecular Physiology Institute, Duke University, Durham, NC, United States of America
- * E-mail:
| |
Collapse
|
5
|
The impact of a fine-scale population stratification on rare variant association test results. PLoS One 2018; 13:e0207677. [PMID: 30521541 PMCID: PMC6283567 DOI: 10.1371/journal.pone.0207677] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Accepted: 11/05/2018] [Indexed: 12/28/2022] Open
Abstract
Population stratification is a well-known confounding factor in both common and rare variant association analyses. Rare variants tend to be more geographically clustered than common variants, because of their more recent origin. However, it is not yet clear if population stratification at a very fine scale (neighboring administrative regions within a country) would lead to statistical bias in rare variant analyses. As the inclusion of convenience controls from external studies is indeed a common procedure, in order to increase the power to detect genetic associations, this problem is important. We studied through simulation the impact of a fine scale population structure on different rare variant association strategies, assessing type I error and power. We showed that principal component analysis (PCA) based methods of adjustment for population stratification adequately corrected type I error inflation at the largest geographical scales, but not at finest scales. We also showed in our simulations that adding controls obviously increased power, but at a considerably lower level when controls were drawn from another population.
Collapse
|
6
|
Luo Y, Maity A, Wu MC, Smith C, Duan Q, Li Y, Tzeng JY. On the substructure controls in rare variant analysis: Principal components or variance components? Genet Epidemiol 2017; 42:276-287. [PMID: 29280188 DOI: 10.1002/gepi.22102] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2017] [Revised: 10/07/2017] [Accepted: 10/19/2017] [Indexed: 11/09/2022]
Abstract
Recent studies showed that population substructure (PS) can have more complex impact on rare variant tests and that similarity-based collapsing tests (e.g., SKAT) may suffer more severely by PS than burden-based tests. In this work, we evaluate the performance of SKAT coupling with principal components (PC) or variance components (VC) based PS correction methods. We consider confounding effects caused by PS including stratified populations, admixed populations, and spatially distributed nongenetic risk; we investigate which types of variants (e.g., common, less frequent, rare, or all variants) should be used to effectively control for confounding effects. We found that (i) PC-based methods can account for confounding effects in most scenarios except for admixture, although the number of sufficient PCs depends on the PS complexity and the type of variants used. (ii) PCs based on all variants (i.e., common + less frequent + rare) tend to require equal or fewer sufficient PCs and often achieve higher power than PCs based on other variant types. (iii) VC-based methods can effectively adjust for confounding in all scenarios (even for admixture), though the type of variants should be used to construct VC may vary. (iv) VC based on all variants works consistently in all scenarios, though its power may be sometimes lower than VC based on other variant types. Given that the best-performed method and which variants to use depend on the underlying unknown confounding mechanisms, a robust strategy is to perform SKAT analyses using VC-based methods based on all variants.
Collapse
Affiliation(s)
- Yiwen Luo
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America.,Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Arnab Maity
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Michael C Wu
- Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Chris Smith
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Qing Duan
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Yun Li
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America.,Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America.,Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America.,Department of Statistics, National Cheng-Kung University, Tainan, Taiwan.,Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
7
|
Kundu K, Pal LR, Yin Y, Moult J. Determination of disease phenotypes and pathogenic variants from exome sequence data in the CAGI 4 gene panel challenge. Hum Mutat 2017; 38:1201-1216. [PMID: 28497567 PMCID: PMC5576720 DOI: 10.1002/humu.23249] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2016] [Revised: 03/30/2017] [Accepted: 04/28/2017] [Indexed: 01/06/2023]
Abstract
The use of gene panel sequence for diagnostic and prognostic testing is now widespread, but there are so far few objective tests of methods to interpret these data. We describe the design and implementation of a gene panel sequencing data analysis pipeline (VarP) and its assessment in a CAGI4 community experiment. The method was applied to clinical gene panel sequencing data of 106 patients, with the goal of determining which of 14 disease classes each patient has and the corresponding causative variant(s). The disease class was correctly identified for 36 cases, including 10 where the original clinical pipeline did not find causative variants. For a further seven cases, we found strong evidence of an alternative disease to that tested. Many of the potentially causative variants are missense, with no previous association with disease, and these proved the hardest to correctly assign pathogenicity or otherwise. Post analysis showed that three-dimensional structure data could have helped for up to half of these cases. Over-reliance on HGMD annotation led to a number of incorrect disease assignments. We used a largely ad hoc method to assign probabilities of pathogenicity for each variant, and there is much work still to be done in this area.
Collapse
Affiliation(s)
- Kunal Kundu
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, MD 20742, USA
| | - Lipika R. Pal
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
| | - Yizhou Yin
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, MD 20742, USA
| | - John Moult
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
- Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742, USA
| |
Collapse
|
8
|
Nicolas G, Charbonnier C, Campion D. From Common to Rare Variants: The Genetic Component of Alzheimer Disease. Hum Hered 2016; 81:129-141. [PMID: 28002825 DOI: 10.1159/000452256] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2016] [Accepted: 09/29/2016] [Indexed: 12/26/2022] Open
Abstract
Alzheimer disease (AD) is a remarkable example of genetic heterogeneity. Extremely rare variants in the APP, PSEN1, or PSEN2 genes, or duplications of the APP gene cause autosomal dominant forms, generally with complete penetrance by the age of 65 years. Nonautosomal dominant forms are considered as a complex disorder with a high genetic component, whatever the age of onset. Although genetically heterogeneous, AD is defined by the same neuropathological criteria in all configurations. According to the amyloid cascade hypothesis, the Aβ peptide, which aggregates in AD brains, is a key player. APP, PSEN1, or PSEN2 gene mutations increase the production of more aggregation-prone forms of the Aβ peptide, triggering the pathological process. Several risk factors identified in association studies hit genes involved in Aβ production/secretion, aggregation, clearance, or toxicity. Among them, the APOE ε4 allele is a rare example of a common allele with a large effect size, the ORs ranging from 4 to 11-14 for heterozygous and homozygous carriers, respectively. In addition, genome-wide association studies have identified more than two dozen loci with a weak but significant association, the OR of the at-risk allele ranging from 1.08 to 1.30. Recently, the use of massive parallel sequencing has enabled the analysis of rare variants in a genome-wide manner. Two rare variants have been nominally associated with AD risk or protection (TREM2 p.R47H, MAF approximately 0.002, OR approximately 4 and APP p.A673T, MAF approximately 0.0005, OR approximately 0.2). Association analyses at the gene level identified rare loss-of-function and missense, predicted damaging, variants (MAF <0.01) in the SORL1 and ABCA7 genes associated with a moderate relative risk (OR approximately 5 and approximately 2.8, respectively). Although the latter analyses revealed association signals with moderately rare variants by collapsing them, the power to detect genes hit by extremely rare variants is still limited. An alternative approach is to consider the de novo paradigm, stating that de novo variants may contribute to AD genetics in sporadic patients. Here, we critically review AD genetics reports with a special focus on rare variants.
Collapse
Affiliation(s)
- Gaël Nicolas
- CNR-MAJ, Rouen University Hospital, Rouen, France
| | | | | |
Collapse
|
9
|
Whole-exome sequencing to analyze population structure, parental inbreeding, and familial linkage. Proc Natl Acad Sci U S A 2016; 113:6713-8. [PMID: 27247391 DOI: 10.1073/pnas.1606460113] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Principal component analysis (PCA), homozygosity rate estimations, and linkage studies in humans are classically conducted through genome-wide single-nucleotide variant arrays (GWSA). We compared whole-exome sequencing (WES) and GWSA for this purpose. We analyzed 110 subjects originating from different regions of the world, including North Africa and the Middle East, which are poorly covered by public databases and have high consanguinity rates. We tested and applied a number of quality control (QC) filters. Compared with GWSA, we found that WES provided an accurate prediction of population substructure using variants with a minor allele frequency > 2% (correlation = 0.89 with the PCA coordinates obtained by GWSA). WES also yielded highly reliable estimates of homozygosity rates using runs of homozygosity with a 1,000-kb window (correlation = 0.94 with the estimates provided by GWSA). Finally, homozygosity mapping analyses in 15 families including a single offspring with high homozygosity rates showed that WES provided 51% less genome-wide linkage information than GWSA overall but 97% more information for the coding regions. At the genome-wide scale, 76.3% of linked regions were found by both GWSA and WES, 17.7% were found by GWSA only, and 6.0% were found by WES only. For coding regions, the corresponding percentages were 83.5%, 7.4%, and 9.1%, respectively. With appropriate QC filters, WES can be used for PCA and adjustment for population substructure, estimating homozygosity rates in individuals, and powerful linkage analyses, particularly in coding regions.
Collapse
|
10
|
Abstract
Empirical studies and evolutionary theory support a role for rare variants in the etiology of complex traits. Given this motivation and increasing affordability of whole-exome and whole-genome sequencing, methods for rare variant association have been an active area of research for the past decade. Here, we provide a survey of the current literature and developments from the Genetics Analysis Workshop 19 (GAW19) Collapsing Rare Variants working group. In particular, we present the generalized linear regression framework and associated score statistic for the 2 major types of methods: burden and variance components methods. We further show that by simply modifying weights within these frameworks we arrive at many of the popular existing methods, for example, the cohort allelic sums test and sequence kernel association test. Meta-analysis techniques are also described. Next, we describe the 6 contributions from the GAW19 Collapsing Rare Variants working group. These included development of new methods, such as a retrospective likelihood for family data, a method using genomic structure to compare cases and controls, a haplotype-based meta-analysis, and a permutation-based method for combining different statistical tests. In addition, one contribution compared a mega-analysis of family-based and population-based data to meta-analysis. Finally, the power of existing family-based methods for binary traits was compared. We conclude with suggestions for open research questions.
Collapse
Affiliation(s)
- Stephanie A Santorico
- Department of Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO, 80217-3364, USA.
| | - Audrey E Hendricks
- Department of Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO, 80217-3364, USA.
| |
Collapse
|
11
|
Rand KA, Rohland N, Tandon A, Stram A, Sheng X, Do R, Pasaniuc B, Allen A, Quinque D, Mallick S, Le Marchand L, Kaggwa S, Lubwama A, Stram DO, Watya S, Henderson BE, Conti DV, Reich D, Haiman CA. Whole-exome sequencing of over 4100 men of African ancestry and prostate cancer risk. Hum Mol Genet 2015; 25:371-81. [PMID: 26604137 DOI: 10.1093/hmg/ddv462] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2015] [Accepted: 11/06/2015] [Indexed: 12/31/2022] Open
Abstract
Prostate cancer is the most common non-skin cancer in males, with a ∼1.5-2-fold higher incidence in African American men when compared with whites. Epidemiologic evidence supports a large heritable contribution to prostate cancer, with over 100 susceptibility loci identified to date that can explain ∼33% of the familial risk. To explore the contribution of both rare and common variation in coding regions to prostate cancer risk, we sequenced the exomes of 2165 prostate cancer cases and 2034 controls of African ancestry at a mean coverage of 10.1×. We identified 395 220 coding variants down to 0.05% frequency [57% non-synonymous (NS), 42% synonymous and 1% gain or loss of stop codon or splice site variant] in 16 751 genes with the strongest associations observed in SPARCL1 on 4q22.1 (rs13051, Ala49Asp, OR = 0.78, P = 1.8 × 10(-6)) and PTPRR on 12q15 (rs73341069, Val239Ile, OR = 1.62, P = 2.5 × 10(-5)). In gene-level testing, the two most significant genes were C1orf100 (P = 2.2 × 10(-4)) and GORAB (P = 2.3 × 10(-4)). We did not observe exome-wide significant associations (after correcting for multiple hypothesis testing) in single variant or gene-level testing in the overall case-control or case-case analyses of disease aggressiveness. In this first whole-exome sequencing study of prostate cancer, our findings do not provide strong support for the hypothesis that NS coding variants down to 0.5-1.0% frequency have large effects on prostate cancer risk in men of African ancestry. Higher-coverage sequencing efforts in larger samples will be needed to study rarer variants with smaller effect sizes associated with prostate cancer risk.
Collapse
Affiliation(s)
- Kristin A Rand
- Department of Preventive Medicine, Keck School of Medicine, Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA
| | - Nadin Rohland
- Department of Genetics, Harvard Medical School, Harvard University, Boston, MA 02115, USA, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Arti Tandon
- Department of Genetics, Harvard Medical School, Harvard University, Boston, MA 02115, USA, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Alex Stram
- Department of Preventive Medicine, Keck School of Medicine
| | - Xin Sheng
- Department of Preventive Medicine, Keck School of Medicine
| | - Ron Do
- Department of Genetics, Harvard Medical School, Harvard University, Boston, MA 02115, USA, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Bogdan Pasaniuc
- Bioinformatics Interdepartmental Program, Department of Human Genetics, David Geffen School of Medicine, Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Alex Allen
- Department of Genetics, Harvard Medical School, Harvard University, Boston, MA 02115, USA, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Dominique Quinque
- Department of Genetics, Harvard Medical School, Harvard University, Boston, MA 02115, USA, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Swapan Mallick
- Department of Genetics, Harvard Medical School, Harvard University, Boston, MA 02115, USA, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA, Howard Hughes Medical Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Loic Le Marchand
- Epidemiology Program, Cancer Research Center, University of Hawaii, Honolulu, HI 96813, USA
| | | | - Alex Lubwama
- School of Public Health, Makerere University College of Health Sciences, Kampala, Uganda and
| | | | | | - Daniel O Stram
- Department of Preventive Medicine, Keck School of Medicine, Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA
| | - Stephen Watya
- School of Public Health, Makerere University College of Health Sciences, Kampala, Uganda and Uro Care, Kampala, Uganda
| | - Brian E Henderson
- Department of Preventive Medicine, Keck School of Medicine, Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA
| | - David V Conti
- Department of Preventive Medicine, Keck School of Medicine, Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA
| | - David Reich
- Department of Genetics, Harvard Medical School, Harvard University, Boston, MA 02115, USA, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA, Howard Hughes Medical Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Christopher A Haiman
- Department of Preventive Medicine, Keck School of Medicine, Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA,
| |
Collapse
|
12
|
Sung YJ, Basson J, Cheng N, Nguyen KDH, Nandakumar P, Hunt SC, Arnett DK, Dávila-Román VG, Rao DC, Chakravarti A. The role of rare variants in systolic blood pressure: analysis of ExomeChip data in HyperGEN African Americans. Hum Hered 2015; 79:20-7. [PMID: 25765051 DOI: 10.1159/000375373] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2014] [Accepted: 01/20/2015] [Indexed: 12/27/2022] Open
Abstract
Cardiovascular diseases are among the most significant health problems in the United States today, with their major risk factor, hypertension, disproportionately affecting African Americans (AAs). Although GWAS have identified dozens of common variants associated with blood pressure (BP) and hypertension in European Americans, these variants collectively explain <2.5% of BP variance, and most of the genetic variants remain yet to be identified. Here, we report the results from rare-variant analysis of systolic BP using 94,595 rare and low-frequency variants (minor allele frequency, MAF, <5%) from the Illumina exome array genotyped in 2,045 HyperGEN AAs. In addition to single-variant analysis, 4 gene-level association tests were used for analysis: burden and family-based SKAT tests using MAF cutoffs of 1 and 5%. The gene-based methods often provided lower p values than the single-variant approach. Some consistency was observed across these 4 gene-based analysis options. While neither the gene-based analyses nor the single-variant analysis produced genome-wide significant results, the top signals, which had supporting evidence from multiple gene-based methods, were of borderline significance. Though additional molecular validations are required, 6 of the 16 most promising genes are biologically plausible with physiological connections to BP regulation.
Collapse
|
13
|
Pérez-Núñez I, Pérez-Castrillón JL, Zarrabeitia MT, García-Ibarbia C, Martínez-Calvo L, Olmos JM, Briongos LS, Riancho J, Camarero V, Muñoz Vives JM, Cruz R, Riancho JA. Exon array analysis reveals genetic heterogeneity in atypical femoral fractures. A pilot study. Mol Cell Biochem 2015; 409:45-50. [DOI: 10.1007/s11010-015-2510-3] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2015] [Accepted: 07/04/2015] [Indexed: 10/23/2022]
|
14
|
Lin KH, Zöllner S. Robust and Powerful Affected Sibpair Test for Rare Variant Association. Genet Epidemiol 2015; 39:325-33. [PMID: 25966809 DOI: 10.1002/gepi.21903] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2014] [Revised: 03/25/2015] [Accepted: 04/01/2015] [Indexed: 11/09/2022]
Abstract
Advances in DNA sequencing technology facilitate investigating the impact of rare variants on complex diseases. However, using a conventional case-control design, large samples are needed to capture enough rare variants to achieve sufficient power for testing the association between suspected loci and complex diseases. In such large samples, population stratification may easily cause spurious signals. One approach to overcome stratification is to use a family-based design. For rare variants, this strategy is especially appropriate, as power can be increased considerably by analyzing cases with affected relatives. We propose a novel framework for association testing in affected sibpairs by comparing the allele count of rare variants on chromosome regions shared identical by descent to the allele count of rare variants on nonshared chromosome regions, referred to as test for rare variant association with family-based internal control (TRAFIC). This design is generally robust to population stratification as cases and controls are matched within each sibpair. We evaluate the power analytically using general model for effect size of rare variants. For the same number of genotyped people, TRAFIC shows superior power over the conventional case-control study for variants with summed risk allele frequency f < 0.05; this power advantage is even more substantial when considering allelic heterogeneity. For complex models of gene-gene interaction, this power advantage depends on the direction of interaction and overall heritability. In sum, we introduce a new method for analyzing rare variants in affected sibpairs that is robust to population stratification, and provide freely available software.
Collapse
Affiliation(s)
- Keng-Han Lin
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of America.,Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Sebastian Zöllner
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of America.,Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan, United States of America.,Department of Psychiatry, University of Michigan, Ann Arbor, Michigan, United States of America
| |
Collapse
|
15
|
Leveraging ancestry to improve causal variant identification in exome sequencing for monogenic disorders. Eur J Hum Genet 2015; 24:113-9. [PMID: 25898925 DOI: 10.1038/ejhg.2015.68] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2014] [Revised: 03/01/2015] [Accepted: 03/10/2015] [Indexed: 01/18/2023] Open
Abstract
Recent breakthroughs in exome-sequencing technology have made possible the identification of many causal variants of monogenic disorders. Although extremely powerful when closely related individuals (eg, child and parents) are simultaneously sequenced, sequencing of a single case is often unsuccessful due to the large number of variants that need to be followed up for functional validation. Many approaches filter out common variants above a given frequency threshold (eg, 1%), and then prioritize the remaining variants according to their functional, structural and conservation properties. Here we present methods that leverage the genetic structure across different populations to improve filtering performance while accounting for the finite sample size of the reference panels. We show that leveraging genetic structure reduces the number of variants that need to be followed up by 16% in simulations and by up to 38% in empirical data of 20 exomes from individuals with monogenic disorders for which the causal variants are known.
Collapse
|
16
|
Pirie A, Wood A, Lush M, Tyrer J, Pharoah PDP. The effect of rare variants on inflation of the test statistics in case-control analyses. BMC Bioinformatics 2015; 16:53. [PMID: 25888290 PMCID: PMC4339749 DOI: 10.1186/s12859-015-0496-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2014] [Accepted: 02/12/2015] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND The detection of bias due to cryptic population structure is an important step in the evaluation of findings of genetic association studies. The standard method of measuring this bias in a genetic association study is to compare the observed median association test statistic to the expected median test statistic. This ratio is inflated in the presence of cryptic population structure. However, inflation may also be caused by the properties of the association test itself particularly in the analysis of rare variants. We compared the properties of the three most commonly used association tests: the likelihood ratio test, the Wald test and the score test when testing rare variants for association using simulated data. RESULTS We found evidence of inflation in the median test statistics of the likelihood ratio and score tests for tests of variants with less than 20 heterozygotes across the sample, regardless of the total sample size. The test statistics for the Wald test were under-inflated at the median for variants below the same minor allele frequency. CONCLUSIONS In a genetic association study, if a substantial proportion of the genetic variants tested have rare minor allele frequencies, the properties of the association test may mask the presence or absence of bias due to population structure. The use of either the likelihood ratio test or the score test is likely to lead to inflation in the median test statistic in the absence of population structure. In contrast, the use of the Wald test is likely to result in under-inflation of the median test statistic which may mask the presence of population structure.
Collapse
Affiliation(s)
- Ailith Pirie
- Department of Public Health and Primary Care, Strangeways Research Laboratory, University of Cambridge, 2 Worts' Causeway, Cambridge, CB1 8RN, UK.
| | - Angela Wood
- Department of Public Health and Primary Care, Strangeways Research Laboratory, University of Cambridge, 2 Worts' Causeway, Cambridge, CB1 8RN, UK.
| | - Michael Lush
- Department of Public Health and Primary Care, Strangeways Research Laboratory, University of Cambridge, 2 Worts' Causeway, Cambridge, CB1 8RN, UK.
| | - Jonathan Tyrer
- Department of Oncology, Strangeways Research Laboratory, University of Cambridge, 2 Worts' Causeway, Cambridge, CB1 8RN, UK.
| | - Paul D P Pharoah
- Department of Public Health and Primary Care, Strangeways Research Laboratory, University of Cambridge, 2 Worts' Causeway, Cambridge, CB1 8RN, UK.
- Department of Oncology, Strangeways Research Laboratory, University of Cambridge, 2 Worts' Causeway, Cambridge, CB1 8RN, UK.
| |
Collapse
|
17
|
Abstract
In humans, most of the genetic variation is rare and often population-specific. Whereas the role of rare genetic variants in familial monogenic diseases is firmly established, we are only now starting to explore the contribution of this class of genetic variation to human common diseases and other complex traits. Such large-scale experiments are possible due to the development of next-generation DNA sequencing. Early findings suggested that rare and low-frequency coding variation might have a large effect on human phenotypes (eg, PCSK9 missense variants on low-density lipoprotein-cholesterol and coronary heart diseases). This observation sparked excitement in prognostic and diagnostic medicine, as well as in genetics-driven strategies to develop new drugs. In this review, I describe results and present initial conclusions regarding some of the recent rare and low-frequency variant discoveries. We can already assume that most phenotype-associated rare and low-frequency variants have modest-to-weak phenotypical effect. Thus, we will need large cohorts to identify them, as for common variants in genome-wide association studies. As we expand the list of associated rare and low-frequency variants, we can also better recognise the current limitations: we need to develop better statistical methods to optimally test association with rare variants, including non-coding variation, and to account for potential confounders such as population stratification.
Collapse
Affiliation(s)
- Guillaume Lettre
- Montreal Heart Institute, Montreal, Quebec, Canada Faculty of Medicine, Department of Medicine, Université de Montréal, Montreal, Quebec, Canada
| |
Collapse
|