1
|
Harris L, McDonagh EM, Zhang X, Fawcett K, Foreman A, Daneck P, Sergouniotis PI, Parkinson H, Mazzarotto F, Inouye M, Hollox EJ, Birney E, Fitzgerald T. Genome-wide association testing beyond SNPs. Nat Rev Genet 2025; 26:156-170. [PMID: 39375560 DOI: 10.1038/s41576-024-00778-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/03/2024] [Indexed: 10/09/2024]
Abstract
Decades of genetic association testing in human cohorts have provided important insights into the genetic architecture and biological underpinnings of complex traits and diseases. However, for certain traits, genome-wide association studies (GWAS) for common SNPs are approaching signal saturation, which underscores the need to explore other types of genetic variation to understand the genetic basis of traits and diseases. Copy number variation (CNV) is an important source of heritability that is well known to functionally affect human traits. Recent technological and computational advances enable the large-scale, genome-wide evaluation of CNVs, with implications for downstream applications such as polygenic risk scoring and drug target identification. Here, we review the current state of CNV-GWAS, discuss current limitations in resource infrastructure that need to be overcome to enable the wider uptake of CNV-GWAS results, highlight emerging opportunities and suggest guidelines and standards for future GWAS for genetic variation beyond SNPs at scale.
Collapse
Affiliation(s)
- Laura Harris
- European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
| | - Ellen M McDonagh
- European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
| | - Xiaolei Zhang
- European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
| | - Katherine Fawcett
- European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
- Department of Population Health Sciences, University of Leicester, Leicester, UK
| | - Amy Foreman
- European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
| | - Petr Daneck
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Panagiotis I Sergouniotis
- European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
- Division of Evolution, Infection and Genomics, School of Biological Sciences, University of Manchester, Manchester, UK
| | - Helen Parkinson
- European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
| | - Francesco Mazzarotto
- Department of Molecular and Translational Medicine, University of Brescia, Brescia, Italy
- National Heart and Lung Institute, Imperial College London, London, UK
| | - Michael Inouye
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Australia
| | - Edward J Hollox
- Department of Genetics and Genome Biology, University of Leicester, Leicester, UK
| | - Ewan Birney
- European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
| | - Tomas Fitzgerald
- European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK.
| |
Collapse
|
2
|
Browning SR, Browning BL. Estimating gene conversion rates from population data using multi-individual identity by descent. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.22.639693. [PMID: 40060563 PMCID: PMC11888280 DOI: 10.1101/2025.02.22.639693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/20/2025]
Abstract
In humans, homologous gene conversions occur at a higher rate than crossovers, however gene conversion tracts are small and often unobservable. As a result, estimating gene conversion rates is more difficult than estimating crossover rates. We present a method for multi-individual identity-by-descent (IBD) inference that allows for mismatches due to genotype error and gene conversion. We use the inferred IBD to detect alleles that have changed due to gene conversion in the recent past. We analyze data from the TOPMed and UK Biobank studies to estimate autosome-wide maps of gene conversion rates. For 10 kb, 100kb, and 1 Mb windows, the correlation between our TOPMed gene conversion map and the deCODE sex-averaged crossover map ranges from 0.56 to 0.67. We find that the strongest gene conversion hotspots typically die back to the baseline gene conversion rate within 1 kb. In 100 kb and 1 Mb windows, our estimated gene conversion map has higher correlation than the deCODE sex-averaged crossover map with PRDM9 binding enrichment (0.34 vs 0.29 for 100 kb windows and 0.52 vs 0.34 for 1 Mb windows), suggesting that the effect of PRDM9 is greater on gene conversion than on crossover recombination. Our TOPMed gene conversion maps are constructed from 55-fold more observed allele conversions than the recently published deCODE gene conversion maps. Our map provides sex-averaged estimates for 10 kb, 100 kb, and 1 Mb windows, whereas the deCODE gene conversion maps provide sex-specific estimates for 3 Mb windows.
Collapse
Affiliation(s)
- Sharon R. Browning
- Department of Biostatistics, University of Washington, Seattle, WA, 98195, USA
| | - Brian L. Browning
- Department of Biostatistics, University of Washington, Seattle, WA, 98195, USA
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, 98195, USA
| |
Collapse
|
3
|
Czech E, Millar TR, Tyler W, White T, Elsworth B, Guez J, Hancox J, Jeffery B, Karczewski KJ, Miles A, Tallman S, Unneberg P, Wojdyla R, Zabad S, Hammerbacher J, Kelleher J. Analysis-ready VCF at Biobank scale using Zarr. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.06.11.598241. [PMID: 38915693 PMCID: PMC11195102 DOI: 10.1101/2024.06.11.598241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed. Results Zarr is a format for storing multi-dimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of three large human datasets (Genomics England: n=78,195; Our Future Health: n=651,050; All of Us: n=245,394) along with whole genome datasets for Norway Spruce (n=1,063) and SARS-CoV-2 (n=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs. Conclusions Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.
Collapse
Affiliation(s)
- Eric Czech
- Open Athena AI Foundation, Lincoln, New Zealand
- Related Sciences, Lincoln, New Zealand
| | - Timothy R. Millar
- The New Zealand Institute for Plant & Food Research Ltd, Lincoln, New Zealand
- Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand
| | | | - Tom White
- Tom White Consulting Ltd., Manchester, UK
| | | | - Jérémy Guez
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
| | | | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | - Konrad J. Karczewski
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
- Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Alistair Miles
- Wellcome Sanger Institute, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Sam Tallman
- Genomics England, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Per Unneberg
- Department of Cell and Molecular Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | | | - Shadi Zabad
- School of Computer Science, McGill University, Montreal, QC, Canada
| | - Jeff Hammerbacher
- Open Athena AI Foundation, Lincoln, New Zealand
- Related Sciences, Lincoln, New Zealand
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| |
Collapse
|
4
|
DeHaas D, Pan Z, Wei X. Enabling efficient analysis of biobank-scale data with genotype representation graphs. NATURE COMPUTATIONAL SCIENCE 2025; 5:112-124. [PMID: 39639156 PMCID: PMC12054550 DOI: 10.1038/s43588-024-00739-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Accepted: 11/06/2024] [Indexed: 12/07/2024]
Abstract
Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. However, encoding genetic data in existing tabular data structures and file formats has become costly and unsustainable. Here we introduce the genotype representation graph (GRG), a fully connected hierarchical data structure that losslessly encodes phased whole-genome polymorphisms. Exploiting variant-sharing across samples enables GRG to compress 200,000 UK Biobank phased human genomes to 5-26 gigabytes per chromosome, also enabling graph-traversal algorithms to reuse computed values in random access memory. Constructing and processing GRG files scales to a million whole genomes. Using allele frequencies and association effects as examples, we show that computation on GRG via graph traversal runs the fastest among all tested alternatives. GRG-based algorithms have the potential to increase the scalability and reduce the cost of analyzing large genomic datasets.
Collapse
Affiliation(s)
- Drew DeHaas
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Ziqing Pan
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Xinzhu Wei
- Department of Computational Biology, Cornell University, Ithaca, NY, USA.
| |
Collapse
|
5
|
Masaki N, Browning SR. Mean gene conversion tract length in humans estimated to be 459 bp from UK Biobank sequence data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.12.30.630818. [PMID: 39868294 PMCID: PMC11761487 DOI: 10.1101/2024.12.30.630818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
Non-crossover gene conversion is a type of meiotic recombination characterized by the non-reciprocal transfer of genetic material between homologous chromosomes. Gene conversions are thought to occur within relatively short tracts of DNA, estimated to be in the order of 100-1,000 bp in humans. However, the number of observable gene conversion tracts per study has so far been limited by the use of pedigree or sperm-typing data to detect gene conversion events. In this study, we propose a statistical method to estimate the mean length of gene conversion tracts in humans. Our method can handle a large number of gene conversion tracts, leading to more precise estimates of the mean tract length. We apply our method to gene conversion tracts detected in whole autosome sequence data from the UK Biobank using clusters of identity-by-descent segments. From this dataset, we estimate the mean gene conversion tract length in humans to be 459 bp (95% CI: [457, 461]). Stratifying detected gene conversion tracts by whether they overlapped with a recombination hotspot, we estimate the mean gene conversion tract length to be 418 bp (95% CI: [416, 420]) and 492 bp (95% CI: [489, 494]) respectively, for tracts that overlap and do not overlap with a recombination hotspot.
Collapse
Affiliation(s)
- Nobuaki Masaki
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
| | - Sharon R. Browning
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
| |
Collapse
|
6
|
Shi S, Rubinacci S, Hu S, Moutsianas L, Stuckey A, Need AC, Palamara PF, Caulfield M, Marchini J, Myers S. A Genomics England haplotype reference panel and imputation of UK Biobank. Nat Genet 2024; 56:1800-1803. [PMID: 39134668 PMCID: PMC11387190 DOI: 10.1038/s41588-024-01868-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 07/11/2024] [Indexed: 09/12/2024]
Abstract
We built a reference panel with 342 million autosomal variants using 78,195 individuals from the Genomics England (GEL) dataset, achieving a phasing switch error rate of 0.18% for European samples and imputation quality of r2 = 0.75 for variants with minor allele frequencies as low as 2 × 10-4 in white British samples. The GEL-imputed UK Biobank genome-wide association analysis identified 70% of associations found by direct exome sequencing (P < 2.18 × 10-11), while extending testing of rare variants to the entire genome. Coding variants dominated the rare-variant genome-wide association results, implying less disruptive effects of rare non-coding variants.
Collapse
Affiliation(s)
- Sinan Shi
- Department of Statistics, University of Oxford, Oxford, UK.
| | | | - Sile Hu
- Novo Nordisk Research Centre, Oxford, UK
| | - Loukas Moutsianas
- Genomics England, London, UK
- Queen Mary University of London, London, UK
| | | | | | | | - Mark Caulfield
- Genomics England, London, UK
- Queen Mary University of London, London, UK
| | | | - Simon Myers
- Department of Statistics, University of Oxford, Oxford, UK.
| |
Collapse
|
7
|
DeHaas D, Pan Z, Wei X. Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.23.590800. [PMID: 38712040 PMCID: PMC11071416 DOI: 10.1101/2024.04.23.590800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. Current practice is to store large-scale genetic polymorphism data using tabular data structures and file formats, where rows and columns represent samples and genetic variants. However, encoding genetic data in such formats has become unsustainable. For example, the UK Biobank polymorphism data of 200,000 phased whole genomes has exceeded 350 terabytes (TB) in Variant Call Format (VCF), cumbersome and inefficient to work with. To mitigate the computational burden, we introduce the Genotype Representation Graph (GRG), an extremely compact data structure to losslessly present phased whole-genome polymorphisms. A GRG is a fully connected hierarchical graph that exploits variant-sharing across samples, leveraging ideas inspired by Ancestral Recombination Graphs. Capturing variant-sharing in a multitree structure compresses biobank-scale human data to the point where it can fit in a typical server's RAM (5-26 gigabytes (GB) per chromosome), and enables graph-traversal algorithms to trivially reuse computed values, both of which can significantly reduce computation time. We have developed a command-line tool and a library usable via both C++ and Python for constructing and processing GRG files which scales to a million whole genomes. It takes 160GB disk space to encode the information in 200,000 UK Biobank phased whole genomes as a GRG, more than 13 times smaller than the size of compressed VCF. We show that summaries of genetic variants such as allele frequency and association effect can be computed on GRG via graph traversal that runs significantly faster than all tested alternatives, including vcf.gz, PLINK BED, tree sequence, XSI, and Savvy. Furthermore, GRG is particularly suitable for doing repeated calculations and interactive data analysis. We anticipate that GRG-based algorithms will improve the scalability of various types of computation and generally lower the cost of analyzing large genomic datasets.
Collapse
Affiliation(s)
- Drew DeHaas
- Department of Computational Biology, Cornell University, Ithaca, NY
| | - Ziqing Pan
- Department of Computational Biology, Cornell University, Ithaca, NY
| | - Xinzhu Wei
- Department of Computational Biology, Cornell University, Ithaca, NY
| |
Collapse
|
8
|
Wertenbroek R, Hofmeister RJ, Xenarios I, Thoma Y, Delaneau O. Improving population scale statistical phasing with whole-genome sequencing data. PLoS Genet 2024; 20:e1011092. [PMID: 38959269 PMCID: PMC11251608 DOI: 10.1371/journal.pgen.1011092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Revised: 07/16/2024] [Accepted: 06/11/2024] [Indexed: 07/05/2024] Open
Abstract
Haplotype estimation, or phasing, has gained significant traction in large-scale projects due to its valuable contributions to population genetics, variant analysis, and the creation of reference panels for imputation and phasing of new samples. To scale with the growing number of samples, haplotype estimation methods designed for population scale rely on highly optimized statistical models to phase genotype data, and usually ignore read-level information. Statistical methods excel in resolving common variants, however, they still struggle at rare variants due to the lack of statistical information. In this study we introduce SAPPHIRE, a new method that leverages whole-genome sequencing data to enhance the precision of haplotype calls produced by statistical phasing. SAPPHIRE achieves this by refining haplotype estimates through the realignment of sequencing reads, particularly targeting low-confidence phase calls. Our findings demonstrate that SAPPHIRE significantly enhances the accuracy of haplotypes obtained from state of the art methods and also provides the subset of phase calls that are validated by sequencing reads. Finally, we show that our method scales to large data sets by its successful application to the extensive 3.6 Petabytes of sequencing data of the last UK Biobank 200,031 sample release.
Collapse
Affiliation(s)
- Rick Wertenbroek
- University of Lausanne, Lausanne, Vaud, Switzerland
- School of Engineering and Management Vaud (HEIG-VD), HES-SO University of Applied Sciences and Arts Western Switzerland, Yverdon-les-Bains, Vaud, Switzerland
| | | | | | - Yann Thoma
- School of Engineering and Management Vaud (HEIG-VD), HES-SO University of Applied Sciences and Arts Western Switzerland, Yverdon-les-Bains, Vaud, Switzerland
| | - Olivier Delaneau
- Regeneron Genetics Center, Tarrytown, New York, United States of America
| |
Collapse
|
9
|
Masaki N, Browning SR, Browning BL. Simultaneous estimation of genotype error and uncalled deletion rates in whole genome sequence data. PLoS Genet 2024; 20:e1011297. [PMID: 38787916 PMCID: PMC11156439 DOI: 10.1371/journal.pgen.1011297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 06/06/2024] [Accepted: 05/10/2024] [Indexed: 05/26/2024] Open
Abstract
Genotype data include errors that may influence conclusions reached by downstream statistical analyses. Previous studies have estimated genotype error rates from discrepancies in human pedigree data, such as Mendelian inconsistent genotypes or apparent phase violations. However, uncalled deletions, which generally have not been accounted for in these studies, can lead to biased error rate estimates. In this study, we propose a genotype error model that considers both genotype errors and uncalled deletions when calculating the likelihood of the observed genotypes in parent-offspring trios. Using simulations, we show that when there are uncalled deletions, our model produces genotype error rate estimates that are less biased than estimates from a model that does not account for these deletions. We applied our model to SNVs in 77 sequenced White British parent-offspring trios in the UK Biobank. We use the Akaike information criterion to show that our model fits the data better than a model that does not account for uncalled deletions. We estimate the genotype error rate at SNVs with minor allele frequency > 0.001 in these data to be [Formula: see text]. We estimate that 77% of the genotype errors at these markers are attributable to uncalled deletions [Formula: see text].
Collapse
Affiliation(s)
- Nobuaki Masaki
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
| | - Sharon R. Browning
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
| | - Brian L. Browning
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
- Department of Medicine, Division of Medical Genetics, University of Washington, Seattle, Washington, United States of America
| |
Collapse
|
10
|
Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. Am J Hum Genet 2024; 111:691-700. [PMID: 38513668 PMCID: PMC11023918 DOI: 10.1016/j.ajhg.2024.02.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 02/26/2024] [Accepted: 02/27/2024] [Indexed: 03/23/2024] Open
Abstract
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more computationally efficient inference of identity by descent (IBD) than approaches that infer pairwise IBD segments and provides locus-specific IBD clusters rather than IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2,900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach. Our IBD clustering method is implemented in the open-source ibd-cluster software package.
Collapse
Affiliation(s)
- Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| | - Brian L Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA; Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, USA.
| |
Collapse
|
11
|
Kwong A, Zawistowski M, Fritsche LG, Zhan X, Bragg-Gresham J, Branham KE, Advani J, Othman M, Ratnapriya R, Teslovich TM, Stambolian D, Chew EY, Abecasis GR, Swaroop A. Whole genome sequencing of 4,787 individuals identifies gene-based rare variants in age-related macular degeneration. Hum Mol Genet 2024; 33:374-385. [PMID: 37934784 PMCID: PMC10840384 DOI: 10.1093/hmg/ddad189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2023] [Revised: 10/12/2023] [Accepted: 10/31/2023] [Indexed: 11/09/2023] Open
Abstract
Genome-wide association studies have contributed extensively to the discovery of disease-associated common variants. However, the genetic contribution to complex traits is still largely difficult to interpret. We report a genome-wide association study of 2394 cases and 2393 controls for age-related macular degeneration (AMD) via whole-genome sequencing, with 46.9 million genetic variants. Our study reveals significant single-variant association signals at four loci and independent gene-based signals in CFH, C2, C3, and NRTN. Using data from the Exome Aggregation Consortium (ExAC) for a gene-based test, we demonstrate an enrichment of predicted rare loss-of-function variants in CFH, CFI, and an as-yet unreported gene in AMD, ORMDL2. Our method of using a large variant list without individual-level genotypes as an external reference provides a flexible and convenient approach to leverage the publicly available variant datasets to augment the search for rare variant associations, which can explain additional disease risk in AMD.
Collapse
Affiliation(s)
- Alan Kwong
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
| | - Matthew Zawistowski
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
| | - Lars G Fritsche
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
| | - Xiaowei Zhan
- Southwestern Medical Center, University of Texas, 5323 Harry Hines Blvd, Dallas, TX 75390, United States
| | - Jennifer Bragg-Gresham
- Kidney Epidemiology and Cost Center, Department of Internal Medicine-Nephrology, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
| | - Kari E Branham
- Department of Ophthalmology and Visual Sciences, University of Michigan Kellogg Eye Center, 1000 Wall St, Ann Arbor, MI 48105, United States
| | - Jayshree Advani
- Neurobiology-Neurodegeneration and Repair Laboratory, National Eye Institute, National Institutes of Health, 6 Center Drive, MSC 0610, Bethesda, MD 20892, United States
| | - Mohammad Othman
- Department of Ophthalmology and Visual Sciences, University of Michigan Kellogg Eye Center, 1000 Wall St, Ann Arbor, MI 48105, United States
| | - Rinki Ratnapriya
- Neurobiology-Neurodegeneration and Repair Laboratory, National Eye Institute, National Institutes of Health, 6 Center Drive, MSC 0610, Bethesda, MD 20892, United States
| | - Tanya M Teslovich
- Regeneron Pharmaceuticals Inc., 777 Old Saw Mill River Rd, Tarrytown, NY 10591, United States
| | - Dwight Stambolian
- Department of Ophthalmology, Perelman School of Medicine, University of Pennsylvania Medical School, 51 N. 39th Street, Philadelphia, PA 19104, United States
| | - Emily Y Chew
- Division of Epidemiology and Clinical Application, National Eye Institute, National Institutes of Health, 10 Center Drive Building 10-CRC, Bethesda, MD 20892, United States
| | - Gonçalo R Abecasis
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
- Regeneron Pharmaceuticals Inc., 777 Old Saw Mill River Rd, Tarrytown, NY 10591, United States
| | - Anand Swaroop
- Neurobiology-Neurodegeneration and Repair Laboratory, National Eye Institute, National Institutes of Health, 6 Center Drive, MSC 0610, Bethesda, MD 20892, United States
| |
Collapse
|
12
|
Avadhanam S, Williams AL. Phase-free local ancestry inference mitigates the impact of switch errors on phase-based methods. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.02.569669. [PMID: 38106003 PMCID: PMC10723336 DOI: 10.1101/2023.12.02.569669] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Local ancestry inference (LAI) is an indispensable component of a variety of analyses in medical and population genetics, from admixture mapping to characterizing demographic history. However, the accuracy of LAI depends on a number of factors such as phase quality (for phase-based LAI methods), time since admixture of the population under study, and other factors. Here we present an empirical analysis of four LAI methods using simulated individuals of mixed African and European ancestry, examining the impact of variable phase quality and a range of demographic scenarios. We found that regardless of phasing options, calls from LAI methods that operate on unphased genotypes (phase-free LAI) have 2.6-4.6% higher Pearson correlation with the ground truth than methods that operate on phased genotypes (phase-based LAI). Applying the TRACTOR phase-correction algorithm led to modest improvements in phase-based LAI, but despite this, the Pearson correlation of phase-free LAI remained 2.4-3.8% higher than phase-corrected phase-based approaches (considering the best performing methods in each category). Phase-free and phase-based LAI accuracy differences can dramatically impact downstream analyses: estimates of the time since admixture using phase-based LAI tracts are upwardly biased by ≈10 generations using our highest quality phased data but have virtually no bias using phase-free LAI calls. Our study underscores the strong dependence of phase-based LAI accuracy on phase quality and highlights the merits of LAI approaches that analyze unphased genetic data.
Collapse
|
13
|
Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.03.565574. [PMID: 37961601 PMCID: PMC10635131 DOI: 10.1101/2023.11.03.565574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more efficient collection and storage of identity by descent (IBD) information than approaches that detect and store pairwise IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach.
Collapse
Affiliation(s)
| | - Brian L. Browning
- Department of Biostatistics, University of Washington, Seattle, WA
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA
| |
Collapse
|
14
|
Cai R, Browning BL, Browning SR. Identity-by-descent-based estimation of the X chromosome effective population size with application to sex-specific demographic history. G3 (BETHESDA, MD.) 2023; 13:jkad165. [PMID: 37497617 PMCID: PMC10542559 DOI: 10.1093/g3journal/jkad165] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 05/10/2023] [Accepted: 07/14/2023] [Indexed: 07/28/2023]
Abstract
The effective size of a population (Ne) in the recent past can be estimated through analysis of identity-by-descent (IBD) segments. Several methods have been developed for estimating Ne from autosomal IBD segments, but no such effort has been made with X chromosome IBD segments. In this work, we propose a method to estimate the X chromosome effective population size from X chromosome IBD segments. We show how to use the estimated autosome Ne and X chromosome Ne to estimate the female and male effective population sizes. We demonstrate the accuracy of our autosome and X chromosome Ne estimation with simulated data. We find that the estimated female and male effective population sizes generally reflect the simulated sex-specific effective population sizes across the past 100 generations but that short-term differences between the estimated sex-specific Ne across tens of generations may not reliably indicate true sex-specific differences. We analyzed the effective size of populations represented by samples of sequenced UK White British and UK Indian individuals from the UK Biobank.
Collapse
Affiliation(s)
- Ruoyi Cai
- Department of Biostatistics, University of Washington, Seattle, Washington, 98195, USA
| | - Brian L Browning
- Department of Biostatistics, University of Washington, Seattle, Washington, 98195, USA
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, Washington, 98195, USA
| | - Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, Washington, 98195, USA
| |
Collapse
|
15
|
Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat Genet 2023:10.1038/s41588-023-01415-w. [PMID: 37386248 DOI: 10.1038/s41588-023-01415-w] [Citation(s) in RCA: 69] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 05/04/2023] [Indexed: 07/01/2023]
Abstract
Phasing involves distinguishing the two parentally inherited copies of each chromosome into haplotypes. Here, we introduce SHAPEIT5, a new phasing method that quickly and accurately processes large sequencing datasets and applied it to UK Biobank (UKB) whole-genome and whole-exome sequencing data. We demonstrate that SHAPEIT5 phases rare variants with low switch error rates of below 5% for variants present in just 1 sample out of 100,000. Furthermore, we outline a method for phasing singletons, which, although less precise, constitutes an important step towards future developments. We then demonstrate that the use of UKB as a reference panel improves the accuracy of genotype imputation, which is even more pronounced when phased with SHAPEIT5 compared with other methods. Finally, we screen the UKB data for loss-of-function compound heterozygous events and identify 549 genes where both gene copies are knocked out. These genes complement current knowledge of gene essentiality in the human genome.
Collapse
Affiliation(s)
- Robin J Hofmeister
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Diogo M Ribeiro
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Simone Rubinacci
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Olivier Delaneau
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
| |
Collapse
|