Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Browning BL, Browning SR. Statistical phasing of 150,119 sequenced genomes in the UK Biobank. Am J Hum Genet 2023;110:161-165. [PMID: 36450278 PMCID: PMC9892698 DOI: 10.1016/j.ajhg.2022.11.008] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Accepted: 11/08/2022] [Indexed: 12/03/2022] Open

For:	Browning BL, Browning SR. Statistical phasing of 150,119 sequenced genomes in the UK Biobank. Am J Hum Genet 2023;110:161-165. [PMID: 36450278 PMCID: PMC9892698 DOI: 10.1016/j.ajhg.2022.11.008] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Accepted: 11/08/2022] [Indexed: 12/03/2022] Open

Number

Cited by Other Article(s)

Harris L, McDonagh EM, Zhang X, Fawcett K, Foreman A, Daneck P, Sergouniotis PI, Parkinson H, Mazzarotto F, Inouye M, Hollox EJ, Birney E, Fitzgerald T. Genome-wide association testing beyond SNPs. Nat Rev Genet 2025;26:156-170. [PMID: 39375560 DOI: 10.1038/s41576-024-00778-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/03/2024] [Indexed: 10/09/2024]

Affiliation(s)

Laura Harris European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
Ellen M McDonagh European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
Xiaolei Zhang European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
Katherine Fawcett European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK Department of Population Health Sciences, University of Leicester, Leicester, UK
Amy Foreman European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
Petr Daneck Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
Panagiotis I Sergouniotis European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK Division of Evolution, Infection and Genomics, School of Biological Sciences, University of Manchester, Manchester, UK
Helen Parkinson European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
Francesco Mazzarotto Department of Molecular and Translational Medicine, University of Brescia, Brescia, Italy National Heart and Lung Institute, Imperial College London, London, UK
Michael Inouye British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Australia
Edward J Hollox Department of Genetics and Genome Biology, University of Leicester, Leicester, UK
Ewan Birney European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK
Tomas Fitzgerald European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute (EBI), Wellcome Genome Campus, Hinxton, UK.

Collapse

Browning SR, Browning BL. Estimating gene conversion rates from population data using multi-individual identity by descent. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.22.639693. [PMID: 40060563 PMCID: PMC11888280 DOI: 10.1101/2025.02.22.639693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/20/2025]

Czech E, Millar TR, Tyler W, White T, Elsworth B, Guez J, Hancox J, Jeffery B, Karczewski KJ, Miles A, Tallman S, Unneberg P, Wojdyla R, Zabad S, Hammerbacher J, Kelleher J. Analysis-ready VCF at Biobank scale using Zarr. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.06.11.598241. [PMID: 38915693 PMCID: PMC11195102 DOI: 10.1101/2024.06.11.598241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]

Abstract

Background

Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.

Results

Zarr is a format for storing multi-dimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of three large human datasets (Genomics England: n=78,195; Our Future Health: n=651,050; All of Us: n=245,394) along with whole genome datasets for Norway Spruce (n=1,063) and SARS-CoV-2 (n=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.

Conclusions

Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.

Collapse

Affiliation(s)

Eric Czech Open Athena AI Foundation, Lincoln, New Zealand Related Sciences, Lincoln, New Zealand
Timothy R. Millar The New Zealand Institute for Plant & Food Research Ltd, Lincoln, New Zealand Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand
Will Tyler Independent researcher, Manchester, UK
Tom White Tom White Consulting Ltd., Manchester, UK
Benjamin Elsworth Our Future Health, Manchester, UK
Jérémy Guez Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
Jonny Hancox NVIDIA Ltd, Reading, UK
Ben Jeffery Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
Konrad J. Karczewski Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
Alistair Miles Wellcome Sanger Institute, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
Sam Tallman Genomics England, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
Per Unneberg Department of Cell and Molecular Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
Rafal Wojdyla Open Athena AI Foundation, Lincoln, New Zealand
Shadi Zabad School of Computer Science, McGill University, Montreal, QC, Canada
Jeff Hammerbacher Open Athena AI Foundation, Lincoln, New Zealand Related Sciences, Lincoln, New Zealand
Jerome Kelleher Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK

Collapse

DeHaas D, Pan Z, Wei X. Enabling efficient analysis of biobank-scale data with genotype representation graphs. NATURE COMPUTATIONAL SCIENCE 2025;5:112-124. [PMID: 39639156 PMCID: PMC12054550 DOI: 10.1038/s43588-024-00739-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Accepted: 11/06/2024] [Indexed: 12/07/2024]

Masaki N, Browning SR. Mean gene conversion tract length in humans estimated to be 459 bp from UK Biobank sequence data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.12.30.630818. [PMID: 39868294 PMCID: PMC11761487 DOI: 10.1101/2024.12.30.630818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]

Shi S, Rubinacci S, Hu S, Moutsianas L, Stuckey A, Need AC, Palamara PF, Caulfield M, Marchini J, Myers S. A Genomics England haplotype reference panel and imputation of UK Biobank. Nat Genet 2024;56:1800-1803. [PMID: 39134668 PMCID: PMC11387190 DOI: 10.1038/s41588-024-01868-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 07/11/2024] [Indexed: 09/12/2024]

DeHaas D, Pan Z, Wei X. Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.23.590800. [PMID: 38712040 PMCID: PMC11071416 DOI: 10.1101/2024.04.23.590800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]

Abstract

Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. Current practice is to store large-scale genetic polymorphism data using tabular data structures and file formats, where rows and columns represent samples and genetic variants. However, encoding genetic data in such formats has become unsustainable. For example, the UK Biobank polymorphism data of 200,000 phased whole genomes has exceeded 350 terabytes (TB) in Variant Call Format (VCF), cumbersome and inefficient to work with. To mitigate the computational burden, we introduce the Genotype Representation Graph (GRG), an extremely compact data structure to losslessly present phased whole-genome polymorphisms. A GRG is a fully connected hierarchical graph that exploits variant-sharing across samples, leveraging ideas inspired by Ancestral Recombination Graphs. Capturing variant-sharing in a multitree structure compresses biobank-scale human data to the point where it can fit in a typical server's RAM (5-26 gigabytes (GB) per chromosome), and enables graph-traversal algorithms to trivially reuse computed values, both of which can significantly reduce computation time. We have developed a command-line tool and a library usable via both C++ and Python for constructing and processing GRG files which scales to a million whole genomes. It takes 160GB disk space to encode the information in 200,000 UK Biobank phased whole genomes as a GRG, more than 13 times smaller than the size of compressed VCF. We show that summaries of genetic variants such as allele frequency and association effect can be computed on GRG via graph traversal that runs significantly faster than all tested alternatives, including vcf.gz, PLINK BED, tree sequence, XSI, and Savvy. Furthermore, GRG is particularly suitable for doing repeated calculations and interactive data analysis. We anticipate that GRG-based algorithms will improve the scalability of various types of computation and generally lower the cost of analyzing large genomic datasets.

Collapse

Wertenbroek R, Hofmeister RJ, Xenarios I, Thoma Y, Delaneau O. Improving population scale statistical phasing with whole-genome sequencing data. PLoS Genet 2024;20:e1011092. [PMID: 38959269 PMCID: PMC11251608 DOI: 10.1371/journal.pgen.1011092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Revised: 07/16/2024] [Accepted: 06/11/2024] [Indexed: 07/05/2024] Open

Masaki N, Browning SR, Browning BL. Simultaneous estimation of genotype error and uncalled deletion rates in whole genome sequence data. PLoS Genet 2024;20:e1011297. [PMID: 38787916 PMCID: PMC11156439 DOI: 10.1371/journal.pgen.1011297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 06/06/2024] [Accepted: 05/10/2024] [Indexed: 05/26/2024] Open

Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. Am J Hum Genet 2024;111:691-700. [PMID: 38513668 PMCID: PMC11023918 DOI: 10.1016/j.ajhg.2024.02.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 02/26/2024] [Accepted: 02/27/2024] [Indexed: 03/23/2024] Open

Kwong A, Zawistowski M, Fritsche LG, Zhan X, Bragg-Gresham J, Branham KE, Advani J, Othman M, Ratnapriya R, Teslovich TM, Stambolian D, Chew EY, Abecasis GR, Swaroop A. Whole genome sequencing of 4,787 individuals identifies gene-based rare variants in age-related macular degeneration. Hum Mol Genet 2024;33:374-385. [PMID: 37934784 PMCID: PMC10840384 DOI: 10.1093/hmg/ddad189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2023] [Revised: 10/12/2023] [Accepted: 10/31/2023] [Indexed: 11/09/2023] Open

Affiliation(s)

Alan Kwong Department of Biostatistics and Center for Statistical Genetics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
Matthew Zawistowski Department of Biostatistics and Center for Statistical Genetics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
Lars G Fritsche Department of Biostatistics and Center for Statistical Genetics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
Xiaowei Zhan Southwestern Medical Center, University of Texas, 5323 Harry Hines Blvd, Dallas, TX 75390, United States
Jennifer Bragg-Gresham Kidney Epidemiology and Cost Center, Department of Internal Medicine-Nephrology, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
Kari E Branham Department of Ophthalmology and Visual Sciences, University of Michigan Kellogg Eye Center, 1000 Wall St, Ann Arbor, MI 48105, United States
Jayshree Advani Neurobiology-Neurodegeneration and Repair Laboratory, National Eye Institute, National Institutes of Health, 6 Center Drive, MSC 0610, Bethesda, MD 20892, United States
Mohammad Othman Department of Ophthalmology and Visual Sciences, University of Michigan Kellogg Eye Center, 1000 Wall St, Ann Arbor, MI 48105, United States
Rinki Ratnapriya Neurobiology-Neurodegeneration and Repair Laboratory, National Eye Institute, National Institutes of Health, 6 Center Drive, MSC 0610, Bethesda, MD 20892, United States
Tanya M Teslovich Regeneron Pharmaceuticals Inc., 777 Old Saw Mill River Rd, Tarrytown, NY 10591, United States
Dwight Stambolian Department of Ophthalmology, Perelman School of Medicine, University of Pennsylvania Medical School, 51 N. 39th Street, Philadelphia, PA 19104, United States
Emily Y Chew Division of Epidemiology and Clinical Application, National Eye Institute, National Institutes of Health, 10 Center Drive Building 10-CRC, Bethesda, MD 20892, United States
Gonçalo R Abecasis Department of Biostatistics and Center for Statistical Genetics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States Regeneron Pharmaceuticals Inc., 777 Old Saw Mill River Rd, Tarrytown, NY 10591, United States
Anand Swaroop Neurobiology-Neurodegeneration and Repair Laboratory, National Eye Institute, National Institutes of Health, 6 Center Drive, MSC 0610, Bethesda, MD 20892, United States

Collapse

Avadhanam S, Williams AL. Phase-free local ancestry inference mitigates the impact of switch errors on phase-based methods. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.02.569669. [PMID: 38106003 PMCID: PMC10723336 DOI: 10.1101/2023.12.02.569669] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]

Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.03.565574. [PMID: 37961601 PMCID: PMC10635131 DOI: 10.1101/2023.11.03.565574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]

Cai R, Browning BL, Browning SR. Identity-by-descent-based estimation of the X chromosome effective population size with application to sex-specific demographic history. G3 (BETHESDA, MD.) 2023;13:jkad165. [PMID: 37497617 PMCID: PMC10542559 DOI: 10.1093/g3journal/jkad165] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 05/10/2023] [Accepted: 07/14/2023] [Indexed: 07/28/2023]

Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat Genet 2023:10.1038/s41588-023-01415-w. [PMID: 37386248 DOI: 10.1038/s41588-023-01415-w] [Citation(s) in RCA: 69] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 05/04/2023] [Indexed: 07/01/2023]