1
|
Betschart RO, Thalén F, Blankenberg S, Zoche M, Zeller T, Ziegler A. A benchmark study of compression software for human short-read sequence data. Sci Rep 2025; 15:15358. [PMID: 40316539 PMCID: PMC12048562 DOI: 10.1038/s41598-025-00491-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2024] [Accepted: 04/28/2025] [Indexed: 05/04/2025] Open
Abstract
Efficient data compression technologies are crucial to reduce the cost of long-term storage and file transfer in whole genome sequencing studies. This study benchmarked four specialized compression tools developed for paired-end fastq.gz files DRAGEN ORA 4.3.4 (ORA), Genozip 15.0.62, repaq 0.3.0, and SPRING 1.1.1 using three subjects from the genome-in-a-bottle consortium that were sequenced 82 times on an Illumina NovaSeq 6000, with an average coverage of 35x. It additionally compared Genozip with SAMtools 1.20 for the compression of BAM files. All tools provided lossless compression. ORA and Genozip achieved compression ratios of approximately 1:6 when compressing fastq.gz. repaq and SPRING had lower compression ratios of 1:2 and 1:4, respectively. repaq and SPRING took longer for both compression and decompression than ORA and Genozip. Genozip had approximately 16% higher compression for BAM files than SAMtools. However, the BAM compression of SAMtools produces CRAM files, which are compatible with many software packages. ORA, repaq, and SPRING are limited to compressing fastq.gz files, while Genozip supports various file formats. Although Genozip requires an annual license, its source code is freely available, ensuring sustainability. In conclusion, paired-end short-read sequence data can be efficiently compressed using specialized compression software. Commercial tools offer higher compression ratios than freely available software.
Collapse
Affiliation(s)
- Raphael O Betschart
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 12, Davos Wolfgang, 7265, Davos, Switzerland
- Institute of Cardiogenetics, University of Lübeck, Lübeck, Germany
| | - Felix Thalén
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 12, Davos Wolfgang, 7265, Davos, Switzerland
| | - Stefan Blankenberg
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 12, Davos Wolfgang, 7265, Davos, Switzerland
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Centre for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- German Center for Cardiovascular Research, Partner Site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Martin Zoche
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Tanja Zeller
- Institute of Cardiogenetics, University of Lübeck, Lübeck, Germany.
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
- Centre for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
- German Center for Cardiovascular Research, Partner Site Hamburg/Kiel/Lübeck, Hamburg, Germany.
| | - Andreas Ziegler
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 12, Davos Wolfgang, 7265, Davos, Switzerland.
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
- Centre for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
- School Mathematics, Statistics and Computer Science, Scottsville, Private Bag X01, Pietermaritzburg, 3209, South Africa.
| |
Collapse
|
2
|
Lee H, Kim W, Kwon N, Kim C, Kim S, An JY. Lessons from national biobank projects utilizing whole-genome sequencing for population-scale genomics. Genomics Inform 2025; 23:8. [PMID: 40050991 PMCID: PMC11887102 DOI: 10.1186/s44342-025-00040-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2024] [Accepted: 01/27/2025] [Indexed: 03/09/2025] Open
Abstract
Large-scale national biobank projects utilizing whole-genome sequencing have emerged as transformative resources for understanding human genetic variation and its relationship to health and disease. These initiatives, which include the UK Biobank, All of Us Research Program, Singapore's PRECISE, Biobank Japan, and the National Project of Bio-Big Data of Korea, are generating unprecedented volumes of high-resolution genomic data integrated with comprehensive phenotypic, environmental, and clinical information. This review examines the methodologies, contributions, and challenges of major WGS-based national genome projects worldwide. We first discuss the landscape of national biobank initiatives, highlighting their distinct approaches to data collection, participant recruitment, and phenotype characterization. We then introduce recent technological advances that enable efficient processing and analysis of large-scale WGS data, including improvements in variant calling algorithms, innovative methods for creating multi-sample VCFs, optimized data storage formats, and cloud-based computing solutions. The review synthesizes key discoveries from these projects, particularly in identifying expression quantitative trait loci and rare variants associated with complex diseases. Our review introduces the latest findings from the National Project of Bio-Big Data of Korea, which has advanced our understanding of population-specific genetic variation and rare diseases in Korean and East Asian populations. Finally, we discuss future directions and challenges in maximizing the impact of these resources on precision medicine and global health equity. This comprehensive examination demonstrates how large-scale national genome projects are revolutionizing genetic research and healthcare delivery while highlighting the importance of continued investment in diverse, population-specific genomic resources.
Collapse
Affiliation(s)
- Hyeji Lee
- Department of Integrated Biomedical and Life Science, Korea University, Seoul, 02841, Republic of Korea
- L-HOPE Program for Community-Based Total Learning Health Systems, Korea University, Seoul, 02841, Republic of Korea
| | - Wooheon Kim
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul, 02841, Republic of Korea
| | - Nahyeon Kwon
- Department of Integrated Biomedical and Life Science, Korea University, Seoul, 02841, Republic of Korea
- L-HOPE Program for Community-Based Total Learning Health Systems, Korea University, Seoul, 02841, Republic of Korea
| | - Chanhee Kim
- Department of Integrated Biomedical and Life Science, Korea University, Seoul, 02841, Republic of Korea
- L-HOPE Program for Community-Based Total Learning Health Systems, Korea University, Seoul, 02841, Republic of Korea
| | - Sungmin Kim
- Department of Integrated Biomedical and Life Science, Korea University, Seoul, 02841, Republic of Korea
- Division of Genome Science, Department of Precision Medicine, National Institute of Health, Cheongju, 28159, Republic of Korea
| | - Joon-Yong An
- Department of Integrated Biomedical and Life Science, Korea University, Seoul, 02841, Republic of Korea.
- L-HOPE Program for Community-Based Total Learning Health Systems, Korea University, Seoul, 02841, Republic of Korea.
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul, 02841, Republic of Korea.
| |
Collapse
|
3
|
Czech E, Millar TR, Tyler W, White T, Elsworth B, Guez J, Hancox J, Jeffery B, Karczewski KJ, Miles A, Tallman S, Unneberg P, Wojdyla R, Zabad S, Hammerbacher J, Kelleher J. Analysis-ready VCF at Biobank scale using Zarr. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.06.11.598241. [PMID: 38915693 PMCID: PMC11195102 DOI: 10.1101/2024.06.11.598241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed. Results Zarr is a format for storing multi-dimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of three large human datasets (Genomics England: n=78,195; Our Future Health: n=651,050; All of Us: n=245,394) along with whole genome datasets for Norway Spruce (n=1,063) and SARS-CoV-2 (n=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs. Conclusions Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.
Collapse
Affiliation(s)
- Eric Czech
- Open Athena AI Foundation, Lincoln, New Zealand
- Related Sciences, Lincoln, New Zealand
| | - Timothy R. Millar
- The New Zealand Institute for Plant & Food Research Ltd, Lincoln, New Zealand
- Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand
| | | | - Tom White
- Tom White Consulting Ltd., Manchester, UK
| | | | - Jérémy Guez
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
| | | | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | - Konrad J. Karczewski
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
- Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Alistair Miles
- Wellcome Sanger Institute, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Sam Tallman
- Genomics England, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Per Unneberg
- Department of Cell and Molecular Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | | | - Shadi Zabad
- School of Computer Science, McGill University, Montreal, QC, Canada
| | - Jeff Hammerbacher
- Open Athena AI Foundation, Lincoln, New Zealand
- Related Sciences, Lincoln, New Zealand
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| |
Collapse
|
4
|
Adhisantoso YG, Körner T, Müntefering F, Ostermann J, Voges J. HiCMC: High-Efficiency Contact Matrix Compressor. BMC Bioinformatics 2024; 25:296. [PMID: 39256681 PMCID: PMC11389233 DOI: 10.1186/s12859-024-05907-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Accepted: 08/20/2024] [Indexed: 09/12/2024] Open
Abstract
BACKGROUND Chromosome organization plays an important role in biological processes such as replication, regulation, and transcription. One way to study the relationship between chromosome structure and its biological functions is through Hi-C studies, a genome-wide method for capturing chromosome conformation. Such studies generate vast amounts of data. The problem is exacerbated by the fact that chromosome organization is dynamic, requiring snapshots at different points in time, further increasing the amount of data to be stored. We present a novel approach called the High-Efficiency Contact Matrix Compressor (HiCMC) for efficient compression of Hi-C data. RESULTS By modeling the underlying structures found in the contact matrix, such as compartments and domains, HiCMC outperforms the state-of-the-art method CMC by approximately 8% and the other state-of-the-art methods cooler, LZMA, and bzip2 by over 50% across multiple cell lines and contact matrix resolutions. In addition, HiCMC integrates domain-specific information into the compressed bitstreams that it generates, and this information can be used to speed up downstream analyses. CONCLUSION HiCMC is a novel compression approach that utilizes intrinsic properties of contact matrix, such as compartments and domains. It allows for a better compression in comparison to the state-of-the-art methods. HiCMC is available at https://github.com/sXperfect/hicmc .
Collapse
Affiliation(s)
- Yeremia Gunawan Adhisantoso
- Institut für Informationsverarbeitung and L3S Research Center, Leibniz University Hannover, Hannover, Germany.
| | - Tim Körner
- Institut für Informationsverarbeitung and L3S Research Center, Leibniz University Hannover, Hannover, Germany
| | - Fabian Müntefering
- Institut für Informationsverarbeitung and L3S Research Center, Leibniz University Hannover, Hannover, Germany
| | - Jörn Ostermann
- Institut für Informationsverarbeitung and L3S Research Center, Leibniz University Hannover, Hannover, Germany
| | - Jan Voges
- CIMA University of Navarra, Pamplona, Spain
- IdiSNA, Pamplona, Spain
| |
Collapse
|
5
|
Bergström A. Improving data archiving practices in ancient genomics. Sci Data 2024; 11:754. [PMID: 38987254 PMCID: PMC11236975 DOI: 10.1038/s41597-024-03563-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Accepted: 06/21/2024] [Indexed: 07/12/2024] Open
Abstract
Ancient DNA is producing a rich record of past genetic diversity in humans and other species. However, unless the primary data is appropriately archived, its long-term value will not be fully realised. I surveyed publicly archived data from 42 recent ancient genomics studies. Half of the studies archived incomplete datasets, preventing accurate replication and representing a loss of data of potential future use. No studies met all criteria that could be considered best practice. Based on these results, I make six recommendations for data producers: (1) archive all sequencing reads, not just those that aligned to a reference genome, (2) archive read alignments too, but as secondary analysis files, (3) provide correct experiment metadata on samples, libraries and sequencing runs, (4) provide informative sample metadata, (5) archive data from low-coverage and negative experiments, and (6) document archiving choices in papers, and peer review these. Given the reliance on destructive sampling of finite material, ancient genomics studies have a particularly strong responsibility to ensure the longevity and reusability of generated data.
Collapse
Affiliation(s)
- Anders Bergström
- School of Biological Sciences, University of East Anglia, Norwich, UK.
| |
Collapse
|
6
|
Wertenbroek R, Hofmeister RJ, Xenarios I, Thoma Y, Delaneau O. Improving population scale statistical phasing with whole-genome sequencing data. PLoS Genet 2024; 20:e1011092. [PMID: 38959269 PMCID: PMC11251608 DOI: 10.1371/journal.pgen.1011092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Revised: 07/16/2024] [Accepted: 06/11/2024] [Indexed: 07/05/2024] Open
Abstract
Haplotype estimation, or phasing, has gained significant traction in large-scale projects due to its valuable contributions to population genetics, variant analysis, and the creation of reference panels for imputation and phasing of new samples. To scale with the growing number of samples, haplotype estimation methods designed for population scale rely on highly optimized statistical models to phase genotype data, and usually ignore read-level information. Statistical methods excel in resolving common variants, however, they still struggle at rare variants due to the lack of statistical information. In this study we introduce SAPPHIRE, a new method that leverages whole-genome sequencing data to enhance the precision of haplotype calls produced by statistical phasing. SAPPHIRE achieves this by refining haplotype estimates through the realignment of sequencing reads, particularly targeting low-confidence phase calls. Our findings demonstrate that SAPPHIRE significantly enhances the accuracy of haplotypes obtained from state of the art methods and also provides the subset of phase calls that are validated by sequencing reads. Finally, we show that our method scales to large data sets by its successful application to the extensive 3.6 Petabytes of sequencing data of the last UK Biobank 200,031 sample release.
Collapse
Affiliation(s)
- Rick Wertenbroek
- University of Lausanne, Lausanne, Vaud, Switzerland
- School of Engineering and Management Vaud (HEIG-VD), HES-SO University of Applied Sciences and Arts Western Switzerland, Yverdon-les-Bains, Vaud, Switzerland
| | | | | | - Yann Thoma
- School of Engineering and Management Vaud (HEIG-VD), HES-SO University of Applied Sciences and Arts Western Switzerland, Yverdon-les-Bains, Vaud, Switzerland
| | - Olivier Delaneau
- Regeneron Genetics Center, Tarrytown, New York, United States of America
| |
Collapse
|
7
|
Müntefering F, Adhisantoso YG, Chandak S, Ostermann J, Hernaez M, Voges J. Genie: the first open-source ISO/IEC encoder for genomic data. Commun Biol 2024; 7:553. [PMID: 38724695 PMCID: PMC11082222 DOI: 10.1038/s42003-024-06249-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 04/26/2024] [Indexed: 05/12/2024] Open
Abstract
For the last two decades, the amount of genomic data produced by scientific and medical applications has been growing at a rapid pace. To enable software solutions that analyze, process, and transmit these data in an efficient and interoperable way, ISO and IEC released the first version of the compression standard MPEG-G in 2019. However, non-proprietary implementations of the standard are not openly available so far, limiting fair scientific assessment of the standard and, therefore, hindering its broad adoption. In this paper, we present Genie, to the best of our knowledge the first open-source encoder that compresses genomic data according to the MPEG-G standard. We demonstrate that Genie reaches state-of-the-art compression ratios while offering interoperability with any other standard-compliant decoder independent from its manufacturer. Finally, the ISO/IEC ecosystem ensures the long-term sustainability and decodability of the compressed data through the ISO/IEC-supported reference decoder.
Collapse
Affiliation(s)
- Fabian Müntefering
- Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany.
| | - Yeremia Gunawan Adhisantoso
- Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany
| | - Shubham Chandak
- Department of Electrical Engineering, Stanford University, 350 Jane Stanford Way, Stanford, CA, 94305, USA
| | - Jörn Ostermann
- Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany
| | - Mikel Hernaez
- Center for Applied Medical Research (CIMA), University of Navarra, Av. de Pío XII, 55, Pamplona, 31008, Navarra, Spain.
| | - Jan Voges
- Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany.
| |
Collapse
|
8
|
Herrick N, Walsh S. ILIAD: a suite of automated Snakemake workflows for processing genomic data for downstream applications. BMC Bioinformatics 2023; 24:424. [PMID: 37940870 PMCID: PMC10633908 DOI: 10.1186/s12859-023-05548-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Accepted: 10/27/2023] [Indexed: 11/10/2023] Open
Abstract
BACKGROUND Processing raw genomic data for downstream applications such as imputation, association studies, and modeling requires numerous third-party bioinformatics software tools. It is highly time-consuming and resource-intensive with computational demands and storage limitations that pose significant challenges that increase cost. The use of software tools independent of one another, in a disjointed stepwise fashion, increases the difficulty and sets forth higher error rates because of fragmented job executions in alignment, variant calling, and/or build conversion complications. As sequencing data availability grows, the ability for biologists to process it using stable, automated, and reproducible workflows is paramount as it significantly reduces the time to generate clean and reliable data. RESULTS The Iliad suite of genomic data workflows was developed to provide users with seamless file transitions from raw genomic data to a quality-controlled variant call format (VCF) file for downstream applications. Iliad benefits from the efficiency of the Snakemake best practices framework coupled with Singularity and Docker containers for repeatability, portability, and ease of installation. This feat is accomplished from the onset with download acquisitions of any raw data type (FASTQ, CRAM, IDAT) straight through to the generation of a clean merged data file that can combine any user-preferred datasets using robust programs such as BWA, Samtools, and BCFtools. Users can customize and direct their workflow with one straightforward configuration file. Iliad is compatible with Linux, MacOS, and Windows platforms and scalable from a local machine to a high-performance computing cluster. CONCLUSION Iliad offers automated workflows with optimized time and resource management that are comparable to other workflows available but generates analysis-ready VCF files from the most common datatypes using a single command. The storage footprint challenge of genomic data is overcome by utilizing temporary intermediate files before the final VCF is generated. This file is ready for use in imputation, genome-wide association study (GWAS) pipelines, high-throughput population genetics studies, select gene candidate studies, and more. Iliad was developed to be portable, compatible, scalable, robust, and repeatable with a simplistic setup, so biologists that are less familiar with programming can manage their own big data with this open-source suite of workflows.
Collapse
Affiliation(s)
- Noah Herrick
- Department of Biology, Indiana University Indianapolis, 723 W. Michigan Street, Indianapolis, IN, USA.
| | - Susan Walsh
- Department of Biology, Indiana University Indianapolis, 723 W. Michigan Street, Indianapolis, IN, USA
| |
Collapse
|
9
|
Liu Y, Shen X, Gong Y, Liu Y, Song B, Zeng X. Sequence Alignment/Map format: a comprehensive review of approaches and applications. Brief Bioinform 2023; 24:bbad320. [PMID: 37668049 DOI: 10.1093/bib/bbad320] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 08/16/2023] [Accepted: 08/18/2023] [Indexed: 09/06/2023] Open
Abstract
The Sequence Alignment/Map (SAM) format file is the text file used to record alignment information. Alignment is the core of sequencing analysis, and downstream tasks accept mapping results for further processing. Given the rapid development of the sequencing industry today, a comprehensive understanding of the SAM format and related tools is necessary to meet the challenges of data processing and analysis. This paper is devoted to retrieving knowledge in the broad field of SAM. First, the format of SAM is introduced to understand the overall process of the sequencing analysis. Then, existing work is systematically classified in accordance with generation, compression and application, and the involved SAM tools are specifically mined. Lastly, a summary and some thoughts on future directions are provided.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangzhen Shen
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Yongshun Gong
- School of Software, Shandong University, 250100, Jinan, China
| | - Yiping Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| |
Collapse
|
10
|
Florian K, Benet-Pagès A, Berner D, Teubert A, Eck S, Arnold N, Bauer P, Begemann M, Sturm M, Kleinle S, B. Haack T, Eggermann T. Quality assurance within the context of genome diagnostics (a german perspective). MED GENET-BERLIN 2023; 35:91-104. [PMID: 38840862 PMCID: PMC10842579 DOI: 10.1515/medgen-2023-2028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2024]
Abstract
The rapid and dynamic implementation of Next-Generation Sequencing (NGS)-based assays has revolutionized genetic testing, and in the near future, nearly all molecular alterations of the human genome will be diagnosable via massive parallel sequencing. While this progress will further corroborate the central role of human genetics in the multidisciplinary management of patients with genetic disorders, it must be accompanied by quality assurance measures in order to allow the safe and optimal use of knowledge ascertained from genome diagnostics. To achieve this, several valuable tools and guidelines have been developed to support the quality of genome diagnostics. In this paper, authors with experience in diverse aspects of genomic analysis summarize the current status of quality assurance in genome diagnostics, with the aim of facilitating further standardization and quality improvement in one of the core competencies of the field.
Collapse
Affiliation(s)
- Kraft Florian
- Medizinische Fakultät der RWTH AachenInstitut für Humangenetik und GenommedizinAachenDeutschland
| | - Anna Benet-Pagès
- Institut für NeurogenomikHelmholtz Zentrum MünchenNeuherbergDeutschland
| | | | | | | | - Norbert Arnold
- Universitätsklinikum Schleswig-HolsteinZentrum für familiären Brust- und Eierstockkrebs; Klinik für Gynäkologie und GeburtshilfeKielDeutschland
| | | | - Matthias Begemann
- Medizinische Fakultät der RWTH AachenInstitut für Humangenetik und GenommedizinAachenDeutschland
| | - Marc Sturm
- Universität TübingenInstitut für Medizinische Genetik und Angewandte GenomikTübingenDeutschland
| | | | - Tobias B. Haack
- Universität TübingenInstitut für Medizinische Genetik und Angewandte GenomikTübingenDeutschland
| | - Thomas Eggermann
- Medizinische Fakultät der RWTH AachenInstitut für Humangenetik und GenommedizinPauwelsstr. 3052074AachenDeutschland
| |
Collapse
|
11
|
Berger B, Yu YW. Navigating bottlenecks and trade-offs in genomic data analysis. Nat Rev Genet 2023; 24:235-250. [PMID: 36476810 PMCID: PMC10204111 DOI: 10.1038/s41576-022-00551-z] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/27/2022] [Indexed: 12/12/2022]
Abstract
Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.
Collapse
Affiliation(s)
- Bonnie Berger
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Yun William Yu
- Department of Computer and Mathematical Sciences, University of Toronto Scarborough, Toronto, Ontario, Canada
- Tri-Campus Department of Mathematics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
12
|
Zhao Y, Gardner EJ, Tuke MA, Zhang H, Pietzner M, Koprulu M, Jia RY, Ruth KS, Wood AR, Beaumont RN, Tyrrell J, Jones SE, Lango Allen H, Day FR, Langenberg C, Frayling TM, Weedon MN, Perry JRB, Ong KK, Murray A. Detection and characterization of male sex chromosome abnormalities in the UK Biobank study. Genet Med 2022; 24:1909-1919. [PMID: 35687092 DOI: 10.1016/j.gim.2022.05.011] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 05/15/2022] [Accepted: 05/16/2022] [Indexed: 11/21/2022] Open
Abstract
PURPOSE The study aimed to systematically ascertain male sex chromosome abnormalities, 47,XXY (Klinefelter syndrome [KS]) and 47,XYY, and characterize their risks of adverse health outcomes. METHODS We analyzed genotyping array or exome sequence data in 207,067 men of European ancestry aged 40 to 70 years from the UK Biobank and related these to extensive routine health record data. RESULTS Only 49 of 213 (23%) of men whom we identified with KS and only 1 of 143 (0.7%) with 47,XYY had a diagnosis of abnormal karyotype on their medical records or self-report. We observed expected associations for KS with reproductive dysfunction (late puberty: risk ratio [RR] = 2.7; childlessness: RR = 4.2; testosterone concentration: RR = -3.8 nmol/L, all P < 2 × 10-8), whereas XYY men appeared to have normal reproductive function. Despite this difference, we identified several higher disease risks shared across both KS and 47,XYY, including type 2 diabetes (RR = 3.0 and 2.6, respectively), venous thrombosis (RR = 6.4 and 7.4, respectively), pulmonary embolism (RR = 3.3 and 3.7, respectively), and chronic obstructive pulmonary disease (RR = 4.4 and 4.6, respectively) (all P < 7 × 10-6). CONCLUSION KS and 47,XYY were mostly unrecognized but conferred substantially higher risks for metabolic, vascular, and respiratory diseases, which were only partially explained by higher levels of body mass index, deprivation, and smoking.
Collapse
Affiliation(s)
- Yajie Zhao
- MRC Epidemiology Unit, Institute of Metabolic Science, School of Clinical Medicine, Cambridge University, Cambridge, United Kingdom
| | - Eugene J Gardner
- MRC Epidemiology Unit, Institute of Metabolic Science, School of Clinical Medicine, Cambridge University, Cambridge, United Kingdom
| | - Marcus A Tuke
- Genetics of Complex Traits, University of Exeter Medical School, University of Exeter, Royal Devon & Exeter Hospital, Exeter, United Kingdom
| | - Huairen Zhang
- Genetics of Complex Traits, University of Exeter Medical School, University of Exeter, Royal Devon & Exeter Hospital, Exeter, United Kingdom
| | - Maik Pietzner
- MRC Epidemiology Unit, Institute of Metabolic Science, School of Clinical Medicine, Cambridge University, Cambridge, United Kingdom; Computational Medicine, Berlin Institute of Health (BIH) at Charité, Universitätsmedizin Berlin, Berlin, Germany
| | - Mine Koprulu
- MRC Epidemiology Unit, Institute of Metabolic Science, School of Clinical Medicine, Cambridge University, Cambridge, United Kingdom
| | - Raina Y Jia
- MRC Epidemiology Unit, Institute of Metabolic Science, School of Clinical Medicine, Cambridge University, Cambridge, United Kingdom
| | - Katherine S Ruth
- Genetics of Complex Traits, University of Exeter Medical School, University of Exeter, Royal Devon & Exeter Hospital, Exeter, United Kingdom
| | - Andrew R Wood
- Genetics of Complex Traits, University of Exeter Medical School, University of Exeter, Royal Devon & Exeter Hospital, Exeter, United Kingdom
| | - Robin N Beaumont
- Genetics of Complex Traits, University of Exeter Medical School, University of Exeter, Royal Devon & Exeter Hospital, Exeter, United Kingdom
| | - Jessica Tyrrell
- Genetics of Complex Traits, University of Exeter Medical School, University of Exeter, Royal Devon & Exeter Hospital, Exeter, United Kingdom
| | - Samuel E Jones
- Genetics of Complex Traits, University of Exeter Medical School, University of Exeter, Royal Devon & Exeter Hospital, Exeter, United Kingdom; Institute for Molecular Medicine Finland (FIMM), Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland
| | - Hana Lango Allen
- MRC Epidemiology Unit, Institute of Metabolic Science, School of Clinical Medicine, Cambridge University, Cambridge, United Kingdom
| | - Felix R Day
- MRC Epidemiology Unit, Institute of Metabolic Science, School of Clinical Medicine, Cambridge University, Cambridge, United Kingdom
| | - Claudia Langenberg
- MRC Epidemiology Unit, Institute of Metabolic Science, School of Clinical Medicine, Cambridge University, Cambridge, United Kingdom; Computational Medicine, Berlin Institute of Health (BIH) at Charité, Universitätsmedizin Berlin, Berlin, Germany
| | - Timothy M Frayling
- Genetics of Complex Traits, University of Exeter Medical School, University of Exeter, Royal Devon & Exeter Hospital, Exeter, United Kingdom
| | - Michael N Weedon
- Genetics of Complex Traits, University of Exeter Medical School, University of Exeter, Royal Devon & Exeter Hospital, Exeter, United Kingdom
| | - John R B Perry
- MRC Epidemiology Unit, Institute of Metabolic Science, School of Clinical Medicine, Cambridge University, Cambridge, United Kingdom
| | - Ken K Ong
- MRC Epidemiology Unit, Institute of Metabolic Science, School of Clinical Medicine, Cambridge University, Cambridge, United Kingdom.
| | - Anna Murray
- Genetics of Complex Traits, University of Exeter Medical School, University of Exeter, Royal Devon & Exeter Hospital, Exeter, United Kingdom.
| |
Collapse
|