1
|
Betschart RO, Riccio C, Aguilera-Garcia D, Blankenberg S, Guo L, Moch H, Seidl D, Solleder H, Thalén F, Thiéry A, Twerenbold R, Zeller T, Zoche M, Ziegler A. Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control. Biom J 2024; 66:e202300278. [PMID: 38988195 DOI: 10.1002/bimj.202300278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 03/21/2024] [Accepted: 05/14/2024] [Indexed: 07/12/2024]
Abstract
Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg-Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.
Collapse
Affiliation(s)
| | | | - Domingo Aguilera-Garcia
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Stefan Blankenberg
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Linlin Guo
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Holger Moch
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Dagmar Seidl
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Hugo Solleder
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
| | - Felix Thalén
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
| | | | - Raphael Twerenbold
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), partner site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Tanja Zeller
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), partner site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Martin Zoche
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Andreas Ziegler
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa
| |
Collapse
|
2
|
Draft genome sequencing data of a pathogenic Pantoea stewartii subspecies stewartii strain SQT1 causing bronzing disease of jackfruit in Malaysia. Data Brief 2020; 30:105634. [PMID: 32395592 PMCID: PMC7205738 DOI: 10.1016/j.dib.2020.105634] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2020] [Revised: 04/20/2020] [Accepted: 04/22/2020] [Indexed: 12/04/2022] Open
Abstract
A Gram-negative bacterium, Pantoea stewartii subspecies stewartii (P. stewartii subsp. stewartii) has been recognized as the causative agent for jackfruit bronzing disease in Malaysia. Here, we report the whole genome sequencing dataset of P. stewartii subsp. stewartii strain SQT1 isolated from local infected jackfruit. The paired-end libraries with an insert size of 350 bp was subjected to the Illumina Hiseq 4000, generating a genome size of 4,783,993 bp with a G+C content of 53.7%. A total protein of 4,671 was identified including virulence factors, resistance factors and secretion systems. Pantoea stewartii subsp. stewartii strain DC283 (NCBI accession no. CP017581.1) was used as a reference genome, where the query hit 72% coverage and average sequencing depth of 68. In total, 28,717 nucleotide polymorphisms, 520 small insertion/deletions and 142 structure variants were identified. The complete genome was deposited at the European Nucleotide Archive under the sample accession number ERP119356 and study accession number PRJEB36196.
Collapse
|