1
|
Javadzadeh S, Adamson A, Park J, Jo SY, Ding YC, Bakhtiari M, Bansal V, Neuhausen SL, Bafna V. Analysis of targeted and whole genome sequencing of PacBio HiFi reads for a comprehensive genotyping of gene-proximal and phenotype-associated Variable Number Tandem Repeats. PLoS Comput Biol 2025; 21:e1012885. [PMID: 40193344 PMCID: PMC11975116 DOI: 10.1371/journal.pcbi.1012885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Accepted: 02/17/2025] [Indexed: 04/09/2025] Open
Abstract
Variable Number Tandem repeats (VNTRs) refer to repeating motifs of size greater than five bp. VNTRs are an important source of genetic variation, and have been associated with multiple Mendelian and complex phenotypes. However, the highly repetitive structures require reads to span the region for accurate genotyping. Pacific Biosciences HiFi sequencing spans large regions and is highly accurate but relatively expensive. Therefore, targeted sequencing approaches coupled with long-read sequencing have been proposed to improve efficiency and throughput. In this paper, we systematically explored the trade-off between targeted and whole genome HiFi sequencing for genotyping VNTRs. We curated a set of 10 , 787 gene-proximal (G-)VNTRs, and 48 phenotype-associated (P-)VNTRs of interest. Illumina reads only spanned 46% of the G-VNTRs and 71% of P-VNTRs, motivating the use of HiFi sequencing. We performed targeted sequencing with hybridization by designing custom probes for 9,999 VNTRs and sequenced 8 samples using HiFi and Illumina sequencing, followed by adVNTR genotyping. We compared these results against HiFi whole genome sequencing (WGS) data from 28 samples in the Human Pangenome Reference Consortium (HPRC). With the targeted approach only 4,091 (41%) G-VNTRs and only 4 (8%) of P-VNTRs were spanned with at least 15 reads. A smaller subset of 3,579 (36%) G-VNTRs had higher median coverage of at least 63 spanning reads. The spanning behavior was consistent across all 8 samples. Among 5,638 VNTRs with low-coverage ( < 15), 67% were located within GC-rich regions ( > 60%). In contrast, the 40X WGS HiFi dataset spanned 98% of all VNTRs and 49 (98%) of P-VNTRs with at least 15 spanning reads, albeit with lower coverage. Spanning reads were sufficient for accurate genotyping in both cases. Our findings demonstrate that targeted sequencing provides consistently high coverage for a small subset of low-GC VNTRs, but WGS is more effective for broad and sufficient sampling of a large number of VNTRs.
Collapse
Affiliation(s)
- Sara Javadzadeh
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, United States of America
| | - Aaron Adamson
- Department of Population Sciences, Beckman Research Institute of City of Hope, Duarte, California, United States of America
| | - Jonghun Park
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, United States of America
| | - Se-Young Jo
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, United States of America
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, South Korea
| | - Yuan-Chun Ding
- Department of Population Sciences, Beckman Research Institute of City of Hope, Duarte, California, United States of America
| | - Mehrdad Bakhtiari
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, United States of America
| | - Vikas Bansal
- School of Medicine, University of California, San Diego La Jolla, California, United States of America
| | - Susan L. Neuhausen
- Department of Population Sciences, Beckman Research Institute of City of Hope, Duarte, California, United States of America
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, United States of America
| |
Collapse
|
2
|
English AC, Dolzhenko E, Ziaei Jam H, McKenzie SK, Olson ND, De Coster W, Park J, Gu B, Wagner J, Eberle MA, Gymrek M, Chaisson MJP, Zook JM, Sedlazeck FJ. Analysis and benchmarking of small and large genomic variants across tandem repeats. Nat Biotechnol 2025; 43:431-442. [PMID: 38671154 PMCID: PMC11952744 DOI: 10.1038/s41587-024-02225-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 03/28/2024] [Indexed: 04/28/2024]
Abstract
Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits and are linked to over 60 disease phenotypes. However, they are often excluded from at-scale studies because of challenges with variant calling and representation, as well as a lack of a genome-wide standard. Here, to promote the development of TR methods, we created a catalog of TR regions and explored TR properties across 86 haplotype-resolved long-read human assemblies. We curated variants from the Genome in a Bottle (GIAB) HG002 individual to create a TR dataset to benchmark existing and future TR analysis methods. We also present an improved variant comparison method that handles variants greater than 4 bp in length and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ~24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 'truth-set' TR benchmark. We demonstrate the utility of this pipeline across short-read and long-read technologies.
Collapse
Affiliation(s)
- Adam C English
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
| | | | - Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | | | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Wouter De Coster
- Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, VIB, Antwerp, Belgium
- Applied and Translational Neurogenomics Group, Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Jonghun Park
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Bida Gu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
3
|
Ziaei Jam H, Zook JM, Javadzadeh S, Park J, Sehgal A, Gymrek M. LongTR: genome-wide profiling of genetic variation at tandem repeats from long reads. Genome Biol 2024; 25:176. [PMID: 38965568 PMCID: PMC11229021 DOI: 10.1186/s13059-024-03319-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Accepted: 06/21/2024] [Indexed: 07/06/2024] Open
Abstract
Tandem repeats are frequent across the human genome, and variation in repeat length has been linked to a variety of traits. Recent improvements in long read sequencing technologies have the potential to greatly improve tandem repeat analysis, especially for long or complex repeats. Here, we introduce LongTR, which accurately genotypes tandem repeats from high-fidelity long reads available from both PacBio and Oxford Nanopore Technologies. LongTR is freely available at https://github.com/gymrek-lab/longtr and https://zenodo.org/doi/10.5281/zenodo.11403979 .
Collapse
Affiliation(s)
- Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, Gaithersburg, MD, USA
| | - Sara Javadzadeh
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Jonghun Park
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Aarushi Sehgal
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA.
- Department of Medicine, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
4
|
Tanudisastro HA, Deveson IW, Dashnow H, MacArthur DG. Sequencing and characterizing short tandem repeats in the human genome. Nat Rev Genet 2024; 25:460-475. [PMID: 38366034 DOI: 10.1038/s41576-024-00692-3] [Citation(s) in RCA: 34] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/06/2023] [Indexed: 02/18/2024]
Abstract
Short tandem repeats (STRs) are highly polymorphic sequences throughout the human genome that are composed of repeated copies of a 1-6-bp motif. Over 1 million variable STR loci are known, some of which regulate gene expression and influence complex traits, such as height. Moreover, variants in at least 60 STR loci cause genetic disorders, including Huntington disease and fragile X syndrome. Accurately identifying and genotyping STR variants is challenging, in particular mapping short reads to repetitive regions and inferring expanded repeat lengths. Recent advances in sequencing technology and computational tools for STR genotyping from sequencing data promise to help overcome this challenge and solve genetically unresolved cases and the 'missing heritability' of polygenic traits. Here, we compare STR genotyping methods, analytical tools and their applications to understand the effect of STR variation on health and disease. We identify emergent opportunities to refine genotyping and quality-control approaches as well as to integrate STRs into variant-calling workflows and large cohort analyses.
Collapse
Affiliation(s)
- Hope A Tanudisastro
- Centre for Population Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia
- Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, Victoria, Australia
- Faculty of Medicine and Health, University of New South Wales, Sydney, New South Wales, Australia
- Faculty of Medicine and Health, University of Sydney, Sydney, New South Wales, Australia
| | - Ira W Deveson
- Faculty of Medicine and Health, University of New South Wales, Sydney, New South Wales, Australia
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales, Australia
| | - Harriet Dashnow
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA.
| | - Daniel G MacArthur
- Centre for Population Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia.
- Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, Victoria, Australia.
- Faculty of Medicine and Health, University of New South Wales, Sydney, New South Wales, Australia.
| |
Collapse
|
5
|
Chaisson MJP, Sulovari A, Valdmanis PN, Miller DE, Eichler EE. Advances in the discovery and analyses of human tandem repeats. Emerg Top Life Sci 2023; 7:361-381. [PMID: 37905568 PMCID: PMC10806765 DOI: 10.1042/etls20230074] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 10/18/2023] [Accepted: 10/18/2023] [Indexed: 11/02/2023]
Abstract
Long-read sequencing platforms provide unparalleled access to the structure and composition of all classes of tandemly repeated DNA from STRs to satellite arrays. This review summarizes our current understanding of their organization within the human genome, their importance with respect to disease, as well as the advances and challenges in understanding their genetic diversity and functional effects. Novel computational methods are being developed to visualize and associate these complex patterns of human variation with disease, expression, and epigenetic differences. We predict accurate characterization of this repeat-rich form of human variation will become increasingly relevant to both basic and clinical human genetics.
Collapse
Affiliation(s)
- Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, U.S.A
- The Genomic and Epigenomic Regulation Program, USC Norris Cancer Center, University of Southern California, Los Angeles, CA 90089, U.S.A
| | - Arvis Sulovari
- Computational Biology, Cajal Neuroscience Inc, Seattle, WA 98102, U.S.A
| | - Paul N Valdmanis
- Division of Medical Genetics, Department of Medicine, University of Washington School of Medicine, Seattle, WA 98195, U.S.A
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, U.S.A
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA 98195, U.S.A
| | - Danny E Miller
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA 98195, U.S.A
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA 98195, U.S.A
- Department of Pediatrics, University of Washington, Seattle, WA 98195, U.S.A
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, U.S.A
- Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, U.S.A
| |
Collapse
|
6
|
English A, Dolzhenko E, Jam HZ, Mckenzie S, Olson ND, De Coster W, Park J, Gu B, Wagner J, Eberle MA, Gymrek M, Chaisson MJP, Zook JM, Sedlazeck FJ. Benchmarking of small and large variants across tandem repeats. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.29.564632. [PMID: 37961319 PMCID: PMC10634962 DOI: 10.1101/2023.10.29.564632] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits, and are linked to over 60 disease phenotypes. However, their complexity often excludes them from at-scale studies due to challenges with variant calling, representation, and lack of a genome-wide standard. To promote TR methods development, we create a comprehensive catalog of TR regions and explore its properties across 86 samples. We then curate variants from the GIAB HG002 individual to create a tandem repeat benchmark. We also present a variant comparison method that handles small and large alleles and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ∼24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 TR benchmark. We work with the GIAB community to demonstrate the utility of this benchmark across short and long read technologies.
Collapse
|
7
|
Ziaei Jam H, Li Y, DeVito R, Mousavi N, Ma N, Lujumba I, Adam Y, Maksimov M, Huang B, Dolzhenko E, Qiu Y, Kakembo FE, Joseph H, Onyido B, Adeyemi J, Bakhtiari M, Park J, Javadzadeh S, Jjingo D, Adebiyi E, Bafna V, Gymrek M. A deep population reference panel of tandem repeat variation. Nat Commun 2023; 14:6711. [PMID: 37872149 PMCID: PMC10593948 DOI: 10.1038/s41467-023-42278-3] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 10/05/2023] [Indexed: 10/25/2023] Open
Abstract
Tandem repeats (TRs) represent one of the largest sources of genetic variation in humans and are implicated in a range of phenotypes. Here we present a deep characterization of TR variation based on high coverage whole genome sequencing from 3550 diverse individuals from the 1000 Genomes Project and H3Africa cohorts. We develop a method, EnsembleTR, to integrate genotypes from four separate methods resulting in high-quality genotypes at more than 1.7 million TR loci. Our catalog reveals novel sequence features influencing TR heterozygosity, identifies population-specific trinucleotide expansions, and finds hundreds of novel eQTL signals. Finally, we generate a phased haplotype panel which can be used to impute most TRs from nearby single nucleotide polymorphisms (SNPs) with high accuracy. Overall, the TR genotypes and reference haplotype panel generated here will serve as valuable resources for future genome-wide and population-wide studies of TRs and their role in human phenotypes.
Collapse
Affiliation(s)
- Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Yang Li
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Ross DeVito
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Nima Mousavi
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA, USA
| | - Nichole Ma
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Ibra Lujumba
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala, Uganda
| | - Yagoub Adam
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun, 112233, Nigeria
| | - Mikhail Maksimov
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Bonnie Huang
- Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
| | | | - Yunjiang Qiu
- Illumina Incorporated, San Diego, CA, 92122, USA
| | - Fredrick Elishama Kakembo
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala, Uganda
| | - Habi Joseph
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala, Uganda
| | - Blessing Onyido
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun, 112233, Nigeria
- Covenant Applied Informatics and Communication Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun, 112233, Nigeria
| | - Jumoke Adeyemi
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun, 112233, Nigeria
- Covenant Applied Informatics and Communication Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun, 112233, Nigeria
| | - Mehrdad Bakhtiari
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Jonghun Park
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Sara Javadzadeh
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Daudi Jjingo
- The African Center of Excellence in Bioinformatics and Data Intensive Sciences, the Infectious Diseases Institute, Makerere University, Kampala, Uganda
- Department of Computer Science, Makerere University, Kampala, Uganda
| | - Ezekiel Adebiyi
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun, 112233, Nigeria
- Department of Computer & Information Sciences, Covenant University, Ota, Ogun, 112233, Nigeria
- Covenant Applied Informatics and Communication Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun, 112233, Nigeria
- Applied Bioinformatics Division, German Cancer Research Center (DKFZ), Heidelberg, Baden-Württemberg, 69120, Germany
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA.
- Department of Medicine, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|