1
|
Lang OW, Srivastava D, Pugh BF, Lai WKM. GenoPipe: identifying the genotype of origin within (epi)genomic datasets. Nucleic Acids Res 2023; 51:12054-12068. [PMID: 37933851 PMCID: PMC10711449 DOI: 10.1093/nar/gkad950] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 09/19/2023] [Accepted: 10/11/2023] [Indexed: 11/08/2023] Open
Abstract
Confidence in experimental results is critical for discovery. As the scale of data generation in genomics has grown exponentially, experimental error has likely kept pace despite the best efforts of many laboratories. Technical mistakes can and do occur at nearly every stage of a genomics assay (i.e. cell line contamination, reagent swapping, tube mislabelling, etc.) and are often difficult to identify post-execution. However, the DNA sequenced in genomic experiments contains certain markers (e.g. indels) encoded within and can often be ascertained forensically from experimental datasets. We developed the Genotype validation Pipeline (GenoPipe), a suite of heuristic tools that operate together directly on raw and aligned sequencing data from individual high-throughput sequencing experiments to characterize the underlying genome of the source material. We demonstrate how GenoPipe validates and rescues erroneously annotated experiments by identifying unique markers inherent to an organism's genome (i.e. epitope insertions, gene deletions and SNPs).
Collapse
Affiliation(s)
- Olivia W Lang
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA
| | - Divyanshi Srivastava
- Department of Biochemistry & Molecular Biology, Pennsylvania State University, University Park, PA, 16801, USA
| | - B Franklin Pugh
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA
| | - William K M Lai
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA
- Department of Computational Biology, Cornell University, Ithaca, NY 14850, USA
- Cornell Institute of Biotechnology, Cornell University, Ithaca, NY 14850, USA
| |
Collapse
|
2
|
Lang O, Srivastava D, Pugh BF, Lai WK. GenoPipe: identifying the genotype of origin within (epi)genomic datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.14.532660. [PMID: 36993164 PMCID: PMC10055126 DOI: 10.1101/2023.03.14.532660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Confidence in experimental results is critical for discovery. As the scale of data generation in genomics has grown exponentially, experimental error has likely kept pace despite the best efforts of many laboratories. Technical mistakes can and do occur at nearly every stage of a genomics assay (i.e., cell line contamination, reagent swapping, tube mislabelling, etc.) and are often difficult to identify post-execution. However, the DNA sequenced in genomic experiments contains certain markers (e.g., indels) encoded within and can often be ascertained forensically from experimental datasets. We developed the Genotype validation Pipeline (GenoPipe), a suite of heuristic tools that operate together directly on raw and aligned sequencing data from individual high-throughput sequencing experiments to characterize the underlying genome of the source material. We demonstrate how GenoPipe validates and rescues erroneously annotated experiments by identifying unique markers inherent to an organism’s genome (i.e., epitope insertions, gene deletions, and SNPs).
Collapse
|
3
|
Cherukuri PF, Soe MM, Condon DE, Bartaria S, Meis K, Gu S, Frost FG, Fricke LM, Lubieniecki KP, Lubieniecka JM, Pyatt RE, Hajek C, Boerkoel CF, Carmichael L. Establishing analytical validity of BeadChip array genotype data by comparison to whole-genome sequence and standard benchmark datasets. BMC Med Genomics 2022; 15:56. [PMID: 35287663 PMCID: PMC8919546 DOI: 10.1186/s12920-022-01199-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Accepted: 02/28/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Clinical use of genotype data requires high positive predictive value (PPV) and thorough understanding of the genotyping platform characteristics. BeadChip arrays, such as the Global Screening Array (GSA), potentially offer a high-throughput, low-cost clinical screen for known variants. We hypothesize that quality assessment and comparison to whole-genome sequence and benchmark data establish the analytical validity of GSA genotyping. METHODS To test this hypothesis, we selected 263 samples from Coriell, generated GSA genotypes in triplicate, generated whole genome sequence (rWGS) genotypes, assessed the quality of each set of genotypes, and compared each set of genotypes to each other and to the 1000 Genomes Phase 3 (1KG) genotypes, a performance benchmark. For 59 genes (MAP59), we also performed theoretical and empirical evaluation of variants deemed medically actionable predispositions. RESULTS Quality analyses detected sample contamination and increased assay failure along the chip margins. Comparison to benchmark data demonstrated that > 82% of the GSA assays had a PPV of 1. GSA assays targeting transitions, genomic regions of high complexity, and common variants performed better than those targeting transversions, regions of low complexity, and rare variants. Comparison of GSA data to rWGS and 1KG data showed > 99% performance across all measured parameters. Consistent with predictions from prior studies, the GSA detection of variation within the MAP59 genes was 3/261. CONCLUSION We establish the analytical validity of GSA assays using quality analytics and comparison to benchmark and rWGS data. GSA assays meet the standards of a clinical screen although assays interrogating rare variants, transversions, and variants within low-complexity regions require careful evaluation.
Collapse
Affiliation(s)
- Praveen F Cherukuri
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA. .,Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA. .,Sanford Research Center, Sioux Falls, SD, USA.
| | - Melissa M Soe
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - David E Condon
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA.,Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA
| | - Shubhi Bartaria
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - Kaitlynn Meis
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - Shaopeng Gu
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - Frederick G Frost
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - Lindsay M Fricke
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - Krzysztof P Lubieniecki
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA.,Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA.,Sanford Research Center, Sioux Falls, SD, USA
| | - Joanna M Lubieniecka
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA.,Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA.,Sanford Research Center, Sioux Falls, SD, USA
| | - Robert E Pyatt
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA.,Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA
| | - Catherine Hajek
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA.,Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA
| | - Cornelius F Boerkoel
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| | - Lynn Carmichael
- Imagenetics, Sanford Health, 1410 W 25th St. Room #302, Sioux Falls, SD, 57105, USA
| |
Collapse
|
4
|
Chan AW, Villwock SS, Williams AL, Jannink JL. Sexual dimorphism and the effect of wild introgressions on recombination in cassava (Manihot esculenta Crantz) breeding germplasm. G3 (BETHESDA, MD.) 2022; 12:jkab372. [PMID: 34791172 PMCID: PMC8728042 DOI: 10.1093/g3journal/jkab372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/21/2019] [Accepted: 09/29/2021] [Indexed: 01/09/2023]
Abstract
Recombination has essential functions in meiosis, evolution, and breeding. The frequency and distribution of crossovers dictate the generation of new allele combinations and can vary across species and between sexes. Here, we examine recombination landscapes across the 18 chromosomes of cassava (Manihot esculenta Crantz) with respect to male and female meioses and known introgressions from the wild relative Manihot glaziovii. We used SHAPEIT2 and duoHMM to infer crossovers from genotyping-by-sequencing data and a validated multigenerational pedigree from the International Institute of Tropical Agriculture cassava breeding germplasm consisting of 7020 informative meioses. We then constructed new genetic maps and compared them to an existing map previously constructed by the International Cassava Genetic Map Consortium. We observed higher recombination rates in females compared to males, and lower recombination rates in M. glaziovii introgression segments on chromosomes 1 and 4, with suppressed recombination along the entire length of the chromosome in the case of the chromosome 4 introgression. Finally, we discuss hypothesized mechanisms underlying our observations of heterochiasmy and crossover suppression and discuss the broader implications for plant breeding.
Collapse
Affiliation(s)
- Ariel W Chan
- Section of Plant Breeding and Genetics, School of Integrative Plant Sciences, Cornell University, Ithaca, NY 14853, USA
| | - Seren S Villwock
- Section of Plant Breeding and Genetics, School of Integrative Plant Sciences, Cornell University, Ithaca, NY 14853, USA
| | - Amy L Williams
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853, USA
| | - Jean-Luc Jannink
- RW Holley Center for Agriculture and Health, United States Department of Agriculture—Agricultural Research Service, School of Integrative Plant Sciences, Cornell University, Ithaca, NY 14853, USA
| |
Collapse
|
5
|
Wolfe MD, Chan AW, Kulakow P, Rabbi I, Jannink JL. Genomic mating in outbred species: predicting cross usefulness with additive and total genetic covariance matrices. Genetics 2021; 219:6363799. [PMID: 34740244 PMCID: PMC8570794 DOI: 10.1093/genetics/iyab122] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 07/13/2021] [Indexed: 11/14/2022] Open
Abstract
Diverse crops are both outbred and clonally propagated. Breeders typically use truncation selection of parents and invest significant time, land, and money evaluating the progeny of crosses to find exceptional genotypes. We developed and tested genomic mate selection criteria suitable for organisms of arbitrary homozygosity level where the full-sibling progeny are of direct interest as future parents and/or cultivars. We extended cross variance and covariance variance prediction to include dominance effects and predicted the multivariate selection index genetic variance of crosses based on haplotypes of proposed parents, marker effects, and recombination frequencies. We combined the predicted mean and variance into usefulness criteria for parent and variety development. We present an empirical study of cassava (Manihot esculenta), a staple tropical root crop. We assessed the potential to predict the multivariate genetic distribution (means, variances, and trait covariances) of 462 cassava families in terms of additive and total value using cross-validation. Most variance (89%) and covariance (70%) prediction accuracy estimates were greater than zero. The usefulness of crosses was accurately predicted with good correspondence between the predicted and the actual mean performance of family members breeders selected for advancement as new parents and candidate varieties. We also used a directional dominance model to quantify significant inbreeding depression for most traits. We predicted 47,083 possible crosses of 306 parents and contrasted them to those previously tested to show how mate selection can reveal the new potential within the germplasm. We enable breeders to consider the potential of crosses to produce future parents (progeny with top breeding values) and varieties (progeny with top own performance).
Collapse
Affiliation(s)
- Marnin D Wolfe
- Section on Plant Breeding and Genetics, School of Integrative Plant Sciences, Cornell University, Ithaca, NY 14850, USA
| | - Ariel W Chan
- Section on Plant Breeding and Genetics, School of Integrative Plant Sciences, Cornell University, Ithaca, NY 14850, USA
| | - Peter Kulakow
- International Institute of Tropical Agriculture (IITA), Ibadan, Nigeria
| | - Ismail Rabbi
- International Institute of Tropical Agriculture (IITA), Ibadan, Nigeria
| | - Jean-Luc Jannink
- Section on Plant Breeding and Genetics, School of Integrative Plant Sciences, Cornell University, Ithaca, NY 14850, USA.,USDA-ARS, Ithaca, NY 14850, USA
| |
Collapse
|
6
|
Ros-Freixedes R, Whalen A, Chen CY, Gorjanc G, Herring WO, Mileham AJ, Hickey JM. Accuracy of whole-genome sequence imputation using hybrid peeling in large pedigreed livestock populations. Genet Sel Evol 2020; 52:17. [PMID: 32248811 PMCID: PMC7132992 DOI: 10.1186/s12711-020-00536-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Accepted: 03/27/2020] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND The coupling of appropriate sequencing strategies and imputation methods is critical for assembling large whole-genome sequence datasets from livestock populations for research and breeding. In this paper, we describe and validate the coupling of a sequencing strategy with the imputation method hybrid peeling in real animal breeding settings. METHODS We used data from four pig populations of different size (18,349 to 107,815 individuals) that were widely genotyped at densities between 15,000 and 75,000 markers genome-wide. Around 2% of the individuals in each population were sequenced (most of them at 1× or 2× and 37-92 individuals per population, totalling 284, at 15-30×). We imputed whole-genome sequence data with hybrid peeling. We evaluated the imputation accuracy by removing the sequence data of the 284 individuals with high coverage, using a leave-one-out design. We simulated data that mimicked the sequencing strategy used in the real populations to quantify the factors that affected the individual-wise and variant-wise imputation accuracies using regression trees. RESULTS Imputation accuracy was high for the majority of individuals in all four populations (median individual-wise dosage correlation: 0.97). Imputation accuracy was lower for individuals in the earliest generations of each population than for the rest, due to the lack of marker array data for themselves and their ancestors. The main factors that determined the individual-wise imputation accuracy were the genotyping status, the availability of marker array data for immediate ancestors, and the degree of connectedness to the rest of the population, but sequencing coverage of the relatives had no effect. The main factors that determined variant-wise imputation accuracy were the minor allele frequency and the number of individuals with sequencing coverage at each variant site. Results were validated with the empirical observations. CONCLUSIONS We demonstrate that the coupling of an appropriate sequencing strategy and hybrid peeling is a powerful strategy for generating whole-genome sequence data with high accuracy in large pedigreed populations where only a small fraction of individuals (2%) had been sequenced, mostly at low coverage. This is a critical step for the successful implementation of whole-genome sequence data for genomic prediction and fine-mapping of causal variants.
Collapse
Affiliation(s)
- Roger Ros-Freixedes
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
- Departament de Ciència Animal, Universitat de Lleida-Agrotecnio Center, Lleida, Spain
| | - Andrew Whalen
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| | - Ching-Yi Chen
- The Pig Improvement Company, Genus plc, 100 Bluegrass Commons Blvd Ste 2200, Hendersonville, TN 37075 USA
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| | - William O. Herring
- The Pig Improvement Company, Genus plc, 100 Bluegrass Commons Blvd Ste 2200, Hendersonville, TN 37075 USA
| | | | - John M. Hickey
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| |
Collapse
|