3
|
Mas-Sandoval A, Pope NS, Nielsen KN, Altinkaya I, Fumagalli M, Korneliussen TS. Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data. Gigascience 2022; 11:giac032. [PMID: 35579549 PMCID: PMC9112775 DOI: 10.1093/gigascience/giac032] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 12/16/2021] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND The site frequency spectrum summarizes the distribution of allele frequencies throughout the genome, and it is widely used as a summary statistic to infer demographic parameters and to detect signals of natural selection. The use of high-throughput low-coverage DNA sequencing data can lead to biased estimates of the site frequency spectrum due to high levels of uncertainty in genotyping. RESULTS Here we design and implement a method to efficiently and accurately estimate the multidimensional joint site frequency spectrum for large numbers of haploid or diploid individuals across an arbitrary number of populations, using low-coverage sequencing data. The method maximizes a likelihood function that represents the probability of the sequencing data observed given a multidimensional site frequency spectrum using genotype likelihoods. Notably, it uses an advanced binning heuristic paired with an accelerated expectation-maximization algorithm for a fast and memory-efficient computation, and can generate both unfolded and folded spectra and bootstrapped replicates for haploid and diploid genomes. On the basis of extensive simulations, we show that the new method requires remarkably less storage and is faster than previous implementations whilst retaining the same accuracy. When applied to low-coverage sequencing data from the fungal pathogen Neonectria neomacrospora, results recapitulate the patterns of population differentiation generated using the original high-coverage data. CONCLUSION The new implementation allows for accurate estimation of population genetic parameters from arbitrarily large, low-coverage datasets, thus facilitating cost-effective sequencing experiments in model and non-model organisms.
Collapse
Affiliation(s)
- Alex Mas-Sandoval
- Department of Life Sciences, Silwood Park campus, Imperial College London, SL5 7PY, Ascot, UK
| | - Nathaniel S Pope
- Department of Entomology, The Pennsylvania State University, 201 Old Main, University Park, PA 16802, USA
| | - Knud Nor Nielsen
- Department of Plant and Environmental Sciences, University of Copenhagen, Thorvaldsensvej 40, 1871 Frederiksberg C, Denmark
| | - Isin Altinkaya
- GLOBE, Section for Geogenetics, Øster Voldgade 5-7, 1350, Copenhagen, Denmark
| | - Matteo Fumagalli
- Department of Life Sciences, Silwood Park campus, Imperial College London, SL5 7PY, Ascot, UK
- School of Biological and Behavioural Sciences, Queen Mary University of London, London, UK
| | | |
Collapse
|
4
|
Lou RN, Jacobs A, Wilder A, Therkildsen NO. A beginner's guide to low-coverage whole genome sequencing for population genomics. Mol Ecol 2021; 30:5966-5993. [PMID: 34250668 DOI: 10.1111/mec.16077] [Citation(s) in RCA: 68] [Impact Index Per Article: 22.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Revised: 06/30/2021] [Accepted: 07/01/2021] [Indexed: 11/26/2022]
Abstract
Low-coverage whole genome sequencing (lcWGS) has emerged as a powerful and cost-effective approach for population genomic studies in both model and non-model species. However, with read depths too low to confidently call individual genotypes, lcWGS requires specialized analysis tools that explicitly account for genotype uncertainty. A growing number of such tools have become available, but it can be difficult to get an overview of what types of analyses can be performed reliably with lcWGS data, and how the distribution of sequencing effort between the number of samples analyzed and per-sample sequencing depths affects inference accuracy. In this introductory guide to lcWGS, we first illustrate how the per-sample cost for lcWGS is now comparable to RAD-seq and Pool-seq in many systems. We then provide an overview of software packages that explicitly account for genotype uncertainty in different types of population genomic inference. Next, we use both simulated and empirical data to assess the accuracy of allele frequency and genetic diversity estimation, detection of population structure, and selection scans under different sequencing strategies. Our results show that spreading a given amount of sequencing effort across more samples with lower depth per sample consistently improves the accuracy of most types of inference, with a few notable exceptions. Finally, we assess the potential for using imputation to bolster inference from lcWGS data in non-model species, and discuss current limitations and future perspectives for lcWGS-based population genomics research. With this overview, we hope to make lcWGS more approachable and stimulate its broader adoption.
Collapse
Affiliation(s)
- Runyang Nicolas Lou
- Department of Natural Resources and the Environment, Cornell University, Ithaca, NY, 14853, USA
| | - Arne Jacobs
- Department of Natural Resources and the Environment, Cornell University, Ithaca, NY, 14853, USA.,Institute of Biodiversity, Animal Health and Comparative Medicine, University of Glasgow, Glasgow, G12 8QQ, UK
| | - Aryn Wilder
- San Diego Zoo Wildlife Alliance, Escondido, CA, 92027, USA
| | - Nina O Therkildsen
- Department of Natural Resources and the Environment, Cornell University, Ithaca, NY, 14853, USA
| |
Collapse
|
5
|
Kwong AM, Blackwell TW, LeFaive J, de Andrade M, Barnard J, Barnes KC, Blangero J, Boerwinkle E, Burchard EG, Cade BE, Chasman DI, Chen H, Conomos MP, Cupples LA, Ellinor PT, Eng C, Gao Y, Guo X, Irvin MR, Kelly TN, Kim W, Kooperberg C, Lubitz SA, Mak ACY, Manichaikul AW, Mathias RA, Montasser ME, Montgomery CG, Musani S, Palmer ND, Peloso GM, Qiao D, Reiner AP, Roden DM, Shoemaker MB, Smith JA, Smith NL, Su JL, Tiwari HK, Weeks DE, Weiss ST, Scott LJ, Smith AV, Abecasis GR, Boehnke M, Kang HM. Robust, flexible, and scalable tests for Hardy-Weinberg equilibrium across diverse ancestries. Genetics 2021; 218:iyab044. [PMID: 33720349 PMCID: PMC8128395 DOI: 10.1093/genetics/iyab044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Accepted: 02/03/2021] [Indexed: 11/13/2022] Open
Abstract
Traditional Hardy-Weinberg equilibrium (HWE) tests (the χ2 test and the exact test) have long been used as a metric for evaluating genotype quality, as technical artifacts leading to incorrect genotype calls often can be identified as deviations from HWE. However, in data sets composed of individuals from diverse ancestries, HWE can be violated even without genotyping error, complicating the use of HWE testing to assess genotype data quality. In this manuscript, we present the Robust Unified Test for HWE (RUTH) to test for HWE while accounting for population structure and genotype uncertainty, and to evaluate the impact of population heterogeneity and genotype uncertainty on the standard HWE tests and alternative methods using simulated and real sequence data sets. Our results demonstrate that ignoring population structure or genotype uncertainty in HWE tests can inflate false-positive rates by many orders of magnitude. Our evaluations demonstrate different tradeoffs between false positives and statistical power across the methods, with RUTH consistently among the best across all evaluations. RUTH is implemented as a practical and scalable software tool to rapidly perform HWE tests across millions of markers and hundreds of thousands of individuals while supporting standard VCF/BCF formats. RUTH is publicly available at https://www.github.com/statgen/ruth.
Collapse
Affiliation(s)
- Alan M Kwong
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Thomas W Blackwell
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Jonathon LeFaive
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | | | - John Barnard
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44106, USA
| | - Kathleen C Barnes
- Department of Medicine, Anschultz Medical Campus, University of Colorado, Aurora, CO 80045, USA
| | - John Blangero
- Department of Human Genetics, South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley School of Medicine, Brownsville, TX 78520, USA
| | - Eric Boerwinkle
- Department of Epidemiology, Human Genetics Center, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Esteban G Burchard
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA 94143, USA
- Department of Medicine, University of California San Francisco, San Francisco, CA 94143, USA
| | - Brian E Cade
- Division of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Boston, MA 02115, USA
- Division of Sleep Medicine, Harvard Medical School, Boston, MA 02115, USA
| | - Daniel I Chasman
- Division of Preventive Medicine, Brigham and Women’s Hospital, Boston, MA 02215, USA
| | - Han Chen
- Department of Epidemiology, Human Genetics Center, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
- Center for Precision Health, School of Public Health and School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Matthew P Conomos
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| | - L Adrienne Cupples
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
- Framingham Heart Study, Framingham, MA 01702, USA
| | - Patrick T Ellinor
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA 02114, USA
- Cardiovascular Disease Initiative, The Broad Institute of MIT and Harvard, Cambridge, MA 02124, USA
| | - Celeste Eng
- Department of Medicine, University of California San Francisco, San Francisco, CA 94143, USA
| | - Yan Gao
- Department of Physiology and Biophysics, University of Mississippi Medical Center, Jackson, MS 39216 USA
| | - Xiuqing Guo
- Department of Pediatrics, The Institute for Translational Genomics and Population Sciences, The Lundquist Institute at Harbor-UCLA Medical Center, Torrance, CA 90502, USA
| | - Marguerite Ryan Irvin
- Department of Epidemiology, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Tanika N Kelly
- Department of Epidemiology, Tulane University, New Orleans, LA 70112, USA
| | - Wonji Kim
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA
| | | | - Steven A Lubitz
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA 02114, USA
- Cardiovascular Disease Initiative, The Broad Institute of MIT and Harvard, Cambridge, MA 02124, USA
| | - Angel C Y Mak
- Department of Medicine, University of California San Francisco, San Francisco, CA 94143, USA
| | - Ani W Manichaikul
- Department of Public Health Sciences, Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22908, USA
| | - Rasika A Mathias
- GeneSTAR Research Program and Division of Allergy and Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21205, USA
| | - May E Montasser
- Division of Endocrinology, Diabetes and Nutrition, Department of Medicine, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Courtney G Montgomery
- Sarcoidosis Research Unit, Genes and Human Disease Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK 73104, USA
| | - Solomon Musani
- Jackson Heart Study, University of Mississippi Medical Center, Jackson, MS 39216, USA
| | - Nicholette D Palmer
- Department of Biochemistry, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| | - Gina M Peloso
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
| | - Dandi Qiao
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA
| | | | - Dan M Roden
- Departments of Medicine, Pharmacology, and Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - M Benjamin Shoemaker
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Jennifer A Smith
- Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA
| | - Nicholas L Smith
- Department of Epidemiology, University of Washington, Seattle, WA 98195, USA
- Kaiser Permanente Washington Health Research Institute, Kaiser Permanente Washington, Seattle, WA 98101, USA
- Department of Veterans Affairs, Seattle Epidemiologic Research and Information Center, Office of Research and Development, Seattle, WA 98108, USA
| | - Jessica Lasky Su
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Hemant K Tiwari
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Daniel E Weeks
- Departments of Human Genetics and Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Scott T Weiss
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA
| | | | | | - Laura J Scott
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Albert V Smith
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Gonçalo R Abecasis
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Michael Boehnke
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Hyun Min Kang
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
7
|
Shastry V, Adams PE, Lindtke D, Mandeville EG, Parchman TL, Gompert Z, Buerkle CA. Model-based genotype and ancestry estimation for potential hybrids with mixed-ploidy. Mol Ecol Resour 2021; 21:1434-1451. [PMID: 33482035 DOI: 10.1111/1755-0998.13330] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 12/11/2020] [Accepted: 01/11/2021] [Indexed: 11/29/2022]
Abstract
Non-random mating among individuals can lead to spatial clustering of genetically similar individuals and population stratification. This deviation from panmixia is commonly observed in natural populations. Consequently, individuals can have parentage in single populations or involving hybridization between differentiated populations. Accounting for this mixture and structure is important when mapping the genetics of traits and learning about the formative evolutionary processes that shape genetic variation among individuals and populations. Stratified genetic relatedness among individuals is commonly quantified using estimates of ancestry that are derived from a statistical model. Development of these models for polyploid and mixed-ploidy individuals and populations has lagged behind those for diploids. Here, we extend and test a hierarchical Bayesian model, called entropy, which can use low-depth sequence data to estimate genotype and ancestry parameters in autopolyploid and mixed-ploidy individuals (including sex chromosomes and autosomes within individuals). Our analysis of simulated data illustrated the trade-off between sequencing depth and genome coverage and found lower error associated with low-depth sequencing across a larger fraction of the genome than with high-depth sequencing across a smaller fraction of the genome. The model has high accuracy and sensitivity as verified with simulated data and through analysis of admixture among populations of diploid and tetraploid Arabidopsis arenosa.
Collapse
Affiliation(s)
| | - Paula E Adams
- Department of Biological Sciences, University of Alabama, Tuscaloosa, AL, USA
| | - Dorothea Lindtke
- Institute of Plant Sciences, University of Bern, Bern, Switzerland
| | | | | | | | - C Alex Buerkle
- Department of Botany, University of Wyoming, Laramie, WY, USA
| |
Collapse
|