1
|
Peng D, Mulder OJ, Edge MD. Evaluating ARG-estimation methods in the context of estimating population-mean polygenic score histories. Genetics 2025; 229:iyaf033. [PMID: 40048614 PMCID: PMC12005257 DOI: 10.1093/genetics/iyaf033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/12/2025] [Accepted: 02/15/2025] [Indexed: 03/12/2025] Open
Abstract
Scalable methods for estimating marginal coalescent trees across the genome present new opportunities for studying evolution and have generated considerable excitement, with new methods extending scalability to thousands of samples. Benchmarking of the available methods has revealed general tradeoffs between accuracy and scalability, but performance in downstream applications has not always been easily predictable from general performance measures, suggesting that specific features of the ancestral recombination graph (ARG) may be important for specific downstream applications of estimated ARGs. To exemplify this point, we benchmark ARG estimation methods with respect to a specific set of methods for estimating the historical time course of a population-mean polygenic score (PGS) using the marginal coalescent trees encoded by the ARG. Here, we examine the performance in simulation of seven ARG estimation methods: ARGweaver, RENT+, Relate, tsinfer+tsdate, ARG-Needle, ASMC-clust, and SINGER, using their estimated coalescent trees and examining bias, mean squared error, confidence interval coverage, and Type I and II error rates of the downstream methods. Although it does not scale to the sample sizes attainable by other new methods, SINGER produced the most accurate estimated PGS histories in many instances, even when Relate, tsinfer+tsdate, ARG-Needle, and ASMC-clust used samples 10 or more times as large as those used by SINGER. In general, the best choice of method depends on the number of samples available and the historical time period of interest. In particular, the unprecedented sample sizes allowed by Relate, tsinfer+tsdate, ARG-Needle, and ASMC-clust are of greatest importance when the recent past is of interest-further back in time, most of the tree has coalesced, and differences in contemporary sample size are less salient.
Collapse
Affiliation(s)
- Dandan Peng
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90098, USA
| | - Obadiah J Mulder
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90098, USA
| | - Michael D Edge
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90098, USA
| |
Collapse
|
2
|
Hoscheit P, Desbiez C. Phylodynamics and phylogeography of watermelon mosaic virus: Multiple local invasion routes in southern France and recombination-driven limits to global analysis. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2025; 129:105732. [PMID: 40020892 DOI: 10.1016/j.meegid.2025.105732] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/10/2024] [Revised: 02/22/2025] [Accepted: 02/25/2025] [Indexed: 03/03/2025]
Abstract
Watermelon mosaic virus (WMV) is a major plant pathogen, infecting over 170 plant species, including cucurbits and legumes. Though mostly propagated locally by aphids in a non-persistent manner, long-range dispersal can occur through human-induced plant or vector movements. Understanding patterns of local and global spread of WMV is crucial to help formulate adequate control strategies. We used phylodynamic methods based on partial and whole-genome sequences collected in France between 2000 and 2017 to reconstruct the introduction of new lineages in the past 30 years and their subsequent diffusion in the country. We identified at least 11 different introduction events, hailing from different parts of the global diversity of WMV, highlighting the critical role international exchanges play in the spread of plant pathogens. For three of these lineages, we estimated the time and location of their introduction in the mid-1990s in the south of France and the speed at which they spread in this specific landscape. We also showed that the highly recombinogenic nature of WMV, as with most potyviruses, makes the use of whole genomes necessary to classify these viruses on a global scale and must be taken into consideration to reconstruct viral evolutionary history. Our results demonstrate how genomic sequencing of plant viruses can help reconstruct specific viral outbreaks and understand global circulation patterns of plant pathogens.
Collapse
Affiliation(s)
- Patrick Hoscheit
- Université Paris-Saclay, INRAE, MaIAGE, 78350 Jouy-en-Josas, France.
| | | |
Collapse
|
3
|
Fan C, Cahoon JL, Dinh BL, Ortega-Del Vecchyo D, Huber CD, Edge MD, Mancuso N, Chiang CWK. A likelihood-based framework for demographic inference from genealogical trees. Nat Genet 2025; 57:865-874. [PMID: 40113903 DOI: 10.1038/s41588-025-02129-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Accepted: 02/14/2025] [Indexed: 03/22/2025]
Abstract
The demographic history of a population underlies patterns of genetic variation and is encoded in the gene-genealogical trees of the sampled haplotypes. Here we propose a demographic inference framework called the genealogical likelihood (gLike). Our method uses a graph-based structure to summarize the relationships among all lineages in a gene-genealogical tree with all possible trajectories of population memberships through time and derives the full likelihood across trees under a parameterized demographic model. We show through simulations and empirical applications that for populations that have experienced multiple admixtures, gLike can accurately estimate dozens of demographic parameters, including ancestral population sizes, admixture timing and admixture proportions, and it outperforms conventional demographic inference methods using the site frequency spectrum. Taken together, our proposed gLike framework harnesses underused genealogical information to offer high sensitivity and accuracy in inferring complex demographies for humans and other species.
Collapse
Affiliation(s)
- Caoqi Fan
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| | - Jordan L Cahoon
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
- Department of Computer Science, University of Southern California, Los Angeles, CA, USA
| | - Bryan L Dinh
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Diego Ortega-Del Vecchyo
- Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de México, Querétaro, México
| | - Christian D Huber
- Department of Biology, Penn State University, University Park, PA, USA
| | - Michael D Edge
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Nicholas Mancuso
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Charleston W K Chiang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
4
|
Adams R, Lozano JR, Duncan M, Green J, Assis R, DeGiorgio M. A Tale of Too Many Trees: A Conundrum for Phylogenetic Regression. Mol Biol Evol 2025; 42:msaf032. [PMID: 39930867 PMCID: PMC11884811 DOI: 10.1093/molbev/msaf032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Revised: 12/20/2024] [Accepted: 01/21/2025] [Indexed: 03/08/2025] Open
Abstract
Just exactly which tree(s) should we assume when testing evolutionary hypotheses? This question has plagued comparative biologists for decades. Though all phylogenetic comparative methods require input trees, we seldom know with certainty whether even a perfectly estimated tree (if this is possible in practice) is appropriate for our studied traits. Yet, we also know that phylogenetic conflict is ubiquitous in modern comparative biology, and we are still learning about its dangers when testing evolutionary hypotheses. Here, we investigate the consequences of tree-trait mismatch for phylogenetic regression in the presence of gene tree-species tree conflict. Our simulation experiments reveal excessively high false positive rates for mismatched models with both small and large trees, simple and complex traits, and known and estimated phylogenies. In some cases, we find evidence of a directionality of error: assuming a species tree for traits that evolved according to a gene tree sometimes fares worse than the opposite. We also explored the impacts of tree choice using an expansive, cross-species gene expression dataset as an arguably "best-case" scenario in which one may have a better chance of matching tree with trait. Offering a potential path forward, we found promise in the application of a robust estimator as a potential, albeit imperfect, solution to some issues raised by tree mismatch. Collectively, our results emphasize the importance of careful study design for comparative methods, highlighting the need to fully appreciate the role of accurate and thoughtful phylogenetic modeling.
Collapse
Affiliation(s)
- Richard Adams
- Department of Entomology and Plant Pathology, University of Arkansas, Fayetteville, AR, USA
- Center for Agricultural Data Analytics, University of Arkansas, Fayetteville, AR, USA
| | - Jenniffer Roa Lozano
- Department of Entomology and Plant Pathology, University of Arkansas, Fayetteville, AR, USA
- Center for Agricultural Data Analytics, University of Arkansas, Fayetteville, AR, USA
| | - Mataya Duncan
- Department of Entomology and Plant Pathology, University of Arkansas, Fayetteville, AR, USA
- Center for Agricultural Data Analytics, University of Arkansas, Fayetteville, AR, USA
| | - Jack Green
- Department of Entomology and Plant Pathology, University of Arkansas, Fayetteville, AR, USA
- Center for Agricultural Data Analytics, University of Arkansas, Fayetteville, AR, USA
| | - Raquel Assis
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
- Institute for Human Health and Disease Intervention, Florida Atlantic University, Boca Raton, FL, USA
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| |
Collapse
|
5
|
Lehmann B, Lee H, Anderson-Trocmé L, Kelleher J, Gorjanc G, Ralph PL. On ARGs, pedigrees, and genetic relatedness matrices. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.03.641310. [PMID: 40093116 PMCID: PMC11908205 DOI: 10.1101/2025.03.03.641310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2025]
Abstract
Genetic relatedness is a central concept in genetics, underpinning studies of population and quantitative genetics in human, animal, and plant settings. It is typically stored as a genetic relatedness matrix (GRM), whose elements are pairwise relatedness values between individuals. This relatedness has been defined in various contexts based on pedigree, genotype, phylogeny, coalescent times, and, recently, ancestral recombination graph (ARG). ARG-based GRMs have been found to better capture the structure of a population and improve association studies relative to the genotype GRM. However, calculating GRMs and further operations with them is fundamentally challenging due to inherent quadratic time and space complexity. Here, we first discuss the different definitions of relatedness in a unifying context, making use of the additive model of a quantitative trait to provide a definition of "branch relatedness" and the corresponding "branch GRM". We explore the relationship between branch relatedness and pedigree relatedness through a case study of French-Canadian individuals that have a known pedigree. Through the tree sequence encoding of an ARG, we then derive an efficient algorithm for computing products between the branch GRM and a general vector, without explicitly forming the branch GRM. This algorithm leverages the sparse encoding of genomes with the tree sequence and hence enables large-scale computations with the branch GRM. We demonstrate the power of this algorithm by developing a randomized principal components algorithm for tree sequences that easily scales to millions of genomes. All algorithms are implemented in the open source tskit Python package. Taken together, this work consolidates the different notions of relatedness as branch relatedness and by leveraging the tree sequence encoding of an ARG it provides efficient algorithms that enable computations with the branch GRM that scale to mega-scale genomic datasets.
Collapse
Affiliation(s)
- Brieuc Lehmann
- Department of Statistical Science, University College London, WC1E 7HB, UK
| | - Hanbin Lee
- Department of Statistics, University of Michigan, Ann Arbor MI 48109, USA
| | | | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, UK
| | - Peter L Ralph
- Institute of Ecology and Evolution, University of Oregon, Eugene OR 97402, USA
- Department of Data Science, University of Oregon, Eugene OR 97402, USA
| |
Collapse
|
6
|
Tribble CM, Márquez-Corro JI, May MR, Hipp AL, Escudero M, Zenil-Ferguson R. Macroevolutionary inference of complex modes of chromosomal speciation in a cosmopolitan plant lineage. THE NEW PHYTOLOGIST 2025; 245:2350-2361. [PMID: 39722216 DOI: 10.1111/nph.20353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2024] [Accepted: 11/28/2024] [Indexed: 12/28/2024]
Abstract
The effects of single chromosome number change-dysploidy - mediating diversification remain poorly understood. Dysploidy modifies recombination rates, linkage, or reproductive isolation, especially for one-fifth of all eukaryote lineages with holocentric chromosomes. Dysploidy effects on diversification have not been estimated because modeling chromosome numbers linked to diversification with heterogeneity along phylogenies is quantitatively challenging. We propose a new state-dependent diversification model of chromosome evolution that links diversification rates to dysploidy rates considering heterogeneity and differentiates between anagenetic and cladogenetic changes. We apply this model to Carex (Cyperaceae), a cosmopolitan flowering plant clade with holocentric chromosomes. We recover two distinct modes of chromosomal evolution and speciation in Carex. In one diversification mode, dysploidy occurs frequently and drives faster diversification rates. In the other mode, dysploidy is rare, and diversification is driven by hidden, unmeasured factors. When we use a model that excludes hidden states, we mistakenly infer a strong, uniformly positive effect of dysploidy on diversification, showing that standard models may lead to confident but incorrect conclusions about diversification. This study demonstrates that dysploidy can have a significant role in speciation in a large plant clade despite the presence of other unmeasured factors that simultaneously affect diversification.
Collapse
Affiliation(s)
- Carrie M Tribble
- Department of Biology, University of Washington, Seattle, WA, 98195, USA
- Burke Museum of Natural History and Culture, University of Washington, Seattle, WA, 98195, USA
- School of Life Sciences, University of Hawai'i at Mānoa, Honolulu, HI, 96822, USA
| | - José Ignacio Márquez-Corro
- Royal Botanic Gardens, Kew, Richmond, Surrey, TW9 3AE, UK
- Department of Molecular Biology and Biochemistry Engineering, Universidad Pablo de Olavide, Sevilla, 41013, Spain
| | - Michael R May
- Department of Evolution and Ecology, University of California Davis, Davis, CA, USA
| | - Andrew L Hipp
- Herbarium and Center for Tree Science, The Morton Arboretum, Lisle, IL, 60532, USA
| | - Marcial Escudero
- Department of Plant Biology and Ecology, Faculty of Biology, University of Sevilla, Sevilla, 41012, Spain
| | | |
Collapse
|
7
|
Bisschop G, Kelleher J, Ralph P. Likelihoods for a general class of ARGs under the SMC. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.24.639977. [PMID: 40060524 PMCID: PMC11888268 DOI: 10.1101/2025.02.24.639977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/22/2025]
Abstract
Ancestral recombination graphs (ARGs) are the focus of much ongoing research interest. Recent progress in inference has made ARG-based approaches feasible across of range of applications, and many new methods using inferred ARGs as input have appeared. This progress on the long-standing problem of ARG inference has proceeded in two distinct directions. First, the Bayesian inference of ARGs under the Sequentially Markov Coalescent (SMC), is now practical for tens-to-hundreds of samples. Second, approximate models and heuristics can now scale to sample sizes two to three orders of magnitude larger. Although these heuristic methods are reasonably accurate under many metrics, one significant drawback is that the ARGs they estimate do not have the topological properties required to compute a likelihood under models such as the SMC under present-day formulations. In particular, heuristic inference methods typically do not estimate precise details about recombination events, which are currently required to compute a likelihood. In this paper we present a backwards-time formulation of the SMC and derive a straightforward definition of the likelihood of a general class of ARG under this model. We show that this formulation does not require precise details of recombination events to be estimated, and is robust to the presence of polytomies. We discuss the possibilities for inference that this opens.
Collapse
|
8
|
Fritze H, Pope N, Kelleher J, Ralph P. A forest is more than its trees: haplotypes and ancestral recombination graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.11.30.626138. [PMID: 40060605 PMCID: PMC11888177 DOI: 10.1101/2024.11.30.626138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/17/2025]
Abstract
Foreshadowing haplotype-based methods of the genomics era, it is an old observation that the "junction" between two distinct haplotypes produced by recombination is inherited as a Mendelian marker. In a genealogical context, this recombination-mediated information reflects the persistence of ancestral haplotypes across local genealogical trees in which they do not represent coalescences. We show how these non-coalescing haplotypes ("locally-unary nodes") may be inserted into ancestral recombination graphs (ARGs), a compact but information-rich data structure describing the genealogical relationships among recombinant sequences. The resulting ARGs are smaller, faster to compute with, and the additional ancestral information that is inserted is nearly always correct where the initial ARG is correct. We provide efficient algorithms to infer locally-unary nodes within existing ARGs, and explore some consequences for ARGs inferred from real data. To do this, we introduce new metrics of agreement and disagreement between ARGs that, unlike previous methods, consider ARGs as describing relationships between haplotypes rather than just a collection of trees.
Collapse
Affiliation(s)
- Halley Fritze
- Department of Mathematics, University of Oregon, Eugene, Oregon
| | - Nathaniel Pope
- Institute of Evolution and Ecology and Department of Biology, University of Oregon, Eugene, Oregon
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford
| | - Peter Ralph
- Institute of Evolution and Ecology and Department of Biology, University of Oregon, Eugene, Oregon
- Department of Mathematics, University of Oregon, Eugene, Oregon
- Department of Data Science, University of Oregon, Eugene, Oregon
| |
Collapse
|
9
|
Czech E, Millar TR, Tyler W, White T, Elsworth B, Guez J, Hancox J, Jeffery B, Karczewski KJ, Miles A, Tallman S, Unneberg P, Wojdyla R, Zabad S, Hammerbacher J, Kelleher J. Analysis-ready VCF at Biobank scale using Zarr. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.06.11.598241. [PMID: 38915693 PMCID: PMC11195102 DOI: 10.1101/2024.06.11.598241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed. Results Zarr is a format for storing multi-dimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of three large human datasets (Genomics England: n=78,195; Our Future Health: n=651,050; All of Us: n=245,394) along with whole genome datasets for Norway Spruce (n=1,063) and SARS-CoV-2 (n=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs. Conclusions Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.
Collapse
Affiliation(s)
- Eric Czech
- Open Athena AI Foundation, Lincoln, New Zealand
- Related Sciences, Lincoln, New Zealand
| | - Timothy R. Millar
- The New Zealand Institute for Plant & Food Research Ltd, Lincoln, New Zealand
- Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand
| | | | - Tom White
- Tom White Consulting Ltd., Manchester, UK
| | | | - Jérémy Guez
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
| | | | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | - Konrad J. Karczewski
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
- Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Alistair Miles
- Wellcome Sanger Institute, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Sam Tallman
- Genomics England, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Per Unneberg
- Department of Cell and Molecular Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | | | - Shadi Zabad
- School of Computer Science, McGill University, Montreal, QC, Canada
| | - Jeff Hammerbacher
- Open Athena AI Foundation, Lincoln, New Zealand
- Related Sciences, Lincoln, New Zealand
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| |
Collapse
|
10
|
DeHaas D, Pan Z, Wei X. Enabling efficient analysis of biobank-scale data with genotype representation graphs. NATURE COMPUTATIONAL SCIENCE 2025; 5:112-124. [PMID: 39639156 PMCID: PMC12054550 DOI: 10.1038/s43588-024-00739-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Accepted: 11/06/2024] [Indexed: 12/07/2024]
Abstract
Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. However, encoding genetic data in existing tabular data structures and file formats has become costly and unsustainable. Here we introduce the genotype representation graph (GRG), a fully connected hierarchical data structure that losslessly encodes phased whole-genome polymorphisms. Exploiting variant-sharing across samples enables GRG to compress 200,000 UK Biobank phased human genomes to 5-26 gigabytes per chromosome, also enabling graph-traversal algorithms to reuse computed values in random access memory. Constructing and processing GRG files scales to a million whole genomes. Using allele frequencies and association effects as examples, we show that computation on GRG via graph traversal runs the fastest among all tested alternatives. GRG-based algorithms have the potential to increase the scalability and reduce the cost of analyzing large genomic datasets.
Collapse
Affiliation(s)
- Drew DeHaas
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Ziqing Pan
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Xinzhu Wei
- Department of Computational Biology, Cornell University, Ithaca, NY, USA.
| |
Collapse
|
11
|
Tang J, Chiang CWK. A genealogy-based approach for revealing ancestry-specific structures in admixed populations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.10.632475. [PMID: 39868281 PMCID: PMC11761683 DOI: 10.1101/2025.01.10.632475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
Elucidating ancestry-specific structures in admixed populations is crucial for comprehending population history and mitigating confounding effects in genome-wide association studies. Existing methods for elucidating the ancestry-specific structures generally rely on frequency-based estimates of genetic relationship matrix (GRM) among admixed individuals after masking segments from ancestry components not being targeted for investigation. However, these approaches disregard linkage information between markers, potentially limiting their resolution in revealing structure within an ancestry component. We introduce ancestry-specific expected GRM (as-eGRM), a novel framework for elucidating the relatedness within ancestry components between admixed individuals. The key design of as-eGRM consists of defining ancestry-specific pairwise relatedness between individuals based on genealogical trees encoded in the Ancestral Recombination Graph (ARG) and local ancestry calls and computing the expectation of the ancestry-specific relatedness across the genome. Comprehensive evaluations using both simulated stepping-stone models of population structure and empirical datasets based on three-way admixed Latino cohorts showed that analysis based on as-eGRM robustly outperforms existing methods in revealing the structure in admixed populations with diverse demographic histories. Taken together, as-eGRM has the promise to better reveal the fine-scale structure within an ancestry component of admixed individuals, which can help improve the robustness and interpretation of findings from association studies of disease or complex traits for these understudied populations.
Collapse
Affiliation(s)
- Ji Tang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA
| | - Charleston W K Chiang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA
| |
Collapse
|
12
|
Owens GL, Caseys C, Mitchell N, Hübner S, Whitney KD, Rieseberg LH. Shared Selection and Genetic Architecture Drive Strikingly Repeatable Evolution in Long-Term Experimental Hybrid Populations. Mol Biol Evol 2025; 42:msaf014. [PMID: 39835697 PMCID: PMC11783286 DOI: 10.1093/molbev/msaf014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Revised: 11/27/2024] [Accepted: 01/09/2025] [Indexed: 01/22/2025] Open
Abstract
The degree to which evolution repeats itself has implications regarding the major forces driving evolution and the potential for evolutionary biology to be a predictive (vs. solely historical) science. To understand the factors that control evolutionary repeatability, we experimentally evolved four replicate hybrid populations of sunflowers at natural sites for up to 14 years and tracked ancestry across the genome. We found that there was very strong negative selection against introgressed ancestry in several chromosomes, but positive selection for introgressed ancestry in one chromosome. Further, the strength of selection was influenced by recombination rate. High recombination regions had lower selection against introgressed ancestry due to more frequent recombination away from incompatible backgrounds. Strikingly, evolution was highly parallel across replicates, with shared selection driving 88% of variance in introgressed allele frequency change. Parallel evolution was driven by both high levels of sustained linkage in introgressed alleles and strong selection on large-effect quantitative trait loci. This work highlights the repeatability of evolution through hybridization and confirms the central roles that natural selection, genomic architecture, and recombination play in the process.
Collapse
Affiliation(s)
- Gregory L Owens
- Department of Biology, University of Victoria, Victoria, BC, Canada
| | - Celine Caseys
- Department of Plant Science, University of California, Davis, CA, USA
| | - Nora Mitchell
- Department of Biology, University of Wisconsin–Eau Claire, Eau Claire, WI, USA
- Department of Biology, University of New Mexico, Albuquerque, NM, USA
| | - Sariel Hübner
- Department of Bioinformatics and Galilee Research Institute (MIGAL), Tel Hai Academic College, Tel Hai, Israel
| | - Kenneth D Whitney
- Department of Biology, University of New Mexico, Albuquerque, NM, USA
| | - Loren H Rieseberg
- Department of Botany and Beaty Biodiversity Centre, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
13
|
Speidel L, Silva M, Booth T, Raffield B, Anastasiadou K, Barrington C, Götherström A, Heather P, Skoglund P. High-resolution genomic history of early medieval Europe. Nature 2025; 637:118-126. [PMID: 39743601 PMCID: PMC11693606 DOI: 10.1038/s41586-024-08275-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Accepted: 10/23/2024] [Indexed: 01/04/2025]
Abstract
Many known and unknown historical events have remained below detection thresholds of genetic studies because subtle ancestry changes are challenging to reconstruct. Methods based on shared haplotypes1,2 and rare variants3,4 improve power but are not explicitly temporal and have not been possible to adopt in unbiased ancestry models. Here we develop Twigstats, an approach of time-stratified ancestry analysis that can improve statistical power by an order of magnitude by focusing on coalescences in recent times, while remaining unbiased by population-specific drift. We apply this framework to 1,556 available ancient whole genomes from Europe in the historical period. We are able to model individual-level ancestry using preceding genomes to provide high resolution. During the first half of the first millennium CE, we observe at least two different streams of Scandinavian-related ancestry expanding across western, central and eastern Europe. By contrast, during the second half of the first millennium CE, ancestry patterns suggest the regional disappearance or substantial admixture of these ancestries. In Scandinavia, we document a major ancestry influx by approximately 800 CE, when a large proportion of Viking Age individuals carried ancestry from groups related to central Europe not seen in individuals from the early Iron Age. Our findings suggest that time-stratified ancestry analysis can provide a higher-resolution lens for genetic history.
Collapse
Affiliation(s)
- Leo Speidel
- Ancient Genomics Laboratory, Francis Crick Institute, London, UK.
- Genetics Institute, University College London, London, UK.
- iTHEMS, RIKEN, Wako, Japan.
| | - Marina Silva
- Ancient Genomics Laboratory, Francis Crick Institute, London, UK
| | - Thomas Booth
- Ancient Genomics Laboratory, Francis Crick Institute, London, UK
| | - Ben Raffield
- Department of Archaeology and Ancient History, Uppsala University, Uppsala, Sweden
| | | | | | - Anders Götherström
- Centre for Palaeogenetics, Stockholm University, Stockholm, Sweden
- Department of Archaeology and Classical Studies, Stockholm University, Stockholm, Sweden
| | - Peter Heather
- Department of History, King's College London, London, UK
| | - Pontus Skoglund
- Ancient Genomics Laboratory, Francis Crick Institute, London, UK.
| |
Collapse
|
14
|
Huang Z, Kelleher J, Chan YB, Balding D. Estimating evolutionary and demographic parameters via ARG-derived IBD. PLoS Genet 2025; 21:e1011537. [PMID: 39778081 PMCID: PMC11750106 DOI: 10.1371/journal.pgen.1011537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2024] [Revised: 01/21/2025] [Accepted: 12/11/2024] [Indexed: 01/11/2025] Open
Abstract
Inference of evolutionary and demographic parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that even poorly-inferred short IBD segments can improve estimation. Our mutation-rate estimator achieves precision similar to a previously-published method despite a 4 000-fold reduction in data used for inference, and we identify significant differences between human populations. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.
Collapse
Affiliation(s)
- Zhendong Huang
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Victoria, Australia
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, United Kingdom
| | - Yao-ban Chan
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Victoria, Australia
| | - David Balding
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Victoria, Australia
| |
Collapse
|
15
|
Peng D, Mulder OJ, Edge MD. Evaluating ARG-estimation methods in the context of estimating population-mean polygenic score histories. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.24.595829. [PMID: 38854009 PMCID: PMC11160635 DOI: 10.1101/2024.05.24.595829] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Scalable methods for estimating marginal coalescent trees across the genome present new opportunities for studying evolution and have generated considerable excitement, with new methods extending scalability to thousands of samples. Benchmarking of the available methods has revealed general tradeoffs between accuracy and scalability, but performance in downstream applications has not always been easily predictable from general performance measures, suggesting that specific features of the ARG may be important for specific downstream applications of estimated ARGs. To exemplify this point, we benchmark ARG estimation methods with respect to a specific set of methods for estimating the historical time course of a population-mean polygenic score (PGS) using the marginal coalescent trees encoded by the ancestral recombination graph (ARG). Here we examine the performance in simulation of seven ARG estimation methods: ARGweaver, RENT+, Relate, tsinfer+tsdate, ARG-Needle, ASMC-clust, and SINGER, using their estimated coalescent trees and examining bias, mean squared error (MSE), confidence interval coverage, and Type I and II error rates of the downstream methods. Although it does not scale to the sample sizes attainable by other new methods, SINGER produced the most accurate estimated PGS histories in many instances, even when Relate, tsinfer+tsdate, ARG-Needle and ASMC-clust used samples ten or more times as large as those used by SINGER. In general, the best choice of method depends on the number of samples available and the historical time period of interest. In particular, the unprecedented sample sizes allowed by Relate, tsinfer+tsdate, ARG-Needle, and ASMC-clust are of greatest importance when the recent past is of interest-further back in time, most of the tree has coalesced, and differences in contemporary sample size are less salient.
Collapse
Affiliation(s)
- Dandan Peng
- Department of Quantitative and Computational Biology, University of Southern California
| | - Obadiah J. Mulder
- Department of Quantitative and Computational Biology, University of Southern California
| | - Michael D. Edge
- Department of Quantitative and Computational Biology, University of Southern California
| |
Collapse
|
16
|
Soni V, Jensen JD. Inferring demographic and selective histories from population genomic data using a two-step approach in species with coding-sparse genomes: an application to human data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.19.613979. [PMID: 39605418 PMCID: PMC11601476 DOI: 10.1101/2024.09.19.613979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
The demographic history of a population, and the distribution of fitness effects (DFE) of newly arising mutations in functional genomic regions, are fundamental factors dictating both genetic variation and evolutionary trajectories. Although both demographic and DFE inference has been performed extensively in humans, these approaches have generally either been limited to simple demographic models involving a single population, or, where a complex population history has been inferred, without accounting for the potentially confounding effects of selection at linked sites. Taking advantage of the coding-sparse nature of the genome, we propose a 2-step approach in which coalescent simulations are first used to infer a complex multi-population demographic model, utilizing large non-functional regions that are likely free from the effects of background selection. We then use forward-in-time simulations to perform DFE inference in functional regions, conditional on the complex demography inferred and utilizing expected background selection effects in the estimation procedure. Throughout, recombination and mutation rate maps were used to account for the underlying empirical rate heterogeneity across the human genome. Importantly, within this framework it is possible to utilize and fit multiple aspects of the data, and this inference scheme represents a generalized approach for such large-scale inference in species with coding-sparse genomes.
Collapse
Affiliation(s)
- Vivak Soni
- School of Life Sciences, Center for Evolution & Medicine, Arizona State University, Tempe, AZ, US
| | - Jeffrey D. Jensen
- School of Life Sciences, Center for Evolution & Medicine, Arizona State University, Tempe, AZ, US
| |
Collapse
|
17
|
Soni V, Terbot JW, Versoza CJ, Pfeifer SP, Jensen JD. A whole-genome scan for evidence of recent positive and balancing selection in aye-ayes ( Daubentonia madagascariensis) utilizing a well-fit evolutionary baseline model. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.08.622667. [PMID: 39605496 PMCID: PMC11601216 DOI: 10.1101/2024.11.08.622667] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
The aye-aye (Daubentonia madagascariensis) is one of the 25 most endangered primate species in the world, maintaining amongst the lowest genetic diversity of any primate measured to date. Characterizing patterns of genetic variation within aye-aye populations, and the relative influences of neutral and selective processes in shaping that variation, is thus important for future conservation efforts. In this study, we performed the first whole-genome scans for recent positive and balancing selection in the species, utilizing high-coverage population genomic data from newly sequenced individuals. We generated null thresholds for our genomic scans by creating an evolutionarily appropriate baseline model that incorporates the demographic history of this aye-aye population, and identified a small number of candidate genes. Most notably, a suite of genes involved in olfaction - a key trait in these nocturnal primates - were identified as experiencing long-term balancing selection. We also conducted analyses to quantify the expected statistical power to detect positive and balancing selection in this population using site frequency spectrum-based inference methods, once accounting for the potentially confounding contributions of population history, recombination and mutation rate variation, and purifying and background selection. This work, presenting the first high-quality, genome-wide polymorphism data across the functional regions of the aye-aye genome, thus provides important insights into the landscape of episodic selective forces in this highly endangered species.
Collapse
Affiliation(s)
- Vivak Soni
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - John W. Terbot
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Cyril J. Versoza
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Susanne P. Pfeifer
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Jeffrey D. Jensen
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| |
Collapse
|
18
|
Aluru N, Venkataraman YR, Murray CS, DePascuale V. Gene expression and DNA methylation changes in response to hypoxia in toxicant-adapted Atlantic killifish ( Fundulus heteroclitus). BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.01.620405. [PMID: 39554046 PMCID: PMC11565929 DOI: 10.1101/2024.11.01.620405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/19/2024]
Abstract
Coastal fish populations are threatened by multiple anthropogenic impacts, including the accumulation of industrial contaminants and the increasing frequency of hypoxia. Some populations of the Atlantic killifish (Fundulus heteroclitus), like those in New Bedford Harbor (NBH), Massachusetts, have evolved a resistance to dioxin-like polychlorinated biphenyls (PCBs) that may influence their ability to cope with secondary stressors. To address this question, we compared hepatic gene expression and DNA methylation patterns in response to mild or severe hypoxia in killifish from NBH and Scorton Creek (SC), a reference population from a relatively pristine environment. We hypothesized that NBH fish would show altered responses to hypoxia due to trade-offs linked to toxicant resistance. Our results revealed substantial differences between populations. SC fish demonstrated a dose-dependent changes in gene expression in response to hypoxia, while NBH fish exhibited a muted transcriptional response to severe hypoxia. Interestingly, NBH fish showed significant DNA methylation changes in response to hypoxia, while SC fish did not exhibit notable epigenetic alterations. These findings suggest that toxicant-adapted killifish may face trade-offs in their molecular response to environmental stress, potentially impacting their ability to survive severe hypoxia in coastal habitats. Further research is needed to elucidate the functional implications of these epigenetic modifications and their role in adaptive stress responses.
Collapse
Affiliation(s)
- Neelakanteswar Aluru
- Biology Department, Woods Hole, Massachusetts 02543
- Woods Hole Center for Oceans and Human Health Woods Hole Oceanographic Institution, Woods Hole, Massachusetts 02543
| | | | | | - Veronica DePascuale
- Biology Department, Woods Hole, Massachusetts 02543
- College of Arts and Sciences, Oberlin College and Conservatory, Oberlin, Ohio 44074
| |
Collapse
|
19
|
Whitehouse LS, Ray DD, Schrider DR. Tree Sequences as a General-Purpose Tool for Population Genetic Inference. Mol Biol Evol 2024; 41:msae223. [PMID: 39460991 PMCID: PMC11600592 DOI: 10.1093/molbev/msae223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 10/05/2024] [Accepted: 10/17/2024] [Indexed: 10/28/2024] Open
Abstract
As population genetic data increase in size, new methods have been developed to store genetic information in efficient ways, such as tree sequences. These data structures are computationally and storage efficient but are not interchangeable with existing data structures used for many population genetic inference methodologies such as the use of convolutional neural networks applied to population genetic alignments. To better utilize these new data structures, we propose and implement a graph convolutional network to directly learn from tree sequence topology and node data, allowing for the use of neural network applications without an intermediate step of converting tree sequences to population genetic alignment format. We then compare our approach to standard convolutional neural network approaches on a set of previously defined benchmarking tasks including recombination rate estimation, positive selection detection, introgression detection, and demographic model parameter inference. We show that tree sequences can be directly learned from using a graph convolutional network approach and can be used to perform well on these common population genetic inference tasks with accuracies roughly matching or even exceeding that of a convolutional neural network-based method. As tree sequences become more widely used in population genetic research, we foresee developments and optimizations of this work to provide a foundation for population genetic inference moving forward.
Collapse
Affiliation(s)
- Logan S Whitehouse
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Dylan D Ray
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| |
Collapse
|
20
|
Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns AW, Kelleher J. A general and efficient representation of ancestral recombination graphs. Genetics 2024; 228:iyae100. [PMID: 39013109 PMCID: PMC11373519 DOI: 10.1093/genetics/iyae100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 06/05/2024] [Indexed: 07/18/2024] Open
Abstract
As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG. However, this approach is out of step with some modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them. We present a simple formalism that defines an ARG in terms of specific genomes and their intervals of genetic inheritance, and show how it generalizes these classical treatments and encompasses the outputs of recent methods. We discuss nuances arising from this more general structure, and argue that it forms an appropriate basis for a software standard in this rapidly growing field.
Collapse
Affiliation(s)
- Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Anastasia Ignatieva
- School of Mathematics and Statistics, University of Glasgow, Glasgow G12 8TA, UK
- Department of Statistics, University of Oxford, Oxford OX1 3LB, UK
| | - Jere Koskela
- School of Mathematics, Statistics and Physics, Newcastle University, Newcastle NE1 7RU, UK
- Department of Statistics, University of Warwick, Coventry CV4 7AL, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK
| | - Anthony W Wohns
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5101, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| |
Collapse
|
21
|
Chotai M, Wei X, Messer PW. Signatures of selective sweeps in continuous-space populations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.26.605365. [PMID: 39091822 PMCID: PMC11291165 DOI: 10.1101/2024.07.26.605365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/04/2024]
Abstract
Selective sweeps describe the process by which an adaptive mutation arises and rapidly fixes in the population, thereby removing genetic variation in its genomic vicinity. The expected signatures of selective sweeps are relatively well understood in panmictic population models, yet natural populations often extend across larger geographic ranges where individuals are more likely to mate with those born nearby. To investigate how such spatial population structure can affect sweep dynamics and signatures, we simulated selective sweeps in populations inhabiting a two-dimensional continuous landscape. The maximum dispersal distance of offspring from their parents can be varied in our simulations from an essentially panmictic population to scenarios with increasingly limited dispersal. We find that in low-dispersal populations, adaptive mutations spread more slowly than in panmictic ones, while recombination becomes less effective at breaking up genetic linkage around the sweep locus. Together, these factors result in a trough of reduced genetic diversity around the sweep locus that looks very similar across dispersal rates. We also find that the site frequency spectrum around hard sweeps in low-dispersal populations becomes enriched for intermediate-frequency variants, making these sweeps appear softer than they are. Furthermore, haplotype heterozygosity at the sweep locus tends to be elevated in low-dispersal scenarios as compared to panmixia, contrary to what we observe in neutral scenarios without sweeps. The haplotype patterns generated by these hard sweeps in low-dispersal populations can resemble soft sweeps from standing genetic variation that arose from substantially older alleles. Our results highlight the need for better accounting for spatial population structure when making inferences about selective sweeps.
Collapse
Affiliation(s)
- Meera Chotai
- Department of Computational Biology, Cornell University
| | - Xinzhu Wei
- Department of Computational Biology, Cornell University
| | | |
Collapse
|
22
|
Marsh JI, Johri P. Biases in ARG-Based Inference of Historical Population Size in Populations Experiencing Selection. Mol Biol Evol 2024; 41:msae118. [PMID: 38874402 PMCID: PMC11245712 DOI: 10.1093/molbev/msae118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Revised: 06/05/2024] [Accepted: 06/11/2024] [Indexed: 06/15/2024] Open
Abstract
Inferring the demographic history of populations provides fundamental insights into species dynamics and is essential for developing a null model to accurately study selective processes. However, background selection and selective sweeps can produce genomic signatures at linked sites that mimic or mask signals associated with historical population size change. While the theoretical biases introduced by the linked effects of selection have been well established, it is unclear whether ancestral recombination graph (ARG)-based approaches to demographic inference in typical empirical analyses are susceptible to misinference due to these effects. To address this, we developed highly realistic forward simulations of human and Drosophila melanogaster populations, including empirically estimated variability of gene density, mutation rates, recombination rates, purifying, and positive selection, across different historical demographic scenarios, to broadly assess the impact of selection on demographic inference using a genealogy-based approach. Our results indicate that the linked effects of selection minimally impact demographic inference for human populations, although it could cause misinference in populations with similar genome architecture and population parameters experiencing more frequent recurrent sweeps. We found that accurate demographic inference of D. melanogaster populations by ARG-based methods is compromised by the presence of pervasive background selection alone, leading to spurious inferences of recent population expansion, which may be further worsened by recurrent sweeps, depending on the proportion and strength of beneficial mutations. Caution and additional testing with species-specific simulations are needed when inferring population history with non-human populations using ARG-based approaches to avoid misinference due to the linked effects of selection.
Collapse
Affiliation(s)
- Jacob I Marsh
- Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Parul Johri
- Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
- Integrative Program for Biological and Genome Sciences, University of North Carolina, Chapel Hill, NC 27599, USA
| |
Collapse
|
23
|
Tagami D, Bisschop G, Kelleher J. tstrait: a quantitative trait simulator for ancestral recombination graphs. Bioinformatics 2024; 40:btae334. [PMID: 38796683 PMCID: PMC11784591 DOI: 10.1093/bioinformatics/btae334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/24/2024] [Indexed: 05/28/2024] Open
Abstract
SUMMARY Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. AVAILABILITY AND IMPLEMENTATION tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait).
Collapse
Affiliation(s)
- Daiki Tagami
- Department of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, United Kingdom
| | - Gertjan Bisschop
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, United Kingdom
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, United Kingdom
| |
Collapse
|
24
|
Tagami D, Bisschop G, Kelleher J. tstrait: a quantitative trait simulator for ancestral recombination graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.13.584790. [PMID: 38559118 PMCID: PMC10980058 DOI: 10.1101/2024.03.13.584790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Summary Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure, and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. Availability and Implementation tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait). Contact daiki.tagami@hertford.ox.ac.uk.
Collapse
Affiliation(s)
- Daiki Tagami
- Department of Statistics, University of Oxford, 24-29 St Giles’, Oxford OX1 3LB, United Kingdom
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| | - Gertjan Bisschop
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| |
Collapse
|
25
|
Huang Z, Kelleher J, Chan YB, Balding DJ. Estimating evolutionary and demographic parameters via ARG-derived IBD. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.07.583855. [PMID: 38559261 PMCID: PMC10979897 DOI: 10.1101/2024.03.07.583855] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Inference of demographic and evolutionary parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that poorly-inferred short IBD segments can improve estimation precision. We show estimation precision similar to a previously-published estimator despite a 4 000-fold reduction in data used for inference. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.
Collapse
Affiliation(s)
- Zhendong Huang
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Australia
| | - Jerome Kelleher
- Oxford Big Data Institute, University of Oxford, United Kingdom
| | - Yao-ban Chan
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Australia
| | - David J. Balding
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Australia
| |
Collapse
|